Handling special characters from html files with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The two next blog posts ([1] and [2]) are illustrating how to parse rather complex Json responses and retrieve the commits’ activity and the issues’ activity of a repository.

In this new blog post, I’ll try solve an issue that is there since the first blog post. When parsing the MVP website and retrieving the first and last names, I had an issue with some special characters: my own name was retrieved as “Cédric Charlier” in place of “Cédric Charlier”. For some reasons some special characters are encoded (but some are not). Anyway we need to decode them. This additional blog post is trying to solve more or less the same issue. The difference between them is the type of encoding of the html entities. In this case, they are encoded with numbers on the other blog post they are encoded with the entity’s name.

From a technical perspective, I’ll solve this issue with a recursion but also with some error handling in Power Query/M language.

mvp-analysis-powerbi-012

The first step is to analyze the problem that we’ve in front of us. The special characters encoded have always the same pattern. A ampersand (“&”) followed by a cross (“#”), a set of digits and a semi-colon (“;”) to end the fragment. We’ll first need to identify this kind of pattern in the string. I’ll simplify to only search for the ampersand and the semi-colon.

let
    ConvertHtmlSpecialChars = (string as text) =>
let
    start = Text.PositionOf(string, "&"),
    end = Text.PositionOf(string, ";"),

Then, I’ll extract this fragment from the string. Then I’ll confirm that this framment is effectively starting by “&” and ending by “#”. If it’s the case then I’ll return the number in this fragment.

    value = Text.Range(string, start, end-start+1),
    cleanValue =
        if
            Text.StartsWith(value, "&")
        then
            Text.Start(
                Text.End(
                    value
                    , Text.Length(value)-2
                )
                ,Text.Length(value)-3
            )
        else
            null,

Now, that I’ve this number, I can translate it to a normal character with the help of the function Character.FromNumber and replace the fragment by the new value.

    replace = Character.FromNumber(Number.FromText(cleanValue)),
    replaced = Text.Replace(string, value, replace),

Now, I need to handle the fact that I can have more than one special character in my string and I must process each of them. Once again recursion will help me!

    continue =
        if string=replaced
        then replaced
        else ConvertHtmlSpecialChars(replaced)
in
    continue

That’s it? Not really. My function is crashing easily. If I’ve no special characters in my string, the function Text.Range(string, start, end-start+1) will crash because -1 (start’s value) is not a valid. But I’ve many other places where I could have an issue.

A little bit of M theory to understand the basics of error handling in this language. When an error is thrown by a function this error is propagated to the caller and so on. It’s really practical because you can catch the error at a higher level and you don’t need to intercept all the potential errors.

In this case, I’ll catch the potential error between the replacement and my recursive condition.

newString = try replaced otherwise string,

The keyword try means that we’re expecting a potential issue during the call to the function replaced. The keyword otherwise specifies that if we effectively have an error during the call to the function replaced then we need to use another function in place of. In this case, the function is just the value of string (the initial value). So, if at any moment during the substitution, I’ve an error then I’ll use the original string.

I still need to update the recursive condition to use this new function:

continue =
        if string=newString
        then newString
        else ConvertHtmlSpecialChars(newString)
in
    continue
in
    ConvertHtmlSpecialChars

Now, I’ll need to add the call to this function to the table dealing the MVPs’ names:

#"Html Special Chars" =
        Table.TransformColumns(
            #"Renamed Columns"
            ,{{"MVP name", ConvertHtmlSpecialChars}}
        )
in
    #"Html Special Chars"

And now I can compare the before

mvp-analysis-powerbi-013

and after

mvp-analysis-powerbi-014

5 comments

  1. This seems clever but I cant follow how to implement it. Is it meant to be used as a function? Some of your fragments seem be edits or replaces for earlier fragments? Can you post the code as a clean gist?

Leave a comment