Handling Html entities when parsing web pages with Power Query/M Language

In a previous blog post, I explained how to handle the html entities when they were number-based such as À or &#9827 ;. This post covers the other possibility to define an html entity, by name: À or ♣.

Before going further, let’s notice that normally, entities defined by their name are case-sensitive. It means that À resulting in “À” is different to à resulting in “à”. Our code will also support mistake in casing, if and only if this mistake will not result in an ambiguity. Based on that À won’t be allowed but ♣ will be.

The code is relatively straightforward, create a list of of all the known html entities and the corresponding values.

let
   list =
   {
       [Name="Agrave", Number=192]
       , [Name="Aacute", Number=193]
       , [Name="Acirc", Number=194]
       , [Name="Atilde", Number=195]
...

Then parse the entity’s name and look for the corresponding record in the list. The search must be executed with the specified casing, if nothing is found use a case-insensitive approach. If there is no match or more than one then raise an error.

        },
        token = if Text.StartsWith(entity, "&")
                   and Text.EndsWith(entity, ";")
                then substring
                else entity,
        substring = Text.Range(entity, 1, Text.Length(entity)-2),
        correctCasing = List.Select(
                 list
                 , each [Name]=token),
        anyCasing = List.Select(
                 list
                 , each Text.Upper([Name])=Text.Upper(token)),
        single =  try List.Single(correctCasing)[Number]
                  otherwise List.Single(anyCasing)[Number],
        handleError = Error.Record(
                       "Invalid parameter"
                       , "Unable to translate the html entity &" & token & ";"
                       , if List.Count(anyCasing)=0 then
                            "No match with the list of known entities."
                         else
                            "Multiple matches with the list of known entities. Check your the casing of your entity's name."),
        char = try Character.FromNumber(single)
               otherwise handleError
    in
        char;

This function is available as a gist at https://gist.github.com/Seddryck/36d24e05c233cccf90e6a88a1af39c68

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s