Power BI

Encoding basic authentication in an url with Power Query M language

More than one year ago, I edited a blog post about how to get data from Neo4j with Power BI and yesterday, Chris Skardon published a cover enriched with more details (useful if you’re new to Power BI). But Chris did more than a cover and also changed the way to authenticate. In my version, I circumnavigated the authentication obstacle by adjusting the server settings to let me run anonymous queries but Chris overcame it by providing a header to the web request (See bellow).

let
  Source = Web.Contents( "http://localhost:7474/db/data/transaction/commit",
    [
      Headers = [#"Authorization" = "Basic bmVvNGo6bmVv"],
      Content=Text.ToBinary("{
        ""statements"" : [ {
          ""statement"" : ""...""} ]
        }")]
      )
    ]
in
  Source

Chris explained that he encoded his user/password combination with the help of C# or PowerShell. Then he added this encoding string in the header for the Authorization tag, after the basic keyword.

This is a good idea but I thought it was also a good idea to explain how to execute this base64-encoding without relying on an external process … just with Power Query M Language!

The first step is to correctly format the string to encode. The expected format is user:password. I can create two parameters User and Password and then creating a new query named Credentials and apply the format.

let
    Source = User & ":" & Password,

After this you need to convert your text to bytes by the help of the function Text.ToBinary

    Bytes = Text.ToBinary(Source),

When you have the bytes, you need to encode them to a string with the help of the Base64 binary encoding.

    TextForm = Binary.ToText(Bytes, BinaryEncoding.Base64)
in
    TextForm

You can now edit the main query and replace Headers = [#"Authorization" = "Basic bmVvNGo6bmVv"], by Headers = [#"Authorization" = "Basic " & Credentials],.

That’s it, the encoding is performed in Power Query M language.

Full code:

let
    Source = User & ":" & Password,
    Bytes = Text.ToBinary(Source),
    TextForm = Binary.ToText(Bytes, BinaryEncoding.Base64)
in
    TextForm

Running a native query within an M project

A few weeks ago, Microsoft published the Power Query SDK, an extension of Visual Studio, on the Visual Studio Marketplace. It is available for free for VS2015 and VS2017. This extension is a really welcome in the world of “Power Query M language” aficionados. By using it, you can save a lot of time when trying to debug a query or when you want to manually test your implementation. Naturally, it’s just a version 1.0.1 and there are rooms for improvements (having more than one file in a .mproj, support for debugging, …) but it’s much better than relying on the “advanced editor” of Power BI (the name “advanced editor” is somewhere fun for just a textbox).

I was just using it since a few days when I found an interesting case. My query had a native query

Sql.Database(
   "server",
   "db",
   [Query = "select * from myTable where field=" & value]
)

When I tried to execute it, I received a message from the Power Query SDK that

The evaluation requires a permission that has not been provided. Data source kind: ‘SQL’. Permission kind: ‘NativeQuery’.

evaluation-requires-permission-not-provided

No luck with my google-fu, so I checked the github repository linked to data connectors and documenting this SDK. Nothing was available in the documentation but an attribute in the project definition caught my attention: AllowNativeQuery. This attribute is by default at false, I changed it to true

<AllowNativeQuery>True</AllowNativeQuery>

I reloaded the project and no surprise: it was where the magic happened. I can now run my native queries within Visual Studio for this project!

Sorting a table based on your own criteria with Power Query (M language)

Earlier, I published a blog about A case insensitive approach to sort a table with Power Query (M language) and I received a few pertinent comments from “Nick”. I agree with most of his comments and some remarks about the syntax were fully valid but the last part of one of his comment was not true (or at least is not true anymore). To quote him:

I wanted to use this method to sort on two fields, I guess I’ll have to go back to converting them to upper-case first, sorting on the upper-case fields and then removing them again

I’m a bit overloaded and I took a very long time to check. But at the end, it’s really possible to define multiple criteria to sort a table.

To illustrate the how-to, I’m starting back from my previous example. I’ll now sort the dataset based the label (column item) without taking into account the case [First criterion – no change] and in case of equality I’ll then sort the rows based on the price (descending) [second criterion to be introduced].

To test this, I’m introducing a few changes in my dataset. I’m adding a tenth record with the item Fishing Rod and updating the price for all the Fishing Rod and Bait items.

let
  Source = Table.FromRecords(
    {
      [OrderID=1, CustomerID=1, Item = "Fishing rod", Price = 100.0],
      [OrderID=2, CustomerID=1, Item = "1 lb. worms", Price = 5.0],
      [OrderID=3, CustomerID=2, Item = "Fishing net", Price = 25.0],
      [OrderID=4, CustomerID=3, Item = "Fish tazer", Price = 200.0],
      [OrderID=5, CustomerID=3, Item = "Bandaids", Price = 2.0],
      [OrderID=6, CustomerID=1, Item = "Tackle box", Price = 20.0],
      [OrderID=7, CustomerID=5, Item = "Bait", Price = 4.25],
      [OrderID=8, CustomerID=5, Item = "Fishing Rod", Price = 120.0],
      [OrderID=9, CustomerID=6, Item = "Bait", Price = 3.25],
      [OrderID=10, CustomerID=5, Item = "Fishing Rod", Price = 80.0]
    }
  ),
#"Changed Type" =
  Table.TransformColumnTypes(
    Source,
    {
      {"Item", type text},
      {"Price", type number}
    }
  ),

The basic principle for sorting based on you own criteria has not changed and you need to specify a list of criterion as the second argument of the Table.Sort function.

#"Sorted Rows" =
  Table.Sort(
    #"Changed Type",
    {
       each Text.Upper([Item]),
       each [Price]*-1
    }
  )

and this method definitely gives the expected result:
sort-table-multiple-criteria

A case insensitive approach to sort a table with Power Query (M language)

Have you already tried to sort a table based on a text field? The result is usually a surprise for most people. M language has a specific implementation of the sort engine for text where upper case letters are always ordered before lower case letters. It means that Z is always before a. In the example (here under), Fishing Rod is sorted before Fishing net.

sort-001

The classical trick to escape from this weird behavior is to create a new column containing the upper case version of the text that will be used to sort your table, then configure the sort operation on this newly created column. This is a two steps approach (Three steps, if you take into account the need to remove the new column). Nothing bad with this except that it obfuscates the code and I hate that.

sort-002

The documentation of Table.Sort is really sparse, especially the last parameter comparisonCriteria. I never found any documentation on this parameter and most of the available examples just use the name of one of the fields of the table. Hereby, I’ll give you a trick to apply a case insensitive sort for a table in one step.

The most important thing to understand is that comparisonCriteria doesn’t need to be a column, it can be any function. Based on this, it’s an safe move to change the code {"columnSort"} into {each Text.Upper(_[columnSort])} … and that’s it!

sort-003
let
  Source = Table.FromRecords(  
    {  
      [OrderID=1, CustomerID=1, Item = "Fishing rod", Price = 100.0],  
      [OrderID=2, CustomerID=1, Item = "1 lb. worms", Price = 5.0],  
      [OrderID=3, CustomerID=2, Item = "Fishing net", Price = 25.0],  
      [OrderID=4, CustomerID=3, Item = "Fish tazer", Price = 200.0],  
      [OrderID=5, CustomerID=3, Item = "Bandaids", Price = 2.0],  
      [OrderID=6, CustomerID=1, Item = "Tackle box", Price = 20.0],  
      [OrderID=7, CustomerID=5, Item = "Bait", Price = 3.25],  
      [OrderID=8, CustomerID=5, Item = "Fishing Rod", Price = 100.0],  
      [OrderID=9, CustomerID=6, Item = "Bait", Price = 3.25]  
    }  
  ),
  #"Changed Type" = 
    Table.TransformColumnTypes(Source,{{"Item", type text}}),
  #"Sorted Rows" = 
    Table.Sort(#"Changed Type",{each Text.Upper([Item])})
in
  #"Sorted Rows"

Web.Contents and 404 page not found in Power BI

The M language (empowering Power Query and Power BI) supports the try ... otherwise ... feature. It is handy to catch errors when processing some of the rows and handle these cases in alternative paths.

Unfortunately this feature is not intuitive when surrounding a call to the function Web.Contents(). Intuitively you could expect that an answer 404: page not found would start the otheriwse path. But it’s not the case, an error will throw!

To correctly manage this kind of error, you’ll need to add a second parameter to your Web.Contents call (after the url): [ManualStatusHandling={404}]). This parameter specifies to Web.Contents() that you don’t want to apply the default behavior of Power BI when receiving a 404 error: you’ll handle this case by yourself and not stop the load of the table.

The pragmatic option is probably to continue your transformation flow independently of the response received from the web server. Your next step will (probably) miserably fail and you’ll be able to add a try ... otherwise ... on this step to handle an error.

But, if you really want to do this cleanly, you’ll have to parse the metadata of the result of Web.Contents. The function Value.Metadata gives you this opportunity. In the metadata, you’ll find a Response.Status field where you’ll be able to check the value received: 404, 100, … Based on this value you can decide which path to follow with a simple “ìf … then … else“`

let
  Source = Web.Content(
    "http://unexisting.com"
    , [ManualStatusHandling={404}])
  ),
  ResponseCode = Value.Metadata(Source)[Response.Status],
  NextStep = if ResponseCode=404 then ... else ...

Handling enum values in fact tables with Power BI

In a previous blog post, I wrote about the analysis of survey’s results and specifically how to build a hierarchy when this notion was not really transparent into the source data. Then, in a second post, I explained how I skipped to create some temporary tables and used transpose and unpivot.

I’ll now continue my journey and explain how to build the tables to manage the results. Some of the challenge that I’ll have to face are:

  • Answers are an enum where I’m expecting some values such as NONE, LARGELLY, FULLY, … [Green area]
  • NOT APPLICABLE should be treated as a no-answer (that is also possible).
  • I DON’T KNOW, should be considered as an answer but not taking into account when I create an aggregation (average) of the results received.
survey-excel-file

To start, I’ll create two tables and not one. In the first I’ll create a new dimension with the list of possible answers to the questions. I can do this manually with the “Enter Data” feature.

enter-data

For the second, I’ll restart from the previously created table Result-Staging. I’ll just merge this table with the newly manually created Value table.

let
    Source = #"Results-Staging",
    #"Merged Queries" = Table.NestedJoin(#"Source",{"Answer"},Value,{"Label"},"NewColumn",JoinKind.Inner),
    #"Expanded NewColumn" = Table.ExpandTableColumn(#"Merged Queries", "NewColumn", {"Value"}, {"Value"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Expanded NewColumn",{"Answer"}),
    #"Filtered Rows1" = Table.SelectRows(#"Removed Columns1", each ([Value] <> null))
in
    #"Filtered Rows1"

Due to the usage of an Inner Join, all the questions where the answer was not one of the predefined values of the enumeration are discarded. It’s a good new because it’ll remove the yellow area, the red area but also all the NOT APPLICABLE answers!

results-final

Our model looks like:

survey-model

We still have to implement a few measures. The count of answers with a given label for a category or a Respondent can be easily acgieved with the straightforward DAX formulae:

CountOfAnswer = COUNT(Results[Value])

We’ve previously decided that a DON’T KNOW, shouldn’t influence our average of the answers. To apply this decision, we just need to filter the table Result and remove all the values equal to 0 (Enum value of DON’T KNOW). Then we calculate the average and subtract 1 to get a value between 0 and 4. Coooool, except that if we’ve no value non equal to 0, it will return -1 … not what we’re expecting. We’ll need to validate that the average is not null before subtracting 1.

AverageOfResponse =
   VAR NonZeroValues =
       CALCULATE(
           AVERAGE('Value'[Value]),
           FILTER('Value', 'Value'[Value]<>0)
       )
   RETURN
       IF(ISBLANK(NonZeroValues), BLANK(), NonZeroValues-1)

That’s it, now I can use the DON’T KNOW values in a count of answers received but they don’t influence my average:

results-charts

 

Merge on a filtered set of rows without creating a new table (Power BI).

In a previous blog post, I wrote about the analysis of survey’s results and explictely how to build a hierarchy when this notion was not really transparent into the source data.

I’ll now continue my journey and explain how to build the table/dimension “respondent”. To summarize the challenge that I’ll face:

  • Each respondent means a new column (and not a new row), each additional information for the respondent add a new row (and not a new column) [Orange area]
  • Some information are available in the list of answers given by the respondent and not in the information on the top [yellow area]
  • For maintenance reasons, I’ll try avoid to create temporary tables that will only be used as a subset of another table
survey-excel-file

The first step is really easy, we’ll just keep the first rows by removing all the rows where the column “A” has a value. when done, we can remove this column that will only contain null values.

let
   Source = Source,
   #"Filtered Rows" =
      Table.SelectRows(
          Source,
          each ([Column1] = null)
      ),
   #"Removed Columns" =
      Table.RemoveColumns(
           #"Filtered Rows",
           {"Column1"}
      ),

The next step is to transform each row into a column and vice versa. Usually people with a SQL background try to apply a pivot or unpivot to achieve this task. There is an easiest way with the M language … just use the function Transpose

#"Transposed Table" = Table.Transpose(#"Removed Columns"),

You’ll obtain the following result:

respondent-transpose

It’s now to promote the first row as header.

#"Promoted Headers" = Table.PromoteHeaders(#"Transposed Table"),

Now that we already have a few information about our respondent, we’ll need to add the block of yellow information to this table.

To achieve this, I’ll first create a new table named Result-Staging and containing the gray area of my excel sheet. Let’s start by removing the column “B”, promoting the first row as header and filter rows to remove them within the orange area (column “A” is null).

let
    Source = Source,
    #"Removed Columns" = Table.RemoveColumns(Source,{"Column2"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns"),
    #"Filtered Rows" = Table.SelectRows(
        #"Promoted Headers",
        each ([Column1] <> null)
    ),
results-gray

Then, I’ll try to change this format to have one column with the QuestionId, another with the Respondent and finally a third with the response. To transform the current data table into this format I’ll need to unpivot all columns except Attribute and Value.

    #"Unpivoted Other Columns" =
        Table.UnpivotOtherColumns(
            #"Filtered Rows",
            {"Column1"},
            "Attribute",
            "Value"
        ),
    #"Renamed Columns" =
         Table.RenameColumns(
              #"Unpivoted Other Columns",
              {
                    {"Column1", "QuestionId"},
                    {"Attribute", "Interviewee"},
                    {"Value", "Answer"}
               }
          )
in
    #"Renamed Columns"
results-unpivot

In this table, only the results with a QuestionId equal to 111 really interest me for a merge with the existing table Respondent. If you’re familiar with the UI of Power BI Desktop then you’ll probably think to create a new table referencing this one then filter on QuestionId equals 111 and finally merge. It’ll work but applying this strategy could result in many “temporary” tables. A lot of these small tables used only for a few steps before merging with other tables tend to be a nightmare for maintenance. You can use the “Advanced formula editor” to not display this kind of temporary tables and embed them in your main table.

On the code of the Respondenttable, add a step to filter the content of the table Result-Staging.

#"Age table" =
   Table.SelectRows(
      #"Results-Staging",
      each [QuestionId] = 111
   ),

If you take a look to the code the function #"Results-Staging" doesn’t exist in the code associated to table Respondent. In fact it’s the end-result of the function/table Results-Staging !

Now that we’ve our filtered set of rows corresponding to the ages of the respondents, I can join it with the content of the table Respondent.

#"Merged Queries" =
   Table.NestedJoin(
       #"Promoted Headers",
       {"Name"},
       #"Age table",
       {"Interviewee"},
       "NewColumn",
       JoinKind.LeftOuter
   ),

With a bit more of expanding and renaming, I can finally get my table Respondent.

respondent-final

In the next blog post, I’ll show how to manage the table of the results.

 

Creating a hierarchy when the source has the levels flattened in a unique column (Power BI)

With a few blog posts, I’ll explain how to create a model for an analysis of answers to a survey. This post is based on a real-life experience that I’m fighting right now, but I completely changed the type of questions and the answers.

My first challenge is to create a clean dimension with the hierarchy of questions based on an input file were this notion of hierarchy was not explicit and everything was flatten in two columns.

Before jumping into the action, I’ll explain the file containing my data. I received a wonderful Excel file with the results of the survey. This file is a bit weird, let’s summarize the format:

  • first column contains a “number” corresponding to a hierarchy to organize the different questions. Each question is part of a sub-category and each sub-category is part of category. The second column contains the question it-self or the label of the (sub-)category (Gray area)
  • an additional column is created for each respondent. (columns C to F)
  • first rows are personal information, not answers to questions (Orange area)
  • some personal information (age) are also available in the list of answers to questions (Yellow area)
  • Most of the question accepts predefined values, including “Not applicable” and “Don’t know” (Green area)
  • Some people were not allowed to answer some questions, I’ve no value for them in the file (Red area)
survey-excel-file

I decided to apply a few “business rules” to analyze the results,

  • “Not applicable” should be considered as “no answer” and not taken into account. This rule must apply when I’m aggregating the answers or counting the answers by questions
  • “Don’t know” is a valid answer that I’d like to see when I’m analyzing the how many answers I received but not when I’m aggregating the answers to get a single value.

To load this Excel file in Power BI, I’ll just use standard functions and define a first table “Source” that won’t be enabled to load in report.

My next task will be to create a table (or dimension) with the different questions. I also want to include a hierarchy in this dimension: I should be able to browse the questions by categories and sub-categories.

Let’s create a new table named “Question” by referencing the “Source” table. Then remove the other columns than A and B.

question-null-id

We’ll also need to remove the rows where the content of “Column1” is null (these rows are personal information, not questions/categories).

At the beginning of this post I said that the first column was a number … to be precise it’s not a whole number (as I expected) but a decimal number! The root cause are rounding errors in Excel … apparently the original file was not configured correctly and the columns were not “text”:

question-rounding

To solution this issue, I’ll just apply the rounding by myself with 0 decimals … not “up” or “down” rounding but just a normal rounding. Then I’ll change the type of this column to “text” because I won’t use this column for aggregations.

#"Rounded Off" =
   Table.TransformColumns(
      #"Renamed Columns",
      {{
         "QuestionId",
         each Number.Round(_, 0), type number
      }}
    ),
#"Changed Type" =
    Table.TransformColumnTypes(
       #"Rounded Off",
      {{"QuestionId", type text}}
    ),

Based on the column “QuestionId”, I can determine if this row is a question, sub-category or category by just taking a look to the length of this text. 1 digit means a category, 2 for a sub-category and 3 for a question … let’s add a column with this information.

#"Added level" = Table.AddColumn(
   #"Changed Type",
   "Level",
   each
      if
         Text.Length([QuestionId])=1
      then "category"
      else if
         Text.Length([QuestionId])=2
      then "sub-category"
      else "question"),

Then I’ll add two additional columns with the Id of the sub-category and the category. To obtain these values, I just need to extract the two first digits for a sub-category and the very first for a category.

#"Added CategoryId" =
   Table.AddColumn(
      #"Added level",
      "CategoryId",
      each Text.Start([QuestionId],1)
   ),
#"Added SubCategoryId" =
   Table.AddColumn(
      #"Added CategoryId",
      "SubCategoryId",
      each Text.Start([QuestionId],2)
   ),
question-level

Once I’ve calculated the “Id” for the sub-category, I’d need to replace it by the label. To achieve this, I’ll just do a merge of this table on itself based on the SubCategoryId equal to the QuestionId:

question-merge
#"Merged Queries" =
    Table.NestedJoin(
        #"Added SubCategoryId",
        {"SubCategoryId"},
        #"Added SubCategoryId",
        {"QuestionId"},
        "NewColumn",JoinKind.LeftOuter),

Expand the table and select the column Question that you should rename sub-category. Apply the same tactic for the addition of the label of the Category and you should have the table bellow.

question-categories.jpg

This table has too many rows because we still have rows for the categories and sub-categories. We can just filter them out using the column level.

#"Filtered Rows1" =
     Table.SelectRows(
          #"Renamed Columns2",
          each ([Level] = "question")
     ),

The column level is now pointless (one unique value equal to “question”) and we can safely remove it (and also the columns for Ids of category and sub-category) to have a clean dimension table:

question-final

In the next blog post, I’ll explain how to build my “interviewee” dimension.

Parsing a string with Power BI (M language)

A few days ago, I read an interesting blog post of Lukas Lötters [b|t] where he is explaining how to use regular expressions in Power BI (M language) by running an R script. It’s a really smart way to solve the issue. But I wanted to try to resolve this problem without delegating the parsing of the string to R (just for the challenge!).

In a nutshell, the challenge is to find a substring in a string. This substring is identified by a pattern \((19|20)\d{2}. If you’re not familiar with regular expressions, it just means a parenthesis followed by “19” or “20” and two additional digits.

regex_meme.jpg

Let’s start by defining a few test-cases that the function to implement will need to pass:

  • “The Shawshank Redemption (1994)” / result: 1994
  • “My Dad Says (2010) {Dog Ed Pursuit (#1.7)}” / result: 2010
  • “Years Younger (2004/I)” / result: 2004
  • “abcd(19xy) hqskj dhq (2020)” / result: 2020
  • “fdsdf (1995) sdfsdf (19)” / result: 1995
  • “hkjehr (195x) (1993)” / result: 1993
  • “hkjehr (19-9) (2005)” / result: 2005
  • “hkjehr (199.) (2000)” / result: 2000
  • “hkjehr (19+9) (1993)” / result: 1993

The three first test-cases are directly retrieved from the original file to parse, for the others, they are just additional potential cases directly produced by my own brain.

To solve this challenge, I’ll create a function expecting a text.

let
    ExtractYear = (string as text) =>

My first step will be to identify all the opening parenthesis in the string. Usually, we use the function Text.PositionOf to get the first occurrence of a character in a string. But if we specify the parameter Occurrence.All, this function is returning a list with all the occurrences.

let
    starts = Text.PositionOf(string, "(", Occurrence.All),

The previous function is returning a list of positions in the string. I’ll now extract the 4 characters following the parenthesis and will replace the previous content of my list with these 4 characters, List.Transform is a good friend for this kind of task.

    blocks =
        List.Transform(
            starts
            , each Text.Range(string, _ + 1, 4)
        ),

Unfortunately, I’m not sure that I’ve 4 characters after the parenthesis in my string (test-case 5). I need to be careful and handle the error. We can implement this in a try ... otherwise ... syntax.

    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),

Now, for each opening parenthesis, I’ve the next 4 characters (or null if they are not existing). I’ll need to filter this list of 4 characters to only take into account those starting by “19” or “20”. To apply this filter to my list, I’ll use the function List.Select.

    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),

This filter will already considerably reduce the list of potentials year. The only test-cases where I still have many potential years are the test-case 6 and next. Expl: hkjehr (195x) (1993). This test-case has 2 remaining elements in the list: 195x and 1993.

To continue to filter out candidate “year”, I need to check that the characters in position 3 and 4 are digits. I have not found a native function in M to test that a character is a digit, so I’ll implement this by myself. But first step, extract characters in position 3 and 4 and pass them as parameter to the to-be-created function named isDigit. This new function will return true/false (a logical in M).

    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),

To validate that a character is a digit, I decided to rely on characters and encoding. All the digits are encoded between positions 48 and 57 in the ASCII table.

ascii-charsMy function isDigit will just test that the character, transformed to a number (by the help of Character.ToNumber) is between the same conversion applied to “0” and “9”.

    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),

This internal function is accepting a parameter. Take a look to the syntax where I stipulated this parameter after the equal symbol and before the arrow symbol (Just as for a normal function).

When all these tests have been applied to filter the candidate for the release year I’m expecting to have just one remaining candidate! I’ll assert this with the function List.Single. This function is throwing an error if the list is empty or has more than one element.

    value = List.Single(blocks1920dd)

That’s it! My newly created function is validating all the cases that I set at the beginning.

challenge-completed

Full code:

let
    ExtractYear = (string as text) =>
let
    starts = Text.PositionOf(string, "(", Occurrence.All),
    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),
    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),
    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),
    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),
    value = List.Single(blocks1920dd)
in
    value
in
    ExtractYear

Handling special characters from html files with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The two next blog posts ([1] and [2]) illustrated how to parse rather complex Json responses and retrieve the commits’ activity and the issues’ activity of a repository.

In this new blog post, I’ll try solve an issue that is there since the first blog post. When parsing the MVP website and retrieving the first and last names, I had an issue with some special characters: my own name was retrieved as “Cédric Charlier” in place of “Cédric Charlier”. For some reasons some special characters are encoded (but some are not). Anyway we need to decode them.

From a technical perspective, I’ll solve this issue with a recursion but also with some error handling in Power Query/M language.

mvp-analysis-powerbi-012

The first step is to analyze the problem that we’ve in front of us. The special characters encoded have always the same pattern. A ampersand (“&”) followed by a cross (“#”), a set of digits and a semi-colon (“;”) to end the fragment. We’ll first need to identify this kind of pattern in the string. I’ll simplify to only search for the ampersand and the semi-colon.

let
    ConvertHtmlSpecialChars = (string as text) =&gt;
let
    start = Text.PositionOf(string, "&amp;"),
    end = Text.PositionOf(string, ";"),

Then, I’ll extract this fragment from the string. Then I’ll confirm that this framment is effectively starting by “&” and ending by “#”. If it’s the case then I’ll return the number in this fragment.

    value = Text.Range(string, start, end-start+1),
    cleanValue =
        if
            Text.StartsWith(value, "&")
        then
            Text.Start(
                Text.End(
                    value
                    , Text.Length(value)-2
                )
                ,Text.Length(value)-3
            )
        else
            null,

Now, that I’ve this number, I can translate it to a normal character with the help of the function Character.FromNumber and replace the fragment by the new value.

    replace = Character.FromNumber(Number.FromText(cleanValue)),
    replaced = Text.Replace(string, value, replace),

Now, I need to handle the fact that I can have more than one special character in my string and I must process each of them. Once again recursion will help me!

    continue =
        if string=replaced
        then replaced
        else ConvertHtmlSpecialChars(replaced)
in
    continue

That’s it? Not really. My function is crashing easily. If I’ve no special characters in my string, the function Text.Range(string, start, end-start+1) will crash because -1 (start’s value) is not a valid. But I’ve many other places where I could have an issue.

A little bit of M theory to understand the basics of error handling in this language. When an error is thrown by a function this error is propagated to the caller and so on. It’s really practical because you can catch the error at a higher level and you don’t need to intercept all the potential errors.

In this case, I’ll catch the potential error between the replacement and my recursive condition.

newString = try replaced otherwise string,

The keyword try means that we’re expecting a potential issue during the call to the function replaced. The keyword otherwise specifies that if we effectively have an error during the call to the function replaced then we need to use another function in place of. In this case, the function is just the value of string (the initial value). So, if at any moment during the substitution, I’ve an error then I’ll use the original string.

I still need to update the recursive condition to use this new function:

continue =
        if string=newString
        then newString
        else ConvertHtmlSpecialChars(newString)
in
    continue
in
    ConvertHtmlSpecialChars

Now, I’ll need to add the call to this function to the table dealing the MVPs’ names:

#"Html Special Chars" =
        Table.TransformColumns(
            #"Renamed Columns"
            ,{{"MVP name", ConvertHtmlSpecialChars}}
        )
in
    #"Html Special Chars"

And now I can compare the before

mvp-analysis-powerbi-013

and after

mvp-analysis-powerbi-014