Power BI

Extracting key phrases with the Text Analytics API and Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. My last article also explained how to use the Face API to guess the gender of these MVP. I’ll now parse their biography to extract key phrases and display the most used themes in these texts. To support this need, I’ll use the Text Analytics API provided by Azure Cognitive services.

The Text Analytics API is a suite of text analytics web services built with best-in-class Microsoft machine learning algorithms. The API can be used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction and language detection. No training data is needed to use this API; just bring your text data. This API uses advanced natural language processing techniques to deliver best in class predictions.

The first step of this exercice is to grab the biography. We previously discussed how to parse a web page and retrieve pertinent information. In that case we need to look for the html tag <pre>, extract the text and clean it up.

let
    GetBiography = (url as text) as text=>
let
    source = Web.Contents(url),
    lines = Lines.FromBinary(source),
    biography = List.Select(lines, each Text.Contains(_, "<pre>")){0},
    cleanedBiography = ConvertHtmlSpecialChars(Text.BeforeDelimiter(Text.AfterDelimiter(Text.Clean(Text.Trim(biography)),">"), "<"))
in
    try cleanedBiography otherwise ""
in
    GetBiography

We need to setup a Text Analytics API on our Azure account. If you don’t know how to do this refer to the previous article of this series. About pricing, selecting the free (FO) possibility is probably the best approach. We’ll have a very limited usage of this API with just one call by run.

text-analytics-api-pricing.png

Indeed, at the difference of the Face API, we don’t need to perform a call for each MVP to the Text Analytics API. We only need to call this API once with all the biographies. The API is expecting an list of records (with the language, an id and the biography). The language will always be “en” (standing for English) and for the id, we’ll use the MvpId. Preparing this structure is performed in a function PrepareKeyPhrases.

let
    PrepareKeyPhrases = () =>
let
    source = Table.SelectColumns(#"All info",{"MvpId", "Biography"}),
    filtered = Table.SelectRows(source, each [Biography] <> ""),
    addEnglish = Table.AddColumn(filtered, "language", each "en"),
    renamed = Table.RenameColumns(addEnglish,{{"Biography", "text"}, {"MvpId", "id"}}),
    reordered = Table.ReorderColumns(renamed,{"language", "id", "text"}),
    list = Table.ToRecords(reordered)
in
    list
in
    PrepareKeyPhrases

Once we have the expected input for the query, we just need an additional function to call the Text Analytics API and the method keyPhrases. This power query-m function will be named GetKeyPhrases, receive in parameter the result of PrepareKeyPhrases and will use the same approach than for the Face API, detailed in this article. The most complex part of this query is to transform the structure (list of records) that we previously prepared to a Json text. We can achieve this by transforming the structure to a Json document with Json.FromValue and then we need to serialize this by transforming our newly created binary to text with the help of Text.FromBinary.

let
    GetKeyPhrases = (list as list) =>

let
    headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = TextApiKey],
    jsonText = Text.FromBinary(Json.FromValue(list)),
    content = "{ ""documents"": " & jsonText & "}",
    source = Web.Contents(TextApiUrl, [Headers = headers, Content=Text.ToBinary(content)])
in
    source
in
    GetKeyPhrases

We’ll need to provide a table for these keyPhrases. This new table will be linked to the main table with all the MVP details in the model.

let
    source = GetKeyPhrases(PrepareKeyPhrases()),
    json = Json.Document(source,65001)[documents],
    toTable = Table.FromList(json, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    expandColumns = Table.ExpandRecordColumn(toTable, "Column1", {"id", "keyPhrases"}, {"MvpId", "Word"}),
    expandList = Table.ExpandListColumn(expandColumns, "Word"),
    nonNullRows = Table.SelectRows(expandList, each [Word] <> null),
    capitalized = Table.TransformColumns(nonNullRows,{{"Word", Text.Proper, type text}}),
    cleaned = Table.TransformColumns(capitalized,{{"Word", Text.Clean, type text}}),
    trimmed = Table.TransformColumns(cleaned,{{"Word", Text.Trim, type text}})
in
    trimmed

Now that we’ve this list of keywords, we can freely use the Word Cloud custom visual.

text-analytics-word-cloud.

It’s interesting to compare some of the keywords extracted for the recently awarded MVPs.

text-analytics-api-new-mvp

and the long time MVPs.

text-analytics-api-long-time-mvp

At a first sight, book sounds as a keyword associated to long-time MVP and Power BI to the recently awarded.

Advertisements

Using the Face API with Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. In this sequel, I’ll also add the gender of the MVPs … based on their profile picture. To achieve this I’ll rely of the Face API, part of the Azure Cognitive Services.

Let’s start by creating a service on your Azure subscription. If you have no Azure subscription don’t panic, you can create a new one and you’ll receive some free credits to try a few services including the Face API. Also, if you want, you can use the Face API for free with the level of service F0 limiting the calls by minute to twenty.

To create a Face API service in your subscription, click on “add resources” and search for “face”, select the Face API:
face-api

Then select the Face service in the list:
face-api

After this selection, give a name to your service, select the resources group and make a choice about the level of service that you’re expecting:

face-api-pricing

Now, we’ll go back to Power BI and we’ll create a new query, To be exact a new function in Power Query. This function will call the Face API passing in parameter the url of the picture. the return will be a text with the possible values: male, female or unknown. I can already create two parameters related to the Face API:

  • The base url dependent of where your service is hosted. I named this parameter FaceApiUrl. This information is located in the Azure portal, in the overview of your Face API service. face-api-url
  • The key of your service. I named this parameter FaceApiKey and the value is also located in the Azure portal but in the keys section.face-api-key

Based on the documentation of the Face API, I’ll have to send a POST request to the service. The request must be sent to a base url and specifying what are the expected parameters computed by the API. In this specific case I’m only interested by the gender. I can build a record for query parameters:

query = [#"returnFaceAttributes"= "gender"]

As previously explain by Chris Webb (blog post) or Erik Svensen (blog post) to submit a post request you’ll need two tricks.

  1. You must specify a content. This action will switch the execution context of Web.Contents from a GET request to a POST request
  2. The text of the content must be transformed to a binary

To create the content, I’ll have to refer to the documentation of the Face API and I need to create a Json document with just one attribute the url of the picture.

content = "{ ""url"": """ & url & """}"

To submit a valide request, I’ll also have to submit the correct key (credential) to the Face API and specify that my content-type is an application/json. These two information must be specified in the headers of my request. The field headers is also expecting a record so I’ll submit the following construction.

headers =
   [#"Content-Type" = "application/json",
   #"Ocp-Apim-Subscription-Key" = FaceApiKey]

Now that the different components of my request are built, i just need to invoke the request with the help of the function Web.Contents().

response = Web.Contents(
   FaceApiUrl,
   [
      Query = query,
      Headers = headers,
      Content=Text.ToBinary(content)
   ]
)

Once we get the response, we just to parse it to extract the gender.

gender = Json.Document(source,65001){0}[faceAttributes][gender]

The complete method should handle some unexpected errors such as picture missing or the quality of the picture is not high enough and it’s not possible to detect the gender.

let
   GetGender = (url as text) =>
let
   query = [#"returnFaceAttributes"= "gender"],
   headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = FaceApiKey],
   content = "{ ""url"": """ & url & """}" ,
   source = Web.Contents(FaceApiUrl, [Query = query, Headers = headers,    Content=Text.ToBinary(content)]),
   gender = Json.Document(source,65001){0}[faceAttributes][gender]
in
   try gender otherwise "unknown"
in
   GetGender

At the end, I’m able to create this visual and see that the percentage of women in the Data Platform MVP is still really low.

face-api-gender

Table.Distinct and case pitfalls with Power Query/M language

In a previous post, I already blogged about case pitfalls in Power Query/M language. By default, comparisons are case-sensitive and that could be an issue when sorting list or tables but also when using the List.Distinct and Table.Distinct functions.

To explain the issue, I had the following table:

distinct-case

Note that at the top of the table, I’ve twice the value ADO.NET, once with uppercase and once lowercase for the “Net” part. If I try to apply a Table.Distinct, the two values will be considered as distinct and one of them won’t be removed.

distinct-case-sensitive

Will it be an issue? If this your key and it’s part of a model where this key is part of one side of a one-to-many then it will be an issue.

duplicate-error-model

The question on the table is how to manage this case? The MSDN documentation informs us that a second parameter can be defined for Table.Distinct

Removes duplicate rows from a table, ensuring that all remaining rows are distinct.

Table.Distinct(
   table as table, 
   optional equationCriteria as any
   ) as table  

But no additional info about what’s the equationCriteria is available on this page. You’ll have more luck if you try to read the documentation for the function List.Distinct. There, you’ve a few samples and you can understand that you’ll be able to use any function but also the Comparer list of functions. The comparers are also documented in MSDN.

In that case, the most obvious choice is to the comparer named Comparer.OrdinalIgnoreCase to be sure that the comparison of the different items won’t be impacted by the difference of cases.

distinct-case-insensitive.

Handling Html entities when parsing web pages with Power Query/M Language

In a previous blog post, I explained how to handle the html entities when they were number-based such as &#192; or &#9827 ;. This post covers the other possibility to define an html entity, by name: &Agrave; or &clubs;.

Before going further, let’s notice that normally, entities defined by their name are case-sensitive. It means that &Agrave; resulting in “À” is different to &agrave; resulting in “à”. Our code will also support mistake in casing, if and only if this mistake will not result in an ambiguity. Based on that &AGRAVE; won’t be allowed but &CLUBS; will be.

The code is relatively straightforward, create a list of of all the known html entities and the corresponding values.

let
   list =
   {
       [Name="Agrave", Number=192]
       , [Name="Aacute", Number=193]
       , [Name="Acirc", Number=194]
       , [Name="Atilde", Number=195]
...

Then parse the entity’s name and look for the corresponding record in the list. The search must be executed with the specified casing, if nothing is found use a case-insensitive approach. If there is no match or more than one then raise an error.

        },
        token = if Text.StartsWith(entity, "&")
                   and Text.EndsWith(entity, ";")
                then substring
                else entity,
        substring = Text.Range(entity, 1, Text.Length(entity)-2),
        correctCasing = List.Select(
                 list
                 , each [Name]=token),
        anyCasing = List.Select(
                 list
                 , each Text.Upper([Name])=Text.Upper(token)),
        single =  try List.Single(correctCasing)[Number]
                  otherwise List.Single(anyCasing)[Number],
        handleError = Error.Record(
                       "Invalid parameter"
                       , "Unable to translate the html entity &" & token & ";"
                       , if List.Count(anyCasing)=0 then
                            "No match with the list of known entities."
                         else
                            "Multiple matches with the list of known entities. Check your the casing of your entity's name."),
        char = try Character.FromNumber(single)
               otherwise handleError
    in
        char;

This function is available as a gist at https://gist.github.com/Seddryck/36d24e05c233cccf90e6a88a1af39c68

Passing a function to another function to improve maintainability in Power Query/M Language

Developing is relatively easy but maintaining is harder. A code relatively easy to maintain must have a good balance between expressiveness and compactness. In Power Query/M Language, passing a function to another function, as a parameter, is helping to keep your code compact and to reuse large sets of code where the difference could be tiny. Have you already coded twice the same function one using a Text.Upper and another Text.Lower? Then this blog post could interest you. Surely, abusing this feature could result in an obfuscated code. Use it cautiously and do not abuse this feature.

To illustrate this, I’ll use the same use-case than in this blog post where I was trying to parse the results of my daughters at their gymnastics championship. One of the (many) difficulties in this parsing was to determine the firstName, lastname and team. These three informations were on the same line, separated by a space. In the most common pattern it looks such as:

CHARLIER Clémence A.R. GymBraine
CHARLIER Alice A.R. GymBraine

The pattern could have been

Lastname FirstName Team

But you’ll quickly find some exceptions and difficulties. First notable pitfall, you can’t determine how many words will compose the team’s name. It could be two (A.R. GymBraine) or three (ABC Gym Dinant) or four (Royale Garde Saint Jean) with optionally hyphens, quotes, dots and parentheses. Nothing reliable, you should take the hypothesis that everything after the first name is the team’s name.

Some people have a composed last name such as VAN DER PERRE. We can’t really rely on the hypothesis that the last name is just the first word. The pattern is more that the last name is fully capitalized.

For the first names, we’ve the same kind of issue, the first names Anne-laure and Lilly-shiva are composed but easily identifiable due to the no space between them. Unfortunately some first names are not inline with the previous assumption. I found a Marie antoinette and another Adina - ioana. Hopefully, you’ll notice than in both case only the first letter is an uppercase, all remaining letters are lowercase and the team’s name is always starting by an uppercase!

From that point the pattern is the following:

LAST NAME First name Team's name

The delimiter condition for the last name is that the next two characters are an uppercase followed by a lowercase (Fi in the case above). For the first name, the delimiter condition is a space followed by a uppercase (space + T in the example above).

Cooooool … we “just” have to code that.

First step, let’s create an helper function to check if a character is uppercase or lowercase. A naïve implementation could be

isUppercase = (c as text) => c= Text.Upper(c) 

But this code will return true for a space or an hyphen. It’s not what we’ll need, we need to strictly limit ourselves to letters. In a previous blog post, I already explained how we can use Character.FromNumber and Character.ToNumber to use the ASCII table and check if a character was a figure or not. We can apply the same approach with uppercase and lowercase.

isUppercase = (c as text) => 
  c = Character.ToNumber(c) >= Character.ToNumber("A") 
      and Character.ToNumber(c) <= Character.ToNumber("Z") 

I could do the same for lowercase replacing “A” by “a” and “Z” by “z” and create a new function. Sure, but it’s probably a good place to use for the first time the pattern of passing a function to another function. I’ll rewrite the previous function as:

isCase = (c as text, case as function) => 
  c = Character.ToNumber(c) >= Character.ToNumber(case("A")) 
      and Character.ToNumber(c) <= Character.ToNumber(case("Z")) 

Then, I could create two additional helpers:

isUpperCase = (c as text) => isCase(c, (x)=>Text.Upper(x)),
isLowerCase = (c as text) => isCase(c, (x)=>Text.Lower(x)),

Did you notice, how the function Text.Upper was passed as a parameter?

Now that we can test the casing of a character, we’ll need to tackle another challenge. We need to parse two consecutives characters in a string and find the first occurrence matching the corresponding delimiter condition.

Let’s start by building a list of the two consecutive characters. To do that we’ll use List.Generateand the not well-known fourth parameter the “selection”. This parameter let’s you define a function that will generate the member of the list in place of using the iterator.

let
   PositionOf = (message as text) as text => 
    let
       list = List.Generate(
            ()=>0,
            each _ < Text.Length(message)-1,
            each _+1, 
            each [
                i = _, 
                x = Text.Range(message, _, 1), 
                y = Text.Range(message, _+1, 1)]
        ),

This function is running from the first character of the string to the antepenultimate character and compose a record with respectively the position of the pair of characters in the string, the first character of the pair and the second one.

Then, we’ll need to select the first record matching a condition to get the position. The condition will change depending if we’re looking for the first name or the last name … so let it be a function!

       first = List.First(
                 List.Select(list, each condition(_))
               )[i], 
        word = Text.Start(message, first)
    in
        word
in
    PositionOf

The conditions for the first name and last name will be expressed as:

conditionLastName = (r as record) => 
    isUpperCase(r[x]) and isLowerCase(r[y]),
conditionFirstName = (r as record) => 
    r[x]=" " and isUpperCase(r[y]),

To test the previous code, we can implement the following test routine:

    list = {
              "ALPHA Beta L'équipe", 
              "ALPHA GOGO Beta Team's name", 
              "ALPHA GOGO Beta - gamma My 'team'"},
    temp = List.Transform(
              list,
              each [ 
                 full = _,
                 lastName = 
                    Text.Trim(PositionOf(_, conditionLastName)),
                 firstName = 
                    PositionOf(Text.Range(_, Text.Length(lastName)+1),
                               conditionFirstName),
                 team = Text.Range(_, 
                           Text.Length(firstName) +
                           Text.Length(lastName) + 1)
           ]),
    result = Table.ExpandRecordColumn(
                Table.FromList(temp, 
                    Splitter.SplitByNothing()),
                "Column1", 
                {"full","lastName","firstName","team"})
in
    result

And we’ll get the following result:

function-end-result

Writing a first data connector for Power BI

Since mid-2017, it’ possible to write your own custom data connectors. This feature offers the opportunity to encapsulate more difficult or technical code into a data connector and distribute it to business analysts. The key benefit is to let them concentrate on what they are supposed to do: business analysis … and not spend hours to put in place some plumbing code to get the data from your data source.

Currently, one of the supported use-case data connectors to brand a supported data source (ODBC) or enabling DirectQuery for an ODBC given data source. We won’t focus on that use-case in the next articles but in place focus on the idea to create a user-friendly experience over a REST API.

Before going further, you must download the Power Query SDK. This package is available as an extension of Visual Studio. It brings several benefits such as a powerful editor for the Power Query/M language with autocompletion and intellisense and limited features to debug your code (but at this moment nothing to monitor performances or no breakpoint). This SDK could be frustrating in some conditions but it’s still a version 1 and when something sounds clumsy in Visual Studio, just try it in Power BI to confirm the correct behavior of your code. There is room for improvement but already useful.

In this first article, I’ll create a basic data connector that will just return a list of numbers from 1 to 50. To achieve that, I’ll create a new project in Visual Studio, in the templates from the tab “Power Query”, I’ll select the “Data Connector Project” and click on “OK”.

vs-select-project-power-query

Visual Studio will create a boilerplate project with a file having an extension .pq another query.pq but also a list of image files and a resources file. The file with the extension “.pq” will contain your code for the connector (at this moment it’s not possible to split your code in several files). The file “.query.pq” will contains the query that you want to run in debug mode (it could reference the functions defined in your “.pq” files).

vs-power-query-solution

If you open the “.pq” file, you’ll find an “hello world” function named PQExtension1.Contents.

vs-power-query-default

You can replace this function by the code to create a serie from 1 to 50:

source = {1..50}
vs-power-query-code

A few important remarks about this function. If you’re used to develop custom functions with the Power Query/M language, you’ll notice that I explicitly set the empty parenthesis to define that this function was not expecting parameters. Something that you could usually skip when writing when writing functions in Power BI. In fact, these empty parentheses imply a lot of consequences but I won’t deep dive into the Power Query/M language at this moment and just remember that functions that you expose through a Data Connector require these empty parenthesis (or with some parameters if you need them).

Second key remark, is the semi-column (“;”) at the end of my function. This is the case for all functions that you’ll develop in the “.pq” files.

If you click on the start button or hit F5, Visual Studio will start your Data connector in debug mode. This first start will just give you the opportunity to specify some credentials. Credentials are managed by DataSource.Kind. Just set the Credential Type to anonymous for this sample and click on “Set Credential”.

vs-power-query-credentials

This screen should only display on your first debugging session, if you don’t change your DataSource.Kind. When Credentials have been specified, click again on the start button, to see the result of your data connector within Visual Studio:

vs-power-query-output

Now that the connector is working within Visual Studio you can deploy it to Power BI. To achieve that task, you just need to copy the result of the build executed by Visual Studio and paste it in the folder %userprofile%\documents\Microsoft Power BI Desktop\Custom Connectors\.
This bat file is just executing this action and starting “Power BI”:

CD /d %~dp0
SET destination="%userprofile%\documents\Microsoft Power BI Desktop\Custom Connectors\"
XCOPY "PQExtension1\bin\debug\*.mez" %destination% /Y /F
"C:\Program Files\Microsoft Power BI Desktop\bin\PBIDesktop.exe"
PAUSE

Once Power BI has started, click on “Get Data” and select “Other”, you’ll find your data connector in this list. Select it and click on OK.

vs-power-query-get-data

It will open the “Power Query editor” and display the result. From this point you can continue to work on your data as usual and add steps if needed.

vs-power-query-get-data-pq-editor

You can also open the “Advanced editor” and check the code that has been generated when you selected the data connector.

vs-power-query-advanced-editor

The code is just calling the function that you created in your connector, hiding all the complex code to the business analyst. Great!

In the next articles of this serie, I’ll go further and get data from a REST API and propose a great user interface to the business analyst.

Converting datetime from UTC to Local time with Power Query M language

This article explains how to convert from UTC to Local time with the Power Query M Language. It sounds easy but it’s not so trivial.

Power Query M Language is supporting a datetimezone type and expose a function DateTimeZone.ToLocal. These artefacts are doing the job and you can quickly create the following function, converting a UTC datetime to a local datetime:

let
  UtcToLocal = (utc as datetime) =>
    let
      utcToLocal = 
        DateTimeZone.RemoveZone(
          DateTimeZone.ToLocal(
            DateTime.AddZone(utc,0)
          )
        )
     in
       utcToLocal
in
    UtcToLocal

I’m from Belgium (UTC+1 during winter time and UTC+2 during summer time) and the following conversions are just fine:

UtcToLocal(#datetime(2018, 3, 21, 17, 0, 0)) = 
  #datetime(2018, 3, 21, 18, 0, 0)   #Winter time => +1
UtcToLocal(#datetime(2018, 3, 30, 17, 0, 0)) = 
  #datetime(2018, 3, 21, 19, 0, 0)   #Summer time => +2

When I try to convert the datetime around the switch from summer time to winter time, I’ve twice the local time 02:00:00, once at 00:00:00UTC and once at 01:00:00UTC. That’s indeed the case.

UtcToLocal(#datetime(2017, 10, 29, 0, 0, 0)) = 
  #datetime(2017, 10, 29, 2, 0, 0)   #Summer time => +2
UtcToLocal(#datetime(2017, 10, 29, 1, 0, 0)) = 
  #datetime(2017, 10, 29, 2, 0, 0)   #Winter time => +1

Everything is fine … except if I share my code with someone from another time zone. The function DateTimeZone.ToLocal is relying on regional settings and in that case my conversion should always be from UTC to “Brussels time”.

I didn’t find any other way to ensure that I’m always converting from UTC to “Brussels time” than implementing the conversion by myself. That’s the goal of the following function:

UtcToLocalTime = (date as date, time as time) =>

In Europe, the two pivotal daylight savings dates are the last Sundays of March and October. Translated in M language, it means the first day of the week, assuming the week is starting a Sunday, containing the last day of this month (which is a 31).

lastSundayOfOctober = 
  Date.StartOfWeek(#date(Date.Year(date), 10, 31), Day.Sunday),
lastSundayOfMarch = 
  Date.StartOfWeek(#date(Date.Year(date), 3, 31), Day.Sunday),

Based on the two pivotal dates, I can check if I’m in winter time or summer time

isSummerTime = 
   (date > lastSundayOfMarch and date < lastSundayOfOctober) 
   or (date = lastSundayOfOctober and time  #time(1,0,0)),

And based on that adjust the time zone

timeZone = 1 + Number.From(isSummerTime),
localTime = 
  DateTime.From(date) 
  + #duration(0,Time.Hour(time),Time.Minute(time),Time.Second(time))  
  + #duration(0, timeZone, 0, 0),

I can also check if the specified date and time corresponds to the doubled-hour of the last Sunday of October and return a symbol to that date (in a text format)

isWinterDouble = 
   (date = lastSundayOfOctober) and time = #time(1,0,0),
localString = 
   DateTime.ToText(localTime, "yyyy-MM-dd HH:mm:ss") 
   & Text.Repeat("*", Number.From(isWinterDouble)),

The final result is a record with the local date, time and textual representation:

    record = 
        [
            LocalDate = Date.From(localTime), 
            LocalTime = Time.From(localTime), 
            LocalDateTime = localTime, 
            LocalString = localString
        ]
in
    record;

This function is returning the expected result:

UtcToLocal

Improving the performance when merging two tables with the Power Query M language

In a previous post, I explained how to parse a semi-structured flat file with the help of a range-join. When you just have a few thousand lines in your semi-structured flat files, this is working really well but when your files is going bigger and bigger, this strategy starts to collapse and is really slow. Hopefully, Chris Webb explained in recent blog post how adding a primary key to a table could help to improve the performances of a merge. I wanted to test that on my set this specific case.

I started to work on my previously created table categories that was in this state.

semi-structured-categories-range

I could have created a primary key on this table with any column but I’ve serious doubt that it would result in any improvement of performances. Main reason is the usage of a custom range-join and not a merge provided by the M language. To work-around that I decided to change the structure of my table and use a classical merge function. In place of having one row for each category with the bounds of this range as fields, I want to have one row for each value between the bounds. My table will be larger but I’ll be able to use a traditional merge. The final result is the following:

semi-structured-expanded-categories

To achieve that result, I created a function to generate a list of numbers contained between the lower and the upper bound given by the two fields Start Range and End Range. I added this list as a new column of my existing table.

#"RowList" = (start, end) => 
    List.Generate(() => start, each _<=end, each _+1),
#"List" = 
    Table.AddColumn(
       #"Parsed Categories", 
       "List", 
       each RowList(_[Start Range], _[End Range])
    ),

then, I expanded the list as new rows

#"Expanded List" = Table.ExpandListColumn(List, "List"),

And to be sure that a primary key was added to my table, I added a step to remove the duplicates based on my new "list" column.

#"Removed Duplicates" = Table.Distinct(#"Expanded List", {"List"}),

Then on my main table participants I replaced thejoin-range by a native merge (Merge queries in the UI or NestedJoin in the code) using this new column.

#"Merged Queries" = 
     Table.NestedJoin(
         #"Removed Duplicates",
         {"Index"},
         categories,
         {"List"},
         "categories",
         JoinKind.LeftOuter
     ),

And you know what? I'm happy … performance has improved from a laborious "more than 3 minutes" to a wonderful "less than 15 seconds".

droopy-happy

Parsing semi-structured text files with the Power Query M language

My daughters participated to the provincial championships of gymnastics during this winter, both qualified for the regional championships. Between the two events, I analyzed the results of the different provinces with Power BI to guess what they could expect at the regional level.

The first unexpected issue was the file results: PDF files. I tried several tips such as open them in Word or with other processors and try to convert them to text files but finally the best and easiest solution was to directly copy/paste them in text files before loading them in Power BI.

Once loaded in Power BI, I had a huge table with the following pattern

##Category 1
Participant 1.1
Participant 1.2
Participant 1.3
##Category 2
Participant 2.1
Participant 2.2
Participant 2.3
Participant 2.4
Participant 2.5

I needed to transform that semi-structured file to the following structured table with two columns:

Participant 1.1     Category 1
Participant 1.2     Category 1
Participant 1.3     Category 1
Participant 2.1     Category 2
Participant 2.2     Category 2
Participant 2.3     Category 2
Participant 2.4     Category 2
Participant 2.5     Category 2

To achieve that transformation, I added an index to my table of flat results. Then creating a reference query, I isolated the categories by filtering out all the rows not starting by ##. The result was similar to that table:

semi-structured-find-categories

My goal was to determine a range of lines corresponding to each category. Based on the line index of a participant, I must be able to determine her category. The index of categories will tell me the first and last line corresponding to a given category. I already had the starting line index of the categories but not the corresponding last line index . This value is simply the value of the start index of the next category minus one. The following code is adding this information and managing the last category with a try ... otherwise ... using the last line of my result file.

#"Filtered Rows" = 
   Table.SelectRows(Source, 
      each Text.StartsWith([Column1], "##")),
#"Renamed Columns" = 
   Table.RenameColumns(#"Filtered Rows",{{"Index", "Start range"}}),
#"Added Index" = 
    Table.AddIndexColumn(#"Renamed Columns", "Local Index", 0, 1),
#"Added Custom" = 
    Table.AddColumn(#"Added Index", "End Range", 
       each 
           try #"Renamed Columns"{[Local Index]+1}[Start range]-1 
           otherwise Table.Max(Source, "Index")[Index]),
#"Removed Columns" = 
    Table.RemoveColumns(#"Added Custom",{"Local Index"})
semi-structured-categories-range

Now, on another function, I’ll filter out all the categories and hold the participants and the line index. Based on this index, I’ll do a range loookup on the other table, containing the categories, and add the returned information as a new column to my table of participants (range join).

The first step is to create a function to handle a range join.

let
    JoinRange = (value as any, array as table) as any =>
    let
        matchingRows = 
             Table.SelectRows(
                 array, 
                 each value>=[Start Range] and value<=[End Range]
             ),
        returnValue = matchingRows{0}[Text]
    in
        returnValue
in
    JoinRange

Then I needed to make usage of this function in a new table "participants":

let
    Source = flat,
    #"Filtered Rows" = 
       Table.SelectRows(
          Source, 
          each not Text.StartsWith([Column1], "##")
       ),
    LookupCategory = 
        Table.AddColumn(
           #"Filtered Rows", 
           "Category", 
           each JoinRange(_[Index], categories)
        ),
    #"Removed Columns" = 
        Table.RemoveColumns(LookupCategory,{"Index"})
in
    #"Removed Columns"

and I finally get the expected result:

semi-structured-participants

Conditionally expanding table’s columns with Power Query M language

In one of my projects, I have to deal with XML files. The structure of the XML is usually flat and I don’t need to expand columns to get access to the values I’m looking for. Loading the following file


<?xml version="1.0" encoding="UTF-8"?>
<file>
  <row>
    <col1>23.25</col1>
    <col2>alpha</col2>
  </row>
  <row>
    <col1>99.12</col1>
    <col2>beta</col2>
  </row>
  <row>
    <col1>44.25</col1>
    <col2>gamma</col2>
  </row>
</file>

with this M code:

let
    Source =
       Xml.Tables(
          File.Contents("C:\Users\cedri\Projects\file.xml")
       ),
    #"Expanded Table" =
       Table.ExpandTableColumn(
           Source, "Table", {"col1", "col2"},
           {"Table.col1", "Table.col2"}
       ),
    #"Removed Columns" =
        Table.RemoveColumns(#"Expanded Table",{"Name"})
in
    #"Removed Columns"

will result in the following table

table-without-nil

But from time to time, I had some errors when loading some of the xml files. After investigations, I found that in a rare few cases, I could get a null value in some columns. The example here under shows a null value for col1 at the second row.

<?xml version="1.0" encoding="UTF-8"?>
<file>
  <row>
    <col1>23.25</col1>
    <col2>alpha</col2>
  </row>
  <row>
    <col1 nil="true"/>
    <col2>beta</col2>
  </row>
  <row>
    <col1>44.25</col1>
    <col2>gamma</col2>
  </row>
</file>

resulting in this kind of table:

table-with-nil

If my next step is a conversion to a numeric value for col1, it’d result in an error:

table-with-error

I wanted to avoid this kind of issue and expand the column to get the Element.Text when some of the values where null and just use the column when all the values were filled.

To manage this I created a new function expecting a table and a list of columns to conditionally expand as parameters.

Custom.ExpandTable = (table as table, columnNames as list) =>

The core ideas of this function is to get the schema, check the type of the columns and if this type is a Table.Typeor an Any.Type then expand this column.

To get the schema of a M table, I can rely on the function Table.Schema(...). It returns a table with a row for each column. Extracting the row that I’m looking for based on the column’s name is not so difficult:

columnType =
  (columnName as text) as text =>
  Table.SingleRow(
    Table.SelectRows(
      Table.Schema(table), each [Name] = columnName
    )
  )[TypeName],

Once I’ve the type of my column, I still need to check if this type correspond to a table or a any field and in that case expand the column:

expandColumn =
  (table as table, columnName as text) as table =>
    if columnType(columnName) = "Table.Type"
      or columnType(columnName) = "Any.Type"
    then
      Table.ExpandTableColumn(
        table, columnName, {"Element:Text"}, {columnName}
      ) else table,

Finally, I need to apply previous functions for each column’s name that I’ve in my list. List.Accumulate is your friend for this operation:

expandedTable =
  List.Accumulate(
    columnNames,
    table,
    (state, current) => expandColumn(state, current)
  )

This function is used in my main M code to conditionally expand the column based on the fact that this column is loaded a simple type or not and is avoiding errors during the next steps.

The full code is available here under:

let
    Source = Xml.Tables(File.Contents("C:\Users\cedri\Projects\file.xml")),
    #"Expanded Table" = Table.ExpandTableColumn(Source, "Table", {"col1", "col2"}, {"Table.col1", "Table.col2"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded Table",{"Name"}),
    #"Expanded Table.col1" = ExpandTable(#"Removed Columns", {"Table.col1"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Expanded Table.col1",{{"Table.col1", type number}})
in
    #"Changed Type"

let ExpandTable = (table as table, columnNames as list) =>
    let
        columnType = (columnName as text) as text => Table.SingleRow(Table.SelectRows(Table.Schema(table), each [Name] = columnName))[TypeName],
        expandColumn = (table as table, columnName as text) as table => if columnType(columnName) = "Table.Type" or columnType(columnName) = "Any.Type" then Table.ExpandTableColumn(table, columnName, {"Element:Text"}, {columnName}) else table,
        expandedTable = List.Accumulate(columnNames, table, (state, current) => expandColumn(state, current))
    in
        expandedTable
in
  ExpandTable