Power BI

Web.Contents and 404 page not found in Power BI

The M language (empowering Power Query and Power BI) supports the try ... otherwise ... feature. It is handy to catch errors when processing some of the rows and handle these cases in alternative paths.

Unfortunately this feature is not intuitive when surrounding a call to the function Web.Contents(). Intuitively you could expect that an answer 404: page not found would start the otheriwse path. But it’s not the case, an error will throw!

To correctly manage this kind of error, you’ll need to add a second parameter to your Web.Contents call (after the url): [ManualStatusHandling={404}]). This parameter specifies to Web.Contents() that you don’t want to apply the default behavior of Power BI when receiving a 404 error: you’ll handle this case by yourself and not stop the load of the table.

The pragmatic option is probably to continue your transformation flow independently of the response received from the web server. Your next step will (probably) miserably fail and you’ll be able to add a try ... otherwise ... on this step to handle an error.

But, if you really want to do this cleanly, you’ll have to parse the metadata of the result of Web.Contents. The function Value.Metadata gives you this opportunity. In the metadata, you’ll find a Response.Status field where you’ll be able to check the value received: 404, 100, … Based on this value you can decide which path to follow with a simple “ìf … then … else“`

let
  Source = Web.Content(
    "http://unexisting.com"
    , [ManualStatusHandling={404}])
  ),
  ResponseCode = Value.Metadata(Source)[Response.Status],
  NextStep = if ResponseCode=404 then ... else ...

Handling enum values in fact tables with Power BI

In a previous blog post, I wrote about the analysis of survey’s results and specifically how to build a hierarchy when this notion was not really transparent into the source data. Then, in a second post, I explained how I skipped to create some temporary tables and used transpose and unpivot.

I’ll now continue my journey and explain how to build the tables to manage the results. Some of the challenge that I’ll have to face are:

  • Answers are an enum where I’m expecting some values such as NONE, LARGELLY, FULLY, … [Green area]
  • NOT APPLICABLE should be treated as a no-answer (that is also possible).
  • I DON’T KNOW, should be considered as an answer but not taking into account when I create an aggregation (average) of the results received.
survey-excel-file

To start, I’ll create two tables and not one. In the first I’ll create a new dimension with the list of possible answers to the questions. I can do this manually with the “Enter Data” feature.

enter-data

For the second, I’ll restart from the previously created table Result-Staging. I’ll just merge this table with the newly manually created Value table.

let
    Source = #"Results-Staging",
    #"Merged Queries" = Table.NestedJoin(#"Source",{"Answer"},Value,{"Label"},"NewColumn",JoinKind.Inner),
    #"Expanded NewColumn" = Table.ExpandTableColumn(#"Merged Queries", "NewColumn", {"Value"}, {"Value"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Expanded NewColumn",{"Answer"}),
    #"Filtered Rows1" = Table.SelectRows(#"Removed Columns1", each ([Value] <> null))
in
    #"Filtered Rows1"

Due to the usage of an Inner Join, all the questions where the answer was not one of the predefined values of the enumeration are discarded. It’s a good new because it’ll remove the yellow area, the red area but also all the NOT APPLICABLE answers!

results-final

Our model looks like:

survey-model

We still have to implement a few measures. The count of answers with a given label for a category or a Respondent can be easily acgieved with the straightforward DAX formulae:

CountOfAnswer = COUNT(Results[Value])

We’ve previously decided that a DON’T KNOW, shouldn’t influence our average of the answers. To apply this decision, we just need to filter the table Result and remove all the values equal to 0 (Enum value of DON’T KNOW). Then we calculate the average and subtract 1 to get a value between 0 and 4. Coooool, except that if we’ve no value non equal to 0, it will return -1 … not what we’re expecting. We’ll need to validate that the average is not null before subtracting 1.

AverageOfResponse =
   VAR NonZeroValues =
       CALCULATE(
           AVERAGE('Value'[Value]),
           FILTER('Value', 'Value'[Value]<>0)
       )
   RETURN
       IF(ISBLANK(NonZeroValues), BLANK(), NonZeroValues-1)

That’s it, now I can use the DON’T KNOW values in a count of answers received but they don’t influence my average:

results-charts

 

Merge on a filtered set of rows without creating a new table (Power BI).

In a previous blog post, I wrote about the analysis of survey’s results and explictely how to build a hierarchy when this notion was not really transparent into the source data.

I’ll now continue my journey and explain how to build the table/dimension “respondent”. To summarize the challenge that I’ll face:

  • Each respondent means a new column (and not a new row), each additional information for the respondent add a new row (and not a new column) [Orange area]
  • Some information are available in the list of answers given by the respondent and not in the information on the top [yellow area]
  • For maintenance reasons, I’ll try avoid to create temporary tables that will only be used as a subset of another table
survey-excel-file

The first step is really easy, we’ll just keep the first rows by removing all the rows where the column “A” has a value. when done, we can remove this column that will only contain null values.

let
   Source = Source,
   #"Filtered Rows" =
      Table.SelectRows(
          Source,
          each ([Column1] = null)
      ),
   #"Removed Columns" =
      Table.RemoveColumns(
           #"Filtered Rows",
           {"Column1"}
      ),

The next step is to transform each row into a column and vice versa. Usually people with a SQL background try to apply a pivot or unpivot to achieve this task. There is an easiest way with the M language … just use the function Transpose

#"Transposed Table" = Table.Transpose(#"Removed Columns"),

You’ll obtain the following result:

respondent-transpose

It’s now to promote the first row as header.

#"Promoted Headers" = Table.PromoteHeaders(#"Transposed Table"),

Now that we already have a few information about our respondent, we’ll need to add the block of yellow information to this table.

To achieve this, I’ll first create a new table named Result-Staging and containing the gray area of my excel sheet. Let’s start by removing the column “B”, promoting the first row as header and filter rows to remove them within the orange area (column “A” is null).

let
    Source = Source,
    #"Removed Columns" = Table.RemoveColumns(Source,{"Column2"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns"),
    #"Filtered Rows" = Table.SelectRows(
        #"Promoted Headers",
        each ([Column1] <> null)
    ),
results-gray

Then, I’ll try to change this format to have one column with the QuestionId, another with the Respondent and finally a third with the response. To transform the current data table into this format I’ll need to unpivot all columns except Attribute and Value.

    #"Unpivoted Other Columns" =
        Table.UnpivotOtherColumns(
            #"Filtered Rows",
            {"Column1"},
            "Attribute",
            "Value"
        ),
    #"Renamed Columns" =
         Table.RenameColumns(
              #"Unpivoted Other Columns",
              {
                    {"Column1", "QuestionId"},
                    {"Attribute", "Interviewee"},
                    {"Value", "Answer"}
               }
          )
in
    #"Renamed Columns"
results-unpivot

In this table, only the results with a QuestionId equal to 111 really interest me for a merge with the existing table Respondent. If you’re familiar with the UI of Power BI Desktop then you’ll probably think to create a new table referencing this one then filter on QuestionId equals 111 and finally merge. It’ll work but applying this strategy could result in many “temporary” tables. A lot of these small tables used only for a few steps before merging with other tables tend to be a nightmare for maintenance. You can use the “Advanced formula editor” to not display this kind of temporary tables and embed them in your main table.

On the code of the Respondenttable, add a step to filter the content of the table Result-Staging.

#"Age table" =
   Table.SelectRows(
      #"Results-Staging",
      each [QuestionId] = 111
   ),

If you take a look to the code the function #"Results-Staging" doesn’t exist in the code associated to table Respondent. In fact it’s the end-result of the function/table Results-Staging !

Now that we’ve our filtered set of rows corresponding to the ages of the respondents, I can join it with the content of the table Respondent.

#"Merged Queries" =
   Table.NestedJoin(
       #"Promoted Headers",
       {"Name"},
       #"Age table",
       {"Interviewee"},
       "NewColumn",
       JoinKind.LeftOuter
   ),

With a bit more of expanding and renaming, I can finally get my table Respondent.

respondent-final

In the next blog post, I’ll show how to manage the table of the results.

 

Creating a hierarchy when the source has the levels flattened in a unique column (Power BI)

With a few blog posts, I’ll explain how to create a model for an analysis of answers to a survey. This post is based on a real-life experience that I’m fighting right now, but I completely changed the type of questions and the answers.

My first challenge is to create a clean dimension with the hierarchy of questions based on an input file were this notion of hierarchy was not explicit and everything was flatten in two columns.

Before jumping into the action, I’ll explain the file containing my data. I received a wonderful Excel file with the results of the survey. This file is a bit weird, let’s summarize the format:

  • first column contains a “number” corresponding to a hierarchy to organize the different questions. Each question is part of a sub-category and each sub-category is part of category. The second column contains the question it-self or the label of the (sub-)category (Gray area)
  • an additional column is created for each respondent. (columns C to F)
  • first rows are personal information, not answers to questions (Orange area)
  • some personal information (age) are also available in the list of answers to questions (Yellow area)
  • Most of the question accepts predefined values, including “Not applicable” and “Don’t know” (Green area)
  • Some people were not allowed to answer some questions, I’ve no value for them in the file (Red area)
survey-excel-file

I decided to apply a few “business rules” to analyze the results,

  • “Not applicable” should be considered as “no answer” and not taken into account. This rule must apply when I’m aggregating the answers or counting the answers by questions
  • “Don’t know” is a valid answer that I’d like to see when I’m analyzing the how many answers I received but not when I’m aggregating the answers to get a single value.

To load this Excel file in Power BI, I’ll just use standard functions and define a first table “Source” that won’t be enabled to load in report.

My next task will be to create a table (or dimension) with the different questions. I also want to include a hierarchy in this dimension: I should be able to browse the questions by categories and sub-categories.

Let’s create a new table named “Question” by referencing the “Source” table. Then remove the other columns than A and B.

question-null-id

We’ll also need to remove the rows where the content of “Column1” is null (these rows are personal information, not questions/categories).

At the beginning of this post I said that the first column was a number … to be precise it’s not a whole number (as I expected) but a decimal number! The root cause are rounding errors in Excel … apparently the original file was not configured correctly and the columns were not “text”:

question-rounding

To solution this issue, I’ll just apply the rounding by myself with 0 decimals … not “up” or “down” rounding but just a normal rounding. Then I’ll change the type of this column to “text” because I won’t use this column for aggregations.

#"Rounded Off" =
   Table.TransformColumns(
      #"Renamed Columns",
      {{
         "QuestionId",
         each Number.Round(_, 0), type number
      }}
    ),
#"Changed Type" =
    Table.TransformColumnTypes(
       #"Rounded Off",
      {{"QuestionId", type text}}
    ),

Based on the column “QuestionId”, I can determine if this row is a question, sub-category or category by just taking a look to the length of this text. 1 digit means a category, 2 for a sub-category and 3 for a question … let’s add a column with this information.

#"Added level" = Table.AddColumn(
   #"Changed Type",
   "Level",
   each
      if
         Text.Length([QuestionId])=1
      then "category"
      else if
         Text.Length([QuestionId])=2
      then "sub-category"
      else "question"),

Then I’ll add two additional columns with the Id of the sub-category and the category. To obtain these values, I just need to extract the two first digits for a sub-category and the very first for a category.

#"Added CategoryId" =
   Table.AddColumn(
      #"Added level",
      "CategoryId",
      each Text.Start([QuestionId],1)
   ),
#"Added SubCategoryId" =
   Table.AddColumn(
      #"Added CategoryId",
      "SubCategoryId",
      each Text.Start([QuestionId],2)
   ),
question-level

Once I’ve calculated the “Id” for the sub-category, I’d need to replace it by the label. To achieve this, I’ll just do a merge of this table on itself based on the SubCategoryId equal to the QuestionId:

question-merge
#"Merged Queries" =
    Table.NestedJoin(
        #"Added SubCategoryId",
        {"SubCategoryId"},
        #"Added SubCategoryId",
        {"QuestionId"},
        "NewColumn",JoinKind.LeftOuter),

Expand the table and select the column Question that you should rename sub-category. Apply the same tactic for the addition of the label of the Category and you should have the table bellow.

question-categories.jpg

This table has too many rows because we still have rows for the categories and sub-categories. We can just filter them out using the column level.

#"Filtered Rows1" =
     Table.SelectRows(
          #"Renamed Columns2",
          each ([Level] = "question")
     ),

The column level is now pointless (one unique value equal to “question”) and we can safely remove it (and also the columns for Ids of category and sub-category) to have a clean dimension table:

question-final

In the next blog post, I’ll explain how to build my “interviewee” dimension.

Parsing a string with Power BI (M language)

A few days ago, I read an interesting blog post of Lukas Lötters [b|t] where he is explaining how to use regular expressions in Power BI (M language) by running an R script. It’s a really smart way to solve the issue. But I wanted to try to resolve this problem without delegating the parsing of the string to R (just for the challenge!).

In a nutshell, the challenge is to find a substring in a string. This substring is identified by a pattern \((19|20)\d{2}. If you’re not familiar with regular expressions, it just means a parenthesis followed by “19” or “20” and two additional digits.

regex_meme.jpg

Let’s start by defining a few test-cases that the function to implement will need to pass:

  • “The Shawshank Redemption (1994)” / result: 1994
  • “My Dad Says (2010) {Dog Ed Pursuit (#1.7)}” / result: 2010
  • “Years Younger (2004/I)” / result: 2004
  • “abcd(19xy) hqskj dhq (2020)” / result: 2020
  • “fdsdf (1995) sdfsdf (19)” / result: 1995
  • “hkjehr (195x) (1993)” / result: 1993
  • “hkjehr (19-9) (2005)” / result: 2005
  • “hkjehr (199.) (2000)” / result: 2000
  • “hkjehr (19+9) (1993)” / result: 1993

The three first test-cases are directly retrieved from the original file to parse, for the others, they are just additional potential cases directly produced by my own brain.

To solve this challenge, I’ll create a function expecting a text.

let
    ExtractYear = (string as text) =>

My first step will be to identify all the opening parenthesis in the string. Usually, we use the function Text.PositionOf to get the first occurrence of a character in a string. But if we specify the parameter Occurrence.All, this function is returning a list with all the occurrences.

let
    starts = Text.PositionOf(string, "(", Occurrence.All),

The previous function is returning a list of positions in the string. I’ll now extract the 4 characters following the parenthesis and will replace the previous content of my list with these 4 characters, List.Transform is a good friend for this kind of task.

    blocks =
        List.Transform(
            starts
            , each Text.Range(string, _ + 1, 4)
        ),

Unfortunately, I’m not sure that I’ve 4 characters after the parenthesis in my string (test-case 5). I need to be careful and handle the error. We can implement this in a try ... otherwise ... syntax.

    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),

Now, for each opening parenthesis, I’ve the next 4 characters (or null if they are not existing). I’ll need to filter this list of 4 characters to only take into account those starting by “19” or “20”. To apply this filter to my list, I’ll use the function List.Select.

    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),

This filter will already considerably reduce the list of potentials year. The only test-cases where I still have many potential years are the test-case 6 and next. Expl: hkjehr (195x) (1993). This test-case has 2 remaining elements in the list: 195x and 1993.

To continue to filter out candidate “year”, I need to check that the characters in position 3 and 4 are digits. I have not found a native function in M to test that a character is a digit, so I’ll implement this by myself. But first step, extract characters in position 3 and 4 and pass them as parameter to the to-be-created function named isDigit. This new function will return true/false (a logical in M).

    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),

To validate that a character is a digit, I decided to rely on characters and encoding. All the digits are encoded between positions 48 and 57 in the ASCII table.

ascii-charsMy function isDigit will just test that the character, transformed to a number (by the help of Character.ToNumber) is between the same conversion applied to “0” and “9”.

    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),

This internal function is accepting a parameter. Take a look to the syntax where I stipulated this parameter after the equal symbol and before the arrow symbol (Just as for a normal function).

When all these tests have been applied to filter the candidate for the release year I’m expecting to have just one remaining candidate! I’ll assert this with the function List.Single. This function is throwing an error if the list is empty or has more than one element.

    value = List.Single(blocks1920dd)

That’s it! My newly created function is validating all the cases that I set at the beginning.

challenge-completed

Full code:

let
    ExtractYear = (string as text) =>
let
    starts = Text.PositionOf(string, "(", Occurrence.All),
    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),
    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),
    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),
    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),
    value = List.Single(blocks1920dd)
in
    value
in
    ExtractYear

Handling special characters from html files with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The two next blog posts ([1] and [2]) illustrated how to parse rather complex Json responses and retrieve the commits’ activity and the issues’ activity of a repository.

In this new blog post, I’ll try solve an issue that is there since the first blog post. When parsing the MVP website and retrieving the first and last names, I had an issue with some special characters: my own name was retrieved as “Cédric Charlier” in place of “Cédric Charlier”. For some reasons some special characters are encoded (but some are not). Anyway we need to decode them.

From a technical perspective, I’ll solve this issue with a recursion but also with some error handling in Power Query/M language.

mvp-analysis-powerbi-012

The first step is to analyze the problem that we’ve in front of us. The special characters encoded have always the same pattern. A ampersand (“&”) followed by a cross (“#”), a set of digits and a semi-colon (“;”) to end the fragment. We’ll first need to identify this kind of pattern in the string. I’ll simplify to only search for the ampersand and the semi-colon.

let
    ConvertHtmlSpecialChars = (string as text) =&gt;
let
    start = Text.PositionOf(string, "&amp;"),
    end = Text.PositionOf(string, ";"),

Then, I’ll extract this fragment from the string. Then I’ll confirm that this framment is effectively starting by “&#”. If it’s the case then I’ll return the number in this fragment.

    value = Text.Range(string, start, end-start+1),
    cleanValue =
        if
            Text.StartsWith(value, "&amp;#")
        then
            Text.Start(
                Text.End(
                    value
                    , Text.Length(value)-2
                )
                ,Text.Length(value)-3
            )
        else
            null,

Now, that I’ve this number, I can translate it to a normal character with the help of the function Character.FromNumber and replace the fragment by the new value.

    replace = Character.FromNumber(Number.FromText(cleanValue)),
    replaced = Text.Replace(string, value, replace),

Now, I need to handle the fact that I can have more than one special character in my string and I must process each of them. Once again recursion will help me!

    continue =
        if string=replaced
        then replaced
        else ConvertHtmlSpecialChars(replaced)
in
    continue

That’s it? Not really. My function is crashing easily. If I’ve no special characters in my string, the function Text.Range(string, start, end-start+1) will crash because -1 (start’s value) is not a valid. But I’ve many other places where I could have an issue.

A little bit of M theory to understand the basics of error handling in this language. When an error is thrown by a function this error is propagated to the caller and so on. It’s really practical because you can catch the error at a higher level and you don’t need to intercept all the potential errors.

In this case, I’ll catch the potential error between the replacement and my recursive condition.

newString = try replaced otherwise string,

The keyword try means that we’re expecting a potential issue during the call to the function replaced. The keyword otherwise specifies that if we effectively have an error during the call to the function replaced then we need to use another function in place of. In this case, the function is just the value of string (the initial value). So, if at any moment during the substitution, I’ve an error then I’ll use the original string.

I still need to update the recursive condition to use this new function:

continue =
        if string=newString
        then newString
        else ConvertHtmlSpecialChars(newString)
in
    continue
in
    ConvertHtmlSpecialChars

Now, I’ll need to add the call to this function to the table dealing the MVPs’ names:

#"Html Special Chars" =
        Table.TransformColumns(
            #"Renamed Columns"
            ,{{"MVP name", ConvertHtmlSpecialChars}}
        )
in
    #"Html Special Chars"

And now I can compare the before

mvp-analysis-powerbi-013

and after

mvp-analysis-powerbi-014

Do … While loop with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The next blog post illustrated how to parse a rather complex Json response and retrieve the commits’ activity of a repository with the help of a projection.

In this new blog post, I’ll make an additional query to GitHub to retrieve the issues’ activity of a repository. At the opposite of the previous calls to the GitHub API, the response will be cut into different pages. It means that one call to the API won’t be enough to retrieve the whole list of issues. Indeed, issues are returned 42 by 42 and I’ll need to retrieve one, two or more pages of information to have my count of issues created during the last 52 weeks.

Let’s start by taking a look to the GitHub API to retrieve information about issues’ activity: /repos/:login/:repos/issues But, that’s a bit more complex. If you use this method as-is, it will return only the open issues and in a random sort. Hopefully, we can override this behavior by applying modifiers:

/repos/:login/:repos/issues?state=all&sort=created

State is there to identify the state of the issues to be returned, I’m enforcing to receive all the issues independently of the current state. With the sort parameter, I’m requesting to order the items by creation date. By default the sort will be descending, I don’t need to explicitly specify it.

With this request, I’ll retrieve the 42 last issues created in a repository. If I want  to retrieve the 42 previous, I’ll need to add an additional parameter.

/repos/:login/:repos/issues?state=all&sort=created&page=2

The parameter page let’s me navigate in my list of issues (ordered by creation date).

If I want to retrieve all the issues, I need to go through all the pages. But how many pages do I need to retrieve … it depends some repositories have no issues, others have thousands.

Many programming languages are imperative (and not functional or declarative) and have a structure do...while to perform loops that are stopped when a condition is met (in this case probably the condition is that the page returns no issue). The M language is a declarative language and has no structure to handle loops. It’s not really a problem because we can overcome this limitation by using a recursion.

The initial state of the recursion will be to perform a query on the GitHub API with parameter page set to 1. Then we’ll parse the response and check if we’ve issues created during the last 52 weeks. If it’s the case, we’ll perform and additional request incrementing the parameter page of 1 unit until we receive no issue created during the last 52 weeks.

Because the function is recursive we need to provide an additional parmeter (page).

let
    GetIssuesDuringPeriod = (
        login as text
        , repository as text
        , countOfWeeks as number
        , page as number
    ) =>

Then we’ll call the GitHub API with the correct modifiers and the parameter page.

let
    #"List of issues" =
        Json.Document(
            Web.Contents(
                "https://api.github.com/repos/"
                & login & "/" & repository
                & "/issues?state=all&sort=created&page="
                & Text.From(page)
            )
        ),

We’ll consume the result of the query by analyzing the response and counting all the issues created since less than 52 weeks.

The first function returns the date corresponding to 52 weeks ago. That date will be compared to the value of the field created_at of each issue. If the date is anterior then we’ll add 1 to our count of issues. If not we won’t change our count.

To achieve this I’m using a projection named List.Accumulate. I’m setting the initial value (s .. stands for seed) to 0 and for each issue, I’m adding 1 to the seed or not based on the date comparaison. This function is returning the seed when all items have been iterated.

#"StartDate" =
        Date.AddWeeks(
            DateTime.LocalNow()
            ,  -countOfWeeks
        ),

    #"CountOfIssues" =
        List.Accumulate(
            #"List of issues"
            , 0
            , (s, c) =>
                if
                    DateTime.FromText(c[created_at], "en-us")
                    >=#"StartDate"
                then
                    s+1
                else
                    s
        ),

Now, I’ll need to take a decision and decide if I need to iterate on an additional page or not. If the count returned in the previous function was 0 it means that I had no issue created during the last 52 weeks on this page. Due to the enforced ordering in my request, I know that I can stop there. If the result was greater than 1, I’ll perform and additional iteration on my recursive function to check if I’ve additional issues on the next page. To achieve this I’ll recall the function itself and increment the parameter page. The result of this new call will be added to the result of the current call.

#"Recursive" =
        if
            #"CountOfIssues"> 0
        then
            #"CountOfIssues"
            + GetIssuesDuringPeriod(
                login
                , repository
                , countOfWeeks
                , page+1
            )
        else
            0
in
    #"Recursive"
in
    GetIssuesDuringPeriod

Now, I can add the result of this function to the table about GitHub repositories:

#"Added issue activity" =
        Table.AddColumn(
            #"Added commit activity"
            , "Count of issues created this year"
            , each GetIssuesDuringPeriod(
                [Login]
                , [#"Repository's name"]
                , 52
                , 1
            )
        )
in
    #"Added issue activity"

And I can create the following visualization to check the popularity of repositories by language (C#, R, JavaScript)

mvp-analysis-powerbi-011

and by MVP

mvp-analysis-powerbi-010

The next blog post of this series, will about parsing special html characters and translating them into correct Unicode values.

 

 

Projections with Power BI (M language)

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It was time to get information about social media for each MVP and connect to the GitHub API to retrieve more info about their activities on GitHub. This third blog post was ending on a remark about the limited interest of the values available (count of public repositories, followers) and the need to retrieve more pertinent info such as the commit activity on the repositories owned by the MVPs.

From a technical perspective, I’ll show how to make a projection in Power BI (M Language). A projection is a process of selecting data in a different shape rather than a specific entity being queried. There are many ways of projection such as GroupBy or SingleOrDefault but I’ll illustrate this with the less known Transform.

Let’s try this! We first need the list of all the repositories of an MVP. The starting point is the list of social media, filtered on the website “GitHub”. Then to get a list of repositories for the login of the MVP, I’ll use the GitHub API /users/:login/repos.

let
    Source = #"Social Media",
    #"Removed Columns" =
        Table.RemoveColumns(
            Source
            ,{"Url"}
        ),
    #"Filtered to Github website" =
        Table.SelectRows(
            #"Removed Columns"
            , each ([Website] = "GitHub")
        ),
    #"Get repositories from Github" =
        Table.AddColumn(
            #"Filtered to Github website"
            , "Custom"
            , each Web.Contents(
                "https://api.github.com/users/"
                & [Login] & "/repos"
            )
        ),

This give me a nice table with all the informations about repositories in a binary format:

mvp-analysis-powerbi-007

At the opposite of previous post, the expanding of the response is a bit easier. I won’t use the pivot functions, just some expands.

    #"Expanded Transform Binary from Github" =
        Table.ExpandListColumn(
            #"Transform JSON to Table"
            , "Transform Binary from Github"
        ),
    #"Expanded Transform Binary from Github1" =
         Table.ExpandRecordColumn(
            #"Expanded Transform Binary from Github"
            , "Transform Binary from Github"
            ,{
                "name", "owner", "html_url"
                , "description", "fork"
                , "stargazers_url", "contributors_url"
                , "homepage", "stargazers_count"
                , "watchers_count", "language"
                , "forks_count", "open_issues_count"
                , "forks", "open_issues", "watchers"
            }
                ,{"name", "owner", "html_url"
                , "description", "fork", "stargazers_url"
                , "contributors_url", "homepage"
                , "stargazers_count", "watchers_count", "language"
                , "forks_count", "open_issues_count"
                , "forks", "open_issues", "watchers"
            }
        ),
    #"Expanded owner" =
        Table.ExpandRecordColumn(
            #"Expanded Transform Binary from Github1"
            , "owner"
            , {"login"}
            , {"owner.login"}
        ),

A bit of cleaning by removing and renaming some columns

    #"Removed Columns1" =
        Table.RemoveColumns(
            #"Expanded owner"
            ,{"Custom"}
        ),
    #"Renamed Columns" =
        Table.RenameColumns(
            #"Removed Columns1"
            ,{{"stargazers_count", "Count of stars"}
            , {"watchers_count", "Count of watchers"}
            , {"forks_count", "Count of forks"}
            , {"name", "Repository's name"}
            , {"open_issues_count", "Count of issues"}}
        ),

Filtering out the forks

#"Filtered Rows" = Table.SelectRows( #"Renamed Columns" , each ([fork] = false) ),

And I’ve the following result:

mvp-analysis-powerbi-008.png

That’s cool but not it’s not finished. I still haven’t got some information about commits. To achieve this, I’ll need to make an additional query to the GitHub API and request the commit activity of each repository: /users/:login/:repos/stats/commit_activity.

Let’s build a function to achieve this.

let
    GetCommitsDuringPeriod = (
        login as text
        , repository as text
        , countOfWeeks as number
    ) =>
let
    #"List of commits by week" =
        Json.Document(
            Web.Contents(
                "https://api.github.com/repos/"
                & login
                & "/"
                & repository
                & "/stats/commit_activity"
            )
        ),

This api send a response with a rather complex structure: A list of records, week by week, the commits on each weekday, the commits of the whole week (equal to the sum of the commits on each weekday – named total) and the week identification (Unix timestamp).

mvp-analysis-powerbi-009

I’m not interested by most of this information and I just need to keep the information total for the last 52 weeks. This is where I’ll apply a projection and I’ll use the function List.Transform. This function, let’s me explain that I want to transform my initial list into a new one. For each element, I’ll just hold the field total.

    #"List of totals" =
        List.Transform(
            List.LastN(
                #"List of commits by week"
                , countOfWeeks
            )
            , each _[total]
        ),

In this case, the transformation will return a list of number (corresponding to each total). Again, I’m not really interested by the detail and I can sum all these records with the help of the function List.Sum.

    #"CountOfCommits" = List.Sum(#"List of totals")
in
    #"CountOfCommits"
in
    GetCommitsDuringPeriod

Now, I can easily integrate my function to the previous table, I just need to add a column calling this function.

    #"Added commit activity" = Table.AddColumn(
            #"Filtered Rows"
            , "Count of commits this year"
            , each GetCommitsDuringPeriod(
                      [Login]
                      , [#"Repository's name"]
                      , 52
                 )
            )
in
    #"Added commit activity"

In the next blog post, I’ll add the issues’ activity for each repository. From a logical point of view, it’ll be the same process but the technical challenge will be the pagination of the results of the GitHub API.

Retrieve GitHub data with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It’s now time to get information about social media for each MVP. Then we’ll parse these informations to get the login of each MVP on GitHub and finally use this login to connect to the GitHub API and retrieve more info.

From a technical perspective, I’ll show

  • how to query an external API to get some information
  • how to return more than one value for a function
  • how to build a function to identify some patterns
  • how to transform pairs of key/value into several columns.

Ready? Let’s start …

The first task is to grab the social media information on the profile page. In previous blog posts, we stored this individual page in a table, the starting point will be this table. Once again, we apply the strategy of the html pattern to identify the useful lines for our task.

let
    IndividualPageTags = #"Individual page",
    #"Added SocialMediaTable" = Table.AddColumn(IndividualPageTags, "Tags", each GetSocialMedia([IndividualPage])),
    #"Removed Columns" = Table.RemoveColumns(#"Added SocialMediaTable",{"IndividualPage"}),
    #"Expanded Tags" = Table.ExpandListColumn(#"Removed Columns", "Tags"),
    #"Extract url" = Table.AddColumn(#"Expanded Tags", "Url", each GetUrlFromTag([Tags])),
    #"Removed Columns1" = Table.RemoveColumns(#"Extract url",{"Tags"}),
    #"Filtered out null" = Table.SelectRows(#"Removed Columns1", each ([Url] <> null)),

Now that we’ve a list of links to the MVP’s social medias, we still have to identify the type of social media (GitHub, Facebook, LinkedIn, …) and to extract the login/username from the url. To achieve this task we’ll create a function but it means that our function will not return a single field but two fields (the type of social media and the login). With the M language, it’s possible to return a record (a record is a container with one or more fields, each field is identified by a name).

To build the record, we use the function Record.AddField.

let
    GetWebsiteInfo = (url as text) =>
let
    url = Text.Trim(url, {" ", "/"}),
    possibilities = {
        [pattern = "facebook"  , value = "Facebook"],
        [pattern = "linkedin"  , value = "LinkedIn"],
        [pattern = "github"    , value = "GitHub"],
        [pattern = "twitter"   , value = "Twitter"]
    },
    Website = List.Single(List.Select(possibilities, each Text.Contains(url, _[pattern])))[value],
    Login = Text.Range(url, Text.PositionOf(url, "/", Occurrence.Last)+1),
    Records = Record.AddField(Record.AddField([], "Website", Website ), "Login", Login)
in
    Records
in
    GetWebsiteInfo

We can add this function to our previous code. The call to the function will add a record to each line. This record must be expanded to reveal the values for the website and the login.

    #"Get Website and Login" = Table.AddColumn(#"Filtered out null", "Custom", each GetWebsiteInfo([Url])),
    #"Expanded Website and Login" = Table.ExpandRecordColumn(#"Get Website and Login", "Custom", {"Website", "Login"}, {"Website", "Login"})
in
    #"Expanded Website and Login"
mvp-analysis-powerbi-006.png

Cool, now that we’ve a list of social medias with information that we can parse, we can use them to contact GitHub and get more info about the GitHub account. First we need to start from the social media table and filter to hold lines referring to GitHub.

let
    Source = #"Social Media",
    #"Removed Columns" = Table.RemoveColumns(Source,{"Url"}),
    #"Filtered to Github website" = Table.SelectRows(#"Removed Columns", each ([Website] = "GitHub")),

GitHub exposes an interesting API that we can use to get some information about accounts, repositories, issues and much more. This API is exposed on http and returns a JSON document. We can easily interact with this API with the functions Web.Contents and Json.Document. To achieve this task , we need to build an url with the format users/:login where :login is a variable corresponding to the login of the user.

    #"Get user info from Github" = Table.AddColumn(#"Filtered to Github website", "Custom", each Web.Contents("https://api.github.com/users/" & [Login])),
    #"Transform JSON to Table" = Table.AddColumn(#"Get user info from Github", "Transform Binary from Github", each Record.ToTable(Json.Document([Custom]))),
 

Before executing this code Power BI will ask us about security settings on the GitHub API. This API is a new source and each source must define it’s own security. The GitHub API can be contacted with <em>anonymous</em>, <em>basic authentication</em>, <em>OAuth2</em>. We could use the <em>anonymous</em> way but we’ll be quickly limited because of the limitation of 50 calls/day to the API. I selected <em>basic authentication</em> and registered my own GitHub credentials into Power BI. When it’s done we can now execute the load of this new table.

Now, we’ve information about GitHub accounts and we need to reformat it. We’ll first expand the Json table received from GitHub. It’s a simple set of key/value pairs. For some convenience, we need to  exclude the key equal to plan (this info is only available for your account). Then, it will be time to remove some columns that are now useless.

    #"Expanded Table to Name/Value pairs" = Table.ExpandTableColumn(#"Transform JSON to Table", "Transform Binary from Github", Table.ColumnNames(#"Transform JSON to Table"[#"Transform Binary from Github"]{0})),
    #"Filtered out ""Plan""" = Table.SelectRows(#"Expanded Table to Name/Value pairs", each ([Name] <> "plan")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered out ""Plan""",{"Website", "Login", "Custom"}),
mvp-analysis-powerbi-004

Based on this content, we can pivot the data to transform the key/value pairs into columns. The pivot function is expecting a summarize function (usually List.Sum). Keep in mid that here the values are text and not numbers. We should specify List.Max because anyway we’ll have one value for each pivoted row and this function accepts a text argument.

    #"Pivoted Key/Value" = Table.Pivot(#"Removed Columns1", List.Distinct(#"Removed Columns1"[Name]), "Name", "Value", List.Max),

 

mvp-analysis-powerbi-005.png

When done, we need some additional cleaning:

    #"Removed Columns2" = Table.RemoveColumns(#"Pivoted Key/Value",{"id", "gravatar_id", "url", "subscriptions_url", "organizations_url", "repos_url", "events_url", "received_events_url", "type", "site_admin", "private_gists", "total_private_repos", "owned_private_repos", "disk_usage", "collaborators"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"login", "GitHub Login"}, {"name", "GitHub fullname"}, {"public_repos", "Count of public repos"}, {"followers", "Count of followers"}})
in
    #"Renamed Columns"

Now, with this new table, we can build the following report.

mvp-analysis-powerbi-003

I’ve a few remarks about this report. First, it’s a bit weird to read that only 35 MVP have registered a GitHub account on their profile pages (it’s less than 10%). Some have probably forgotten to register this information.

Count of followers and public repositories are not the best parameters to determine the popularity or the activity on GitHub of someone. I’ll improve this in a next blog post.

Scan web pages with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. It was great but I just get a few information about the MVP (name and country), it’d be great to also know since when they are MVP, how many awards and links to social media (in the next blog post I’ll use these links to get more info about GitHub accounts).

In the previous blog post, I add build a ProfileUrl field with the url of the individual profile page, it’s time to use it.

The first task is to grab the awards’ information on the profile page: categories, first year and count of awards. But before going further I need to take into account that I’ll probably use the individual profile page of each MVP more than once (I’ll also use these pages to get data about Social Media). Based on this, it’s better to avoid to duplicate the code to load these pages everywhere (will also save me from several download for each pages).

let
   IndividualPageTags = #"MvpId and Name",
   #"Added Page" = Table.AddColumn(#"MvpId and Name", "IndividualPage", each Web.Contents([ProfileUrl])),
   #"Removed Columns" = Table.RemoveColumns(#"Added Page",{"Name", "ProfileUrl", "PhotoUrl", "Index"})
in
   #"Removed Columns"

To get the information on this âge we’ll use the same strategy than in the first part of this post. Transform the binary to a list of lines and filter them on a pattern.

let
   IndividualPageTags = #"Individual page",
   #"Added AwardsTable" = Table.AddColumn(#"IndividualPageTags", "Awards", each GetAwards([IndividualPage])),
   #"Expanded Custom" = Table.ExpandTableColumn(#"Added AwardsTable", "Awards", {"Award", "FirstYear", "CountOfAwards"}, {"AwardsList", "FirstYear", "CountOfAwards"})
in
   #"Expanded Custom"

The filtering task is performed by the function GetAwards. This function accepts, as a parameter, the individual profile page of the MVP. The type is binary. This function is looking for the pattern div class="ly infoContent". After cleaning, we’ve three rows. The first one contains the different categories, the second the first year and finally the count of awards. It’s possible to transform the three rows into three columns using the Table.Transpose function.

let
    AwardsInfo = (page as binary) =&gt;
let
    Lines = Lines.FromBinary(page),
    Tags = List.Select(Lines, each (Text.Contains(_, "
&lt;div class=""ly infoContent""&gt;"))),
    #"Converted to Table" = Table.FromList(Tags, Splitter.SplitTextByDelimiter("&gt;"), null, null, ExtraValues.Error),
    #"Changed Type" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column1", "Column3"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns","&lt;/div","",Replacer.ReplaceText,{"Column2"}),
    #"Transposed Table" = Table.Transpose(#"Replaced Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Transposed Table",{{"Column1", "Award"}, {"Column2", "FirstYear"}, {"Column3", "CountOfAwards"}})
in
    #"Renamed Columns"
in
    AwardsInfo

When we’ve retrieved the awards’ information, we can add the data about first year and count of awards to our main table.

let
    Source = #"MvpId and Name",
    #"Merged MvpId and Country" = Table.NestedJoin(Source,{"Index"},Country,{"Index"},"NewColumn",JoinKind.Inner),
    #"Expanded Country" = Table.ExpandTableColumn(#"Merged MvpId and Country", "NewColumn", {"Country"}, {"Country"}),
    #"Removed Index" = Table.RemoveColumns(#"Expanded Country",{"Index"}),
    #"Merged with Awards" = Table.NestedJoin(#"Removed Index",{"MvpId"},Awards,{"MvpId"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded Awards" = Table.ExpandTableColumn(#"Merged with Awards", "NewColumn", {"FirstYear", "CountOfAwards"}, {"FirstYear", "CountOfAwards"})
in
    #"Expanded Awards"

The relation between an MVP and the awards’ categories is a one-to-many. We need to build a second table with all the awards’ categories received by each MVP. To achieve this we first need to transform the list of categories into multiples rows. To achieve this, we need to split our string with the separator coma. then we can unpivot the data on the MVP Id and then remove the useless column Attribute. To clean the remaining spaces of some categories, we just need to trim.

let
    Source = Awards,
    #"Removed Columns" = Table.RemoveColumns(Source,{"IndividualPage", "FirstYear", "CountOfAwards"}),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Columns","AwardsList",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"AwardsList.1", "AwardsList.2"}),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Split Column by Delimiter", {"MvpId"}, "Attribute", "Value"),
    #"Removed Columns1" = Table.RemoveColumns(#"Unpivoted Columns",{"Attribute"}),
    #"Trimmed Text" = Table.TransformColumns(#"Removed Columns1",{{"Value", Text.Trim}})
in
    #"Trimmed Text"

It’s now time to check the relations between our main table (All info) and the new table (Awards Nomination).

mvp-analysis-powerbi-003

We can now build a new report with some information about the awards received.

mvp-analysis-powerbi-002

The next step is to retrieve the information about social media for each MVP and to use this information to retrieve additional info about their usage of GitHub … but it’s for the next blog post.