Handling enum values in fact tables with Power BI

In a previous blog post, I wrote about the analysis of survey’s results and specifically how to build a hierarchy when this notion was not really transparent into the source data. Then, in a second post, I explained how I skipped to create some temporary tables and used transpose and unpivot.

I’ll now continue my journey and explain how to build the tables to manage the results. Some of the challenge that I’ll have to face are:

  • Answers are an enum where I’m expecting some values such as NONE, LARGELLY, FULLY, … [Green area]
  • NOT APPLICABLE should be treated as a no-answer (that is also possible).
  • I DON’T KNOW, should be considered as an answer but not taking into account when I create an aggregation (average) of the results received.
survey-excel-file

To start, I’ll create two tables and not one. In the first I’ll create a new dimension with the list of possible answers to the questions. I can do this manually with the “Enter Data” feature.

enter-data

For the second, I’ll restart from the previously created table Result-Staging. I’ll just merge this table with the newly manually created Value table.

let
    Source = #"Results-Staging",
    #"Merged Queries" = Table.NestedJoin(#"Source",{"Answer"},Value,{"Label"},"NewColumn",JoinKind.Inner),
    #"Expanded NewColumn" = Table.ExpandTableColumn(#"Merged Queries", "NewColumn", {"Value"}, {"Value"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Expanded NewColumn",{"Answer"}),
    #"Filtered Rows1" = Table.SelectRows(#"Removed Columns1", each ([Value] <> null))
in
    #"Filtered Rows1"

Due to the usage of an Inner Join, all the questions where the answer was not one of the predefined values of the enumeration are discarded. It’s a good new because it’ll remove the yellow area, the red area but also all the NOT APPLICABLE answers!

results-final

Our model looks like:

survey-model

We still have to implement a few measures. The count of answers with a given label for a category or a Respondent can be easily acgieved with the straightforward DAX formulae:

CountOfAnswer = COUNT(Results[Value])

We’ve previously decided that a DON’T KNOW, shouldn’t influence our average of the answers. To apply this decision, we just need to filter the table Result and remove all the values equal to 0 (Enum value of DON’T KNOW). Then we calculate the average and subtract 1 to get a value between 0 and 4. Coooool, except that if we’ve no value non equal to 0, it will return -1 … not what we’re expecting. We’ll need to validate that the average is not null before subtracting 1.

AverageOfResponse =
   VAR NonZeroValues =
       CALCULATE(
           AVERAGE('Value'[Value]),
           FILTER('Value', 'Value'[Value]<>0)
       )
   RETURN
       IF(ISBLANK(NonZeroValues), BLANK(), NonZeroValues-1)

That’s it, now I can use the DON’T KNOW values in a count of answers received but they don’t influence my average:

results-charts

 

Merge on a filtered set of rows without creating a new table (Power BI).

In a previous blog post, I wrote about the analysis of survey’s results and explictely how to build a hierarchy when this notion was not really transparent into the source data.

I’ll now continue my journey and explain how to build the table/dimension “respondent”. To summarize the challenge that I’ll face:

  • Each respondent means a new column (and not a new row), each additional information for the respondent add a new row (and not a new column) [Orange area]
  • Some information are available in the list of answers given by the respondent and not in the information on the top [yellow area]
  • For maintenance reasons, I’ll try avoid to create temporary tables that will only be used as a subset of another table
survey-excel-file

The first step is really easy, we’ll just keep the first rows by removing all the rows where the column “A” has a value. when done, we can remove this column that will only contain null values.

let
   Source = Source,
   #"Filtered Rows" =
      Table.SelectRows(
          Source,
          each ([Column1] = null)
      ),
   #"Removed Columns" =
      Table.RemoveColumns(
           #"Filtered Rows",
           {"Column1"}
      ),

The next step is to transform each row into a column and vice versa. Usually people with a SQL background try to apply a pivot or unpivot to achieve this task. There is an easiest way with the M language … just use the function Transpose

#"Transposed Table" = Table.Transpose(#"Removed Columns"),

You’ll obtain the following result:

respondent-transpose

It’s now to promote the first row as header.

#"Promoted Headers" = Table.PromoteHeaders(#"Transposed Table"),

Now that we already have a few information about our respondent, we’ll need to add the block of yellow information to this table.

To achieve this, I’ll first create a new table named Result-Staging and containing the gray area of my excel sheet. Let’s start by removing the column “B”, promoting the first row as header and filter rows to remove them within the orange area (column “A” is null).

let
    Source = Source,
    #"Removed Columns" = Table.RemoveColumns(Source,{"Column2"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns"),
    #"Filtered Rows" = Table.SelectRows(
        #"Promoted Headers",
        each ([Column1] <> null)
    ),
results-gray

Then, I’ll try to change this format to have one column with the QuestionId, another with the Respondent and finally a third with the response. To transform the current data table into this format I’ll need to unpivot all columns except Attribute and Value.

    #"Unpivoted Other Columns" =
        Table.UnpivotOtherColumns(
            #"Filtered Rows",
            {"Column1"},
            "Attribute",
            "Value"
        ),
    #"Renamed Columns" =
         Table.RenameColumns(
              #"Unpivoted Other Columns",
              {
                    {"Column1", "QuestionId"},
                    {"Attribute", "Interviewee"},
                    {"Value", "Answer"}
               }
          )
in
    #"Renamed Columns"
results-unpivot

In this table, only the results with a QuestionId equal to 111 really interest me for a merge with the existing table Respondent. If you’re familiar with the UI of Power BI Desktop then you’ll probably think to create a new table referencing this one then filter on QuestionId equals 111 and finally merge. It’ll work but applying this strategy could result in many “temporary” tables. A lot of these small tables used only for a few steps before merging with other tables tend to be a nightmare for maintenance. You can use the “Advanced formula editor” to not display this kind of temporary tables and embed them in your main table.

On the code of the Respondenttable, add a step to filter the content of the table Result-Staging.

#"Age table" =
   Table.SelectRows(
      #"Results-Staging",
      each [QuestionId] = 111
   ),

If you take a look to the code the function #"Results-Staging" doesn’t exist in the code associated to table Respondent. In fact it’s the end-result of the function/table Results-Staging !

Now that we’ve our filtered set of rows corresponding to the ages of the respondents, I can join it with the content of the table Respondent.

#"Merged Queries" =
   Table.NestedJoin(
       #"Promoted Headers",
       {"Name"},
       #"Age table",
       {"Interviewee"},
       "NewColumn",
       JoinKind.LeftOuter
   ),

With a bit more of expanding and renaming, I can finally get my table Respondent.

respondent-final

In the next blog post, I’ll show how to manage the table of the results.

 

Creating a hierarchy when the source has the levels flattened in a unique column (Power BI)

With a few blog posts, I’ll explain how to create a model for an analysis of answers to a survey. This post is based on a real-life experience that I’m fighting right now, but I completely changed the type of questions and the answers.

My first challenge is to create a clean dimension with the hierarchy of questions based on an input file were this notion of hierarchy was not explicit and everything was flatten in two columns.

Before jumping into the action, I’ll explain the file containing my data. I received a wonderful Excel file with the results of the survey. This file is a bit weird, let’s summarize the format:

  • first column contains a “number” corresponding to a hierarchy to organize the different questions. Each question is part of a sub-category and each sub-category is part of category. The second column contains the question it-self or the label of the (sub-)category (Gray area)
  • an additional column is created for each respondent. (columns C to F)
  • first rows are personal information, not answers to questions (Orange area)
  • some personal information (age) are also available in the list of answers to questions (Yellow area)
  • Most of the question accepts predefined values, including “Not applicable” and “Don’t know” (Green area)
  • Some people were not allowed to answer some questions, I’ve no value for them in the file (Red area)
survey-excel-file

I decided to apply a few “business rules” to analyze the results,

  • “Not applicable” should be considered as “no answer” and not taken into account. This rule must apply when I’m aggregating the answers or counting the answers by questions
  • “Don’t know” is a valid answer that I’d like to see when I’m analyzing the how many answers I received but not when I’m aggregating the answers to get a single value.

To load this Excel file in Power BI, I’ll just use standard functions and define a first table “Source” that won’t be enabled to load in report.

My next task will be to create a table (or dimension) with the different questions. I also want to include a hierarchy in this dimension: I should be able to browse the questions by categories and sub-categories.

Let’s create a new table named “Question” by referencing the “Source” table. Then remove the other columns than A and B.

question-null-id

We’ll also need to remove the rows where the content of “Column1” is null (these rows are personal information, not questions/categories).

At the beginning of this post I said that the first column was a number … to be precise it’s not a whole number (as I expected) but a decimal number! The root cause are rounding errors in Excel … apparently the original file was not configured correctly and the columns were not “text”:

question-rounding

To solution this issue, I’ll just apply the rounding by myself with 0 decimals … not “up” or “down” rounding but just a normal rounding. Then I’ll change the type of this column to “text” because I won’t use this column for aggregations.

#"Rounded Off" =
   Table.TransformColumns(
      #"Renamed Columns",
      {{
         "QuestionId",
         each Number.Round(_, 0), type number
      }}
    ),
#"Changed Type" =
    Table.TransformColumnTypes(
       #"Rounded Off",
      {{"QuestionId", type text}}
    ),

Based on the column “QuestionId”, I can determine if this row is a question, sub-category or category by just taking a look to the length of this text. 1 digit means a category, 2 for a sub-category and 3 for a question … let’s add a column with this information.

#"Added level" = Table.AddColumn(
   #"Changed Type",
   "Level",
   each
      if
         Text.Length([QuestionId])=1
      then "category"
      else if
         Text.Length([QuestionId])=2
      then "sub-category"
      else "question"),

Then I’ll add two additional columns with the Id of the sub-category and the category. To obtain these values, I just need to extract the two first digits for a sub-category and the very first for a category.

#"Added CategoryId" =
   Table.AddColumn(
      #"Added level",
      "CategoryId",
      each Text.Start([QuestionId],1)
   ),
#"Added SubCategoryId" =
   Table.AddColumn(
      #"Added CategoryId",
      "SubCategoryId",
      each Text.Start([QuestionId],2)
   ),
question-level

Once I’ve calculated the “Id” for the sub-category, I’d need to replace it by the label. To achieve this, I’ll just do a merge of this table on itself based on the SubCategoryId equal to the QuestionId:

question-merge
#"Merged Queries" =
    Table.NestedJoin(
        #"Added SubCategoryId",
        {"SubCategoryId"},
        #"Added SubCategoryId",
        {"QuestionId"},
        "NewColumn",JoinKind.LeftOuter),

Expand the table and select the column Question that you should rename sub-category. Apply the same tactic for the addition of the label of the Category and you should have the table bellow.

question-categories.jpg

This table has too many rows because we still have rows for the categories and sub-categories. We can just filter them out using the column level.

#"Filtered Rows1" =
     Table.SelectRows(
          #"Renamed Columns2",
          each ([Level] = "question")
     ),

The column level is now pointless (one unique value equal to “question”) and we can safely remove it (and also the columns for Ids of category and sub-category) to have a clean dimension table:

question-final

In the next blog post, I’ll explain how to build my “interviewee” dimension.

Parsing a string with Power BI (M language)

A few days ago, I read an interesting blog post of Lukas Lötters [b|t] where he is explaining how to use regular expressions in Power BI (M language) by running an R script. It’s a really smart way to solve the issue. But I wanted to try to resolve this problem without delegating the parsing of the string to R (just for the challenge!).

In a nutshell, the challenge is to find a substring in a string. This substring is identified by a pattern \((19|20)\d{2}. If you’re not familiar with regular expressions, it just means a parenthesis followed by “19” or “20” and two additional digits.

regex_meme.jpg

Let’s start by defining a few test-cases that the function to implement will need to pass:

  • “The Shawshank Redemption (1994)” / result: 1994
  • “My Dad Says (2010) {Dog Ed Pursuit (#1.7)}” / result: 2010
  • “Years Younger (2004/I)” / result: 2004
  • “abcd(19xy) hqskj dhq (2020)” / result: 2020
  • “fdsdf (1995) sdfsdf (19)” / result: 1995
  • “hkjehr (195x) (1993)” / result: 1993
  • “hkjehr (19-9) (2005)” / result: 2005
  • “hkjehr (199.) (2000)” / result: 2000
  • “hkjehr (19+9) (1993)” / result: 1993

The three first test-cases are directly retrieved from the original file to parse, for the others, they are just additional potential cases directly produced by my own brain.

To solve this challenge, I’ll create a function expecting a text.

let
    ExtractYear = (string as text) =>

My first step will be to identify all the opening parenthesis in the string. Usually, we use the function Text.PositionOf to get the first occurrence of a character in a string. But if we specify the parameter Occurrence.All, this function is returning a list with all the occurrences.

let
    starts = Text.PositionOf(string, "(", Occurrence.All),

The previous function is returning a list of positions in the string. I’ll now extract the 4 characters following the parenthesis and will replace the previous content of my list with these 4 characters, List.Transform is a good friend for this kind of task.

    blocks =
        List.Transform(
            starts
            , each Text.Range(string, _ + 1, 4)
        ),

Unfortunately, I’m not sure that I’ve 4 characters after the parenthesis in my string (test-case 5). I need to be careful and handle the error. We can implement this in a try ... otherwise ... syntax.

    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),

Now, for each opening parenthesis, I’ve the next 4 characters (or null if they are not existing). I’ll need to filter this list of 4 characters to only take into account those starting by “19” or “20”. To apply this filter to my list, I’ll use the function List.Select.

    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),

This filter will already considerably reduce the list of potentials year. The only test-cases where I still have many potential years are the test-case 6 and next. Expl: hkjehr (195x) (1993). This test-case has 2 remaining elements in the list: 195x and 1993.

To continue to filter out candidate “year”, I need to check that the characters in position 3 and 4 are digits. I have not found a native function in M to test that a character is a digit, so I’ll implement this by myself. But first step, extract characters in position 3 and 4 and pass them as parameter to the to-be-created function named isDigit. This new function will return true/false (a logical in M).

    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),

To validate that a character is a digit, I decided to rely on characters and encoding. All the digits are encoded between positions 48 and 57 in the ASCII table.

ascii-charsMy function isDigit will just test that the character, transformed to a number (by the help of Character.ToNumber) is between the same conversion applied to “0” and “9”.

    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),

This internal function is accepting a parameter. Take a look to the syntax where I stipulated this parameter after the equal symbol and before the arrow symbol (Just as for a normal function).

When all these tests have been applied to filter the candidate for the release year I’m expecting to have just one remaining candidate! I’ll assert this with the function List.Single. This function is throwing an error if the list is empty or has more than one element.

    value = List.Single(blocks1920dd)

That’s it! My newly created function is validating all the cases that I set at the beginning.

challenge-completed

Full code:

let
    ExtractYear = (string as text) =>
let
    starts = Text.PositionOf(string, "(", Occurrence.All),
    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),
    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),
    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),
    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),
    value = List.Single(blocks1920dd)
in
    value
in
    ExtractYear

Use test-cases to validate your design with stakeholders

This post is part of a series about specifications, requirements, work items and test-cases in the field of Business Intelligence solutions. First part is about why I hate and I’d not recommend specifications, the second is about how to write good requirements for BI solutions. Then, in the third part I set the focus on work items and what does it mean according to me. I’ll now explain the role of test-cases in this strategy to deliver good BI solutions.

Let’s go back and see what is on the table until now:

  • Requirements:
    • A description of the business process to be analyzed
    • List of “what” (and not “how”) the stakeholders are expecting to do with our solution.
    • Information about the end-users (experience, …)
  • Work items
    • A general design, the solution that I’m planning to build
    • A list of deliverables to be implemented, each of them is a small and independent (meaning that they can be discarded or postponed).

Until now, we haven’t validated anything. General design and work items are just the team’s vision of what could be the best solution for the end-users. If we’re implementing this, we take a lot of risks. Too many! If something is not correctly described in the requirements or something not understood in the business process, we’ll only know it when revealing the solution to the end-users, it will be too late. We must have a hand shake between the implementation team and the stakeholders before implementing the whole solution, ideally before developing the first lines of code.

invisible-handshake-with-giant

Keep in mind that you should never ask to the stakeholders to give a “go” for the whole solution. Will you ask them to confirm that the solution you’ve in your head is perfectly what they want with just a high level description (slides, whiteboard, …). It’d just be impossible (even for you). That’s the same than expecting from you to give a precise amount of hours to implement a solution with just a few lines of requirements.

We need to have a validation from stakeholders but what can we ask them to validate? Work items? The main issue with work items is the difficulty for end-users to have a good vision of what will be the solution: “What they will have in their hands”! Work items are usually too small to give them this overview. If we ask them to validate something else than the work items, we’ll also be responsible that work items and this “other thing” are related to each others during the whole implementation process.

So, the first option is to write a 3000 pages document with a really detailed explanation of what will be built. Cross your fingers that the stakeholders have time to read it (by experience, they won’t … and yes it will be your problem now or later but it will be). If they have time let me guess that won’t understand more than 50% (again sooner or later, it will be an issue for you).  Don’t forget to pray that you’ve not make some small mistakes in this description, or it will be your responsibility! You’ll need to deliver it (even if it’s not possible)… and don’t forget that this document must be written before the development so without any compiler to check your rough ideas. Oh, finally, you’ll need to maintain the link between this document and your work items. From my experience, go an buy a lottery ticket, you’ve more chances to succeed.

lottery-ticket.png

A second option? Write test-cases. Most of the industry practices ask stakeholders to validate “designs” or “specifications” but rarely “test-cases”. When you think about it, it’s weird: test-cases are much more precise and less abstract than the two others, so it’s easier to understand for the stakeholders and we’ve less opportunities to discuss the interpretation of a test-case than the interpretation of a specification. Funny. From my experience of the software industry, it’s usually because consultancy company don’t write test-cases … they are relying on end-users to effectively tests! And both (stakeholders and consultants) think that they are saving money with this “strategy”. Are we serious?

are-you-serious

Let’s go back to test-cases. A common definition is “A test-case is a set of conditions under which we will determine whether an application, software system or one of its features is working as it was originally established for it to do”.

We can reverse the definition and say that “the way the application, software system or one of its features is expected to work is define by a set of conditions named a test-case”. It means that to express the result that I want (the what), I’m explaining what I’ll execute(the how). It gives a very good opportunity to keep the discussion with the stakeholders at the what-level and not the how-level. When you’re validating the test-case, you ask them to validate that it’s effectively the expected result … and not that it’s effectively how they were dreaming to implement it.

The difference between validating a “specification” and a “test-case” can be explained in a few words: Don’t ask them to validate a formulae, ask them to validate the output of the formulae when it receives a given set of parameters in input. That’s completely different.

Let me be clear on one important topic: validating a test-case has two points. The first is obvious, when you’re giving these parameters, you’re expecting this result. The second is less obvious ; you must also validate that the end-users are effectively able to give the parameters. For many reasons (usage of codes in place of labels, data not available at the moment of the request, …), the solution could not be implementable or usable!

How should you build the test-cases? From the end-user point of view: Show them the report or pivot table that are expecting/will receive. Don’t bother them with validation of the ETL … they don’t care and it’s not their problem, it’s yours. I agree that 50% of the errors will be in the ETL but it’s not something that you should discuss with end-users!Focus on deliverables: a dimension, a measure-group, a report and alert!

point-of-view.png

The big issue with BI solutions is usually that when we’re speaking about input parameters, we’re speaking a thousand or a million values. Building datasets for test-cases can be a huge task. My recommendation is to use real and already existing data as much as possible. If you can use production data (of the previous years). If you can’t argue for it and show to stakeholders the cost of building datasets … they could change their mind. If it’s really not possible, you’ll need to build datasets. But in both cases, really think about the need to assert the final result on all the data, a subset won’t be enough.

To illustrate this, I’ll explain my own case. At my current job, we’ve complex aggregations over time and the daylight savings switches are a nightmare for us. Some testers would try to validate all the aggregations at the day level for the 14 years of data. It will take them days to implement  … Who will validate such a result? do you really think that someone will spent his time to validate 5000 complex sums/averages/weighted averages? No one. I’m recommending to test one standard day and to test the two daylight savings of a same year. 3 tiny test-cases versus one huge test-cases.

There is another big advantage to use tiny test-cases versus huge test-cases. When my test will be successful for the standard day and the summer daylight saving but not for winter, I’d directly understand that I’ve a problem in loading my data for the winter shift and nowhere else. But when your huge case is failing, you must first analyze it deeply to check what is exactly failing and it could take a lot of time before realizing that it’s just for the winter shift that you’ve an issue.

But what will happens if my solution is correct for these 3 days but not for the others? I could have a bug that the solution is not loading the data before the year 2015! That’s a possibility and it’s why I should have some tiny tests with aggregation at the year level to ensure that I’m not only loading the last years of data. … but check my words … my test has changed! I’m not anymore testing that I’ve a correct aggregation, I’m testing that I’ve plausible values! Something that is usually much more quick to write and to validate.

The building of the set test-cases should be done with stakeholders. lets go back to an illustration from the real world. A few years ago, on of my testers spent many hours to implement a mock-up of a pivot table. The mock-up was with real figures and he spent a lot of times to calculate the expected content of each cell (around 16). After one day of work, he met the subject matter expert. In less than one second, the answer was there: “No, something is wrong!”. How could he says in less than 10 seconds that it was wrong. First, we had an issue with units (KW or MW), it means that all the figures had been a thousand times larger than expected. for him it was obvious, for us we had no idea if it was 45 or 45.000. In less than one second the subject-matter expert could identify that something was wrong but also explain how to fix it. Great! But there was something else, another issue. The expert started to move the members in a different sort and then identified another mistake some values were expected to be zero and it wasn’t. The expert explained that the lower part of the matrix should be zero due to the business process and it wasn’t the case. We discussed about the potential root causes of our mistakes and identified a wrong interpretation of a business rule described in the requirements. In just a few minutes we captured 3 potential mistakes in the development (units, sorting of members, business rule misinterpretation) … They continued to work together during one hour and reviewed the whole test-suite, adding test-cases that only subject matter experts could bring on the table and simplifying many test-cases. Great added-value for everyone.

work-together.jpg

At the opposite of writing work items, writing test-cases can be executed by juniors and seniors developers. What … developers writing tests? If your team members or you can’t write tests by themselves how could you hope that you develop with a minimum of bug? How can you anticipate all potential issues?  That’s the definition of testing. If a tester has absolutely no notion of development, I’ve huge doubt that he can anticipate potential issues, it’s just monkey testing. It has some value but does it sound serious to build your testing strategy on exclusively performing monkey testing? I doubt.

Nevertheless, if your team is totally new to testing, it could be useful to add a test manager to your team during a few months (or years). He should be able to review the test-cases with you, help you to identify alternative paths … help your team to go to the next level of maturity in testing. But your developers should write the test-cases by themselves. Not necessarily  write the test-cases on what they are implementing. They can pair with another implementer and one is testing the work-item that the other is coding (and vice versa).

Writing a test-case, is a slow (but valuable) process. I usually recommend a 1:1 ratio between the time of writing (not executing) the test-case and the time to develop the code. Keep in mind that I’m not just writing a test-case on a sheet of paper. I’m also validating my solution with the stakeholders before it’s effectively coded! Reusability? Once validated, I’ll use this artefact to build a robust test-suite that will be automated to validate my development. I’m also building a really good artefact to detect possible issues and impacts with some change requests. Much more than a test that I’ll through away within 10 days. Still not convinced? Have you noticed that I’m not writing documentation … My documentation are my test-cases. With just playing the test-suites, I can confirm that my documentation is inline with my development … living and always up-to-date documentation is priceless.

priceless

 

Writing work items and not tasks

This post is part of a series about specifications, requirements,work items and test-cases in the field of Business Intelligence solutions. First part is about why I don’t want specifications and second is about how to write good requirements. This third part is settings the focus on work items before jumping to the test-cases to validate with stakeholders.

I’m always expecting that the whole team has a good understanding of the business we’re working for (even and especially when it’s a really complex business). The main reason is that it reminds them that our goal is to deliver working software to someone who expects specific benefits from using it … not just deliver a software corresponding to some specifications. To gain this good understanding of the business, everyone need to read (and understand) the requirements.

The key question is “Why do we write work items, if everyone is aware of the requirements?”. The answer is really simple, if you just have a problem, you can have many ways to solve it (probably many poor solutions but some great). First reason to write work items is to help the team to go on the same direction.

What are work items in this case? They are the solution, not the problem.  This solution must be debated with a maximum of team members to be sure that we’re covering all the aspects of the problem. It’s what I’m naming the design. I’m not expecting to have interactions with business stakeholders during this phase, it’s not their role to validate that I’ve a good solution because (at this moment). They will probably not understand what they will effectively receive and should be their unique concerns (the what and not the how). This validation from stakeholders or end-users is for later during the writing of test-cases.

The design phase is not something with a clear start and clear end. It’s something that is starting during the reception of the requirements and is on-going until the end of the implementation. Anyway, you should have a good overview of your final design before starting the implementation.

At the end of this “main” design phase, we need to write the target solution somewhere to be sure we effectively implement what has been debated (No, just a picture of the whiteboard is rarely a good idea). This is a good idea to write down ou design in case we need to adapt it. Indeed if this design is written somewhere, we’ll clearly see the impact of our changes (what to re-do, effect on planning, …) . But also what has been poorly taken into account during this initial design phase … it’s usually an excellent lesson-learnt.

Until now, I’m speaking about a design and not about work items. What’s the difference? Work items are a decomposition of the design. The design is unique, work items are many.

When the team knows what should be done (the solution/the design), we also need to separate the work between the different team members. With one big document describing the solution, it would be a pain. To overcome this pain, I’d recommend to create some work items to subdivise the work to be executed by different resources and indicate the relations between them (precedence, …).

where-to-start

In order to achieve this, I’ve a personal rule: no work item will require more than 40 hours of implementation’s work (dev + test). If at any moment this rule is not validated, the work item must be split (It could be during the design, during the development or during the testing).

But it’s not so easy: We also need to be able to decide that a work item will be discarded due to a lack of time to implement it or a ratio cost to implement versus added-value too low. To achieve this it means that we can’t split the work between layers (database, etl, cube, report). If we need to build a dimension, I can’t split the work by layer and take the risk that we decide that we’ve no time to implement the ETL when the cube, database and reports are already “done“. We always need the implementation in all our layers or this new dimension will be pointless! So don’t split work items by technical layers.

Another good reason to avoid the split by layer is the lack of responsibility of the developers when the work items are expressed this way. Even, if the team has made an amazing job during the design phase, it won’t be perfect. We’ll meet unexpected issues (data quality or technical) and we’ll need to change our plans. When you’re dealing with the dimension from A to Z, you know for sure that any tiny problem during the ETL will create bigger issues during the implementation of the cube. So you need to fix the issue in the ETL … even if it was not initially described in the design. When splitting by layer, it’s really easy for the developer to not care about a tiny problem (to keep his personal planning) and let the next layers fix the issues. It will drive you to a poor solution where all issues are fixed in the last layers with tricks … not something that you’d like to maintain.

What’s a work item? It’s a small part of the global design, that can be implemented in less than 40 hours, discarded of the scope, and that will give a business value to the end-user when implemented.

Added Value - Highway Sign

What about a deployment to QA, a code-review or a refactoring? Are they work items? Not really. Naturally, if you don’t deploy, your solution is pointless but it doesn’t mean that it’s really part of your design. So it’s not a work-item but a task. And the script to deploy? Again not a work-item because it won’t directly give some added-value to the end-users and you can’t really discard this. So again it’s a task not a work-items. Tasks should not be discussed with the end-users (and surely not validated) but must be included in the planning.

Back to the work-items, what does it look like? Our work items are usually entitled:

  • Create a new dimension D
  • Add attributes X and Y to existing dimension T
  • Build a hierarchy Z on existing dimension U
  • Create new measure-group M and link it to dimension D and T
  • Create new report “M by X and Y”

Sometimes, with our experience, we can split a work item into two distincts work items with essential attributes of the dimension in the first-one and some nice-to-have in a second. It will be easier to discard or post-pone the second work-item. But, that’s already an advanced level.

Then there is the description of the work-item and it’s usually where the things are becoming less straightforward … especially for really good developers. The usual problem with poor work items is not in how many we’ve, rarely their duration, and neither on their title but always in the content.

Poor work items are usually too detailed. They don’t write the solution at the conceptual level but at the logical or physical level. Most of the time, they are doing this to help their teammates … but it’s just the opposite that is happening.

An illustration? Surely! If your master data solution has two distinct fields first name and last name and you only need one field name. At the conceptual level, you’ll express that you’re expecting a unique field being the concatenation of first then last name separated by a single space.

Note that I never said if it should be done at the ETL level or in a view just before the cube. I have not explicitly written how to perform the concatenation! It would have been easier for me to write something like first + ' '+ last … except that if the fields first name and last name are not trimmed in the master data, no one would trim them. Why? Because it was not requested! It’s so detailed that it would be too easy for the implementer, to say, I just applied your formulae and it doesn’t work, not my problem. Or even I thought that you specifically wanted to not trim them! Stay at the conceptual level, explicit the result that you want to achieve and not the how-to. If the developer is not sure about the need to trim or not, he will ask the question if you stayed at the conceptual level not if you explicitly requested to do a first + ' '+ last.

Keep in mind that you cannot develop a whole solution without a few bugs. When you develop we’ve a few lifeguards such as a compiler and a few manual runs … writing detailed  work-items is just denying this evidence and telling to others, “take a look I can develop a solution bug-free without a compiler and testing” … everybody knows that you’re wrong. Stay at the conceptual level.

If most of your work items are tasks in disguise, then the development process becomes task-focused (doing things) instead of delivery-focused (creating value) … and that’s usually my root issue with poorly written work items.

doing-things

How to help the juniors without going to deep in the work items?

  1. Before they start to implement, ask them to explain how they will implement the work item. Don’t be too rude with them, keep in mind that you’re asking them to code on a whiteboard (without compiler, …). It’s difficult but at least, you can point them to a few possible issues and especially be sure that they have understood the work item.
  2. During the development, review each couple of hours their code with them. Ask them to explain what and why they are making these choices. Propose alternatives. Let them make their own choices.
  3. Write the tests in parallel of the development and ask them to use the tests during the whole development process (not only at the end when the whole development is done) and discuss any divergence between the code and the tests.

You should notice a few constants in the three suggestions:

  • don’t do their job
  • put them in front of their responsibilities: don’t make the choice for them!
  • work in parallel, not ahead or after … with them.
writer.jpg

Who should write the list of work items? I’ll give two answers to this question. At the beginning, it’s important to make an excellent work when creating the work items. By creating, I mean to decide if it’s one or two or three work items, check the precedence constrains (and avoid loops) and give good titles to these work items. From my point of view, it means that most junior team members won’t be able to do it. If they are, I’d strongly recommend to give them the senior adjective!

This is not a full-time job. If you’ve someone in your team writing work items during days, something is wrong in your process. It usually takes a couple of days of work to write work-items to implement in a couple of weeks. If you’ve a huge team, consider to have several writers but it’s really important that they have a good overview of what others are writing. I’d also recommend to consider to not have one dedicated resource to write the work items, the job-owner must change at every release … minimum. It’s really important that senior implementers keep their hands in the code and do their part of the job by coding … not just by reviewing code and writing work items.

But finally, work-items are alive and they must be regularly updated, reviewed and sometimes discarded. Every team member has the right to propose an adaptation and is responsible to update a work item, not just the initial writer. Every update must be internally debated with a maximum of team members and the initial writer.

You can use the work items in your discussions with stakeholders. It will be your unit to discuss what must be postponed or discarded. If your work items are described with too technical terms, it will be difficult for your stakeholders to discuss at this level … another good reason to stay at a higher level.

Keep in mind that discussing on an artefact to gain a confirmation that it must be prioritized or not is not the same as asking for a validation of the content of the work item. I stated before that work items shouldn’t be validated and I hold on … test-cases will help us to confirm with stakeholders if our design is correct or not! and it’s the scope of the next blog post.

Handling special characters from html files with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The two next blog posts ([1] and [2]) illustrated how to parse rather complex Json responses and retrieve the commits’ activity and the issues’ activity of a repository.

In this new blog post, I’ll try solve an issue that is there since the first blog post. When parsing the MVP website and retrieving the first and last names, I had an issue with some special characters: my own name was retrieved as “Cédric Charlier” in place of “Cédric Charlier”. For some reasons some special characters are encoded (but some are not). Anyway we need to decode them.

From a technical perspective, I’ll solve this issue with a recursion but also with some error handling in Power Query/M language.

mvp-analysis-powerbi-012

The first step is to analyze the problem that we’ve in front of us. The special characters encoded have always the same pattern. A ampersand (“&”) followed by a cross (“#”), a set of digits and a semi-colon (“;”) to end the fragment. We’ll first need to identify this kind of pattern in the string. I’ll simplify to only search for the ampersand and the semi-colon.

let
    ConvertHtmlSpecialChars = (string as text) =&gt;
let
    start = Text.PositionOf(string, "&amp;"),
    end = Text.PositionOf(string, ";"),

Then, I’ll extract this fragment from the string. Then I’ll confirm that this framment is effectively starting by “&” and ending by “#”. If it’s the case then I’ll return the number in this fragment.

    value = Text.Range(string, start, end-start+1),
    cleanValue =
        if
            Text.StartsWith(value, "&")
        then
            Text.Start(
                Text.End(
                    value
                    , Text.Length(value)-2
                )
                ,Text.Length(value)-3
            )
        else
            null,

Now, that I’ve this number, I can translate it to a normal character with the help of the function Character.FromNumber and replace the fragment by the new value.

    replace = Character.FromNumber(Number.FromText(cleanValue)),
    replaced = Text.Replace(string, value, replace),

Now, I need to handle the fact that I can have more than one special character in my string and I must process each of them. Once again recursion will help me!

    continue =
        if string=replaced
        then replaced
        else ConvertHtmlSpecialChars(replaced)
in
    continue

That’s it? Not really. My function is crashing easily. If I’ve no special characters in my string, the function Text.Range(string, start, end-start+1) will crash because -1 (start’s value) is not a valid. But I’ve many other places where I could have an issue.

A little bit of M theory to understand the basics of error handling in this language. When an error is thrown by a function this error is propagated to the caller and so on. It’s really practical because you can catch the error at a higher level and you don’t need to intercept all the potential errors.

In this case, I’ll catch the potential error between the replacement and my recursive condition.

newString = try replaced otherwise string,

The keyword try means that we’re expecting a potential issue during the call to the function replaced. The keyword otherwise specifies that if we effectively have an error during the call to the function replaced then we need to use another function in place of. In this case, the function is just the value of string (the initial value). So, if at any moment during the substitution, I’ve an error then I’ll use the original string.

I still need to update the recursive condition to use this new function:

continue =
        if string=newString
        then newString
        else ConvertHtmlSpecialChars(newString)
in
    continue
in
    ConvertHtmlSpecialChars

Now, I’ll need to add the call to this function to the table dealing the MVPs’ names:

#"Html Special Chars" =
        Table.TransformColumns(
            #"Renamed Columns"
            ,{{"MVP name", ConvertHtmlSpecialChars}}
        )
in
    #"Html Special Chars"

And now I can compare the before

mvp-analysis-powerbi-013

and after

mvp-analysis-powerbi-014

Do … While loop with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The next blog post illustrated how to parse a rather complex Json response and retrieve the commits’ activity of a repository with the help of a projection.

In this new blog post, I’ll make an additional query to GitHub to retrieve the issues’ activity of a repository. At the opposite of the previous calls to the GitHub API, the response will be cut into different pages. It means that one call to the API won’t be enough to retrieve the whole list of issues. Indeed, issues are returned 42 by 42 and I’ll need to retrieve one, two or more pages of information to have my count of issues created during the last 52 weeks.

Let’s start by taking a look to the GitHub API to retrieve information about issues’ activity: /repos/:login/:repos/issues But, that’s a bit more complex. If you use this method as-is, it will return only the open issues and in a random sort. Hopefully, we can override this behavior by applying modifiers:

/repos/:login/:repos/issues?state=all&sort=created

State is there to identify the state of the issues to be returned, I’m enforcing to receive all the issues independently of the current state. With the sort parameter, I’m requesting to order the items by creation date. By default the sort will be descending, I don’t need to explicitly specify it.

With this request, I’ll retrieve the 42 last issues created in a repository. If I want  to retrieve the 42 previous, I’ll need to add an additional parameter.

/repos/:login/:repos/issues?state=all&sort=created&page=2

The parameter page let’s me navigate in my list of issues (ordered by creation date).

If I want to retrieve all the issues, I need to go through all the pages. But how many pages do I need to retrieve … it depends some repositories have no issues, others have thousands.

Many programming languages are imperative (and not functional or declarative) and have a structure do...while to perform loops that are stopped when a condition is met (in this case probably the condition is that the page returns no issue). The M language is a declarative language and has no structure to handle loops. It’s not really a problem because we can overcome this limitation by using a recursion.

The initial state of the recursion will be to perform a query on the GitHub API with parameter page set to 1. Then we’ll parse the response and check if we’ve issues created during the last 52 weeks. If it’s the case, we’ll perform and additional request incrementing the parameter page of 1 unit until we receive no issue created during the last 52 weeks.

Because the function is recursive we need to provide an additional parmeter (page).

let
    GetIssuesDuringPeriod = (
        login as text
        , repository as text
        , countOfWeeks as number
        , page as number
    ) =>

Then we’ll call the GitHub API with the correct modifiers and the parameter page.

let
    #"List of issues" =
        Json.Document(
            Web.Contents(
                "https://api.github.com/repos/"
                & login & "/" & repository
                & "/issues?state=all&sort=created&page="
                & Text.From(page)
            )
        ),

We’ll consume the result of the query by analyzing the response and counting all the issues created since less than 52 weeks.

The first function returns the date corresponding to 52 weeks ago. That date will be compared to the value of the field created_at of each issue. If the date is anterior then we’ll add 1 to our count of issues. If not we won’t change our count.

To achieve this I’m using a projection named List.Accumulate. I’m setting the initial value (s .. stands for seed) to 0 and for each issue, I’m adding 1 to the seed or not based on the date comparaison. This function is returning the seed when all items have been iterated.

#"StartDate" =
        Date.AddWeeks(
            DateTime.LocalNow()
            ,  -countOfWeeks
        ),

    #"CountOfIssues" =
        List.Accumulate(
            #"List of issues"
            , 0
            , (s, c) =>
                if
                    DateTime.FromText(c[created_at], "en-us")
                    >=#"StartDate"
                then
                    s+1
                else
                    s
        ),

Now, I’ll need to take a decision and decide if I need to iterate on an additional page or not. If the count returned in the previous function was 0 it means that I had no issue created during the last 52 weeks on this page. Due to the enforced ordering in my request, I know that I can stop there. If the result was greater than 1, I’ll perform and additional iteration on my recursive function to check if I’ve additional issues on the next page. To achieve this I’ll recall the function itself and increment the parameter page. The result of this new call will be added to the result of the current call.

#"Recursive" =
        if
            #"CountOfIssues"> 0
        then
            #"CountOfIssues"
            + GetIssuesDuringPeriod(
                login
                , repository
                , countOfWeeks
                , page+1
            )
        else
            0
in
    #"Recursive"
in
    GetIssuesDuringPeriod

Now, I can add the result of this function to the table about GitHub repositories:

#"Added issue activity" =
        Table.AddColumn(
            #"Added commit activity"
            , "Count of issues created this year"
            , each GetIssuesDuringPeriod(
                [Login]
                , [#"Repository's name"]
                , 52
                , 1
            )
        )
in
    #"Added issue activity"

And I can create the following visualization to check the popularity of repositories by language (C#, R, JavaScript)

mvp-analysis-powerbi-011

and by MVP

mvp-analysis-powerbi-010

The next blog post of this series, will about parsing special html characters and translating them into correct Unicode values.

 

 

Projections with Power BI (M language)

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It was time to get information about social media for each MVP and connect to the GitHub API to retrieve more info about their activities on GitHub. This third blog post was ending on a remark about the limited interest of the values available (count of public repositories, followers) and the need to retrieve more pertinent info such as the commit activity on the repositories owned by the MVPs.

From a technical perspective, I’ll show how to make a projection in Power BI (M Language). A projection is a process of selecting data in a different shape rather than a specific entity being queried. There are many ways of projection such as GroupBy or SingleOrDefault but I’ll illustrate this with the less known Transform.

Let’s try this! We first need the list of all the repositories of an MVP. The starting point is the list of social media, filtered on the website “GitHub”. Then to get a list of repositories for the login of the MVP, I’ll use the GitHub API /users/:login/repos.

let
    Source = #"Social Media",
    #"Removed Columns" =
        Table.RemoveColumns(
            Source
            ,{"Url"}
        ),
    #"Filtered to Github website" =
        Table.SelectRows(
            #"Removed Columns"
            , each ([Website] = "GitHub")
        ),
    #"Get repositories from Github" =
        Table.AddColumn(
            #"Filtered to Github website"
            , "Custom"
            , each Web.Contents(
                "https://api.github.com/users/"
                & [Login] & "/repos"
            )
        ),

This give me a nice table with all the informations about repositories in a binary format:

mvp-analysis-powerbi-007

At the opposite of previous post, the expanding of the response is a bit easier. I won’t use the pivot functions, just some expands.

    #"Expanded Transform Binary from Github" =
        Table.ExpandListColumn(
            #"Transform JSON to Table"
            , "Transform Binary from Github"
        ),
    #"Expanded Transform Binary from Github1" =
         Table.ExpandRecordColumn(
            #"Expanded Transform Binary from Github"
            , "Transform Binary from Github"
            ,{
                "name", "owner", "html_url"
                , "description", "fork"
                , "stargazers_url", "contributors_url"
                , "homepage", "stargazers_count"
                , "watchers_count", "language"
                , "forks_count", "open_issues_count"
                , "forks", "open_issues", "watchers"
            }
                ,{"name", "owner", "html_url"
                , "description", "fork", "stargazers_url"
                , "contributors_url", "homepage"
                , "stargazers_count", "watchers_count", "language"
                , "forks_count", "open_issues_count"
                , "forks", "open_issues", "watchers"
            }
        ),
    #"Expanded owner" =
        Table.ExpandRecordColumn(
            #"Expanded Transform Binary from Github1"
            , "owner"
            , {"login"}
            , {"owner.login"}
        ),

A bit of cleaning by removing and renaming some columns

    #"Removed Columns1" =
        Table.RemoveColumns(
            #"Expanded owner"
            ,{"Custom"}
        ),
    #"Renamed Columns" =
        Table.RenameColumns(
            #"Removed Columns1"
            ,{{"stargazers_count", "Count of stars"}
            , {"watchers_count", "Count of watchers"}
            , {"forks_count", "Count of forks"}
            , {"name", "Repository's name"}
            , {"open_issues_count", "Count of issues"}}
        ),

Filtering out the forks

#"Filtered Rows" = Table.SelectRows( #"Renamed Columns" , each ([fork] = false) ),

And I’ve the following result:

mvp-analysis-powerbi-008.png

That’s cool but not it’s not finished. I still haven’t got some information about commits. To achieve this, I’ll need to make an additional query to the GitHub API and request the commit activity of each repository: /users/:login/:repos/stats/commit_activity.

Let’s build a function to achieve this.

let
    GetCommitsDuringPeriod = (
        login as text
        , repository as text
        , countOfWeeks as number
    ) =>
let
    #"List of commits by week" =
        Json.Document(
            Web.Contents(
                "https://api.github.com/repos/"
                & login
                & "/"
                & repository
                & "/stats/commit_activity"
            )
        ),

This api send a response with a rather complex structure: A list of records, week by week, the commits on each weekday, the commits of the whole week (equal to the sum of the commits on each weekday – named total) and the week identification (Unix timestamp).

mvp-analysis-powerbi-009

I’m not interested by most of this information and I just need to keep the information total for the last 52 weeks. This is where I’ll apply a projection and I’ll use the function List.Transform. This function, let’s me explain that I want to transform my initial list into a new one. For each element, I’ll just hold the field total.

    #"List of totals" =
        List.Transform(
            List.LastN(
                #"List of commits by week"
                , countOfWeeks
            )
            , each _[total]
        ),

In this case, the transformation will return a list of number (corresponding to each total). Again, I’m not really interested by the detail and I can sum all these records with the help of the function List.Sum.

    #"CountOfCommits" = List.Sum(#"List of totals")
in
    #"CountOfCommits"
in
    GetCommitsDuringPeriod

Now, I can easily integrate my function to the previous table, I just need to add a column calling this function.

    #"Added commit activity" = Table.AddColumn(
            #"Filtered Rows"
            , "Count of commits this year"
            , each GetCommitsDuringPeriod(
                      [Login]
                      , [#"Repository's name"]
                      , 52
                 )
            )
in
    #"Added commit activity"

In the next blog post, I’ll add the issues’ activity for each repository. From a logical point of view, it’ll be the same process but the technical challenge will be the pagination of the results of the GitHub API.

Retrieve GitHub data with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It’s now time to get information about social media for each MVP. Then we’ll parse these informations to get the login of each MVP on GitHub and finally use this login to connect to the GitHub API and retrieve more info.

From a technical perspective, I’ll show

  • how to query an external API to get some information
  • how to return more than one value for a function
  • how to build a function to identify some patterns
  • how to transform pairs of key/value into several columns.

Ready? Let’s start …

The first task is to grab the social media information on the profile page. In previous blog posts, we stored this individual page in a table, the starting point will be this table. Once again, we apply the strategy of the html pattern to identify the useful lines for our task.

let
    IndividualPageTags = #"Individual page",
    #"Added SocialMediaTable" = Table.AddColumn(IndividualPageTags, "Tags", each GetSocialMedia([IndividualPage])),
    #"Removed Columns" = Table.RemoveColumns(#"Added SocialMediaTable",{"IndividualPage"}),
    #"Expanded Tags" = Table.ExpandListColumn(#"Removed Columns", "Tags"),
    #"Extract url" = Table.AddColumn(#"Expanded Tags", "Url", each GetUrlFromTag([Tags])),
    #"Removed Columns1" = Table.RemoveColumns(#"Extract url",{"Tags"}),
    #"Filtered out null" = Table.SelectRows(#"Removed Columns1", each ([Url] <> null)),

Now that we’ve a list of links to the MVP’s social medias, we still have to identify the type of social media (GitHub, Facebook, LinkedIn, …) and to extract the login/username from the url. To achieve this task we’ll create a function but it means that our function will not return a single field but two fields (the type of social media and the login). With the M language, it’s possible to return a record (a record is a container with one or more fields, each field is identified by a name).

To build the record, we use the function Record.AddField.

let
    GetWebsiteInfo = (url as text) =>
let
    url = Text.Trim(url, {" ", "/"}),
    possibilities = {
        [pattern = "facebook"  , value = "Facebook"],
        [pattern = "linkedin"  , value = "LinkedIn"],
        [pattern = "github"    , value = "GitHub"],
        [pattern = "twitter"   , value = "Twitter"]
    },
    Website = List.Single(List.Select(possibilities, each Text.Contains(url, _[pattern])))[value],
    Login = Text.Range(url, Text.PositionOf(url, "/", Occurrence.Last)+1),
    Records = Record.AddField(Record.AddField([], "Website", Website ), "Login", Login)
in
    Records
in
    GetWebsiteInfo

We can add this function to our previous code. The call to the function will add a record to each line. This record must be expanded to reveal the values for the website and the login.

    #"Get Website and Login" = Table.AddColumn(#"Filtered out null", "Custom", each GetWebsiteInfo([Url])),
    #"Expanded Website and Login" = Table.ExpandRecordColumn(#"Get Website and Login", "Custom", {"Website", "Login"}, {"Website", "Login"})
in
    #"Expanded Website and Login"
mvp-analysis-powerbi-006.png

Cool, now that we’ve a list of social medias with information that we can parse, we can use them to contact GitHub and get more info about the GitHub account. First we need to start from the social media table and filter to hold lines referring to GitHub.

let
    Source = #"Social Media",
    #"Removed Columns" = Table.RemoveColumns(Source,{"Url"}),
    #"Filtered to Github website" = Table.SelectRows(#"Removed Columns", each ([Website] = "GitHub")),

GitHub exposes an interesting API that we can use to get some information about accounts, repositories, issues and much more. This API is exposed on http and returns a JSON document. We can easily interact with this API with the functions Web.Contents and Json.Document. To achieve this task , we need to build an url with the format users/:login where :login is a variable corresponding to the login of the user.

    #"Get user info from Github" = Table.AddColumn(#"Filtered to Github website", "Custom", each Web.Contents("https://api.github.com/users/" & [Login])),
    #"Transform JSON to Table" = Table.AddColumn(#"Get user info from Github", "Transform Binary from Github", each Record.ToTable(Json.Document([Custom]))),
 

Before executing this code Power BI will ask us about security settings on the GitHub API. This API is a new source and each source must define it’s own security. The GitHub API can be contacted with <em>anonymous</em>, <em>basic authentication</em>, <em>OAuth2</em>. We could use the <em>anonymous</em> way but we’ll be quickly limited because of the limitation of 50 calls/day to the API. I selected <em>basic authentication</em> and registered my own GitHub credentials into Power BI. When it’s done we can now execute the load of this new table.

Now, we’ve information about GitHub accounts and we need to reformat it. We’ll first expand the Json table received from GitHub. It’s a simple set of key/value pairs. For some convenience, we need to  exclude the key equal to plan (this info is only available for your account). Then, it will be time to remove some columns that are now useless.

    #"Expanded Table to Name/Value pairs" = Table.ExpandTableColumn(#"Transform JSON to Table", "Transform Binary from Github", Table.ColumnNames(#"Transform JSON to Table"[#"Transform Binary from Github"]{0})),
    #"Filtered out ""Plan""" = Table.SelectRows(#"Expanded Table to Name/Value pairs", each ([Name] <> "plan")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered out ""Plan""",{"Website", "Login", "Custom"}),
mvp-analysis-powerbi-004

Based on this content, we can pivot the data to transform the key/value pairs into columns. The pivot function is expecting a summarize function (usually List.Sum). Keep in mid that here the values are text and not numbers. We should specify List.Max because anyway we’ll have one value for each pivoted row and this function accepts a text argument.

    #"Pivoted Key/Value" = Table.Pivot(#"Removed Columns1", List.Distinct(#"Removed Columns1"[Name]), "Name", "Value", List.Max),

 

mvp-analysis-powerbi-005.png

When done, we need some additional cleaning:

    #"Removed Columns2" = Table.RemoveColumns(#"Pivoted Key/Value",{"id", "gravatar_id", "url", "subscriptions_url", "organizations_url", "repos_url", "events_url", "received_events_url", "type", "site_admin", "private_gists", "total_private_repos", "owned_private_repos", "disk_usage", "collaborators"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"login", "GitHub Login"}, {"name", "GitHub fullname"}, {"public_repos", "Count of public repos"}, {"followers", "Count of followers"}})
in
    #"Renamed Columns"

Now, with this new table, we can build the following report.

mvp-analysis-powerbi-003

I’ve a few remarks about this report. First, it’s a bit weird to read that only 35 MVP have registered a GitHub account on their profile pages (it’s less than 10%). Some have probably forgotten to register this information.

Count of followers and public repositories are not the best parameters to determine the popularity or the activity on GitHub of someone. I’ll improve this in a next blog post.