Parse a web page with Power BI

A few days ago, I found the excellent blog post of James Dixon about an age and sex analysis of Microsoft USA MVPs. I really like F# because it’s so efficient for this kind of tasks: in just a few lines of code you can parse the web, call an API and build a great data frame. Just after reading this article, I started to try to code the same project in Power BI/Power Query (M language) but during this journey, I changed a bit the idea to finally focus on Data Platform MVP and their Github accounts.

Let’s start by getting a list of Data Platform MVP. Microsoft doesn’t expose an API for its MVP website, so we need to parse the website. This website expose a search engine where it’s easy to filter on the Award Category “Data Platform”. By default, the search results page returns the 18 first results and let’s you navigate to next pages. If you click on the bottom figures, you can change the count of MVPs by page to 36 or 48. If you’re focus on your task then you’ll see that this value (18,36 or 48) is available as a parameter in the url. I tried to change this value to 1000 in the url, without any problem: the web page was listing 1000 MVPs. Based on this, it’s possible to render all the data platform MVP on one single page with the following url: https://mvp.microsoft.com/en-us/MvpSearch?&ex=Data Platform&ps=1000&pn=1

Now, that we’ve a way to retrieve all the MVP, let’s try to consume this from Power BI. Unfortunately, the website doesn’t return a table but a rather complex list of div tags. From my personal experience, it’s often the case when you try to parse a website. I could try to build a parser with M functions but it would be rather complex and there is an easier solution. I’ll first convert the binary object into a set of lines. For debugging purpose I’m also creating a parameter LimitRowsForDevelopment that I set to 700 to speed up the grabbing process when developing and I switch it back to 10000000 when running “in production”.

let
    Source = Web.Contents("https://mvp.microsoft.com/en-us/MvpSearch?&ex=Data Platform&ps=1000"),
    Lines = Lines.FromBinary(Source),
    #"Kept First Items" = List.FirstN(Lines,LimitRowsForDevelopment)
in
    #"Kept First Items"

When I’ve this list of lines, I need to extract a subset of this list where all the lines have the following patterns each Text.Contains(_, "<a href")and Text.Contains("</a>"). It will give me a list of links to the individual profile pages but also the name of the MVP.

I’s time to parse the link found above and extract the Mvp Id (a few figures identifying the MVP).

This can easily be achieved with three functions. The first one is a kind of overload of Text.PositionOf where you can specify the position of the occurrence.

let
    GetPositionInText = (string as nullable text, substring as text, position as number) =>

let
    Source = if string = null then 0 else List.First(List.Skip(Text.PositionOf(string , substring, Occurrence.All), position-1))
in
    Source
in
    GetPositionInText

This function list all the occurrences of the character Text.PositionOf, skip the n-1 first occurrences List.Skip and take the first of the remaining occurrences List.First.

When we’ve this quick and useful function we can parse the a href tag and ask Power BI to get the value between the second quote and the last occurrence of a dash.

let
    GetMvpIdFromUrl = (url as text) =>

let
    EndPos = GetPositionInText(url, """", 2),
    SubText = Text.Range(url, 0, EndPos),
    StartPos = Text.PositionOf(SubText, Character.FromNumber(45), Occurrence.Last)+1,
    Source = Text.Range(SubText, StartPos, EndPos-StartPos)
in
    Source
in
    GetMvpIdFromUrl

This function uses a few tricks not well-known about the M language. First it’s possible to define intermediate steps as variables that you can consume later (It’s the case for EndPos, SubText and StartPos). Another trick useful with special characters is the possibility to identify them as a number and transform them into the characters with the function Character.FromNumber.

For the name of the MVP, we use the same strategy and following function:

let
    GetNameFromUrl = (url as text) =>

let
    StartPos = GetPositionInText(url, ">", 1)+1,
    EndPos = GetPositionInText(url, "<", 2),
    Source = Text.Replace(Text.Range(url, StartPos, EndPos-StartPos), "  ", " ")
in
    Source
in
    GetNameFromUrl

Now back to the main table and we can extract names and MVP ids with the following code:

let
    LinkTags = List.Select(#"Listing page", each Text.Contains(_, "<a href=""/en-us/mvp/") and Text.Contains(_,"</a>")),
    #"Converted to Table" = Table.FromList(LinkTags , Splitter.SplitByNothing(), {"OriginalUrl"}, null, ExtraValues.Error),
    #"Get Mvp Id" = Table.AddColumn(#"Converted to Table", "MvpId", each GetMvpIdFromUrl([OriginalUrl])),
    #"Get Name" = Table.AddColumn(#"Get Mvp Id", "Name", each GetNameFromUrl([OriginalUrl])),
    #"Removed Columns" = Table.RemoveColumns(#"Get Name",{"OriginalUrl"}),
    #"Added Index" = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1)
in
    #"Added Index"

The last line of code is the addition of an index that we’ll use just after. Indeed, I’ll go back to my list of lines retrieved in #"Listing page" and now extract the country. To achieve this I’ll again keep all the rows with a given pattern and then perform string manipulation on the remaining lines.

let
    CountryTags = List.Select(#"Listing page", each (Text.Contains(_, "<span class=""subItemContent"">") and Text.Contains(_, "</span>"))),
    #"Converted to Table" = Table.FromList(CountryTags, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Converted to Table","Column1",Splitter.SplitTextByEachDelimiter({">"}, QuoteStyle.Csv, false),{"Column1.1", "Country"}),
    #"Removed Columns" = Table.RemoveColumns(#"Split Column by Delimiter",{"Column1.1"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns","</span>","",Replacer.ReplaceText,{"Country"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1)
in
    #"Added Index"

Now that I’ve the two tables loaded (one with Name and MVP id, a second with Country), I can merge them to create a unique dataset with the help of the index.

let
    Source = #"MvpId and Name",
    #"Merged MvpId and Country" = Table.NestedJoin(Source,{"Index"},Country,{"Index"},"NewColumn",JoinKind.Inner),
    #"Expanded Country" = Table.ExpandTableColumn(#"Merged MvpId and Country", "NewColumn", {"Country"}, {"Country"}),
    #"Removed Index" = Table.RemoveColumns(#"Expanded Country",{"Index"}),

...

The MVP id is also really helpful to build two additional urls. The first one will redirect to the profile page of the MVP and the second will be used to retrieve the picture of the MVP.

...

    #"Added ProfileUrl" = Table.AddColumn(#"Removed Index", "ProfileUrl", each "https://mvp.microsoft.com/en-us/PublicProfile/" & [MvpId]),
    #"Added PhotoUrl" = Table.AddColumn(#"Added ProfileUrl", "PhotoUrl", each "https://mvp.microsoft.com/en-us/PublicProfile/Photo/" & [MvpId]),
in
    #"Added PhotoUrl"

Well, it’s now time to make some visualizations to show the results.

mvp-analysis-powerbi-001

In a next blog post, I’ll show how to extract information not available in the search page but only on the profile pages of the MVP. I’ll include these info into the existing dataset. Then it will be time to merge this dataset with some information that I’ll retrieve from Github.

Advertisements

5 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s