Scan web pages with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. It was great but I just get a few information about the MVP (name and country), it’d be great to also know since when they are MVP, how many awards and links to social media (in the next blog post I’ll use these links to get more info about GitHub accounts).

In the previous blog post, I add build a ProfileUrl field with the url of the individual profile page, it’s time to use it.

The first task is to grab the awards’ information on the profile page: categories, first year and count of awards. But before going further I need to take into account that I’ll probably use the individual profile page of each MVP more than once (I’ll also use these pages to get data about Social Media). Based on this, it’s better to avoid to duplicate the code to load these pages everywhere (will also save me from several download for each pages).

   IndividualPageTags = #"MvpId and Name",
   #"Added Page" = Table.AddColumn(#"MvpId and Name", "IndividualPage", each Web.Contents([ProfileUrl])),
   #"Removed Columns" = Table.RemoveColumns(#"Added Page",{"Name", "ProfileUrl", "PhotoUrl", "Index"})
   #"Removed Columns"

To get the information on this âge we’ll use the same strategy than in the first part of this post. Transform the binary to a list of lines and filter them on a pattern.

   IndividualPageTags = #"Individual page",
   #"Added AwardsTable" = Table.AddColumn(#"IndividualPageTags", "Awards", each GetAwards([IndividualPage])),
   #"Expanded Custom" = Table.ExpandTableColumn(#"Added AwardsTable", "Awards", {"Award", "FirstYear", "CountOfAwards"}, {"AwardsList", "FirstYear", "CountOfAwards"})
   #"Expanded Custom"

The filtering task is performed by the function GetAwards. This function accepts, as a parameter, the individual profile page of the MVP. The type is binary. This function is looking for the pattern div class="ly infoContent". After cleaning, we’ve three rows. The first one contains the different categories, the second the first year and finally the count of awards. It’s possible to transform the three rows into three columns using the Table.Transpose function.

    AwardsInfo = (page as binary) =>
    Lines = Lines.FromBinary(page),
    Tags = List.Select(Lines, each (Text.Contains(_, "
<div class=""ly infoContent"">"))),
    #"Converted to Table" = Table.FromList(Tags, Splitter.SplitTextByDelimiter(">"), null, null, ExtraValues.Error),
    #"Changed Type" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column1", "Column3"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns","</div","",Replacer.ReplaceText,{"Column2"}),
    #"Transposed Table" = Table.Transpose(#"Replaced Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Transposed Table",{{"Column1", "Award"}, {"Column2", "FirstYear"}, {"Column3", "CountOfAwards"}})
    #"Renamed Columns"

When we’ve retrieved the awards’ information, we can add the data about first year and count of awards to our main table.

    Source = #"MvpId and Name",
    #"Merged MvpId and Country" = Table.NestedJoin(Source,{"Index"},Country,{"Index"},"NewColumn",JoinKind.Inner),
    #"Expanded Country" = Table.ExpandTableColumn(#"Merged MvpId and Country", "NewColumn", {"Country"}, {"Country"}),
    #"Removed Index" = Table.RemoveColumns(#"Expanded Country",{"Index"}),
    #"Merged with Awards" = Table.NestedJoin(#"Removed Index",{"MvpId"},Awards,{"MvpId"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded Awards" = Table.ExpandTableColumn(#"Merged with Awards", "NewColumn", {"FirstYear", "CountOfAwards"}, {"FirstYear", "CountOfAwards"})
    #"Expanded Awards"

The relation between an MVP and the awards’ categories is a one-to-many. We need to build a second table with all the awards’ categories received by each MVP. To achieve this we first need to transform the list of categories into multiples rows. To achieve this, we need to split our string with the separator coma. then we can unpivot the data on the MVP Id and then remove the useless column Attribute. To clean the remaining spaces of some categories, we just need to trim.

    Source = Awards,
    #"Removed Columns" = Table.RemoveColumns(Source,{"IndividualPage", "FirstYear", "CountOfAwards"}),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Columns","AwardsList",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"AwardsList.1", "AwardsList.2"}),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Split Column by Delimiter", {"MvpId"}, "Attribute", "Value"),
    #"Removed Columns1" = Table.RemoveColumns(#"Unpivoted Columns",{"Attribute"}),
    #"Trimmed Text" = Table.TransformColumns(#"Removed Columns1",{{"Value", Text.Trim}})
    #"Trimmed Text"

It’s now time to check the relations between our main table (All info) and the new table (Awards Nomination).


We can now build a new report with some information about the awards received.


The next step is to retrieve the information about social media for each MVP and to use this information to retrieve additional info about their usage of GitHub … but it’s for the next blog post.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s