Retrieve GitHub data with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It’s now time to get information about social media for each MVP. Then we’ll parse these informations to get the login of each MVP on GitHub and finally use this login to connect to the GitHub API and retrieve more info.

From a technical perspective, I’ll show

  • how to query an external API to get some information
  • how to return more than one value for a function
  • how to build a function to identify some patterns
  • how to transform pairs of key/value into several columns.

Ready? Let’s start …

The first task is to grab the social media information on the profile page. In previous blog posts, we stored this individual page in a table, the starting point will be this table. Once again, we apply the strategy of the html pattern to identify the useful lines for our task.

let
    IndividualPageTags = #"Individual page",
    #"Added SocialMediaTable" = Table.AddColumn(IndividualPageTags, "Tags", each GetSocialMedia([IndividualPage])),
    #"Removed Columns" = Table.RemoveColumns(#"Added SocialMediaTable",{"IndividualPage"}),
    #"Expanded Tags" = Table.ExpandListColumn(#"Removed Columns", "Tags"),
    #"Extract url" = Table.AddColumn(#"Expanded Tags", "Url", each GetUrlFromTag([Tags])),
    #"Removed Columns1" = Table.RemoveColumns(#"Extract url",{"Tags"}),
    #"Filtered out null" = Table.SelectRows(#"Removed Columns1", each ([Url] <> null)),

Now that we’ve a list of links to the MVP’s social medias, we still have to identify the type of social media (GitHub, Facebook, LinkedIn, …) and to extract the login/username from the url. To achieve this task we’ll create a function but it means that our function will not return a single field but two fields (the type of social media and the login). With the M language, it’s possible to return a record (a record is a container with one or more fields, each field is identified by a name).

To build the record, we use the function Record.AddField.

let
    GetWebsiteInfo = (url as text) =>
let
    url = Text.Trim(url, {" ", "/"}),
    possibilities = {
        [pattern = "facebook"  , value = "Facebook"],
        [pattern = "linkedin"  , value = "LinkedIn"],
        [pattern = "github"    , value = "GitHub"],
        [pattern = "twitter"   , value = "Twitter"]
    },
    Website = List.Single(List.Select(possibilities, each Text.Contains(url, _[pattern])))[value],
    Login = Text.Range(url, Text.PositionOf(url, "/", Occurrence.Last)+1),
    Records = Record.AddField(Record.AddField([], "Website", Website ), "Login", Login)
in
    Records
in
    GetWebsiteInfo

We can add this function to our previous code. The call to the function will add a record to each line. This record must be expanded to reveal the values for the website and the login.

    #"Get Website and Login" = Table.AddColumn(#"Filtered out null", "Custom", each GetWebsiteInfo([Url])),
    #"Expanded Website and Login" = Table.ExpandRecordColumn(#"Get Website and Login", "Custom", {"Website", "Login"}, {"Website", "Login"})
in
    #"Expanded Website and Login"
mvp-analysis-powerbi-006.png

Cool, now that we’ve a list of social medias with information that we can parse, we can use them to contact GitHub and get more info about the GitHub account. First we need to start from the social media table and filter to hold lines referring to GitHub.

let
    Source = #"Social Media",
    #"Removed Columns" = Table.RemoveColumns(Source,{"Url"}),
    #"Filtered to Github website" = Table.SelectRows(#"Removed Columns", each ([Website] = "GitHub")),

GitHub exposes an interesting API that we can use to get some information about accounts, repositories, issues and much more. This API is exposed on http and returns a JSON document. We can easily interact with this API with the functions Web.Contents and Json.Document. To achieve this task , we need to build an url with the format users/:login where :login is a variable corresponding to the login of the user.

    #"Get user info from Github" = Table.AddColumn(#"Filtered to Github website", "Custom", each Web.Contents("https://api.github.com/users/" & [Login])),
    #"Transform JSON to Table" = Table.AddColumn(#"Get user info from Github", "Transform Binary from Github", each Record.ToTable(Json.Document([Custom]))),
 

Before executing this code Power BI will ask us about security settings on the GitHub API. This API is a new source and each source must define it’s own security. The GitHub API can be contacted with <em>anonymous</em>, <em>basic authentication</em>, <em>OAuth2</em>. We could use the <em>anonymous</em> way but we’ll be quickly limited because of the limitation of 50 calls/day to the API. I selected <em>basic authentication</em> and registered my own GitHub credentials into Power BI. When it’s done we can now execute the load of this new table.

Now, we’ve information about GitHub accounts and we need to reformat it. We’ll first expand the Json table received from GitHub. It’s a simple set of key/value pairs. For some convenience, we need to  exclude the key equal to plan (this info is only available for your account). Then, it will be time to remove some columns that are now useless.

    #"Expanded Table to Name/Value pairs" = Table.ExpandTableColumn(#"Transform JSON to Table", "Transform Binary from Github", Table.ColumnNames(#"Transform JSON to Table"[#"Transform Binary from Github"]{0})),
    #"Filtered out ""Plan""" = Table.SelectRows(#"Expanded Table to Name/Value pairs", each ([Name] <> "plan")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered out ""Plan""",{"Website", "Login", "Custom"}),
mvp-analysis-powerbi-004

Based on this content, we can pivot the data to transform the key/value pairs into columns. The pivot function is expecting a summarize function (usually List.Sum). Keep in mid that here the values are text and not numbers. We should specify List.Max because anyway we’ll have one value for each pivoted row and this function accepts a text argument.

    #"Pivoted Key/Value" = Table.Pivot(#"Removed Columns1", List.Distinct(#"Removed Columns1"[Name]), "Name", "Value", List.Max),

 

mvp-analysis-powerbi-005.png

When done, we need some additional cleaning:

    #"Removed Columns2" = Table.RemoveColumns(#"Pivoted Key/Value",{"id", "gravatar_id", "url", "subscriptions_url", "organizations_url", "repos_url", "events_url", "received_events_url", "type", "site_admin", "private_gists", "total_private_repos", "owned_private_repos", "disk_usage", "collaborators"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"login", "GitHub Login"}, {"name", "GitHub fullname"}, {"public_repos", "Count of public repos"}, {"followers", "Count of followers"}})
in
    #"Renamed Columns"

Now, with this new table, we can build the following report.

mvp-analysis-powerbi-003

I’ve a few remarks about this report. First, it’s a bit weird to read that only 35 MVP have registered a GitHub account on their profile pages (it’s less than 10%). Some have probably forgotten to register this information.

Count of followers and public repositories are not the best parameters to determine the popularity or the activity on GitHub of someone. I’ll improve this in a next blog post.

Advertisements

4 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s