Extracting key phrases with the Text Analytics API and Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. My last article also explained how to use the Face API to guess the gender of these MVP. I’ll now parse their biography to extract key phrases and display the most used themes in these texts. To support this need, I’ll use the Text Analytics API provided by Azure Cognitive services.

The Text Analytics API is a suite of text analytics web services built with best-in-class Microsoft machine learning algorithms. The API can be used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction and language detection. No training data is needed to use this API; just bring your text data. This API uses advanced natural language processing techniques to deliver best in class predictions.

The first step of this exercice is to grab the biography. We previously discussed how to parse a web page and retrieve pertinent information. In that case we need to look for the html tag <pre>, extract the text and clean it up.

let
    GetBiography = (url as text) as text=>
let
    source = Web.Contents(url),
    lines = Lines.FromBinary(source),
    biography = List.Select(lines, each Text.Contains(_, "<pre>")){0},
    cleanedBiography = ConvertHtmlSpecialChars(Text.BeforeDelimiter(Text.AfterDelimiter(Text.Clean(Text.Trim(biography)),">"), "<"))
in
    try cleanedBiography otherwise ""
in
    GetBiography

We need to setup a Text Analytics API on our Azure account. If you don’t know how to do this refer to the previous article of this series. About pricing, selecting the free (FO) possibility is probably the best approach. We’ll have a very limited usage of this API with just one call by run.

text-analytics-api-pricing.png

Indeed, at the difference of the Face API, we don’t need to perform a call for each MVP to the Text Analytics API. We only need to call this API once with all the biographies. The API is expecting an list of records (with the language, an id and the biography). The language will always be “en” (standing for English) and for the id, we’ll use the MvpId. Preparing this structure is performed in a function PrepareKeyPhrases.

let
    PrepareKeyPhrases = () =>
let
    source = Table.SelectColumns(#"All info",{"MvpId", "Biography"}),
    filtered = Table.SelectRows(source, each [Biography] <> ""),
    addEnglish = Table.AddColumn(filtered, "language", each "en"),
    renamed = Table.RenameColumns(addEnglish,{{"Biography", "text"}, {"MvpId", "id"}}),
    reordered = Table.ReorderColumns(renamed,{"language", "id", "text"}),
    list = Table.ToRecords(reordered)
in
    list
in
    PrepareKeyPhrases

Once we have the expected input for the query, we just need an additional function to call the Text Analytics API and the method keyPhrases. This power query-m function will be named GetKeyPhrases, receive in parameter the result of PrepareKeyPhrases and will use the same approach than for the Face API, detailed in this article. The most complex part of this query is to transform the structure (list of records) that we previously prepared to a Json text. We can achieve this by transforming the structure to a Json document with Json.FromValue and then we need to serialize this by transforming our newly created binary to text with the help of Text.FromBinary.

let
    GetKeyPhrases = (list as list) =>

let
    headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = TextApiKey],
    jsonText = Text.FromBinary(Json.FromValue(list)),
    content = "{ ""documents"": " & jsonText & "}",
    source = Web.Contents(TextApiUrl, [Headers = headers, Content=Text.ToBinary(content)])
in
    source
in
    GetKeyPhrases

We’ll need to provide a table for these keyPhrases. This new table will be linked to the main table with all the MVP details in the model.

let
    source = GetKeyPhrases(PrepareKeyPhrases()),
    json = Json.Document(source,65001)[documents],
    toTable = Table.FromList(json, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    expandColumns = Table.ExpandRecordColumn(toTable, "Column1", {"id", "keyPhrases"}, {"MvpId", "Word"}),
    expandList = Table.ExpandListColumn(expandColumns, "Word"),
    nonNullRows = Table.SelectRows(expandList, each [Word] <> null),
    capitalized = Table.TransformColumns(nonNullRows,{{"Word", Text.Proper, type text}}),
    cleaned = Table.TransformColumns(capitalized,{{"Word", Text.Clean, type text}}),
    trimmed = Table.TransformColumns(cleaned,{{"Word", Text.Trim, type text}})
in
    trimmed

Now that we’ve this list of keywords, we can freely use the Word Cloud custom visual.

text-analytics-word-cloud.

It’s interesting to compare some of the keywords extracted for the recently awarded MVPs.

text-analytics-api-new-mvp

and the long time MVPs.

text-analytics-api-long-time-mvp

At a first sight, book sounds as a keyword associated to long-time MVP and Power BI to the recently awarded.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s