Return intermediate vertices by creating aliases with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library, how to write your first queries with the Gremlin language, how to Manage vertices’ properties and the first steps to traverse a graph. This blog post is dedicated to explain how to you can use the traversal of a graph and return some vertices that are not the end of the path.

Until now, you’ve always returned the last step of our query. But it’s not always what we want! In the previous query, we’ve returned the karatekas having defeated the karatekas who defeated my daughter. One of the issue of this query was that we didn’t know who defeated who.

We can fix this problem by returning tuples containing the two karatekas corresponding to the vertices in green on this drawing.

To achieve this, we’ll need to give an alias to the selection performed at the third intermediate step (and also to the latest). Gremlin as an operator named as to give aliases.

The following query is just assigning aliases to the interesting steps:

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")

At that moment, if you executed the query, the result would be identical to what it was previously. Just giving aliases is not enough, you’ll also have to instruct to Gremlin to make some usage of them. The simpler usage is to select them. This operation ensure that the aliased vertices are part of the result of the query.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")
.select ("winner", "loser")

The result of this query is a list of tuples containing each time a loser and a winner:

{
"loser": {
"id": "karateka.52",
"label": "karateka",
"type": "vertex",
"properties": {
…,
"fullName": [
{
"id": "3dec83c1-5f2e-41d4-98e9-8371cf6017b4",
"value": "B****, Rania"
}
],
...
]
}
},
"winner": {
"id": "karateka.59",
"label": "karateka",
"type": "vertex",
"properties": {
...
"fullName": [
{
"id": "44fdc18a-a3e7-47ba-8ce5-d7599548c267",
"value": "S****, Kiara"
}
...
},
{
"loser": {
...
},
"winner": {
...
},
...

It’s usually not suited to get the whole list of properties. Most of the time we’ll only be interested by a subset of the properties in this case the fullName. To project the vertices to a single property, we can use the operation by provided by Gremlin.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")
.select ("winner", "loser")
.by("fullName")

This query is limiting the result to the full names of the losers and the winners:

[
{
"loser": "B****, Rania",
"winner": "S****, Kiara"
},
...
{
"loser": "P****, Denisa",
"winner": "D*****, Louise"
}
]

The next blog post will about filtering during the traversal of a graph.

Advertisements

Basic traversing of a graph with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library, how to write your first queries with the Gremlin language and how to Manage vertices’ properties This blog post is dedicated to explain how to jump from a vertex to an adjacent vertex using the existing edges.

You’ll have to learn a few additional operations to traverse a graph. The first set of operations are inE and outE. These two operations let you select all the edges respectivelly ending and starting from the selected vertex. The example here under show the result (in green) of the operation outE for a given vertex.

The following query is returning all the edges having the label participates and starting from the vertex representing a given karateka.

g.V().has("karateka", "fullName", "Charlier, Alice").outE("participates")

The result of this query is a list of edges. From this result we can see that the starting node is always the given karateka and that each edge is linking to a bout.

[
{
"id": "4a33a619-593a-469e-88ff-983e951149ed",
"label": "participates",
"type": "edge",
"inVLabel": "bout",
"outVLabel": "karateka",
"inV": "bout.120",
"outV": "karateka.70",
"properties": {
"color": "aka"
}
},
{
"id": "517e4b17-95ea-4ad0-a74c-45c083587a46",
"label": "participates",
"type": "edge",
"inVLabel": "bout",
"outVLabel": "karateka",
"inV": "bout.116",
"outV": "karateka.70",
"properties": {
"color": "aka"
}
},
… (many more ojects)

The same kind of operations exist for selecting vertices being the end or the start of a selected edge. These functions are named outV and inV.

The following query is starting from a karateka, then jumping the the edges having the label participates and then jumping to all the vertices being the end of the previously selected edges.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.outE("participates")
.inV()

The result of this query is a list of bouts (based on our previously defined graph schema).

[
{
"id": "bout.120",
"label": "bout",
"type": "vertex",
"properties": {
"pk": [
{
"id": "bout.120|pk",
"value": 1
}
]
}
},
{
"id": "bout.116",
"label": "bout",
"type": "vertex",
"properties": {
"pk": [
{
"id": "bout.116|pk",
"value": 1
}
]
}
},

Most of the time, you don’t really want to select the edges. They are just a some means to go from one vertex to any adjacent vertex. for convenience, Gremlin is supporting two operations in and out. They are the equivalent of respectively outE followed by inV and inE followed by outV.

The following drawing explains that starting from the black vertex and using a in operation, you’ll directly select the three adjacent vertices.

A good usage of the traversing of a graph will be to know the names of all the karateka having defeated a given karateka. To write this query we’ll first select the initial karateka, then going to all the bouts where the edge is labelled loses and corresponding to all the bouts where the karateka has been defeated. Having this list of bouts we just need to follow the edges labelled as wins to know the name of the winner.

The following query applies this pattern:

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")

The result of this query is a list of karatekas. If we want to just display their names, we can add a values operations.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.values("fullName")

The result will be a list of strings (results have been anonymized for GDPR reasons):

[
"B****, Rania",
"H****, Maëlie",
"P****, Denisa"
]

We can go a bit further and check if these karatekas have already been defeated by someone or not. To achieve this, I’ll apply the exact same pattern and use the edges loses and wins from the second karateka.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.out("loses")
.in("wins")
.values("fullName")

The result is also a list of string:

[
"D****, Louise",
"F****, Julia",
"F****, Julia",
"S****, Kiara",
"K****, Lee-lou",
"Z****, Manon",
"E****, Lena",
"A****, Ezdina",
"B****, Margaux",
"B****, Zineb",
"B****, Rania",
"D****, Louise"
]

As you can see the first and last names but also the second and third are identical. The reason is that these two karatekas have defeated twice one of the three karatekas listed above (or once two of them). That’s really important to understand that Gremlin doesn’t automatically deduplicate the vertices. If you want to achieve this, just use the function dedup

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.out("loses")
.in("wins")
.dedup()
.values("fullName")

Note that I applied the function dedup to the vertices and not to the property fullName. The reason is to avoid to consider as duplicates two karatekas that are just homonyms.

The next blog post will be about the step modulator … and how to return a result where the selected vertices are not the last vertices traversed by the query!

Querying, adding or dropping vertices’ properties with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library but also how to write your first queries with the Gremlin language. This blog post is dedicated to explain how to add or remove properties on a vertex.

Properties are always attached to a vertex or an edge and give some information about this vertex or edge.  Using Azure Cosmos DB, you’ll always have a property named pk that is mandatory and automatically generated by Azure Cosmos DB

      "pk": [
        {
          "id": "karateka.70|pk",
          "value": 1
        }
      ],

pk stands for partition key. Partition keys are an important feature of Azure Cosmos DB but I’ll explain them in a future dedicated post.

Other properties always have the same structure. The name of JSON field is the name of Gremlin property. Each JSON field contains an array of values for this property. Indeed, in Gremlin each property is multi-valued, we’ll come back to this later in this post. Each property’s value has an id having for value a GUID automatically attributed by the Azure Cosmos DB engine and a second field named value keeping track of the effective value of the property.

      "fullName": [
        {
          "id": "5c4a9e2e-4c62-41ab-9732-9bd39c0a6837",
          "value": "Charlier, Alice"
        }
      ],
      "firstName": [
        {
          "id": "b5c99f14-2bd2-4700-a42f-3bf97042e351",
          "value": "Alice"
        }
      ],
      "lastName": [
        {
          "id": "5cef6756-8ddc-45bf-95b9-3ee858aa0a12",
          "value": "Charlier"
        }

If you want to add a property to a vertex, you don’t need to perform any explicit change to the underlying schema. Graph databases are schema-less. It means that you can quickly add a new property by just specifying it. Naturally, it has some drawbacks and any typo could create a new property in place of editing an existing one.

To add a property to a vertex, you first need to select this vertex as explain in the previous blog post and then apply the operator property. This operation is expecting the property’s name and the property’s value as parameters:

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .property("nationality", "Belgian")

Executing this query will return the selected vertex and you’ll quickly see that the property has been added.

[
  {
    "id": "karateka.70",
    "label": "karateka",
    "type": "vertex",
    "properties": {
      "pk": [
        {
          "id": "karateka.70|pk",
          "value": 1
        }
      ],
      ...
      "nationality": [
        {
          "id": "f0387c84-42e4-43ea-badb-7bfecc75443e",
          "value": "Belgian"
        }
      ]
    }
  }
]

It’s possible to add a property to more one than vertex. To achieve this, just select multiple vertices and define the common property. In the example below, I’m assigning the same birthday to my twin daughters.

g.V()
  .has("karateka", "fullName", within(
      ["Charlier, Alice", "Charlier, Clémence"]))
  .property("birthDate", "2010-07-04")

As explained above, Gremlin natively support multi-valued properties. If you want to define a property having more than one property, you’ll have to specify it as the first parameter of the operator property by specifying the keyword list. The next query is adding a few middle names to a karateka.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .property(list, "middleName", "Coralie")
  .property(list, "middleName",  "Florence")

The result is the following:

     "middleName": [
        {
          "id": "a4ec9a1d-caf0-4ba5-9611-eae448f6c684",
          "value": "Coralie"
        },
        {
          "id": "ba1ce36a-1369-4c30-a4b2-08391c9b5a30",
          "value": "Florence"
        }
      ]

To return the values of some properties for a given vertex, you’ll need to apply the operator values and specify the properties’ name.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .values("lastName", "firstName")

This query is returning a really flat and compact JSOn document with just the values of the properties.

[
  "Charlier",
  "Alice"
]

When selecting multiple vertices this operator could be useless due to the fact that you don’t have the values grouped by vertex and that you don’t know the mapping between the property and the value. If you want a more explicit view on the property you can use the operator valueMap.

g.V()
  .has("karateka", "lastName", "Charlier")
  .valueMap("nationality", "firstName", "lastName")

The result of this query will be an explicit JSON document, listing all the requested properties and grouping them by vertex. Once again if a property is not existing for a given vertex, it won’t be an issue.

[
  {
    "nationality": [
      "Belgian"
    ],
    "firstName": [
      "Alice"
    ],
    "lastName": [
      "Charlier"
    ]
  },
  {
    "firstName": [
      "Clémence"
    ],
    "lastName": [
      "Charlier"
    ]
  }
]

If you want to remove a property, you’ll have to select it and then drop it.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .properties("birthDate")
  .drop()

It’s also possible to drop several properties on several vertices. If the properties are not existing it won’t create an exception once again a benefit of schema-less databases.

g.V()
  .has("karateka", "fullName", 
     within(["Charlier, Alice", "Charlier, Clémence"]))
  .properties("middleName", "birthDate")
  .drop()

Exactly the same approach can be applied to a vertex to add or drop properties!

The next step in this series of blog posts is to traverse the path, jumping from one vertex on another using the edges.

 

 

First steps with Gremlin to query a graph loaded in Azure Cosmos DB

In previous blog posts, I explained the use-case and also how to load data into an Azure Cosmos DB graph database with the bulk executor library. This blog post is dedicated to explain how to perform basic queries with Gremlin on this dataset.

You can use two tools to query a graph hosted by Azure Cosmos DB. The first tool is embedded in the Azure Portal and is named Data Explorer.

cosmos-data-explorer

The second tool is an extension of Visual Studio Code that is freely available and is named Azure Cosmos DB. This extension let you browse and query your MongoDB databases both locally and in the cloud using scrapbooks but also to write queries in Gremlin and display results as a Graph or as JSON documents.

cosmos-visual-code-extension

Before we go further, a little bit of theory about the Gremlin query language.

  • Each query is starting by g
  • V() stands for vertices and returns one or more vertices
  • E() stands for edges and returns one or more edges
  • hasLabel("label") filter Vertices/Edges based on label (type)
  • hasId("id") filter Vertices(/Edges) based on Id (must be unique)
  • has("propertyName", "value") filter Vertices based on value of any property

With these few elements, it’s already possible to write some interesting queries. The two first queries will respectivelly retrieve all the bouts and all the karatekas:

g.V().hasLabel("bout")
g.V().hasLabel("karateka")

If you want to retrieve a specific karateka and you know her id, you can apply the two variantes. It will return a unique result.

g.V().hasId("karateka.1") 
g.V("karateka.1")

The graph view is not really providing a value

cosmos-query-id-graph

But the JSON view, offer the opportunity to confirm that when returning a vertex, we’re also returning each of the properties!

cosmos-query-id-json

Most of the time, you don’t know the if of a vertex and you’ll need to perform a search through the graph to find it. In the first example, we’re looking for a karateka named Alice and in the second we’re looking for two karatekas.

g.V().has("karateka", "fullName", "Charlier, Alice") 
g.V().has("karateka", "fullName", 
   within(["Charlier, Alice", "Charlier, Clémence"]))

The first query is identical in terms of result to the following query:

g.V().has("fullName", "Charlier, Alice")

But the first version is more performant. Indeed, by specifying that you’re looking for a karateka, the engine will avoid to search within all the vertices that have not the label karateka. It’s a best practice to always specify this information when possible.

In the next blog posts we’ll see how to add or remove some properties to a vertex or and edge.

Loading data into an Azure Cosmos DB graph database with the bulk executor library

In a previous blog post, I started to explain how to model karate tournaments’ results within a graph database. Now, we’ll see how we can efficiently load these data into Azure Cosmos DB with the bulk executor library. This library allows you to perform bulk operations in Azure Cosmos DB through bulk import and bulk update APIs (currently limited to SQL and Gremlin API).

One of the biggest benefits to use this library, is that you don’t need to handle rate limiting of request, request timeouts, and other transient exceptions. This library is doing these tedious tasks for you. Another good point is the fact that it significantly reduces the client-side compute resources needed to saturate the throughput allocated to a container.

For this case, my data are stored in a kind of flat file. I’ll explain the format here under but if you want to see the full picture, that’s an intermediatory file with raw data but already aligned with the structure
of my graph. In this file, you’ll find a section defining karatekas, another tournaments but also bouts, divisions etc. All these entities are vertices and each section will start by the label of the set of vertices between brackets. The next line will contain the name of the properties separated by a semi-column(;). following lines, contain all the vertices and their respective properties.

[tournament]
id;name;date
1;FFKAMA Championship 2017;2017-01-29
2;FFKAMA Championship 2018;2018-01-28
3;GFK Championship 2018;2018-04-15
...

The transition between two set of vertices or between is marked by an empty line.

The label of a set of edges is identified by starting with a dash (-) and ending with a dash and a greater than symbol (symbol of an arrow). Expl: - belongs to ->. Same than vertices, second line identifies the properties. Two important properties are assigned to any edge, the source and the destination. These two properties are vertices and are identified by as such by brackets surrounding the name of the property. They are always located as first and second properties.

-loses->
[karateka];[bout];point;senshu
33;67;0;
75;68;
30;69;4;false
...

The example above represents the edges labeled “loses”. These edges represent one the possible links between a karateka (source) and a bout (destination). Both source and destination are identified by their corresponding id. As you can see each property, except the source and destination, can be unset.

Now that we’ve our flat file to import, we’ll need to write a reader. This reader is implemented in the class GraphCsvReader. This reader will retrieve the labels and properties for each vertex and edge stored in the flat file. Once read, all these information will be transformed to vertices or edges. That’s the role of the class CosmosDbBulkGraphBuilder. Once we’ve these objects, we’ll forward them to the bulk execution library provided by Microsoft in order to load them in Azure Cosmos DB. This library is leveraged in the class CosmosDbBulkWriter.

read-transform-bulk-load-process

A few key insights about the code of CosmosDbBulkWriter.

If you’re familiar with SqlConnection or OleDbConnection for SQL Server or most of the relational databases, you could be surprised to not find the equivalent for Azure Cosmos DB. The best match with the notion of connection is the class DocumentClient. It represents a client-side logical representation for the Azure Cosmos DB. This client is used to configure and execute request against the service.

The GraphBulkExecutor, provided by the bulk executor library, is the class effectively bulk loading the data to Azure Cosmos DB. Before instantiating and initializing an object of this type, you should set the retry options of the DocumentClient to high values. After the initialization, you can reset them to lower values. Indeed, the initialization of a GraphBulkExecutor will probably request a higher workload than usual on your Azure Cosmos DB account and generate some internal and hidden throttled requests.

If you want more details about the code, feel free to take a look in the GitHub project. About performances, the dataset that I’m using has around 3000 (676 vertices and 2323 edges). In terms of entities, it means more than 200 karatekas, results of 350 bouts, ranking for 130 karatekas (I didn’t load the ranking of a tournament if I had the results of each bouts) belonging to 25 tournaments. To read my flat file and create vertices and edges on client side, I need less than 70ms … extremely fast. For the loading time, it depends of how many RU/s you’ll assign to your collection.

If you’re assigning 400 RU/s, you’ll need around 45s. Switching to 800 RU/s will let you wait during 22s. Scaling up to 2000 RU/s, will decrease the time to load the dataset to around 8.5s. That’s inline with the first results … if I check how many RU/s are consumed during the first two attempts, I’ll have around 400 RU/s and 2000 RU/s … logical. Changing the settings to 8000 RU/s will only decrease the time required to load the dataset to 4.5 sec … meaning that I’m limited to an effective value of 4300 RU/s! The main reason for this is that my client is the bottleneck and is not able to send the data at the speed that Azure Cosmos DB can consume them.

cosmos-load-time

Now that we’ve loaded the dataset, we’ll see how to query this database with the use of Gremlin in the next blog post.

 

Encrypting connection strings for NBi

In some companies encryption of the connection strings and other security-related information is required and checked security policies. NBi is supporting these requirements and offers the opportunity to encrypt the connection-strings in the config file. NBi is trying to not reinvent the wheel and, as such, is relying on the .Net framework native features for the encryption of the connection strings.

The first step to consider when using encrypted connection strings is to move them to the config file. In your test-suite you should find the following reference to a connection string named myDB:

<settings>
  <default apply-to="system-under-test">
    <connectionString>@myDB</connectionString>
  </default>
</settings>

And in the config file, you should have the information about the myDB connection string:

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="TestSuite.nbits"/>
  <connectionStrings>
    <add name="myDB" connectionString="Provider=SQLNCLI11;Server=****.database.windows.net;Database=AdventureWorks2012;UId=****;Pwd=****;" />
  </connectionStrings>
</configuration>

The following walkthrough is using the RsaProtectedConfigurationProvider. This provider uses RSA encryption to encrypt and decrypt configuration data.
Before you and the user running your test-suite can decrypt encrypted information in the config file, their identity must have read access to the encryption key that is used to encrypt and decrypt the encrypted sections. This walkthrough uses the default RsaProtectedConfigurationProvider provider that is specified in the Machine.config file and named RsaProtectedConfigurationProvider. The RSA key container that is used by the default RsaProtectedConfigurationProvider provider is named NetFrameworkConfigurationKey.

To grant Read Access to an RSA Encryption Key, you’ll need to use aspnet_regiis. This utility can be found in any version of the .Net framework installed on your machine. The latest version is available at C:\Windows\Microsoft.NET\Framework\v4.0.30319. Granting read access to the user xyz can be executed with the following command (you probably need admin rights to successfully execute this command):

aspnet_regiis -pa "NetFrameworkConfigurationKey" "xyz"

Once, read access granted to your development account don’t forget to also add the account executing your test-suite (if they are different accounts).

Next step is to rename you config file to web.config. It could sound weird but the encryption tool is not looking for other files than web.config files!

After renaming your config file to web.config, you’ll need to use aspnet_regiis again to encrypt your file. Use the following command:

aspnet_regiis -PeF "connectionStrings" "C:\Full Path\MyTestSuite"

Note that the name of section connectionStrings is case-sensitive! Take into account the “S” in upper case. You should also be careful with the last parameter setting the path of your project (and not the path of your config file). Last but not least the path of your project shouldn’t contain a backslash “\” at the end.

Now, you should open the web.config file and you’ll find the following section:

<connectionStrings configProtectionProvider="RsaProtectedConfigurationProvider">
  <EncryptedData Type="http://www.w3.org/2001/04/xmlenc#Element"
    xmlns="http://www.w3.org/2001/04/xmlenc#">
    <EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#tripledes-cbc" />
    <KeyInfo xmlns="http://www.w3.org/2000/09/xmldsig#">
      <EncryptedKey xmlns="http://www.w3.org/2001/04/xmlenc#">
        <EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-1_5" />
        <KeyInfo xmlns="http://www.w3.org/2000/09/xmldsig#">
          <KeyName>Rsa Key</KeyName>
        </KeyInfo>
        <CipherData>
          <CipherValue>aloQVZs+EU3icflVqx+kl9TgCrJZ+qw+fMG5zu0y5SbjkflNgOPtv/Id0H07jNHW5QXA5dcTUa8vMXb4evJMqv281/PTnTq9D4+YtpS5n2eeoGNrlkenHA4L2hOkwbO5A5M8hRAm6MMCjWuvmIgxnczH+BY6tAMAfyU53cjkeWyYOL5SBbmeq0iZ3xcm256VDojqQUdddhuLzlBDQ/FPKeDEJhV9TsbQmaWxmkQ7ftWKsVrhgkzIiqlVjyUw/KM6S2iW/CwayOXhyOZhxYqZAVy6BmaE943/Hoky/UG8E1aOaLBrmUEt+ahl7hru/RZb2wNacGqCO5y+X8TqFdpk0g==</CipherValue>
        </CipherData>
      </EncryptedKey>
    </KeyInfo>
    <CipherData>
      <CipherValue>b2K1WbCd+0mOj5L6xL3ZczWqsgNwdV/RP6jqEA7U2ULYigXF7VccUS7LP7FIRGfVWPcgxQvVHTXanvfY+HKv6J8QfJV7IUopcrn9PYZYQBjm5gZ61AZA5ePfI16GaLsoPk4+VGyxjNCXwoaNSRLgUotA5vyA1cb7VuKKbGZMYixb7L9xPUj9sm5kb9r2PLGjjWDBKhGKBTxn6dGDnUHMjaBk1IXixuj8z1kx4l4CwvUtyLExddeWVu32PkgkIczlzxNb5VG11lI3M6KcptI/fkCCg9Vs/ZFd</CipherValue>
    </CipherData>
  </EncryptedData>
</connectionStrings>

Your connection string is now encrypted. You can reverse the encryption by using the following command:

aspnet_regiis -PdF "connectionStrings" "C:\Full Path\MyTestSuite"

To use your encrypted config file (don’t forget to re-encrypt it if you just decrypt it), you’ll need to rename it to the original name.

Then, just run your test-suite! Nothing else to configure. with the help of native .Net features, NBi will understand that the config file is encrypted, where the encryption key is stored and how to decrypt it.

Extracting key phrases with the Text Analytics API and Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. My last article also explained how to use the Face API to guess the gender of these MVP. I’ll now parse their biography to extract key phrases and display the most used themes in these texts. To support this need, I’ll use the Text Analytics API provided by Azure Cognitive services.

The Text Analytics API is a suite of text analytics web services built with best-in-class Microsoft machine learning algorithms. The API can be used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction and language detection. No training data is needed to use this API; just bring your text data. This API uses advanced natural language processing techniques to deliver best in class predictions.

The first step of this exercice is to grab the biography. We previously discussed how to parse a web page and retrieve pertinent information. In that case we need to look for the html tag <pre>, extract the text and clean it up.

let
    GetBiography = (url as text) as text=>
let
    source = Web.Contents(url),
    lines = Lines.FromBinary(source),
    biography = List.Select(lines, each Text.Contains(_, "<pre>")){0},
    cleanedBiography = ConvertHtmlSpecialChars(Text.BeforeDelimiter(Text.AfterDelimiter(Text.Clean(Text.Trim(biography)),">"), "<"))
in
    try cleanedBiography otherwise ""
in
    GetBiography

We need to setup a Text Analytics API on our Azure account. If you don’t know how to do this refer to the previous article of this series. About pricing, selecting the free (FO) possibility is probably the best approach. We’ll have a very limited usage of this API with just one call by run.

text-analytics-api-pricing.png

Indeed, at the difference of the Face API, we don’t need to perform a call for each MVP to the Text Analytics API. We only need to call this API once with all the biographies. The API is expecting an list of records (with the language, an id and the biography). The language will always be “en” (standing for English) and for the id, we’ll use the MvpId. Preparing this structure is performed in a function PrepareKeyPhrases.

let
    PrepareKeyPhrases = () =>
let
    source = Table.SelectColumns(#"All info",{"MvpId", "Biography"}),
    filtered = Table.SelectRows(source, each [Biography] <> ""),
    addEnglish = Table.AddColumn(filtered, "language", each "en"),
    renamed = Table.RenameColumns(addEnglish,{{"Biography", "text"}, {"MvpId", "id"}}),
    reordered = Table.ReorderColumns(renamed,{"language", "id", "text"}),
    list = Table.ToRecords(reordered)
in
    list
in
    PrepareKeyPhrases

Once we have the expected input for the query, we just need an additional function to call the Text Analytics API and the method keyPhrases. This power query-m function will be named GetKeyPhrases, receive in parameter the result of PrepareKeyPhrases and will use the same approach than for the Face API, detailed in this article. The most complex part of this query is to transform the structure (list of records) that we previously prepared to a Json text. We can achieve this by transforming the structure to a Json document with Json.FromValue and then we need to serialize this by transforming our newly created binary to text with the help of Text.FromBinary.

let
    GetKeyPhrases = (list as list) =>

let
    headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = TextApiKey],
    jsonText = Text.FromBinary(Json.FromValue(list)),
    content = "{ ""documents"": " & jsonText & "}",
    source = Web.Contents(TextApiUrl, [Headers = headers, Content=Text.ToBinary(content)])
in
    source
in
    GetKeyPhrases

We’ll need to provide a table for these keyPhrases. This new table will be linked to the main table with all the MVP details in the model.

let
    source = GetKeyPhrases(PrepareKeyPhrases()),
    json = Json.Document(source,65001)[documents],
    toTable = Table.FromList(json, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    expandColumns = Table.ExpandRecordColumn(toTable, "Column1", {"id", "keyPhrases"}, {"MvpId", "Word"}),
    expandList = Table.ExpandListColumn(expandColumns, "Word"),
    nonNullRows = Table.SelectRows(expandList, each [Word] <> null),
    capitalized = Table.TransformColumns(nonNullRows,{{"Word", Text.Proper, type text}}),
    cleaned = Table.TransformColumns(capitalized,{{"Word", Text.Clean, type text}}),
    trimmed = Table.TransformColumns(cleaned,{{"Word", Text.Trim, type text}})
in
    trimmed

Now that we’ve this list of keywords, we can freely use the Word Cloud custom visual.

text-analytics-word-cloud.

It’s interesting to compare some of the keywords extracted for the recently awarded MVPs.

text-analytics-api-new-mvp

and the long time MVPs.

text-analytics-api-long-time-mvp

At a first sight, book sounds as a keyword associated to long-time MVP and Power BI to the recently awarded.

Using the Face API with Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. In this sequel, I’ll also add the gender of the MVPs … based on their profile picture. To achieve this I’ll rely of the Face API, part of the Azure Cognitive Services.

Let’s start by creating a service on your Azure subscription. If you have no Azure subscription don’t panic, you can create a new one and you’ll receive some free credits to try a few services including the Face API. Also, if you want, you can use the Face API for free with the level of service F0 limiting the calls by minute to twenty.

To create a Face API service in your subscription, click on “add resources” and search for “face”, select the Face API:
face-api

Then select the Face service in the list:
face-api

After this selection, give a name to your service, select the resources group and make a choice about the level of service that you’re expecting:

face-api-pricing

Now, we’ll go back to Power BI and we’ll create a new query, To be exact a new function in Power Query. This function will call the Face API passing in parameter the url of the picture. the return will be a text with the possible values: male, female or unknown. I can already create two parameters related to the Face API:

  • The base url dependent of where your service is hosted. I named this parameter FaceApiUrl. This information is located in the Azure portal, in the overview of your Face API service. face-api-url
  • The key of your service. I named this parameter FaceApiKey and the value is also located in the Azure portal but in the keys section.face-api-key

Based on the documentation of the Face API, I’ll have to send a POST request to the service. The request must be sent to a base url and specifying what are the expected parameters computed by the API. In this specific case I’m only interested by the gender. I can build a record for query parameters:

query = [#"returnFaceAttributes"= "gender"]

As previously explain by Chris Webb (blog post) or Erik Svensen (blog post) to submit a post request you’ll need two tricks.

  1. You must specify a content. This action will switch the execution context of Web.Contents from a GET request to a POST request
  2. The text of the content must be transformed to a binary

To create the content, I’ll have to refer to the documentation of the Face API and I need to create a Json document with just one attribute the url of the picture.

content = "{ ""url"": """ & url & """}"

To submit a valide request, I’ll also have to submit the correct key (credential) to the Face API and specify that my content-type is an application/json. These two information must be specified in the headers of my request. The field headers is also expecting a record so I’ll submit the following construction.

headers =
   [#"Content-Type" = "application/json",
   #"Ocp-Apim-Subscription-Key" = FaceApiKey]

Now that the different components of my request are built, i just need to invoke the request with the help of the function Web.Contents().

response = Web.Contents(
   FaceApiUrl,
   [
      Query = query,
      Headers = headers,
      Content=Text.ToBinary(content)
   ]
)

Once we get the response, we just to parse it to extract the gender.

gender = Json.Document(source,65001){0}[faceAttributes][gender]

The complete method should handle some unexpected errors such as picture missing or the quality of the picture is not high enough and it’s not possible to detect the gender.

let
   GetGender = (url as text) =>
let
   query = [#"returnFaceAttributes"= "gender"],
   headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = FaceApiKey],
   content = "{ ""url"": """ & url & """}" ,
   source = Web.Contents(FaceApiUrl, [Query = query, Headers = headers,    Content=Text.ToBinary(content)]),
   gender = Json.Document(source,65001){0}[faceAttributes][gender]
in
   try gender otherwise "unknown"
in
   GetGender

At the end, I’m able to create this visual and see that the percentage of women in the Data Platform MVP is still really low.

face-api-gender

Analyzing sport results with a graph and Azure Cosmos DB

On previous blog posts, I made usage of Power BI to analyze the results of my daughters participating in gymnastics competitions. On this new series, I’ll analyze their results in karate (kumite to be precise) … using a larger set of tools: Azure Cosmos DB, Power BI and Azure Databricks.

Let’s start with the analysis to perform: I want to understand their results (is it an expected defeat or is it something that could be set as a counter-performance) or predict them (what’s the likelihood that we must stay at the sport hall until the podium ceremony).

Before going further, a few things that you should know about Karate tournaments before diving into the results:

  • The process is the same than in any Tennis tournament: bout’s winner goes the next round and loser is out.
  • At the exception that if you’ve been defeated by someone going to the final, you’ll be enlisted for a second stage with other competitors defeated by the same finalist and you’ll run for the bronze medal
  • It means that you’ve two bronze medals (because you’ve two finalists) and not one
  • There is always a bout’s winner: in case of draw at the end of the time allocated to a bout, the first competitor who scored is the winner (it’s a bit more complex but let’s keep this simple). This advantage is named “senshu”. In case of 0-0, judges will take the decision
  • Categories are based on age. In most sports, your age is determined based on your birth year … not in Karate where it’s based on your birthdate. It means that you can be in the same category than another competitor during months and suddenly, in the middle of the season, be in another category just because one of you get his birthday.
  • Categories are based on weight. But the bounds for the weight are not the same at each tournament, and the weight is also something really dynamic with children.

Who thought that it would be easy to analyze? Based on the last bullets, I took the decision to collect all the results of the categories surrounding the age of my daughters (hopefully, they are twins) and I’ll take all the sub-categories based on weight.

Based on sportsML wording all the karatekas participating in the same category of a tournament are part of a division.

Now, that we have our goal and some rules, let’s download the results. We’ve a major issue: results are partial. For some competitions, you only have the podium. It means that for some tournaments, I’ve no notion of bouts’ result but just the final standing.

Based on that, we’ve the following (main) entities: we’ve the tournament, the category (age/weight) derived into a division, the bouts (part of a round and a stage) and the competitors (karatekas). For the relations, we’ve: participates/wins/loses (between karatekas and bouts), belongs to (between tournament, divisions, stages, rounds and bouts), ranks (between karateka and a division). I could go further gathering information about club and nationalities but it’s out-of-scope at this point.

I took the decision to model this as a graph. Why? Three important factors drove me to this choice:

  • First argument, I’m not really interested by the entities but more by their relations. That mostly exclude column-based and document models.
  • Second argument, I’ll look for patterns such as “Who has been defeated by A and has defeated C?”
  • karate-query-1

    or have we got any cycle such as “A defeated B who defeated C but A defeated C”.

    karate-query-2
    This kind of queries are complex to model in SQL compared to graph query languages (Gremlin or Cypher).

  • Last argument: the partial results. A graph can easily handle partial information with its schema-less approach. You shouldn’t consider this argument with a storage point-of-view (I’m aware of the null value in relational databases). This argument makes sense if you consider that the relations could be inferred in a graph and that is harder to put in place with a relational model.

 
Based on consideration above my schema looks like:
karate-full-schema

As you could see, the “defeated by” used in the query above is something that could be inferred by the pattern wins/loses between two karatekas and a bout.

karate-inferred-relation

In next post, I’ll explain how to load the dataset with the Azure Cosmos DB bulk executor library, then I’ll cover how to query the database with Gremlin to get the first insights about my daughters’ results.

Generating tests requiring a synchronous iteration over multiple arrays with genbiL

Todays’ challenge is to create a test-suite, with genbiL, where each test is a all-rows assertion with an undetermined count of predicates. The predicates will be combined with an or operator. Pragmatically, each test will have the following pattern for the assertion:

<all-rows>
  <combination operator="or">
    <predicate operand="...">
      <equal>...</equal>
    </predicate> 
  </combination>
</all-rows>

In the second part of this blog post, I already explained how to deal with multiple values for a variable of the template … what’s different? The case that we’re facing is somewhat similar but not identical. In the previous blog post, each variable of the template (.nbitt) was set to an array by genbiL and the template iterated over the values of the array. But the iterations were independent of each other.

If you take a look to the predicate case that we’re facing now, we haven’t one value to iterate over but two: the operand(attribute of the xml element predicate) and the reference (value between the equal elements). Much more important, we need to iterate synchronously between these two values. It means that each time I iterate for a value of the operand, I also need to iterate over a value for the reference. Let’s take a tangible case: my first predicate would be on field “foo” and I want to check if this field is equal to “1”. My second predicate is on field “bar” and will check if this field is equal to “5”. This should lead to the following assertion:

<all-rows>
  <combination operator="or">
    <predicate operand="foo">
      <equal>1</equal>
    </predicate> 
    <predicate operand="bar">
      <equal>5</equal>
    </predicate> 
  </combination>
</all-rows>

That clearly means that if I create a template-variable named operand and set it with values foo and bar and a second template-variable named reference to set with values 1 and 5. I can’t first iterate on the first variable and then on the second variable or I’d have something such as:

<all-rows>
  <combination operator="or">
    <predicate operand="foo">
    </predicate> 
    <predicate operand="bar">
    </predicate> 
      <equal>1</equal>
      <equal>5</equal>
  </combination>
</all-rows>

Hopefully, StringTemplate is a powerful engine and manage this kind of synchronous iterations over two or more arrays (StringTemplate use the wording multi-valued attribute and not array). To achieve that, you’ll need to define an anonymous-template in your template and specify two variables to this anonymous-template.

$xArray,yArray:{ x,y | ...}$

This syntax means that you want to synchronously iterate over two arrays. This template iterates max(n,m) times where n and m are the lengths of xArray and yArray, respectively.

In our specific case the anonymous-template will be:

$operands,references:{operand,reference|<predicate operand="$operand$">
          <equal>$reference$</equal>
        </predicate> 
        }$

There is nothing fancy in the genbiL code to do to support this kind of template, just load the cases and transform some variables into arrays.

case load file 'Acceptance\GenbiL\Resources\Predicate.csv';
case split columns 'operands', 'references' with value '/';

template load file 'Acceptance\GenbiL\Resources\Predicate.nbitt';
suite generate;
suite save as 'Acceptance\GenbiL\Resources\Predicate.nbits';

The csv file would be (.csv):

operands;references
foo/bar;1/5

and the full template (.nbitt):

<test name="All rows from '$field$' validate the rules.">
  <trait name="template-name">predicate.nbitt</trait>
  <system-under-test>
    <resultSet>
      <query>
        select * from myTable
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <all-rows>
      <combination operator="or">
        $operands, references:{operand, reference|<predicate operand="$operand$">
          <equal>$reference$</equal>
        </predicate> 
        }$
      </combination>
    </all-rows>
  </assert>
</test>