Running automated tests with the Azure Cosmos DB Emulator on AppVeyor

When developing an application relying on Azure Cosmos DB as the data backend, it’s always interesting to develop and run automated tests to check the integration between the business logic layer and the data layer. Nevertheless, creating a specific database and host it on azure.com, is an expensive and not so flexible solution. Hopefully, Microsoft is providing the Azure Cosmos DB Emulator to develop and test application interacting with Azure cosmos DB. To run our automated tests during the continuous build process, we can leverage this Emulator. As often, I’ll demonstrate the how-to with the Ci/CD service AppVeyor but this should be applicable to other CI/CD (on-premises or cloud-based).

The good news with AppVeyor is that a recent version of the Azure Cosmos DB Emulator is already provided in the build image. We don’t need to install it before anything. If you really want the last version of this emulator or if you’re using another CI/CD service, I’ll explain at the end of this blog post how to download and install the emulator (and also how to uninstall it).

The first step is to start the emulator. The emulator is located on C:\Program Files\Azure cosmos Db Emulator\CosmosDb.Emulator.exe and this executable is expecting arguments to define the action to execute. By default, without argument, you’ll just start the exe. In this case, we need to ask the emulator to start without displaying UI information and we can also ask it to not start the Data Explorer (that is dedicated to UI interactions that anyway we won’t be able to use during a build). The two arguments to pass to the exe are: /NoUI /NoExplorer.

By default in v2.7, the explorer is just enabling the SQL API. If you want to use other API such as Cassandra, Gremlin or MongoDB then you’ll have to enable the endpoints. In this case, we’ll enable the Gremlin endpoint by providing the argument /EnableGremlinEndpoint. It’s also possible to define on which port the emulator will be listening for this endpoint but I usually don’t change them. If you want to get an overview of all the flags that you can activate or disable in the emulator just check on your own computer by submitting the command CosmosDb.Emulator.exe /help.

To start the emulator with the default endpoint for Gremlin in the context of a CI/CD service, the final command is:

"C:\Program Files\Azure Cosmos DB Emulator\CosmosDB.Emulator.exe"  /NoExplorer /NoUI /EnableGremlinEndpoint

You can place this command in a batch file named start-emulator.cmd

If you just submit this command in the install step of your CI/CD workflow, you’ll have to face difficult times. Indeed, the Emulator is just running and listening on ports, so this command is never ending. On a CI/CD build, it means that the command will run on the main process of the CI/CD build and block everything, eventually your build will hang out. To avoid this, you’ll need to execute this command in another process. With PowerShell, you can just do this with the cmdlet Start-Process "start-emulator.cmd".

The emulator always take a few seconds to fully start. It’s important to check that the port is open before effectively using it or you could have unexpected exceptions due to the unavailability of the emulator. To avoid this start the emulator as soon as possible (before your build) and use it as late as possible (run the tests relying on it as the last tests). Nevertheless, it’s always a good idea, to check if this tool is correctly running before using it. To perform this check, you can assess that the port is effectively listening. Following PowerShell script is doing the job by checking if it’s possible to connect to the port at regular interval as long as the port is not available. When the port is available, the script display a positive message and continue.

$attempt = 0; $max = 5
do {
	$client = New-Object System.Net.Sockets.TcpClient([System.Net.Sockets.AddressFamily]::InterNetwork)
	try {    
		$client.Connect("127.0.0.1", 8901)
		write-host "Gremlin endpoint listening. Connection successful."
	}
	
	catch {
		$client.Close()
		if($attempt -eq $max) {
			write-host "Gremlin endpoint is not listening. Aborting connection."
		} else {
			[int]$sleepTime = 5 * (++$attempt)
			write-host "Gremlin endpoint is not listening. Retry after $sleepTime seconds..."
			sleep $sleepTime;
		}
	}
}while(!$client.Connected -and $attempt -lt $max)

Now that the emulator is started, you’ll need to create a database and a collection. It’s important to note that even if you want to create a graph database (Gremlin) and the Gremlin API is listening on port 8901 (by default), you’ll have to connect to the SQL API at the port 8081 (by default) to create the database and the collection. This is valid for all the api kind and not just for Gremlin. In fact, if you want to manage your Azure Cosmos DB instance, you must connect to the SQL API.

To create a database and a collection (if they don’t exist), you’ll need the Microsoft.Azure.DocumentDB package. Once installed, you’ll be able to use the methods CreateDatabaseIfNotExistsAsync and CreateDocumentCollectionIfNotExistsAsync of the class DocumentClient. The response of these two methods returns information about what happened (the database/collection was already existing or has just been created).

using (var client = new DocumentClient(sqlApiEndpoint, authKey))
{
    var database = new Database() { Id = databaseId };
    var databaseResponse = client.CreateDatabaseIfNotExistsAsync(database).Result;
    switch (databaseResponse.StatusCode)
    {
        case System.Net.HttpStatusCode.OK:
            Console.WriteLine($"Database {databaseId} already exists.");
            break;
        case System.Net.HttpStatusCode.Created:
            Console.WriteLine($"Database {databaseId} created.");
            break;
        default:
            throw new ArgumentException($"Can't create database {databaseId}: {databaseResponse.StatusCode}");
    }

    var databaseUri = UriFactory.CreateDatabaseUri(databaseId);
    var collectionResponse = client.CreateDocumentCollectionIfNotExistsAsync(databaseUri, new DocumentCollection() { Id = collectionId }).Result;
    switch (collectionResponse.StatusCode)
    {
        case System.Net.HttpStatusCode.OK:
            Console.WriteLine($"Collection {collectionId} already exists.");
            break;
        case System.Net.HttpStatusCode.Created:
            Console.WriteLine($"Database {collectionId} created.");
            break;
        default:
            throw new ArgumentException($"Can't create database {collectionId}: {collectionResponse.StatusCode}");
    }
}

Once the database created, you’ll need to populate it before you can effectively run your tests on it. This is usually something that I code in the OneTimeSetUp of my test-suite. To proceed, you must create a few Gremlin commands to add vertices and edges such as:

g.V().drop()
g.addV('person').property('pk', 1).property('id', 'thomas').property('firstName', 'Thomas').property('age', 44)
g.addV('person').property('pk', 1).property('id', 'mary').property('firstName', 'Mary').property('lastName', 'Andersen').property('age', 39)
g.addV('person').property('pk', 1).property('id', 'ben').property('firstName', 'Ben').property('lastName', 'Miller')
g.addV('person').property('pk', 1).property('id', 'robin').property('firstName', 'Robin').property('lastName', 'Wakefield')
g.V().has('firstName','Thomas').addE('knows').to(g.V().has('firstName','Mary'))
g.V().has('firstName','Thomas').addE('knows').to(g.V().has('firstName','Ben'))
g.V().has('firstName','Ben').addE('knows').to(g.V().has('firstName','Robin'))

Then you’ll have to execute them. Tiny reminder: don’t forget that you’ll have to use the Gremlin API endpoint and not anymore the SQL API endpoint! The driver to connect to this API is the Gremlin.Net driver available on nuget and the source code is available on GitHub on the repository of TinkerPop. Once again, the usage is relatively straightforward, you first create an instance of a GremlinServer and then instantiate a GremlinClient to connect to this server. Using the method SubmitAsync of the class GremlinClient, you’ll send your commands to the Azure Cosmos DB Emulator.

var gremlinServer = new GremlinServer(hostname, port, enableSsl, username, password);

using (var gremlinClient = new GremlinClient(gremlinServer, new GraphSON2Reader(), new GraphSON2Writer(), GremlinClient.GraphSON2MimeType))
{
    foreach (var statement in Statements)
    {
        var query = gremlinClient.SubmitAsync(statement);
        Console.WriteLine($"Setup database: { statement }");
        query.Wait();
    }
}

Now that your database is populated, you can freely connect to it and run your automated tests. Writting these test is another story (and another blog post)!

Back to the first step … as stated before I took the hypothesis that Azure Cosmos DB Emulator was already installed on your CI/CID image and that the version suited your needs. If it’s not the case, that’s not the end of the world, you can install it by yourself.

Before, you start to uninstall and install this software, it could be useful to check the version of the current installation. The following PowerShell script is just displaying this:

Write-Host "Version of ComosDB Emulator:" ([System.Diagnostics.FileVersionInfo]::GetVersionInfo("C:\Program Files\Azure Cosmos DB Emulator\CosmosDB.Emulator.exe").FileVersion)

If you need to install a new version of Azure Cosmos DB Emulator, the first action is to remove the existing instance of Azure Cosmos DB Emulator. If you don’t do it, the new installation will succeed but the emulator will never start correctly. The black magic to remove an installed software is to use the Windows Management Instrumentation Command-line. This command is surely not really fast but it’s doing the job.

wmic product where name="Azure Cosmos DB Emulator" call uninstall

Then, you can download the last version of Azure Cosmos DB Emulator at the following shortcut url: https://aka.ms/cosmosdb-emulator

Start-FileDownload 'https://aka.ms/cosmosdb-emulator' -FileName 'c:\projects\cosmos-db.msi'

Once downloaded, the installation is started with the following command

cmd /c start /wait msiexec /i "c:\projects\cosmos-db.msi" /qn /quiet /norestart /log install.log

You should now be able to install, start and populate your Azure cosmos DB Emulator on your favorite CI/CD server … just a few tests to write and you’ll have some end-to-end tests running with Azure Cosmos DB Emulator.

The following appveyor.yml statements should do the job:

install:
# Install the Azure CosmosDb Emaulator with the option /EnableGremlin
- wmic product where name="Azure Cosmos DB Emulator" call uninstall
- ps: Start-FileDownload 'https://aka.ms/cosmosdb-emulator' -FileName 'c:\projects\cosmos-db.msi'
- cmd: cmd /c start /wait msiexec /i "c:\projects\cosmos-db.msi" /qn /quiet /norestart /log install.log  
- ps: Write-Host "Version of ComosDB Emulator:" ([System.Diagnostics.FileVersionInfo]::GetVersionInfo("C:\Program Files\Azure Cosmos DB Emulator\CosmosDB.Emulator.exe").FileVersion)
- ps: Start-Process "start-emulator.cmd"

Don’t forget to check/wait for the Emulator before executing your tests:

test_script:
- ps: .\wait-connect-cosmosdb.ps1

Return intermediate vertices by creating aliases with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library, how to write your first queries with the Gremlin language, how to Manage vertices’ properties and the first steps to traverse a graph. This blog post is dedicated to explain how to you can use the traversal of a graph and return some vertices that are not the end of the path.

Until now, you’ve always returned the last step of our query. But it’s not always what we want! In the previous query, we’ve returned the karatekas having defeated the karatekas who defeated my daughter. One of the issue of this query was that we didn’t know who defeated who.

We can fix this problem by returning tuples containing the two karatekas corresponding to the vertices in green on this drawing.

To achieve this, we’ll need to give an alias to the selection performed at the third intermediate step (and also to the latest). Gremlin as an operator named as to give aliases.

The following query is just assigning aliases to the interesting steps:

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")

At that moment, if you executed the query, the result would be identical to what it was previously. Just giving aliases is not enough, you’ll also have to instruct to Gremlin to make some usage of them. The simpler usage is to select them. This operation ensure that the aliased vertices are part of the result of the query.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")
.select ("winner", "loser")

The result of this query is a list of tuples containing each time a loser and a winner:

{
"loser": {
"id": "karateka.52",
"label": "karateka",
"type": "vertex",
"properties": {
…,
"fullName": [
{
"id": "3dec83c1-5f2e-41d4-98e9-8371cf6017b4",
"value": "B****, Rania"
}
],
...
]
}
},
"winner": {
"id": "karateka.59",
"label": "karateka",
"type": "vertex",
"properties": {
...
"fullName": [
{
"id": "44fdc18a-a3e7-47ba-8ce5-d7599548c267",
"value": "S****, Kiara"
}
...
},
{
"loser": {
...
},
"winner": {
...
},
...

It’s usually not suited to get the whole list of properties. Most of the time we’ll only be interested by a subset of the properties in this case the fullName. To project the vertices to a single property, we can use the operation by provided by Gremlin.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")
.select ("winner", "loser")
.by("fullName")

This query is limiting the result to the full names of the losers and the winners:

[
{
"loser": "B****, Rania",
"winner": "S****, Kiara"
},
...
{
"loser": "P****, Denisa",
"winner": "D*****, Louise"
}
]

The next blog post will about filtering during the traversal of a graph.

Basic traversing of a graph with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library, how to write your first queries with the Gremlin language and how to Manage vertices’ properties This blog post is dedicated to explain how to jump from a vertex to an adjacent vertex using the existing edges.

You’ll have to learn a few additional operations to traverse a graph. The first set of operations are inE and outE. These two operations let you select all the edges respectivelly ending and starting from the selected vertex. The example here under show the result (in green) of the operation outE for a given vertex.

The following query is returning all the edges having the label participates and starting from the vertex representing a given karateka.

g.V().has("karateka", "fullName", "Charlier, Alice").outE("participates")

The result of this query is a list of edges. From this result we can see that the starting node is always the given karateka and that each edge is linking to a bout.

[
{
"id": "4a33a619-593a-469e-88ff-983e951149ed",
"label": "participates",
"type": "edge",
"inVLabel": "bout",
"outVLabel": "karateka",
"inV": "bout.120",
"outV": "karateka.70",
"properties": {
"color": "aka"
}
},
{
"id": "517e4b17-95ea-4ad0-a74c-45c083587a46",
"label": "participates",
"type": "edge",
"inVLabel": "bout",
"outVLabel": "karateka",
"inV": "bout.116",
"outV": "karateka.70",
"properties": {
"color": "aka"
}
},
… (many more ojects)

The same kind of operations exist for selecting vertices being the end or the start of a selected edge. These functions are named outV and inV.

The following query is starting from a karateka, then jumping the the edges having the label participates and then jumping to all the vertices being the end of the previously selected edges.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.outE("participates")
.inV()

The result of this query is a list of bouts (based on our previously defined graph schema).

[
{
"id": "bout.120",
"label": "bout",
"type": "vertex",
"properties": {
"pk": [
{
"id": "bout.120|pk",
"value": 1
}
]
}
},
{
"id": "bout.116",
"label": "bout",
"type": "vertex",
"properties": {
"pk": [
{
"id": "bout.116|pk",
"value": 1
}
]
}
},

Most of the time, you don’t really want to select the edges. They are just a some means to go from one vertex to any adjacent vertex. for convenience, Gremlin is supporting two operations in and out. They are the equivalent of respectively outE followed by inV and inE followed by outV.

The following drawing explains that starting from the black vertex and using a in operation, you’ll directly select the three adjacent vertices.

A good usage of the traversing of a graph will be to know the names of all the karateka having defeated a given karateka. To write this query we’ll first select the initial karateka, then going to all the bouts where the edge is labelled loses and corresponding to all the bouts where the karateka has been defeated. Having this list of bouts we just need to follow the edges labelled as wins to know the name of the winner.

The following query applies this pattern:

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")

The result of this query is a list of karatekas. If we want to just display their names, we can add a values operations.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.values("fullName")

The result will be a list of strings (results have been anonymized for GDPR reasons):

[
"B****, Rania",
"H****, Maëlie",
"P****, Denisa"
]

We can go a bit further and check if these karatekas have already been defeated by someone or not. To achieve this, I’ll apply the exact same pattern and use the edges loses and wins from the second karateka.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.out("loses")
.in("wins")
.values("fullName")

The result is also a list of string:

[
"D****, Louise",
"F****, Julia",
"F****, Julia",
"S****, Kiara",
"K****, Lee-lou",
"Z****, Manon",
"E****, Lena",
"A****, Ezdina",
"B****, Margaux",
"B****, Zineb",
"B****, Rania",
"D****, Louise"
]

As you can see the first and last names but also the second and third are identical. The reason is that these two karatekas have defeated twice one of the three karatekas listed above (or once two of them). That’s really important to understand that Gremlin doesn’t automatically deduplicate the vertices. If you want to achieve this, just use the function dedup

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.out("loses")
.in("wins")
.dedup()
.values("fullName")

Note that I applied the function dedup to the vertices and not to the property fullName. The reason is to avoid to consider as duplicates two karatekas that are just homonyms.

The next blog post will be about the step modulator … and how to return a result where the selected vertices are not the last vertices traversed by the query!

Querying, adding or dropping vertices’ properties with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library but also how to write your first queries with the Gremlin language. This blog post is dedicated to explain how to add or remove properties on a vertex.

Properties are always attached to a vertex or an edge and give some information about this vertex or edge.  Using Azure Cosmos DB, you’ll always have a property named pk that is mandatory and automatically generated by Azure Cosmos DB

      "pk": [
        {
          "id": "karateka.70|pk",
          "value": 1
        }
      ],

pk stands for partition key. Partition keys are an important feature of Azure Cosmos DB but I’ll explain them in a future dedicated post.

Other properties always have the same structure. The name of JSON field is the name of Gremlin property. Each JSON field contains an array of values for this property. Indeed, in Gremlin each property is multi-valued, we’ll come back to this later in this post. Each property’s value has an id having for value a GUID automatically attributed by the Azure Cosmos DB engine and a second field named value keeping track of the effective value of the property.

      "fullName": [
        {
          "id": "5c4a9e2e-4c62-41ab-9732-9bd39c0a6837",
          "value": "Charlier, Alice"
        }
      ],
      "firstName": [
        {
          "id": "b5c99f14-2bd2-4700-a42f-3bf97042e351",
          "value": "Alice"
        }
      ],
      "lastName": [
        {
          "id": "5cef6756-8ddc-45bf-95b9-3ee858aa0a12",
          "value": "Charlier"
        }

If you want to add a property to a vertex, you don’t need to perform any explicit change to the underlying schema. Graph databases are schema-less. It means that you can quickly add a new property by just specifying it. Naturally, it has some drawbacks and any typo could create a new property in place of editing an existing one.

To add a property to a vertex, you first need to select this vertex as explain in the previous blog post and then apply the operator property. This operation is expecting the property’s name and the property’s value as parameters:

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .property("nationality", "Belgian")

Executing this query will return the selected vertex and you’ll quickly see that the property has been added.

[
  {
    "id": "karateka.70",
    "label": "karateka",
    "type": "vertex",
    "properties": {
      "pk": [
        {
          "id": "karateka.70|pk",
          "value": 1
        }
      ],
      ...
      "nationality": [
        {
          "id": "f0387c84-42e4-43ea-badb-7bfecc75443e",
          "value": "Belgian"
        }
      ]
    }
  }
]

It’s possible to add a property to more one than vertex. To achieve this, just select multiple vertices and define the common property. In the example below, I’m assigning the same birthday to my twin daughters.

g.V()
  .has("karateka", "fullName", within(
      ["Charlier, Alice", "Charlier, Clémence"]))
  .property("birthDate", "2010-07-04")

As explained above, Gremlin natively support multi-valued properties. If you want to define a property having more than one property, you’ll have to specify it as the first parameter of the operator property by specifying the keyword list. The next query is adding a few middle names to a karateka.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .property(list, "middleName", "Coralie")
  .property(list, "middleName",  "Florence")

The result is the following:

     "middleName": [
        {
          "id": "a4ec9a1d-caf0-4ba5-9611-eae448f6c684",
          "value": "Coralie"
        },
        {
          "id": "ba1ce36a-1369-4c30-a4b2-08391c9b5a30",
          "value": "Florence"
        }
      ]

To return the values of some properties for a given vertex, you’ll need to apply the operator values and specify the properties’ name.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .values("lastName", "firstName")

This query is returning a really flat and compact JSOn document with just the values of the properties.

[
  "Charlier",
  "Alice"
]

When selecting multiple vertices this operator could be useless due to the fact that you don’t have the values grouped by vertex and that you don’t know the mapping between the property and the value. If you want a more explicit view on the property you can use the operator valueMap.

g.V()
  .has("karateka", "lastName", "Charlier")
  .valueMap("nationality", "firstName", "lastName")

The result of this query will be an explicit JSON document, listing all the requested properties and grouping them by vertex. Once again if a property is not existing for a given vertex, it won’t be an issue.

[
  {
    "nationality": [
      "Belgian"
    ],
    "firstName": [
      "Alice"
    ],
    "lastName": [
      "Charlier"
    ]
  },
  {
    "firstName": [
      "Clémence"
    ],
    "lastName": [
      "Charlier"
    ]
  }
]

If you want to remove a property, you’ll have to select it and then drop it.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .properties("birthDate")
  .drop()

It’s also possible to drop several properties on several vertices. If the properties are not existing it won’t create an exception once again a benefit of schema-less databases.

g.V()
  .has("karateka", "fullName", 
     within(["Charlier, Alice", "Charlier, Clémence"]))
  .properties("middleName", "birthDate")
  .drop()

Exactly the same approach can be applied to a vertex to add or drop properties!

The next step in this series of blog posts is to traverse the path, jumping from one vertex on another using the edges.

 

 

First steps with Gremlin to query a graph loaded in Azure Cosmos DB

In previous blog posts, I explained the use-case and also how to load data into an Azure Cosmos DB graph database with the bulk executor library. This blog post is dedicated to explain how to perform basic queries with Gremlin on this dataset.

You can use two tools to query a graph hosted by Azure Cosmos DB. The first tool is embedded in the Azure Portal and is named Data Explorer.

cosmos-data-explorer

The second tool is an extension of Visual Studio Code that is freely available and is named Azure Cosmos DB. This extension let you browse and query your MongoDB databases both locally and in the cloud using scrapbooks but also to write queries in Gremlin and display results as a Graph or as JSON documents.

cosmos-visual-code-extension

Before we go further, a little bit of theory about the Gremlin query language.

  • Each query is starting by g
  • V() stands for vertices and returns one or more vertices
  • E() stands for edges and returns one or more edges
  • hasLabel("label") filter Vertices/Edges based on label (type)
  • hasId("id") filter Vertices(/Edges) based on Id (must be unique)
  • has("propertyName", "value") filter Vertices based on value of any property

With these few elements, it’s already possible to write some interesting queries. The two first queries will respectivelly retrieve all the bouts and all the karatekas:

g.V().hasLabel("bout")
g.V().hasLabel("karateka")

If you want to retrieve a specific karateka and you know her id, you can apply the two variantes. It will return a unique result.

g.V().hasId("karateka.1") 
g.V("karateka.1")

The graph view is not really providing a value

cosmos-query-id-graph

But the JSON view, offer the opportunity to confirm that when returning a vertex, we’re also returning each of the properties!

cosmos-query-id-json

Most of the time, you don’t know the if of a vertex and you’ll need to perform a search through the graph to find it. In the first example, we’re looking for a karateka named Alice and in the second we’re looking for two karatekas.

g.V().has("karateka", "fullName", "Charlier, Alice") 
g.V().has("karateka", "fullName", 
   within(["Charlier, Alice", "Charlier, Clémence"]))

The first query is identical in terms of result to the following query:

g.V().has("fullName", "Charlier, Alice")

But the first version is more performant. Indeed, by specifying that you’re looking for a karateka, the engine will avoid to search within all the vertices that have not the label karateka. It’s a best practice to always specify this information when possible.

In the next blog posts we’ll see how to add or remove some properties to a vertex or and edge.

Loading data into an Azure Cosmos DB graph database with the bulk executor library

In a previous blog post, I started to explain how to model karate tournaments’ results within a graph database. Now, we’ll see how we can efficiently load these data into Azure Cosmos DB with the bulk executor library. This library allows you to perform bulk operations in Azure Cosmos DB through bulk import and bulk update APIs (currently limited to SQL and Gremlin API).

One of the biggest benefits to use this library, is that you don’t need to handle rate limiting of request, request timeouts, and other transient exceptions. This library is doing these tedious tasks for you. Another good point is the fact that it significantly reduces the client-side compute resources needed to saturate the throughput allocated to a container.

For this case, my data are stored in a kind of flat file. I’ll explain the format here under but if you want to see the full picture, that’s an intermediatory file with raw data but already aligned with the structure
of my graph. In this file, you’ll find a section defining karatekas, another tournaments but also bouts, divisions etc. All these entities are vertices and each section will start by the label of the set of vertices between brackets. The next line will contain the name of the properties separated by a semi-column(;). following lines, contain all the vertices and their respective properties.

[tournament]
id;name;date
1;FFKAMA Championship 2017;2017-01-29
2;FFKAMA Championship 2018;2018-01-28
3;GFK Championship 2018;2018-04-15
...

The transition between two set of vertices or between is marked by an empty line.

The label of a set of edges is identified by starting with a dash (-) and ending with a dash and a greater than symbol (symbol of an arrow). Expl: - belongs to ->. Same than vertices, second line identifies the properties. Two important properties are assigned to any edge, the source and the destination. These two properties are vertices and are identified by as such by brackets surrounding the name of the property. They are always located as first and second properties.

-loses->
[karateka];[bout];point;senshu
33;67;0;
75;68;
30;69;4;false
...

The example above represents the edges labeled “loses”. These edges represent one the possible links between a karateka (source) and a bout (destination). Both source and destination are identified by their corresponding id. As you can see each property, except the source and destination, can be unset.

Now that we’ve our flat file to import, we’ll need to write a reader. This reader is implemented in the class GraphCsvReader. This reader will retrieve the labels and properties for each vertex and edge stored in the flat file. Once read, all these information will be transformed to vertices or edges. That’s the role of the class CosmosDbBulkGraphBuilder. Once we’ve these objects, we’ll forward them to the bulk execution library provided by Microsoft in order to load them in Azure Cosmos DB. This library is leveraged in the class CosmosDbBulkWriter.

read-transform-bulk-load-process

A few key insights about the code of CosmosDbBulkWriter.

If you’re familiar with SqlConnection or OleDbConnection for SQL Server or most of the relational databases, you could be surprised to not find the equivalent for Azure Cosmos DB. The best match with the notion of connection is the class DocumentClient. It represents a client-side logical representation for the Azure Cosmos DB. This client is used to configure and execute request against the service.

The GraphBulkExecutor, provided by the bulk executor library, is the class effectively bulk loading the data to Azure Cosmos DB. Before instantiating and initializing an object of this type, you should set the retry options of the DocumentClient to high values. After the initialization, you can reset them to lower values. Indeed, the initialization of a GraphBulkExecutor will probably request a higher workload than usual on your Azure Cosmos DB account and generate some internal and hidden throttled requests.

If you want more details about the code, feel free to take a look in the GitHub project. About performances, the dataset that I’m using has around 3000 (676 vertices and 2323 edges). In terms of entities, it means more than 200 karatekas, results of 350 bouts, ranking for 130 karatekas (I didn’t load the ranking of a tournament if I had the results of each bouts) belonging to 25 tournaments. To read my flat file and create vertices and edges on client side, I need less than 70ms … extremely fast. For the loading time, it depends of how many RU/s you’ll assign to your collection.

If you’re assigning 400 RU/s, you’ll need around 45s. Switching to 800 RU/s will let you wait during 22s. Scaling up to 2000 RU/s, will decrease the time to load the dataset to around 8.5s. That’s inline with the first results … if I check how many RU/s are consumed during the first two attempts, I’ll have around 400 RU/s and 2000 RU/s … logical. Changing the settings to 8000 RU/s will only decrease the time required to load the dataset to 4.5 sec … meaning that I’m limited to an effective value of 4300 RU/s! The main reason for this is that my client is the bottleneck and is not able to send the data at the speed that Azure Cosmos DB can consume them.

cosmos-load-time

Now that we’ve loaded the dataset, we’ll see how to query this database with the use of Gremlin in the next blog post.

 

Encrypting connection strings for NBi

In some companies encryption of the connection strings and other security-related information is required and checked security policies. NBi is supporting these requirements and offers the opportunity to encrypt the connection-strings in the config file. NBi is trying to not reinvent the wheel and, as such, is relying on the .Net framework native features for the encryption of the connection strings.

The first step to consider when using encrypted connection strings is to move them to the config file. In your test-suite you should find the following reference to a connection string named myDB:

<settings>
  <default apply-to="system-under-test">
    <connectionString>@myDB</connectionString>
  </default>
</settings>

And in the config file, you should have the information about the myDB connection string:

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="TestSuite.nbits"/>
  <connectionStrings>
    <add name="myDB" connectionString="Provider=SQLNCLI11;Server=****.database.windows.net;Database=AdventureWorks2012;UId=****;Pwd=****;" />
  </connectionStrings>
</configuration>

The following walkthrough is using the RsaProtectedConfigurationProvider. This provider uses RSA encryption to encrypt and decrypt configuration data.
Before you and the user running your test-suite can decrypt encrypted information in the config file, their identity must have read access to the encryption key that is used to encrypt and decrypt the encrypted sections. This walkthrough uses the default RsaProtectedConfigurationProvider provider that is specified in the Machine.config file and named RsaProtectedConfigurationProvider. The RSA key container that is used by the default RsaProtectedConfigurationProvider provider is named NetFrameworkConfigurationKey.

To grant Read Access to an RSA Encryption Key, you’ll need to use aspnet_regiis. This utility can be found in any version of the .Net framework installed on your machine. The latest version is available at C:\Windows\Microsoft.NET\Framework\v4.0.30319. Granting read access to the user xyz can be executed with the following command (you probably need admin rights to successfully execute this command):

aspnet_regiis -pa "NetFrameworkConfigurationKey" "xyz"

Once, read access granted to your development account don’t forget to also add the account executing your test-suite (if they are different accounts).

Next step is to rename you config file to web.config. It could sound weird but the encryption tool is not looking for other files than web.config files!

After renaming your config file to web.config, you’ll need to use aspnet_regiis again to encrypt your file. Use the following command:

aspnet_regiis -PeF "connectionStrings" "C:\Full Path\MyTestSuite"

Note that the name of section connectionStrings is case-sensitive! Take into account the “S” in upper case. You should also be careful with the last parameter setting the path of your project (and not the path of your config file). Last but not least the path of your project shouldn’t contain a backslash “\” at the end.

Now, you should open the web.config file and you’ll find the following section:

<connectionStrings configProtectionProvider="RsaProtectedConfigurationProvider">
  <EncryptedData Type="http://www.w3.org/2001/04/xmlenc#Element"
    xmlns="http://www.w3.org/2001/04/xmlenc#">
    <EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#tripledes-cbc" />
    <KeyInfo xmlns="http://www.w3.org/2000/09/xmldsig#">
      <EncryptedKey xmlns="http://www.w3.org/2001/04/xmlenc#">
        <EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-1_5" />
        <KeyInfo xmlns="http://www.w3.org/2000/09/xmldsig#">
          <KeyName>Rsa Key</KeyName>
        </KeyInfo>
        <CipherData>
          <CipherValue>aloQVZs+EU3icflVqx+kl9TgCrJZ+qw+fMG5zu0y5SbjkflNgOPtv/Id0H07jNHW5QXA5dcTUa8vMXb4evJMqv281/PTnTq9D4+YtpS5n2eeoGNrlkenHA4L2hOkwbO5A5M8hRAm6MMCjWuvmIgxnczH+BY6tAMAfyU53cjkeWyYOL5SBbmeq0iZ3xcm256VDojqQUdddhuLzlBDQ/FPKeDEJhV9TsbQmaWxmkQ7ftWKsVrhgkzIiqlVjyUw/KM6S2iW/CwayOXhyOZhxYqZAVy6BmaE943/Hoky/UG8E1aOaLBrmUEt+ahl7hru/RZb2wNacGqCO5y+X8TqFdpk0g==</CipherValue>
        </CipherData>
      </EncryptedKey>
    </KeyInfo>
    <CipherData>
      <CipherValue>b2K1WbCd+0mOj5L6xL3ZczWqsgNwdV/RP6jqEA7U2ULYigXF7VccUS7LP7FIRGfVWPcgxQvVHTXanvfY+HKv6J8QfJV7IUopcrn9PYZYQBjm5gZ61AZA5ePfI16GaLsoPk4+VGyxjNCXwoaNSRLgUotA5vyA1cb7VuKKbGZMYixb7L9xPUj9sm5kb9r2PLGjjWDBKhGKBTxn6dGDnUHMjaBk1IXixuj8z1kx4l4CwvUtyLExddeWVu32PkgkIczlzxNb5VG11lI3M6KcptI/fkCCg9Vs/ZFd</CipherValue>
    </CipherData>
  </EncryptedData>
</connectionStrings>

Your connection string is now encrypted. You can reverse the encryption by using the following command:

aspnet_regiis -PdF "connectionStrings" "C:\Full Path\MyTestSuite"

To use your encrypted config file (don’t forget to re-encrypt it if you just decrypt it), you’ll need to rename it to the original name.

Then, just run your test-suite! Nothing else to configure. with the help of native .Net features, NBi will understand that the config file is encrypted, where the encryption key is stored and how to decrypt it.

Extracting key phrases with the Text Analytics API and Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. My last article also explained how to use the Face API to guess the gender of these MVP. I’ll now parse their biography to extract key phrases and display the most used themes in these texts. To support this need, I’ll use the Text Analytics API provided by Azure Cognitive services.

The Text Analytics API is a suite of text analytics web services built with best-in-class Microsoft machine learning algorithms. The API can be used to analyze unstructured text for tasks such as sentiment analysis, key phrase extraction and language detection. No training data is needed to use this API; just bring your text data. This API uses advanced natural language processing techniques to deliver best in class predictions.

The first step of this exercice is to grab the biography. We previously discussed how to parse a web page and retrieve pertinent information. In that case we need to look for the html tag <pre>, extract the text and clean it up.

let
    GetBiography = (url as text) as text=>
let
    source = Web.Contents(url),
    lines = Lines.FromBinary(source),
    biography = List.Select(lines, each Text.Contains(_, "<pre>")){0},
    cleanedBiography = ConvertHtmlSpecialChars(Text.BeforeDelimiter(Text.AfterDelimiter(Text.Clean(Text.Trim(biography)),">"), "<"))
in
    try cleanedBiography otherwise ""
in
    GetBiography

We need to setup a Text Analytics API on our Azure account. If you don’t know how to do this refer to the previous article of this series. About pricing, selecting the free (FO) possibility is probably the best approach. We’ll have a very limited usage of this API with just one call by run.

text-analytics-api-pricing.png

Indeed, at the difference of the Face API, we don’t need to perform a call for each MVP to the Text Analytics API. We only need to call this API once with all the biographies. The API is expecting an list of records (with the language, an id and the biography). The language will always be “en” (standing for English) and for the id, we’ll use the MvpId. Preparing this structure is performed in a function PrepareKeyPhrases.

let
    PrepareKeyPhrases = () =>
let
    source = Table.SelectColumns(#"All info",{"MvpId", "Biography"}),
    filtered = Table.SelectRows(source, each [Biography] <> ""),
    addEnglish = Table.AddColumn(filtered, "language", each "en"),
    renamed = Table.RenameColumns(addEnglish,{{"Biography", "text"}, {"MvpId", "id"}}),
    reordered = Table.ReorderColumns(renamed,{"language", "id", "text"}),
    list = Table.ToRecords(reordered)
in
    list
in
    PrepareKeyPhrases

Once we have the expected input for the query, we just need an additional function to call the Text Analytics API and the method keyPhrases. This power query-m function will be named GetKeyPhrases, receive in parameter the result of PrepareKeyPhrases and will use the same approach than for the Face API, detailed in this article. The most complex part of this query is to transform the structure (list of records) that we previously prepared to a Json text. We can achieve this by transforming the structure to a Json document with Json.FromValue and then we need to serialize this by transforming our newly created binary to text with the help of Text.FromBinary.

let
    GetKeyPhrases = (list as list) =>

let
    headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = TextApiKey],
    jsonText = Text.FromBinary(Json.FromValue(list)),
    content = "{ ""documents"": " & jsonText & "}",
    source = Web.Contents(TextApiUrl, [Headers = headers, Content=Text.ToBinary(content)])
in
    source
in
    GetKeyPhrases

We’ll need to provide a table for these keyPhrases. This new table will be linked to the main table with all the MVP details in the model.

let
    source = GetKeyPhrases(PrepareKeyPhrases()),
    json = Json.Document(source,65001)[documents],
    toTable = Table.FromList(json, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    expandColumns = Table.ExpandRecordColumn(toTable, "Column1", {"id", "keyPhrases"}, {"MvpId", "Word"}),
    expandList = Table.ExpandListColumn(expandColumns, "Word"),
    nonNullRows = Table.SelectRows(expandList, each [Word] <> null),
    capitalized = Table.TransformColumns(nonNullRows,{{"Word", Text.Proper, type text}}),
    cleaned = Table.TransformColumns(capitalized,{{"Word", Text.Clean, type text}}),
    trimmed = Table.TransformColumns(cleaned,{{"Word", Text.Trim, type text}})
in
    trimmed

Now that we’ve this list of keywords, we can freely use the Word Cloud custom visual.

text-analytics-word-cloud.

It’s interesting to compare some of the keywords extracted for the recently awarded MVPs.

text-analytics-api-new-mvp

and the long time MVPs.

text-analytics-api-long-time-mvp

At a first sight, book sounds as a keyword associated to long-time MVP and Power BI to the recently awarded.

Using the Face API with Power BI

On a previous series of blog posts, I explained how to grab the list of Data Platform MVPs from a website and add additional data about them as such as their GitHub contributions. In this sequel, I’ll also add the gender of the MVPs … based on their profile picture. To achieve this I’ll rely of the Face API, part of the Azure Cognitive Services.

Let’s start by creating a service on your Azure subscription. If you have no Azure subscription don’t panic, you can create a new one and you’ll receive some free credits to try a few services including the Face API. Also, if you want, you can use the Face API for free with the level of service F0 limiting the calls by minute to twenty.

To create a Face API service in your subscription, click on “add resources” and search for “face”, select the Face API:
face-api

Then select the Face service in the list:
face-api

After this selection, give a name to your service, select the resources group and make a choice about the level of service that you’re expecting:

face-api-pricing

Now, we’ll go back to Power BI and we’ll create a new query, To be exact a new function in Power Query. This function will call the Face API passing in parameter the url of the picture. the return will be a text with the possible values: male, female or unknown. I can already create two parameters related to the Face API:

  • The base url dependent of where your service is hosted. I named this parameter FaceApiUrl. This information is located in the Azure portal, in the overview of your Face API service. face-api-url
  • The key of your service. I named this parameter FaceApiKey and the value is also located in the Azure portal but in the keys section.face-api-key

Based on the documentation of the Face API, I’ll have to send a POST request to the service. The request must be sent to a base url and specifying what are the expected parameters computed by the API. In this specific case I’m only interested by the gender. I can build a record for query parameters:

query = [#"returnFaceAttributes"= "gender"]

As previously explain by Chris Webb (blog post) or Erik Svensen (blog post) to submit a post request you’ll need two tricks.

  1. You must specify a content. This action will switch the execution context of Web.Contents from a GET request to a POST request
  2. The text of the content must be transformed to a binary

To create the content, I’ll have to refer to the documentation of the Face API and I need to create a Json document with just one attribute the url of the picture.

content = "{ ""url"": """ & url & """}"

To submit a valide request, I’ll also have to submit the correct key (credential) to the Face API and specify that my content-type is an application/json. These two information must be specified in the headers of my request. The field headers is also expecting a record so I’ll submit the following construction.

headers =
   [#"Content-Type" = "application/json",
   #"Ocp-Apim-Subscription-Key" = FaceApiKey]

Now that the different components of my request are built, i just need to invoke the request with the help of the function Web.Contents().

response = Web.Contents(
   FaceApiUrl,
   [
      Query = query,
      Headers = headers,
      Content=Text.ToBinary(content)
   ]
)

Once we get the response, we just to parse it to extract the gender.

gender = Json.Document(source,65001){0}[faceAttributes][gender]

The complete method should handle some unexpected errors such as picture missing or the quality of the picture is not high enough and it’s not possible to detect the gender.

let
   GetGender = (url as text) =>
let
   query = [#"returnFaceAttributes"= "gender"],
   headers = [#"Content-Type" = "application/json", #"Ocp-Apim-Subscription-Key" = FaceApiKey],
   content = "{ ""url"": """ & url & """}" ,
   source = Web.Contents(FaceApiUrl, [Query = query, Headers = headers,    Content=Text.ToBinary(content)]),
   gender = Json.Document(source,65001){0}[faceAttributes][gender]
in
   try gender otherwise "unknown"
in
   GetGender

At the end, I’m able to create this visual and see that the percentage of women in the Data Platform MVP is still really low.

face-api-gender

Analyzing sport results with a graph and Azure Cosmos DB

On previous blog posts, I made usage of Power BI to analyze the results of my daughters participating in gymnastics competitions. On this new series, I’ll analyze their results in karate (kumite to be precise) … using a larger set of tools: Azure Cosmos DB, Power BI and Azure Databricks.

Let’s start with the analysis to perform: I want to understand their results (is it an expected defeat or is it something that could be set as a counter-performance) or predict them (what’s the likelihood that we must stay at the sport hall until the podium ceremony).

Before going further, a few things that you should know about Karate tournaments before diving into the results:

  • The process is the same than in any Tennis tournament: bout’s winner goes the next round and loser is out.
  • At the exception that if you’ve been defeated by someone going to the final, you’ll be enlisted for a second stage with other competitors defeated by the same finalist and you’ll run for the bronze medal
  • It means that you’ve two bronze medals (because you’ve two finalists) and not one
  • There is always a bout’s winner: in case of draw at the end of the time allocated to a bout, the first competitor who scored is the winner (it’s a bit more complex but let’s keep this simple). This advantage is named “senshu”. In case of 0-0, judges will take the decision
  • Categories are based on age. In most sports, your age is determined based on your birth year … not in Karate where it’s based on your birthdate. It means that you can be in the same category than another competitor during months and suddenly, in the middle of the season, be in another category just because one of you get his birthday.
  • Categories are based on weight. But the bounds for the weight are not the same at each tournament, and the weight is also something really dynamic with children.

Who thought that it would be easy to analyze? Based on the last bullets, I took the decision to collect all the results of the categories surrounding the age of my daughters (hopefully, they are twins) and I’ll take all the sub-categories based on weight.

Based on sportsML wording all the karatekas participating in the same category of a tournament are part of a division.

Now, that we have our goal and some rules, let’s download the results. We’ve a major issue: results are partial. For some competitions, you only have the podium. It means that for some tournaments, I’ve no notion of bouts’ result but just the final standing.

Based on that, we’ve the following (main) entities: we’ve the tournament, the category (age/weight) derived into a division, the bouts (part of a round and a stage) and the competitors (karatekas). For the relations, we’ve: participates/wins/loses (between karatekas and bouts), belongs to (between tournament, divisions, stages, rounds and bouts), ranks (between karateka and a division). I could go further gathering information about club and nationalities but it’s out-of-scope at this point.

I took the decision to model this as a graph. Why? Three important factors drove me to this choice:

  • First argument, I’m not really interested by the entities but more by their relations. That mostly exclude column-based and document models.
  • Second argument, I’ll look for patterns such as “Who has been defeated by A and has defeated C?”
  • karate-query-1

    or have we got any cycle such as “A defeated B who defeated C but A defeated C”.

    karate-query-2
    This kind of queries are complex to model in SQL compared to graph query languages (Gremlin or Cypher).

  • Last argument: the partial results. A graph can easily handle partial information with its schema-less approach. You shouldn’t consider this argument with a storage point-of-view (I’m aware of the null value in relational databases). This argument makes sense if you consider that the relations could be inferred in a graph and that is harder to put in place with a relational model.

 
Based on consideration above my schema looks like:
karate-full-schema

As you could see, the “defeated by” used in the query above is something that could be inferred by the pattern wins/loses between two karatekas and a bout.

karate-inferred-relation

In next post, I’ll explain how to load the dataset with the Azure Cosmos DB bulk executor library, then I’ll cover how to query the database with Gremlin to get the first insights about my daughters’ results.