Azure Cosmos DB

Return intermediate vertices by creating aliases with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library, how to write your first queries with the Gremlin language, how to Manage vertices’ properties and the first steps to traverse a graph. This blog post is dedicated to explain how to you can use the traversal of a graph and return some vertices that are not the end of the path.

Until now, you’ve always returned the last step of our query. But it’s not always what we want! In the previous query, we’ve returned the karatekas having defeated the karatekas who defeated my daughter. One of the issue of this query was that we didn’t know who defeated who.

We can fix this problem by returning tuples containing the two karatekas corresponding to the vertices in green on this drawing.

To achieve this, we’ll need to give an alias to the selection performed at the third intermediate step (and also to the latest). Gremlin as an operator named as to give aliases.

The following query is just assigning aliases to the interesting steps:

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")

At that moment, if you executed the query, the result would be identical to what it was previously. Just giving aliases is not enough, you’ll also have to instruct to Gremlin to make some usage of them. The simpler usage is to select them. This operation ensure that the aliased vertices are part of the result of the query.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")
.select ("winner", "loser")

The result of this query is a list of tuples containing each time a loser and a winner:

{
"loser": {
"id": "karateka.52",
"label": "karateka",
"type": "vertex",
"properties": {
…,
"fullName": [
{
"id": "3dec83c1-5f2e-41d4-98e9-8371cf6017b4",
"value": "B****, Rania"
}
],
...
]
}
},
"winner": {
"id": "karateka.59",
"label": "karateka",
"type": "vertex",
"properties": {
...
"fullName": [
{
"id": "44fdc18a-a3e7-47ba-8ce5-d7599548c267",
"value": "S****, Kiara"
}
...
},
{
"loser": {
...
},
"winner": {
...
},
...

It’s usually not suited to get the whole list of properties. Most of the time we’ll only be interested by a subset of the properties in this case the fullName. To project the vertices to a single property, we can use the operation by provided by Gremlin.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.as("loser")
.out("loses")
.in("wins")
.as("winner")
.select ("winner", "loser")
.by("fullName")

This query is limiting the result to the full names of the losers and the winners:

[
{
"loser": "B****, Rania",
"winner": "S****, Kiara"
},
...
{
"loser": "P****, Denisa",
"winner": "D*****, Louise"
}
]

The next blog post will about filtering during the traversal of a graph.

Advertisements

Basic traversing of a graph with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library, how to write your first queries with the Gremlin language and how to Manage vertices’ properties This blog post is dedicated to explain how to jump from a vertex to an adjacent vertex using the existing edges.

You’ll have to learn a few additional operations to traverse a graph. The first set of operations are inE and outE. These two operations let you select all the edges respectivelly ending and starting from the selected vertex. The example here under show the result (in green) of the operation outE for a given vertex.

The following query is returning all the edges having the label participates and starting from the vertex representing a given karateka.

g.V().has("karateka", "fullName", "Charlier, Alice").outE("participates")

The result of this query is a list of edges. From this result we can see that the starting node is always the given karateka and that each edge is linking to a bout.

[
{
"id": "4a33a619-593a-469e-88ff-983e951149ed",
"label": "participates",
"type": "edge",
"inVLabel": "bout",
"outVLabel": "karateka",
"inV": "bout.120",
"outV": "karateka.70",
"properties": {
"color": "aka"
}
},
{
"id": "517e4b17-95ea-4ad0-a74c-45c083587a46",
"label": "participates",
"type": "edge",
"inVLabel": "bout",
"outVLabel": "karateka",
"inV": "bout.116",
"outV": "karateka.70",
"properties": {
"color": "aka"
}
},
… (many more ojects)

The same kind of operations exist for selecting vertices being the end or the start of a selected edge. These functions are named outV and inV.

The following query is starting from a karateka, then jumping the the edges having the label participates and then jumping to all the vertices being the end of the previously selected edges.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.outE("participates")
.inV()

The result of this query is a list of bouts (based on our previously defined graph schema).

[
{
"id": "bout.120",
"label": "bout",
"type": "vertex",
"properties": {
"pk": [
{
"id": "bout.120|pk",
"value": 1
}
]
}
},
{
"id": "bout.116",
"label": "bout",
"type": "vertex",
"properties": {
"pk": [
{
"id": "bout.116|pk",
"value": 1
}
]
}
},

Most of the time, you don’t really want to select the edges. They are just a some means to go from one vertex to any adjacent vertex. for convenience, Gremlin is supporting two operations in and out. They are the equivalent of respectively outE followed by inV and inE followed by outV.

The following drawing explains that starting from the black vertex and using a in operation, you’ll directly select the three adjacent vertices.

A good usage of the traversing of a graph will be to know the names of all the karateka having defeated a given karateka. To write this query we’ll first select the initial karateka, then going to all the bouts where the edge is labelled loses and corresponding to all the bouts where the karateka has been defeated. Having this list of bouts we just need to follow the edges labelled as wins to know the name of the winner.

The following query applies this pattern:

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")

The result of this query is a list of karatekas. If we want to just display their names, we can add a values operations.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.values("fullName")

The result will be a list of strings (results have been anonymized for GDPR reasons):

[
"B****, Rania",
"H****, Maëlie",
"P****, Denisa"
]

We can go a bit further and check if these karatekas have already been defeated by someone or not. To achieve this, I’ll apply the exact same pattern and use the edges loses and wins from the second karateka.

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.out("loses")
.in("wins")
.values("fullName")

The result is also a list of string:

[
"D****, Louise",
"F****, Julia",
"F****, Julia",
"S****, Kiara",
"K****, Lee-lou",
"Z****, Manon",
"E****, Lena",
"A****, Ezdina",
"B****, Margaux",
"B****, Zineb",
"B****, Rania",
"D****, Louise"
]

As you can see the first and last names but also the second and third are identical. The reason is that these two karatekas have defeated twice one of the three karatekas listed above (or once two of them). That’s really important to understand that Gremlin doesn’t automatically deduplicate the vertices. If you want to achieve this, just use the function dedup

g.V()
.has("karateka", "fullName", "Charlier, Alice")
.out("loses")
.in("wins")
.out("loses")
.in("wins")
.dedup()
.values("fullName")

Note that I applied the function dedup to the vertices and not to the property fullName. The reason is to avoid to consider as duplicates two karatekas that are just homonyms.

The next blog post will be about the step modulator … and how to return a result where the selected vertices are not the last vertices traversed by the query!

Querying, adding or dropping vertices’ properties with the Gremlin language

In previous blog posts, I explained the use-case, how to load data into an Azure Cosmos DB graph database with the bulk executor library but also how to write your first queries with the Gremlin language. This blog post is dedicated to explain how to add or remove properties on a vertex.

Properties are always attached to a vertex or an edge and give some information about this vertex or edge.  Using Azure Cosmos DB, you’ll always have a property named pk that is mandatory and automatically generated by Azure Cosmos DB

      "pk": [
        {
          "id": "karateka.70|pk",
          "value": 1
        }
      ],

pk stands for partition key. Partition keys are an important feature of Azure Cosmos DB but I’ll explain them in a future dedicated post.

Other properties always have the same structure. The name of JSON field is the name of Gremlin property. Each JSON field contains an array of values for this property. Indeed, in Gremlin each property is multi-valued, we’ll come back to this later in this post. Each property’s value has an id having for value a GUID automatically attributed by the Azure Cosmos DB engine and a second field named value keeping track of the effective value of the property.

      "fullName": [
        {
          "id": "5c4a9e2e-4c62-41ab-9732-9bd39c0a6837",
          "value": "Charlier, Alice"
        }
      ],
      "firstName": [
        {
          "id": "b5c99f14-2bd2-4700-a42f-3bf97042e351",
          "value": "Alice"
        }
      ],
      "lastName": [
        {
          "id": "5cef6756-8ddc-45bf-95b9-3ee858aa0a12",
          "value": "Charlier"
        }

If you want to add a property to a vertex, you don’t need to perform any explicit change to the underlying schema. Graph databases are schema-less. It means that you can quickly add a new property by just specifying it. Naturally, it has some drawbacks and any typo could create a new property in place of editing an existing one.

To add a property to a vertex, you first need to select this vertex as explain in the previous blog post and then apply the operator property. This operation is expecting the property’s name and the property’s value as parameters:

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .property("nationality", "Belgian")

Executing this query will return the selected vertex and you’ll quickly see that the property has been added.

[
  {
    "id": "karateka.70",
    "label": "karateka",
    "type": "vertex",
    "properties": {
      "pk": [
        {
          "id": "karateka.70|pk",
          "value": 1
        }
      ],
      ...
      "nationality": [
        {
          "id": "f0387c84-42e4-43ea-badb-7bfecc75443e",
          "value": "Belgian"
        }
      ]
    }
  }
]

It’s possible to add a property to more one than vertex. To achieve this, just select multiple vertices and define the common property. In the example below, I’m assigning the same birthday to my twin daughters.

g.V()
  .has("karateka", "fullName", within(
      ["Charlier, Alice", "Charlier, Clémence"]))
  .property("birthDate", "2010-07-04")

As explained above, Gremlin natively support multi-valued properties. If you want to define a property having more than one property, you’ll have to specify it as the first parameter of the operator property by specifying the keyword list. The next query is adding a few middle names to a karateka.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .property(list, "middleName", "Coralie")
  .property(list, "middleName",  "Florence")

The result is the following:

     "middleName": [
        {
          "id": "a4ec9a1d-caf0-4ba5-9611-eae448f6c684",
          "value": "Coralie"
        },
        {
          "id": "ba1ce36a-1369-4c30-a4b2-08391c9b5a30",
          "value": "Florence"
        }
      ]

To return the values of some properties for a given vertex, you’ll need to apply the operator values and specify the properties’ name.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .values("lastName", "firstName")

This query is returning a really flat and compact JSOn document with just the values of the properties.

[
  "Charlier",
  "Alice"
]

When selecting multiple vertices this operator could be useless due to the fact that you don’t have the values grouped by vertex and that you don’t know the mapping between the property and the value. If you want a more explicit view on the property you can use the operator valueMap.

g.V()
  .has("karateka", "lastName", "Charlier")
  .valueMap("nationality", "firstName", "lastName")

The result of this query will be an explicit JSON document, listing all the requested properties and grouping them by vertex. Once again if a property is not existing for a given vertex, it won’t be an issue.

[
  {
    "nationality": [
      "Belgian"
    ],
    "firstName": [
      "Alice"
    ],
    "lastName": [
      "Charlier"
    ]
  },
  {
    "firstName": [
      "Clémence"
    ],
    "lastName": [
      "Charlier"
    ]
  }
]

If you want to remove a property, you’ll have to select it and then drop it.

g.V()
  .has("karateka", "fullName", "Charlier, Alice")
  .properties("birthDate")
  .drop()

It’s also possible to drop several properties on several vertices. If the properties are not existing it won’t create an exception once again a benefit of schema-less databases.

g.V()
  .has("karateka", "fullName", 
     within(["Charlier, Alice", "Charlier, Clémence"]))
  .properties("middleName", "birthDate")
  .drop()

Exactly the same approach can be applied to a vertex to add or drop properties!

The next step in this series of blog posts is to traverse the path, jumping from one vertex on another using the edges.

 

 

First steps with Gremlin to query a graph loaded in Azure Cosmos DB

In previous blog posts, I explained the use-case and also how to load data into an Azure Cosmos DB graph database with the bulk executor library. This blog post is dedicated to explain how to perform basic queries with Gremlin on this dataset.

You can use two tools to query a graph hosted by Azure Cosmos DB. The first tool is embedded in the Azure Portal and is named Data Explorer.

cosmos-data-explorer

The second tool is an extension of Visual Studio Code that is freely available and is named Azure Cosmos DB. This extension let you browse and query your MongoDB databases both locally and in the cloud using scrapbooks but also to write queries in Gremlin and display results as a Graph or as JSON documents.

cosmos-visual-code-extension

Before we go further, a little bit of theory about the Gremlin query language.

  • Each query is starting by g
  • V() stands for vertices and returns one or more vertices
  • E() stands for edges and returns one or more edges
  • hasLabel("label") filter Vertices/Edges based on label (type)
  • hasId("id") filter Vertices(/Edges) based on Id (must be unique)
  • has("propertyName", "value") filter Vertices based on value of any property

With these few elements, it’s already possible to write some interesting queries. The two first queries will respectivelly retrieve all the bouts and all the karatekas:

g.V().hasLabel("bout")
g.V().hasLabel("karateka")

If you want to retrieve a specific karateka and you know her id, you can apply the two variantes. It will return a unique result.

g.V().hasId("karateka.1") 
g.V("karateka.1")

The graph view is not really providing a value

cosmos-query-id-graph

But the JSON view, offer the opportunity to confirm that when returning a vertex, we’re also returning each of the properties!

cosmos-query-id-json

Most of the time, you don’t know the if of a vertex and you’ll need to perform a search through the graph to find it. In the first example, we’re looking for a karateka named Alice and in the second we’re looking for two karatekas.

g.V().has("karateka", "fullName", "Charlier, Alice") 
g.V().has("karateka", "fullName", 
   within(["Charlier, Alice", "Charlier, Clémence"]))

The first query is identical in terms of result to the following query:

g.V().has("fullName", "Charlier, Alice")

But the first version is more performant. Indeed, by specifying that you’re looking for a karateka, the engine will avoid to search within all the vertices that have not the label karateka. It’s a best practice to always specify this information when possible.

In the next blog posts we’ll see how to add or remove some properties to a vertex or and edge.

Loading data into an Azure Cosmos DB graph database with the bulk executor library

In a previous blog post, I started to explain how to model karate tournaments’ results within a graph database. Now, we’ll see how we can efficiently load these data into Azure Cosmos DB with the bulk executor library. This library allows you to perform bulk operations in Azure Cosmos DB through bulk import and bulk update APIs (currently limited to SQL and Gremlin API).

One of the biggest benefits to use this library, is that you don’t need to handle rate limiting of request, request timeouts, and other transient exceptions. This library is doing these tedious tasks for you. Another good point is the fact that it significantly reduces the client-side compute resources needed to saturate the throughput allocated to a container.

For this case, my data are stored in a kind of flat file. I’ll explain the format here under but if you want to see the full picture, that’s an intermediatory file with raw data but already aligned with the structure
of my graph. In this file, you’ll find a section defining karatekas, another tournaments but also bouts, divisions etc. All these entities are vertices and each section will start by the label of the set of vertices between brackets. The next line will contain the name of the properties separated by a semi-column(;). following lines, contain all the vertices and their respective properties.

[tournament]
id;name;date
1;FFKAMA Championship 2017;2017-01-29
2;FFKAMA Championship 2018;2018-01-28
3;GFK Championship 2018;2018-04-15
...

The transition between two set of vertices or between is marked by an empty line.

The label of a set of edges is identified by starting with a dash (-) and ending with a dash and a greater than symbol (symbol of an arrow). Expl: - belongs to ->. Same than vertices, second line identifies the properties. Two important properties are assigned to any edge, the source and the destination. These two properties are vertices and are identified by as such by brackets surrounding the name of the property. They are always located as first and second properties.

-loses->
[karateka];[bout];point;senshu
33;67;0;
75;68;
30;69;4;false
...

The example above represents the edges labeled “loses”. These edges represent one the possible links between a karateka (source) and a bout (destination). Both source and destination are identified by their corresponding id. As you can see each property, except the source and destination, can be unset.

Now that we’ve our flat file to import, we’ll need to write a reader. This reader is implemented in the class GraphCsvReader. This reader will retrieve the labels and properties for each vertex and edge stored in the flat file. Once read, all these information will be transformed to vertices or edges. That’s the role of the class CosmosDbBulkGraphBuilder. Once we’ve these objects, we’ll forward them to the bulk execution library provided by Microsoft in order to load them in Azure Cosmos DB. This library is leveraged in the class CosmosDbBulkWriter.

read-transform-bulk-load-process

A few key insights about the code of CosmosDbBulkWriter.

If you’re familiar with SqlConnection or OleDbConnection for SQL Server or most of the relational databases, you could be surprised to not find the equivalent for Azure Cosmos DB. The best match with the notion of connection is the class DocumentClient. It represents a client-side logical representation for the Azure Cosmos DB. This client is used to configure and execute request against the service.

The GraphBulkExecutor, provided by the bulk executor library, is the class effectively bulk loading the data to Azure Cosmos DB. Before instantiating and initializing an object of this type, you should set the retry options of the DocumentClient to high values. After the initialization, you can reset them to lower values. Indeed, the initialization of a GraphBulkExecutor will probably request a higher workload than usual on your Azure Cosmos DB account and generate some internal and hidden throttled requests.

If you want more details about the code, feel free to take a look in the GitHub project. About performances, the dataset that I’m using has around 3000 (676 vertices and 2323 edges). In terms of entities, it means more than 200 karatekas, results of 350 bouts, ranking for 130 karatekas (I didn’t load the ranking of a tournament if I had the results of each bouts) belonging to 25 tournaments. To read my flat file and create vertices and edges on client side, I need less than 70ms … extremely fast. For the loading time, it depends of how many RU/s you’ll assign to your collection.

If you’re assigning 400 RU/s, you’ll need around 45s. Switching to 800 RU/s will let you wait during 22s. Scaling up to 2000 RU/s, will decrease the time to load the dataset to around 8.5s. That’s inline with the first results … if I check how many RU/s are consumed during the first two attempts, I’ll have around 400 RU/s and 2000 RU/s … logical. Changing the settings to 8000 RU/s will only decrease the time required to load the dataset to 4.5 sec … meaning that I’m limited to an effective value of 4300 RU/s! The main reason for this is that my client is the bottleneck and is not able to send the data at the speed that Azure Cosmos DB can consume them.

cosmos-load-time

Now that we’ve loaded the dataset, we’ll see how to query this database with the use of Gremlin in the next blog post.

 

Analyzing sport results with a graph and Azure Cosmos DB

On previous blog posts, I made usage of Power BI to analyze the results of my daughters participating in gymnastics competitions. On this new series, I’ll analyze their results in karate (kumite to be precise) … using a larger set of tools: Azure Cosmos DB, Power BI and Azure Databricks.

Let’s start with the analysis to perform: I want to understand their results (is it an expected defeat or is it something that could be set as a counter-performance) or predict them (what’s the likelihood that we must stay at the sport hall until the podium ceremony).

Before going further, a few things that you should know about Karate tournaments before diving into the results:

  • The process is the same than in any Tennis tournament: bout’s winner goes the next round and loser is out.
  • At the exception that if you’ve been defeated by someone going to the final, you’ll be enlisted for a second stage with other competitors defeated by the same finalist and you’ll run for the bronze medal
  • It means that you’ve two bronze medals (because you’ve two finalists) and not one
  • There is always a bout’s winner: in case of draw at the end of the time allocated to a bout, the first competitor who scored is the winner (it’s a bit more complex but let’s keep this simple). This advantage is named “senshu”. In case of 0-0, judges will take the decision
  • Categories are based on age. In most sports, your age is determined based on your birth year … not in Karate where it’s based on your birthdate. It means that you can be in the same category than another competitor during months and suddenly, in the middle of the season, be in another category just because one of you get his birthday.
  • Categories are based on weight. But the bounds for the weight are not the same at each tournament, and the weight is also something really dynamic with children.

Who thought that it would be easy to analyze? Based on the last bullets, I took the decision to collect all the results of the categories surrounding the age of my daughters (hopefully, they are twins) and I’ll take all the sub-categories based on weight.

Based on sportsML wording all the karatekas participating in the same category of a tournament are part of a division.

Now, that we have our goal and some rules, let’s download the results. We’ve a major issue: results are partial. For some competitions, you only have the podium. It means that for some tournaments, I’ve no notion of bouts’ result but just the final standing.

Based on that, we’ve the following (main) entities: we’ve the tournament, the category (age/weight) derived into a division, the bouts (part of a round and a stage) and the competitors (karatekas). For the relations, we’ve: participates/wins/loses (between karatekas and bouts), belongs to (between tournament, divisions, stages, rounds and bouts), ranks (between karateka and a division). I could go further gathering information about club and nationalities but it’s out-of-scope at this point.

I took the decision to model this as a graph. Why? Three important factors drove me to this choice:

  • First argument, I’m not really interested by the entities but more by their relations. That mostly exclude column-based and document models.
  • Second argument, I’ll look for patterns such as “Who has been defeated by A and has defeated C?”
  • karate-query-1

    or have we got any cycle such as “A defeated B who defeated C but A defeated C”.

    karate-query-2
    This kind of queries are complex to model in SQL compared to graph query languages (Gremlin or Cypher).

  • Last argument: the partial results. A graph can easily handle partial information with its schema-less approach. You shouldn’t consider this argument with a storage point-of-view (I’m aware of the null value in relational databases). This argument makes sense if you consider that the relations could be inferred in a graph and that is harder to put in place with a relational model.

 
Based on consideration above my schema looks like:
karate-full-schema

As you could see, the “defeated by” used in the query above is something that could be inferred by the pattern wins/loses between two karatekas and a bout.

karate-inferred-relation

In next post, I’ll explain how to load the dataset with the Azure Cosmos DB bulk executor library, then I’ll cover how to query the database with Gremlin to get the first insights about my daughters’ results.

Automating the testing of gremlin queries with NBi

Since five years, you can run automated tests on your SQL server databases and SSAS cubes with the help of the open-source framework NBi. This framework was extended to support other relational databases (using the OleDb or ODBC drivers) in 2017. In 2018, I’m adding support for graph databases. Earlier this year, I released an extension for the Neo4j bolt driver supporting Cypher queries and another compatible with some of the Azure Cosmos DB API and notably the Graph API. This new extension is supporting all the Apache Tinkerpop-enabled databases and is tested on Azure Cosmos Db and Tinkerpop Server but should also work other graph databases supporting gremlin such as JanusGraph, Titan and OrientDB. The code for this extension is hosted on GitHub.Note that Azure Cosmos DB is now recommending to use the Gremlin driver in place of the Graph API dlls, so this extension should also be your first target if you want to connect to Azure Cosmos DB database and you want to run gremlin queries.

If you’re used to NBi, you can directly jump to the next paragraph. For newcomers, NBi is an open-source framework dedicated to the testing of BI solutions: it supports complex comparison of queries results, asserting conditions on each row of a result-set, running ETL (SSIS), validating queries from reports (SSRS) or many other possibilities. On top of this, it has an advanced feature to automate the process of creating test-suites.

In order to setup an environment for testing some gremlin queries, you must download the version 1.18 of NBi and its extension for Gremlin in version 1.0. Unzip the content of the NBi download and then partially override it with the content of the extension in the NBi.Gremlin (dlls from both packages must be deployed in a unique directory). If NUnit 2.6.4 is not available on your computer don’t forget to download it. If you need more info about how to setup a NBi’s extension, read the documentation.

When NBi and its extension are installed, create your test-suite with its config and NUnit project file as you’d normally do it for any NBi test-suite (or, if you’re in the hurry, download the files with the examples here, but don’t forget to update the connection-string when trying to run it). If you need more info about how to setup a test-suite, read the here.

For this example, I started a Tinkerpop Server and created some vertexes and edges with:

g.V().drop()
g.addV('person').property('id', 'thomas')
  .property('firstName', 'Thomas').property('age', 44)
g.addV('person').property('id', 'mary')
  .property('firstName', 'Mary')
  .property('lastName', 'Andersen')
  .property('age', 39)
g.addV('person').property('id', 'ben')
  .property('firstName', 'Ben')
  .property('lastName', 'Miller')
g.addV('person').property('id', 'robin')
.property('firstName', 'Robin')
  .property('lastName', 'Wakefield')
g.V().has('firstName','Thomas').addE('knows')
  .to(g.V().has('firstName','Mary'))
g.V().has('firstName','Thomas').addE('knows')
  .to(g.V().has('firstName','Ben'))
g.V().has('firstName','Ben').addE('knows')
  .to(g.V().has('firstName','Robin'))

When the environment is correctly configured, you must edit your config file to reference the extension NBi.Core.Gremlin.

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="...">
    <extensions>
      <add assembly="NBi.Core.Gremlin"/>
    </extensions>
  </nbi>
</configuration>

You also need to edit your config file. Add all the binding redirections that the Azure Cosmos DB is coming with. To achieve this, copy/paste the content of the runtime element of NBi.Core.CosmosDb.dll.config into a new runtime element in the config file of your test-suite FoF.config.

<runtime>
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity name="System.Diagnostics.DiagnosticSource" publicKeyToken="cc7b13ffcd2ddd51" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.1.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.Http" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.2.0.0" newVersion="4.2.0.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.WebSockets" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.2.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.WebSockets.Client" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.2.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-11.0.0.0" newVersion="11.0.0.0" />
      </dependentAssembly>
    </assemblyBinding>
</runtime>

The concept of connection-string is not familiar to Tinkerpop or Gremlin but to match with relational databases, we use a concatenation of the usual parameters defined to connect to an instance supporting gremlin queries. You must provide the hostname, the port, the enableSsl, the username and password. You must also add a tip for NBi to identify the correct type of connection-string by adding an api tag with the value gremlin.If your database is a Azure Cosmos DB instance, then your username will be /dbs/{db}/colls/{coll} where {db} stands for the name of the database and {coll} for the name of the collection. Also, note that the hostname is a url part of the subdomain .gremlin.cosmosdb.azure.com.

hostname=https://gremlin-demo.gremlin.azure.com;
port=443;
enableSsl=true;
username=database=/dbs/Friends/colls/FoF;
password=F...==;
api=gremlin;

The same connection-string for a Tinkerpop Server, running on your computer, will be

hostname=localhost;
port=8182;
enableSsl=false;
username=database=Anything;
password=Anything;
api=gremlin;

Note that username and password are not expected by a Tinerpop Server. Anyway, they can’t be omited but any value will make it.

In your test-suite, querying a relational database with SQL or a Tinkerpop-enabled database with a gremlin query is not different. You’re still using the result-set and query elements. The gremlin query is specified within the query. It doesn’t support parameters at the moment but you’re still free to use template-tokens.

Gremlin queries can return complex results such as paths that are not easily mapped to a table (result-set). Currently, NBi support for Gremlin is limited to queries returning a list of Vertexes or a list of Edges or a list of values.

In this first test, NBi is asserting that the query to return Thomas’ friends of friends is effectively returning the correct result. In this case a unique vertex representing a person named Robin.

FoF-Thomas-Robin

<test name="Thomas' friends of friends" uid="0001">
    <system-under-test>
      <resultSet>
        <query>
          g.V().has('firstName', 'Thomas').outE('knows').inV().hasLabel('person').outE('knows').inV().hasLabel('person')
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <equalTo values-default-type="text">
        <resultSet>
          <row>
            <cell>robin</cell>
            <cell>person</cell>
            <cell>vertex</cell>
            <cell>Robin</cell>
            <cell>Wakefield</cell>
          </row>
        </resultSet>
      </equalTo>
    </assert>
  </test>

In this second test, we’re asserting that the all the vertexes contain a property age greater than 20.

<test name="All of them are older than 20 (or unknown)" uid="0002">
    <system-under-test>
      <resultSet>
        <query>
          g.V()
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <all-rows>
        <combination operator="or">
          <predicate operand="age">
            <null/>
          </predicate>
          <predicate operand="age">
            <more-than>20</more-than>
          </predicate>
        </combination>
      </all-rows>
    </assert>
  </test>

And finally, in this last test, we’re checking that the result-set returned by a projection& matches the expected result.

  <test name="Count of person's relationships" uid="0003">
    <system-under-test>
      <resultSet>
        <query>
          g.V().project('FirstName','KnowsCount').by('firstName').by(out().count())
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <equalTo>
        <resultSet>
          <row>
            <cell>Mary</cell>
            <cell>0</cell>
          </row>
          <row>
            <cell>Robin</cell>
            <cell>0</cell>
          </row>
          <row>
            <cell>Ben</cell>
            <cell>1</cell>
          </row>
          <row>
            <cell>Thomas</cell>
            <cell>2</cell>
          </row>
        </resultSet>
      </equalTo>
    </assert>
  </test>

FoF-running-test-suite

Our test-suite is green … time to write some additional tests!

To know more about NBi, check the website at www.nbi.io and if you’ve any question or suggestion, feel free to ask on Twitter or on the Github repository (section “issues”). The code of this extension is also available on Github.

Creating and querying a database using the Graph API on the Azure Cosmos DB emulator

When developing on Azure Cosmos DB, Microsoft’s globally distributed, horizontally partitioned, multi-model database service, it’s useful to use the local emulator. At the moment the Web Interface of the data explorer is oriented for the SQL API and it sounds impossible to create a graph query on the emulator. This is false, the UI is not aware of these features but the engine supporting the Graph API.

You can access the emulator engine by the means of the .Net librairies. Create a new console project in Visual Studio and add the package Microsoft.Azure.Graphs. This package is in preview, you’ll need to specify to the package manager that you’re accepting the previews.

Install-Package Microsoft.Azure.Graphs -pre

This package will come with many friends, notably the reference to Microsoft.Azure.Documents.Client, the core library for everything related to Azure Cosmos DB and the reference to Newtonsoft.Json for everything related to Json serialization.

cosmosdb-console-001

For the Newtonsoft.Json library, you have some conflicts of version in the bunch of dependencies. These issues are fixed by applying some binding redirection in the app.config file.

cosmosdb-console-002

To create the database, you won’t need to use methods specific to the Graph API. A database is agnostic of the API that you’ll use for querying. If the Azure Portal is asking you to specify an API when creating a database online, it’s just to display a UI that would be easier to use with the selected API, nothing else.

The endpoint and authkey for the emulator are usually set to:

var endpoint= new Uri("https://localhost:8081");
var authkey = "C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw==";

the method CreateDatabaseIfNotExistsAsync is useful for creating a database and skipping this step if the database is already existing.

using (var client = new DocumentClient(endpoint, authKey))
{
  var databaseResponse = client.CreateDatabaseIfNotExistsAsync(
      new Database() { Id = databaseId }
).Result;
}

It could be interesting to check the status code of this request. If the status code is OK, it means that the database was already existing. For a newly created database the status code would be Created.

switch (databaseResponse.StatusCode)
{
  case System.Net.HttpStatusCode.OK:
    Console.WriteLine($"Database {databaseId} already exists.");
    break;
  case System.Net.HttpStatusCode.Created:
    Console.WriteLine($"Database {databaseId} created.");
    break;
  default:
    throw new ArgumentException(
        $"Can't create database {databaseId}: {databaseResponse.StatusCode}"
    );
}

Once the database is created, you’ll need to create the collection. Same remark, a collection is not dependent of a specific API.

var databaseUri = UriFactory.CreateDatabaseUri(databaseId);
var collectionResponse = client.CreateDocumentCollectionIfNotExistsAsync(
     databaseUri
     , new DocumentCollection() { Id = collectionId }
).Result; 
switch (collectionResponse.StatusCode)
{
    case System.Net.HttpStatusCode.OK:
        Console.WriteLine($"Collection {collectionId} already exists.");
        break;
    case System.Net.HttpStatusCode.Created:
        Console.WriteLine($"Collection {collectionId} created.");
        break;
    default:
        throw new ArgumentException(
           $"Can't create collection {collectionId}:
           {collectionResponse.StatusCode}"
        );
}

You can check in the Web UI of the emulator and you’ll see the newly created database and collection. If it doesn’t appear, try to open the page in another tab.

Now that the database and the collection are existing, we can start to query the graph. At this step, I’ll need to use some specific methods to the Graph API. The most notable method is the CreateGremlinQuery extension. This method accepts two parameters: the collection to query and the statement.

var queryCount = client.CreateGremlinQuery<dynamic>(collectionResponse,
     "g.V().Count()");

This will just instantiate a query, not executing it. To execute the query, you’ll need to call the method ExecuteNextAsync. This method will return the first set of results. In this case, we know that we’re expecting exactly one result and can go with this shortcut.

var count = queryCount.ExecuteNextAsync().Result.First();

With the help of the method CreateGremlinQuery, you can also create some statements to create nodes or edges. The node or edge will be created when the method ExecuteNextAsync is executed.

var queryNode = client.CreateGremlinQuery<dynamic>(collectionResponse,
    "g.addV('person').property('id', 'thomas').property
    ('firstName', 'Thomas').property('age', 44)");
query.ExecuteNextAsync().Result;

Counting items with the SQL API on Azure Cosmos DB

Azure Cosmos DB is Microsoft’s globally distributed, horizontally partitioned, multi-model database service. The terms “Multi-model database service” mean that you can seamlessly access the content of your database from different APIs. One of these API is named SQL and is a kind of mix between JSON and SQL. If you’re used to ANSI-SQL, it’s probably the easiest API to start with.

This SQL API supports the keyword COUNT but it’s not necessary the keyword that will be the more useful when you want to know how many elements are available in an array.

Before going further, to illustrate this article, we’ll need to setup a database and a collection (named families), you can achieve this with the UI of the portal. Then load a few items in your collection. To achieve this task, click on “New document” on the tab documents of your collection, paste the json corresponding to the Andersen family from this page, click on “Save”. Then click again on “New document” and paste the content of the json related to the Wakefield family from the same page. Don’t forget to click on “Save”. Your two families are now part of the database’s content and we’ll be able to query them.

The first example is about how counting the children by family. In ANSI-SQL, the approach would probably be to select the familyId then perform a count distinct child grouping by familyId. With the SQL API of CosmosDB, you’ll need to take another approach and use the keyword ARRAY_LENGTH. This keyword returns the number of elements of the specified array expression. In our case the array expression is just the array containing the children of a family. Our select will become:

SELECT
f.id AS FamilyId,
ARRAY_LENGTH(f.children) AS ChildrenCount
FROM
families f

The result is effectively a list of FamilyId and ChidrenCountcosmosdb-count-001

We can do something a bit more complex and check how many pets have each child. To succeed we’ll to start from the children level and then take the length of the sub-array pets. This can be done, using the keywords JOIN and IN to force an iteration on existing arrays. In this case we set the set the granularity of the query to be a child and for each of them we’re retrieving her familyId, givenName and count the pets.

SELECT
f.id as FamilyId,
c.givenName,
ARRAY_LENGTH(c.pets) as PetsCount
FROM
families f
JOIN c IN f.children

If you want to improve the rendering of the result, you can use the coalesce operator ??. If a value is not returned by the query then the result can be replaced by another value. In this case, if the givenName is not returned we’ll use the firstName and if the array of pets is not existing then we’ll set the count of pets to zero.

SELECT
f.id as FamilyId,
c.givenName ?? c.firstName,
ARRAY_LENGTH(c.pets) ?? 0 as PetsCount
FROM
families f
JOIN c IN f.children

cosmosdb-count-002

Last question to answer, how many pets have we got in our database? To answer this kind of question, you’ll need to perform an aggregation. The idea will be to know how many pets do each child have and then sum the result that we obtained for each child. Translated in SQL:

SELECT
SUM(ARRAY_LENGTH(c.pets)) as PetsCount
FROM
families f
JOIN c IN f.children

cosmosdb-count-003

Another way to achieve the same result is to change the granularity of the query and start from the pets. In this case, we’ll need to count distinct pets and we can finally use the keyword COUNT.

SELECT
COUNT(p) as PetsCount
FROM
families f
JOIN c IN f.children
JOIN p IN c.pets

cosmosdb-count-003

Deploying the Azure Cosmos DB emulator when building with AppVeyor

Azure Cosmos DB is Microsoft’s globally distributed, horizontally partitioned, multi-model database service. AppVeyor is a Continuous Integration solution for Windows. The key question is how can I test my solution, built on appveyor, with an Azure Cosmos DB back-end? Before going further, you need to know that Azure Cosmos DB has an emulator that you can run on any Windows machine. You can use to develop but also to test your solution. I’d not recommend load tests but unit or integration tests are supported.

The first step is to download the emulator (only 26MB) on your Appveyor build machine. You can do this during the install step by the help of this one line script in PowerShell:

Start-FileDownload 'https://aka.ms/cosmosdb-emulator' -FileName 'c:\projects\cosmos-db.msi'

When the file is downloaded, you will need to execute it to install the emulator on the machine. Using a few parameters of the installer this can be acheived without the need to click on buttons.

cmd /c start /wait msiexec /i "c:\projects\cosmos-db.msi" /qn /quiet /norestart /log install.log

Then, once the software is installed, you’ll need to run it. As it will run in background you won’t need UI or Data Explorer. These options can be turned off by the means of the parameters /NoUI and /No-Explorer.

"C:\Program Files\Azure Cosmos DB Emulator\CosmosDB.Emulator.exe"  /NoExplorer /NoUI

To avoid to run it interactively in the build execution context, you’ll need to start it in another process. It can be done with PowerShell Start-Process

Finally, you cannot expect that this kind of software is starting in a milli-second. You’ll have to pool and check if a connection is possible or not. that’s the goal of this PowerShell script:

$attempt = 0
$client = New-Object System.Net.Sockets.TcpClient([System.Net.Sockets.AddressFamily]::InterNetwork)

do {
	$attempt++
	try {    
		$client.Connect("127.0.0.1", 8081)
		write-host "CosmosDB started"
	}
	catch {
		$client.Close()
		if($attempt -eq 5) {
			write-host "CosmosDB was not started"
		} else {
			[int]$sleepTime = 3*$attempt
			write-host "CosmosDB is not started. Retry after $sleepTime seconds..."
			sleep $sleepTime;
		}
	}
}while(!$client.Connected -and $attempt -lt $max)

If the emulator has not started after 15 seconds, something is going wrong and you should step out of this loop.

The full listing with all actions taken during the install step of your build process is visible there:

appveyor-cosmosdb-emulator

If you’ve troubles, it can be useful to check if the port 8081 is listening: this can be easily done with the help of the command netstat -ab that you can add as a fifth action in your install step.