Graph databases

Analyzing sport results with a graph and Azure Cosmos DB

On previous blog posts, I made usage of Power BI to analyze the results of my daughters participating in gymnastics competitions. On this new series, I’ll analyze their results in karate (kumite to be precise) … using a larger set of tools: Azure Cosmos DB, Power BI and ML.Net.

Let’s start with the analysis to perform: I want to understand their results (is it an expected defeat or is it something that could be set as a counter-performance) or predict them (what’s the likelihood that we must stay at the sport hall until the podium ceremony).

Before going further, a few things that you should know about Karate tournaments before diving into the results:

  • The process is the same than in any Tennis tournament: bout’s winner goes the next round and loser is out.
  • At the exception that if you’ve been defeated by someone going to the final, you’ll be enlisted for a second stage with other competitors defeated by the same finalist and you’ll run for the bronze medal
  • It means that you’ve two bronze medals (because you’ve two finalists) and not one
  • There is always a bout’s winner: in case of draw at the end of the time allocated to a bout, the first competitor who scored is the winner (it’s a bit more complex but let’s keep this simple). This advantage is named “senshu”. In case of 0-0, judges will take the decision
  • Categories are based on age. In most sports, your age is determined based on your birth year … not in Karate where it’s based on your birthdate. It means that you can be in the same category than another competitor during months and suddenly, in the middle of the season, be in another category just because one of you get his birthday.
  • Categories are based on weight. But the bounds for the weight are not the same at each tournament, and the weight is also something really dynamic at this age.

Who thought that it would be easy to analyze? Based on the last bullets, I took the decision to collect all the results of the categories surrounding the age of my daughters (hopefully, they are twins) and I’ll take all the sub-categories based on weight.

Based on sportsML wording all the karatekas participating in the same category of a tournament are part of a division.

Now, that we have our goal and some rules, let’s download the results. We’ve a major issue: results are partial. For some competitions, you only have the podium. It means that for some tournaments, I’ve no notion of bouts’ result but just the final standing.

Based on that, we’ve the following (main) entities: we’ve the tournament, the category (age/weight) derived into a division, the bouts (part of a round and a stage) and the competitors (karatekas). For the relations, we’ve: participates/wins/loses (between karatekas and bouts), belongs to (between tournament, divisions, stages, rounds and bouts), ranks (between karateka and a division). I could go further gathering information about club and nationalities but it’s out-of-scope at this point.

I took the decision to model this as a graph. Why? Three important factors drove me to this choice:

  • First argument, I’m not really interested by the entities but more by their relations. That mostly exclude column-based and document models.
  • Second argument, I’ll look for patterns such as “Who has been defeated by A and has defeated C?”
  • karate-query-1

    or have we got any cycle such as “A defeated B who defeated C but A defeated C”.

    karate-query-2
    This kind of queries are complex to model in SQL compared to graph query languages (Gremlin or Cypher).

  • Last argument: the partial results. A graph can easily handle partial information with its schema-less approach. You shouldn’t consider this argument with a storage point-of-view (I’m aware of the null value in relational databases). This argument makes sense if you consider that the relations could be inferred in a graph and that is harder to put in place with a relational model.

 
Based on consideration above my schema looks like:
karate-full-schema

As you could see, the “defeated by” used in the query above is something that could be inferred by the pattern wins/loses between two karatekas and a bout.

karate-inferred-relation

In next post, I’ll explain how to load the dataset with the Azure Cosmos DB bulk executor library, then I’ll cover how to query the database with Gremlin to get the first insights about my daughters’ results.

Advertisements

Automating the testing of gremlin queries with NBi

Since five years, you can run automated tests on your SQL server databases and SSAS cubes with the help of the open-source framework NBi. This framework was extended to support other relational databases (using the OleDb or ODBC drivers) in 2017. In 2018, I’m adding support for graph databases. Earlier this year, I released an extension for the Neo4j bolt driver supporting Cypher queries and another compatible with some of the Azure Cosmos DB API and notably the Graph API. This new extension is supporting all the Apache Tinkerpop-enabled databases and is tested on Azure Cosmos Db and Tinkerpop Server but should also work other graph databases supporting gremlin such as JanusGraph, Titan and OrientDB. The code for this extension is hosted on GitHub.Note that Azure Cosmos DB is now recommending to use the Gremlin driver in place of the Graph API dlls, so this extension should also be your first target if you want to connect to Azure Cosmos DB database and you want to run gremlin queries.

If you’re used to NBi, you can directly jump to the next paragraph. For newcomers, NBi is an open-source framework dedicated to the testing of BI solutions: it supports complex comparison of queries results, asserting conditions on each row of a result-set, running ETL (SSIS), validating queries from reports (SSRS) or many other possibilities. On top of this, it has an advanced feature to automate the process of creating test-suites.

In order to setup an environment for testing some gremlin queries, you must download the version 1.18 of NBi and its extension for Gremlin in version 1.0. Unzip the content of the NBi download and then partially override it with the content of the extension in the NBi.Gremlin (dlls from both packages must be deployed in a unique directory). If NUnit 2.6.4 is not available on your computer don’t forget to download it. If you need more info about how to setup a NBi’s extension, read the documentation.

When NBi and its extension are installed, create your test-suite with its config and NUnit project file as you’d normally do it for any NBi test-suite (or, if you’re in the hurry, download the files with the examples here, but don’t forget to update the connection-string when trying to run it). If you need more info about how to setup a test-suite, read the here.

For this example, I started a Tinkerpop Server and created some vertexes and edges with:

g.V().drop()
g.addV('person').property('id', 'thomas')
  .property('firstName', 'Thomas').property('age', 44)
g.addV('person').property('id', 'mary')
  .property('firstName', 'Mary')
  .property('lastName', 'Andersen')
  .property('age', 39)
g.addV('person').property('id', 'ben')
  .property('firstName', 'Ben')
  .property('lastName', 'Miller')
g.addV('person').property('id', 'robin')
.property('firstName', 'Robin')
  .property('lastName', 'Wakefield')
g.V().has('firstName','Thomas').addE('knows')
  .to(g.V().has('firstName','Mary'))
g.V().has('firstName','Thomas').addE('knows')
  .to(g.V().has('firstName','Ben'))
g.V().has('firstName','Ben').addE('knows')
  .to(g.V().has('firstName','Robin'))

When the environment is correctly configured, you must edit your config file to reference the extension NBi.Core.Gremlin.

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="...">
    <extensions>
      <add assembly="NBi.Core.Gremlin"/>
    </extensions>
  </nbi>
</configuration>

You also need to edit your config file. Add all the binding redirections that the Azure Cosmos DB is coming with. To achieve this, copy/paste the content of the runtime element of NBi.Core.CosmosDb.dll.config into a new runtime element in the config file of your test-suite FoF.config.

<runtime>
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity name="System.Diagnostics.DiagnosticSource" publicKeyToken="cc7b13ffcd2ddd51" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.1.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.Http" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.2.0.0" newVersion="4.2.0.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.WebSockets" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.2.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.WebSockets.Client" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.2.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-11.0.0.0" newVersion="11.0.0.0" />
      </dependentAssembly>
    </assemblyBinding>
</runtime>

The concept of connection-string is not familiar to Tinkerpop or Gremlin but to match with relational databases, we use a concatenation of the usual parameters defined to connect to an instance supporting gremlin queries. You must provide the hostname, the port, the enableSsl, the username and password. You must also add a tip for NBi to identify the correct type of connection-string by adding an api tag with the value gremlin.If your database is a Azure Cosmos DB instance, then your username will be /dbs/{db}/colls/{coll} where {db} stands for the name of the database and {coll} for the name of the collection. Also, note that the hostname is a url part of the subdomain .gremlin.cosmosdb.azure.com.

hostname=https://gremlin-demo.gremlin.azure.com;
port=443;
enableSsl=true;
username=database=/dbs/Friends/colls/FoF;
password=F...==;
api=gremlin;

The same connection-string for a Tinkerpop Server, running on your computer, will be

hostname=localhost;
port=8182;
enableSsl=false;
username=database=Anything;
password=Anything;
api=gremlin;

Note that username and password are not expected by a Tinerpop Server. Anyway, they can’t be omited but any value will make it.

In your test-suite, querying a relational database with SQL or a Tinkerpop-enabled database with a gremlin query is not different. You’re still using the result-set and query elements. The gremlin query is specified within the query. It doesn’t support parameters at the moment but you’re still free to use template-tokens.

Gremlin queries can return complex results such as paths that are not easily mapped to a table (result-set). Currently, NBi support for Gremlin is limited to queries returning a list of Vertexes or a list of Edges or a list of values.

In this first test, NBi is asserting that the query to return Thomas’ friends of friends is effectively returning the correct result. In this case a unique vertex representing a person named Robin.

FoF-Thomas-Robin
<test name="Thomas' friends of friends" uid="0001">
    <system-under-test>
      <resultSet>
        <query>
          g.V().has('firstName', 'Thomas').outE('knows').inV().hasLabel('person').outE('knows').inV().hasLabel('person')
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <equalTo values-default-type="text">
        <resultSet>
          <row>
            <cell>robin</cell>
            <cell>person</cell>
            <cell>vertex</cell>
            <cell>Robin</cell>
            <cell>Wakefield</cell>
          </row>
        </resultSet>
      </equalTo>
    </assert>
  </test>

In this second test, we’re asserting that the all the vertexes contain a property age greater than 20.

<test name="All of them are older than 20 (or unknown)" uid="0002">
    <system-under-test>
      <resultSet>
        <query>
          g.V()
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <all-rows>
        <combination operator="or">
          <predicate operand="age">
            <null/>
          </predicate>
          <predicate operand="age">
            <more-than>20</more-than>
          </predicate>
        </combination>
      </all-rows>
    </assert>
  </test>

And finally, in this last test, we’re checking that the result-set returned by a projection& matches the expected result.

  <test name="Count of person's relationships" uid="0003">
    <system-under-test>
      <resultSet>
        <query>
          g.V().project('FirstName','KnowsCount').by('firstName').by(out().count())
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <equalTo>
        <resultSet>
          <row>
            <cell>Mary</cell>
            <cell>0</cell>
          </row>
          <row>
            <cell>Robin</cell>
            <cell>0</cell>
          </row>
          <row>
            <cell>Ben</cell>
            <cell>1</cell>
          </row>
          <row>
            <cell>Thomas</cell>
            <cell>2</cell>
          </row>
        </resultSet>
      </equalTo>
    </assert>
  </test>
FoF-running-test-suite

Our test-suite is green … time to write some additional tests!

To know more about NBi, check the website at www.nbi.io and if you’ve any question or suggestion, feel free to ask on Twitter or on the Github repository (section “issues”). The code of this extension is also available on Github.

Automating the testing of an Azure Cosmos DB instance with NBi

Since five years, you can run automated tests on your SQL server databases and SSAS cubes with the help of the open-source framework NBi. This framework was extended to support other relational databases (using the OleDb or ODBC drivers) in 2017. To give a really good start to the year 2018, NBi is now supporting Azure Cosmos DB databases using the Graph API (gremlin queries)!

If you’re used to NBi, you can directly jump to the next paragraph. For newcomers, NBi is an open-source framework dedicated to the testing of BI solutions: it supports complex comparison of queries results, asserting conditions on each row of a result-set, running ETL (SSIS), validating queries from reports (SSRS) or many other possibilities. On top of this, it has an advanced feature to automate the process of creating test-suites.

In order to setup an environment for testing some gremlin queries through the Graph API of Azure Cosmos DB, you must download the version 1.18 of NBi and its extension for Gremlin in version 1.0. Unzip the content of the NBi download and then partially override it with the content of the extension in the NBi.CosmosDb (dlls from both packages must be deployed in a unique directory). If NUnit 2.6.4 is not available on your computer don’t forget to download it. If you need more info about how to setup a NBi’s extension, read the documentation.

When NBi and its extension are installed, create your test-suite with its config and NUnit project file as you’d normally do it for any NBi test-suite (or, if you’re in the hurry, download the files with the examples here, but don’t forget to update the connection-string when trying to run it). If you need more info about how to setup a test-suite, read the here.

For this example, I created a database on Azure Cosmos DB (named Friends) containing a graph (named FoF). I created some vertexes and edges with:

g.V().Drop()
g.addV('person').property('id', 'thomas')
  .property('firstName', 'Thomas')
  .property('age', 44)
g.addV('person').property('id', 'mary')
  .property('firstName', 'Mary')
  .property('lastName', 'Andersen')
  .property('age', 39)
g.addV('person').property('id', 'ben')
  .property('firstName', 'Ben')
  .property('lastName', 'Miller')
g.addV('person').property('id', 'robin')
  .property('firstName', 'Robin')
  .property('lastName', 'Wakefield')
g.V('thomas').addE('knows').to(g.V('mary'))
g.V('thomas').addE('knows').to(g.V('ben'))
g.V('ben').addE('knows').to(g.V('robin'))

When the environment is correctly configured, you must edit your config file to reference the extension NBi.Core.CosmosDB.

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="...">
    <extensions>
      <add assembly="NBi.Core.CosmosDb"/>
    </extensions>
  </nbi>
</configuration>

You also need to edit your config file. Add all the binding redirections that the Azure Cosmos DB is coming with. To achieve this, copy/paste the content of the runtime element of NBi.Core.CosmosDb.dll.config into a new runtime element in the file FoF.config.

The concept of connection-string is not familiar to Azure Cosmos DB but to match with relational databases, we use a concatenation of the usual parameters defined to connect to an Azure Cosmos DB instance. You must provide the endpoint, the authkey, the database Id and the collection Id, concatenated into named-tokens separated by semi-colons. For the collection Id, you must use the name graph and not collection, in prevision of NBi’s compatibility with other APIs supported by Cosmos DB.

Endpoint=https://gremlin-demo.documents.azure.com:443;AuthKey=F...==;database=Friends;graph=FoF

In your test-suite, querying a relational database or your Azure Cosmos DB database with the Graph API is not different. You’re still using the result-set and query elements. The gremlin query is specified within the query. It doesn’t support parameters at the moment but you’re still free to use template-tokens.

Graph queries can return complex results such as paths that are not easily mapped to a table (result-set). Currently, NBi supports for Cosmos DB is limited to queries returning a list of Vertexes or a list of Edges or a list of values.

In this first test, NBi is asserting that the query to return Thomas’ friends of friends is effectively returning the correct result. In this case a unique vertex representing a person named Robin.

FoF-Thomas-Robin
<test name="Thomas' friends of friends" uid="0001">
  <system-under-test>
    <resultSet>
      <query>
        g.V('thomas').outE('knows').inV().hasLabel('person').outE('knows').inV().hasLabel('person')
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <equalTo values-default-type="text">
      <resultSet>
        <row>
          <cell>robin</cell>
          <cell>person</cell>
          <cell>vertex</cell>
          <cell>Robin</cell>
          <cell>Wakefield</cell>
        </row>
      </resultSet>
    </equalTo>
  </assert>
</test>

In this second test, we’re asserting that the all the vertexes contain a family name correctly capitalized:

<test name="All of them are older than 20" uid="0002">
  <system-under-test>
    <resultSet>
      <query>
        g.V()
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <all-rows>
      <predicate operand="firstName" type="text">
        <matches-regex>^([A-Z][a-z]+)$</matches-regex>
      </predicate>
    </all-rows>
  </assert>
</test>

And finally, in this last test, we’re checking that the result-set returned by a projection matches the expected result.

<test name="Count of person's relationships" uid="0003">
  <system-under-test>
    <resultSet>
      <query>
        g.V().project('FirstName','KnowsCount').by('firstName').by(out().Count())
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <equalTo>
      <resultSet>
        <row>
          <cell>Mary</cell>
          <cell>0</cell>
        </row>
        <row>
          <cell>Robin</cell>
          <cell>0</cell>
        </row>
        <row>
          <cell>Ben</cell>
          <cell>1</cell>
        </row>
        <row>
          <cell>Thomas</cell>
          <cell>2</cell>
        </row>
      </resultSet>
    </equalTo>
  </assert>
</test>
FoF-running-test-suite

Our test-suite is green … time to write new tests!

To know more about NBi, check the website at www.nbi.io and if you’ve any question or suggestion, feel free to ask on Twitter
or on Github. The code of this extension is also available on Github.