Author: seddryck

Analyzing sport results with a graph and Azure Cosmos DB

On previous blog posts, I made usage of Power BI to analyze the results of my daughters participating in gymnastics competitions. On this new series, I’ll analyze their results in karate (kumite to be precise) … using a larger set of tools: Azure Cosmos DB, Power BI and ML.Net.

Let’s start with the analysis to perform: I want to understand their results (is it an expected defeat or is it something that could be set as a counter-performance) or predict them (what’s the likelihood that we must stay at the sport hall until the podium ceremony).

Before going further, a few things that you should know about Karate tournaments before diving into the results:

  • The process is the same than in any Tennis tournament: bout’s winner goes the next round and loser is out.
  • At the exception that if you’ve been defeated by someone going to the final, you’ll be enlisted for a second stage with other competitors defeated by the same finalist and you’ll run for the bronze medal
  • It means that you’ve two bronze medals (because you’ve two finalists) and not one
  • There is always a bout’s winner: in case of draw at the end of the time allocated to a bout, the first competitor who scored is the winner (it’s a bit more complex but let’s keep this simple). This advantage is named “senshu”. In case of 0-0, judges will take the decision
  • Categories are based on age. In most sports, your age is determined based on your birth year … not in Karate where it’s based on your birthdate. It means that you can be in the same category than another competitor during months and suddenly, in the middle of the season, be in another category just because one of you get his birthday.
  • Categories are based on weight. But the bounds for the weight are not the same at each tournament, and the weight is also something really dynamic at this age.

Who thought that it would be easy to analyze? Based on the last bullets, I took the decision to collect all the results of the categories surrounding the age of my daughters (hopefully, they are twins) and I’ll take all the sub-categories based on weight.

Based on sportsML wording all the karatekas participating in the same category of a tournament are part of a division.

Now, that we have our goal and some rules, let’s download the results. We’ve a major issue: results are partial. For some competitions, you only have the podium. It means that for some tournaments, I’ve no notion of bouts’ result but just the final standing.

Based on that, we’ve the following (main) entities: we’ve the tournament, the category (age/weight) derived into a division, the bouts (part of a round and a stage) and the competitors (karatekas). For the relations, we’ve: participates/wins/loses (between karatekas and bouts), belongs to (between tournament, divisions, stages, rounds and bouts), ranks (between karateka and a division). I could go further gathering information about club and nationalities but it’s out-of-scope at this point.

I took the decision to model this as a graph. Why? Three important factors drove me to this choice:

  • First argument, I’m not really interested by the entities but more by their relations. That mostly exclude column-based and document models.
  • Second argument, I’ll look for patterns such as “Who has been defeated by A and has defeated C?”
  • karate-query-1

    or have we got any cycle such as “A defeated B who defeated C but A defeated C”.

    karate-query-2
    This kind of queries are complex to model in SQL compared to graph query languages (Gremlin or Cypher).

  • Last argument: the partial results. A graph can easily handle partial information with its schema-less approach. You shouldn’t consider this argument with a storage point-of-view (I’m aware of the null value in relational databases). This argument makes sense if you consider that the relations could be inferred in a graph and that is harder to put in place with a relational model.

 
Based on consideration above my schema looks like:
karate-full-schema

As you could see, the “defeated by” used in the query above is something that could be inferred by the pattern wins/loses between two karatekas and a bout.

karate-inferred-relation

In next post, I’ll explain how to load the dataset with the Azure Cosmos DB bulk executor library, then I’ll cover how to query the database with Gremlin to get the first insights about my daughters’ results.

Advertisements

Generating tests requiring a synchronous iteration over multiple arrays with genbiL

Todays’ challenge is to create a test-suite, with genbiL, where each test is a all-rows assertion with an undetermined count of predicates. The predicates will be combined with an or operator. Pragmatically, each test will have the following pattern for the assertion:

<all-rows>
  <combination operator="or">
    <predicate operand="...">
      <equal>...</equal>
    </predicate> 
  </combination>
</all-rows>

In the second part of this blog post, I already explained how to deal with multiple values for a variable of the template … what’s different? The case that we’re facing is somewhat similar but not identical. In the previous blog post, each variable of the template (.nbitt) was set to an array by genbiL and the template iterated over the values of the array. But the iterations were independent of each other.

If you take a look to the predicate case that we’re facing now, we haven’t one value to iterate over but two: the operand(attribute of the xml element predicate) and the reference (value between the equal elements). Much more important, we need to iterate synchronously between these two values. It means that each time I iterate for a value of the operand, I also need to iterate over a value for the reference. Let’s take a tangible case: my first predicate would be on field “foo” and I want to check if this field is equal to “1”. My second predicate is on field “bar” and will check if this field is equal to “5”. This should lead to the following assertion:

<all-rows>
  <combination operator="or">
    <predicate operand="foo">
      <equal>1</equal>
    </predicate> 
    <predicate operand="bar">
      <equal>5</equal>
    </predicate> 
  </combination>
</all-rows>

That clearly means that if I create a template-variable named operand and set it with values foo and bar and a second template-variable named reference to set with values 1 and 5. I can’t first iterate on the first variable and then on the second variable or I’d have something such as:

<all-rows>
  <combination operator="or">
    <predicate operand="foo">
    </predicate> 
    <predicate operand="bar">
    </predicate> 
      <equal>1</equal>
      <equal>5</equal>
  </combination>
</all-rows>

Hopefully, StringTemplate is a powerful engine and manage this kind of synchronous iterations over two or more arrays (StringTemplate use the wording multi-valued attribute and not array). To achieve that, you’ll need to define an anonymous-template in your template and specify two variables to this anonymous-template.

$xArray,yArray:{ x,y | ...}$

This syntax means that you want to synchronously iterate over two arrays. This template iterates max(n,m) times where n and m are the lengths of xArray and yArray, respectively.

In our specific case the anonymous-template will be:

$operands,references:{operand,reference|<predicate operand="$operand$">
          <equal>$reference$</equal>
        </predicate> 
        }$

There is nothing fancy in the genbiL code to do to support this kind of template, just load the cases and transform some variables into arrays.

case load file 'Acceptance\GenbiL\Resources\Predicate.csv';
case split columns 'operands', 'references' with value '/';

template load file 'Acceptance\GenbiL\Resources\Predicate.nbitt';
suite generate;
suite save as 'Acceptance\GenbiL\Resources\Predicate.nbits';

The csv file would be (.csv):

operands;references
foo/bar;1/5

and the full template (.nbitt):

<test name="All rows from '$field$' validate the rules.">
  <trait name="template-name">predicate.nbitt</trait>
  <system-under-test>
    <resultSet>
      <query>
        select * from myTable
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <all-rows>
      <combination operator="or">
        $operands, references:{operand, reference|<predicate operand="$operand$">
          <equal>$reference$</equal>
        </predicate> 
        }$
      </combination>
    </all-rows>
  </assert>
</test>

Using NBi with hierarchy’s level in MDX queries

NBi is a toolset dedicated to test BI and data-centric applications. One of the key feature is to compare the results returned by two queries. This can be used to compare the result of a query on the relational datawarehouse and on the OLAP cube. A pitfall that you could face when using NBi with MDX queries is the count of columns returned by this kind of query when a hierarchy’s level is defined on an axis.

When executing the following query (on the Adventure Works 2012 sample database/cube), you’ll see two columns in the result displayed by SSMS. It’s probably what you’re expecting, you’re only selecting one specific level of the hierarchy [Date].[Calendar Date] and one measure.

hierarchy_ssms

You’re probably expecting that NBi will also consider two columns. Unfortunately, it’s not the case: NBi will consider 4 columns! What are the additional and unexpected columns? The [Date].[Calendar].[Calendar Year] and [Date].[Calendar].[Calendar Semester] are also returned. In reality, this is not something specific to NBi, it’s just the standard behaviour of the ADOMD library and SSMS is cheating when only displaying one column for the level!

If we don’t care about this specificity of NBi and setup a test where the MDX query is the system-under-test, then we’ll have to face the following exception

NBi.NUnit.Runtime.CustomStackTraceErrorException :
The column with index '[Date].[Calendar].[Calendar Semester].[MEMBER_CAPTION]'
 is expecting a numeric value but the first row of your result set contains a value 'H1 CY 2008'
 not recognized as a valid numeric value or a valid interval.
Hierarchy_UnexpectedColumns

When comparing the second column of your assertion (the orders’ count so the value, so by default a numeric value) to the second column of your system-under-test (what you’re expecting to be the orders’ count but is in reality the [Date].[Calendar].[Calendar Semester]) … NBi is not able to convert “H1 CY 2008” to a numeric value.

To overcome this issue, the easiest is to fake in the SQL query the additional columns returned by the MDX query. In place of having the following SQL query on your assertion:

select
	, 'Q' + cast([CalendarQuarter] as char(1)) + ' CY 2008' as QY
	, count(distinct SalesOrderNumber)
from
	dbo.factInternetSales f
inner join
	Dimdate d
	on d.DateKey = f.OrderDateKey
where
	f.OrderDateKey between 20080000 and 20089999
group by
	[CalendarQuarter]

Just fake the two additional columns filling them by NULL.

select
	null as [Year]
	, null as [HY]
	, 'Q' + cast([CalendarQuarter] as char(1)) + ' CY 2008' as QY
	, count(distinct SalesOrderNumber)
from
	...

But it’s not the end! You’ll have to inform NBi that it shouldn’t care about these two additional columns. It can be done in the equalTo element where you can specify to ignore some columns. Ignoring means that these columns are not keys and neither values … they don’t interfere in the comparison process. To successfully implement that you’ll define the role as ignore for these columns.

<equalTo keys="all-except-last">
   <column index="0" role="ignore"/>
   <column index="1" role="ignore"/>
   <query>
      select
        null as [Year]
        , null as [HY]
        , 'Q' + cast([CalendarQuarter] as char(1)) + ' CY 2008' as QY
        , count(distinct SalesOrderNumber)
      from
        ...

Now that you expectly said to NBi that he should expect four columns and don’t care about the two firsts, you’ll be able to run your test and find real issues or success!

NBi.NUnit.Runtime.TestSuite.Order's count by Quarter Year for CY 2008:
NBi.NUnit.Runtime.CustomStackTraceAssertionException : Execution of the query doesn't match the expected result 

  Expected: 
Result-set with 3 rows

     #0 (Column1) | #1 (Column2) | #2 (Column3) | #3 (Column4)   
     (Text)       | (Text)       | KEY (Text)   | VALUE (Numeric)
     ------------ | ------------ | ------------ | --------------- 
     (null)       | (null)       | Q1 CY 2008   | 6080           
     (null)       | (null)       | Q2 CY 2008   | 6886           
     (null)       | (null)       | Q3 CY 2008   | 976            


  But was:  
Result-set with 3 rows

     #0 ([Date].[Calendar].[Calendar Year].[MEMBER_CAPTION]) | #1 ([Date].[Calendar].[Calendar Semester].[MEMBER_CAPTION]) | #2 ([Date].[Calendar].[Calendar Quarter].[MEMBER_CAPTION]) | #3 ([Measures].[Order Count])
     (Text)                                                  | (Text)                                                      | KEY (Text)                                                 | VALUE (Numeric)              
     ------------------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------- 
     CY 2008                                                 | H1 CY 2008                                                  | Q1 CY 2008                                                 | 6082                         
     CY 2008                                                 | H1 CY 2008                                                  | Q2 CY 2008                                                 | 6886                         
     CY 2008                                                 | H2 CY 2008                                                  | Q3 CY 2008                                                 | 976                          





Non matching value rows:
------------------------

Result-set with 1 row

     #0 ([Date].[Calendar].[Calendar Year].[MEMBER_CAPTION]) | #1 ([Date].[Calendar].[Calendar Semester].[MEMBER_CAPTION]) | #2 ([Date].[Calendar].[Calendar Quarter].[MEMBER_CAPTION]) | #3 ([Measures].[Order Count])
     (Text)                                                  | (Text)                                                      | KEY (Text)                                                 | VALUE (Numeric)              
     ------------------------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------- 
     CY 2008                                                 | H1 CY 2008                                                  | Q1 CY 2008                                                 | 6082 <> 6080                 

Using NBi to assert calculations based on measures

A few days ago on GitHub, I received a request to explain how to assert the results of some calculations with measures and also how to automate these tests with NBi. This blog post is the complete answer to this question.

To illustrate the case, I’ll use the Adventure Works 2012 multidimensional cube, but you can do the same with any other database. In this cube, you’ve a measure-group with the measures: Internet Gross Profit, Internet Sales Amount, Internet Gross Profit Margin. As you can guess, the last measure is the result of the division of the two firsts (expressed in percentage). My test will assert that this is still the case when I’m slicing by customers’ country.

To achieve this, I’m defining a system-under-test with a result-set defined in a MDX query:

<system-under-test>
  <resultSet>
    <query>
      <![CDATA[
      select
      {
        [Measures].[Internet Gross Profit],
        [Measures].[Internet Sales Amount],
        [Measures].[Internet Gross Profit Margin]   
      } on 0,
      {
        [Customer].[Country].members
      } on 1
      from
        [Adventure Works]
      ]]>
    </query>
  </resultSet>
</system-under-test>

Asserting the calculations’ result can be executed by a all-rows and a predicate where you’ll check that the content of the last measure is equal to the quotient of the two firsts.

Due to the complex names of columns for MDX queries, it’s recommended to use aliases. With NBi, we can define an alias based on columns’ position (starting at 0). The position is 0 will be for the column [Customer].[Country].members, we can skip it. But columns 1 to 3 will receive aliases:

<alias column-index="1">Profit</alias>
<alias column-index="2">Sales</alias>
<alias column-index="3">Margin</alias>

The test will need to compute the quotient between Profit and Sales and check that this value is equal to the value returned by the cube in the column Margin. This can be expressed in the following expression:

<expression name="calculate">
    Round(Profit/Sales, 4) = Round(Margin, 4)
</expression>

The Round function is provided by the underlying library for the evaluation of these expressions: NCalc. In this case, it’s extremely useful to avoid rounding issues during computations. Due to the percentage, I didn’t round to 2 digits after the decimal separator but to 4.

Based on the expression defined above, we can use the predicate true. Each row not validating the equality defined above will return false and not validate the predicate. The test is a all-rows so if any row is not validating the predicate then the test will fail.

Some developers will perhaps try to use the predicate equal in place of true, but unfortunately, at this moment, NBi can only handle fixed value for the equal predicate and it’s not possible to specify a column This limitation is invalidating the approach with equal.

The full code for this test is

<test name="Profit, Sales and Margin">
  <system-under-test>
    <resultSet>
      <query>
        <![CDATA[
      select
      {
          [Measures].[Internet Gross Profit],
          [Measures].[Internet Sales Amount],
          [Measures].[Internet Gross Profit Margin] 
      } on 0,
      {
          [Customer].[Country].members
      } on 1
      from
          [Adventure Works]
      ]]>
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <all-rows>
      <alias column-index="1">Profit</alias>
      <alias column-index="2">Sales</alias>
      <alias column-index="3">Margin</alias>
      <expression name="calculate">
          Round(Profit/Sales, 4) = Round(Margin, 4)
      </expression>
      <predicate operand="calculate" type="boolean">
        <true />
      </predicate>
    </all-rows>
  </assert>
</test>

The second part of this blog post is about the automation of generation of this kind of tests. Let’s start by defining our test-cases. To define them, I’ll put the aliases that I’d like to use (column alias), the expected formulae (column expression), the exact name of the measure (column measure) and how many digits are needed for rounded comparisons (column round). In this example, I created two test-cases. The first test-case will check that Profit is equal to Sales minus Cost and the last one is asserting that Margin is equal to Profit divided by Sales.

alias;expression;measure;round
Sales,Cost,Profit;Sales-Cost;[Measures].[Internet Sales Amount],[Measures].[Internet Total Product Cost],[Measures].[Internet Gross Profit];2
Profit,Sales,Margin;Profit/Sales;[Measures].[Internet Gross Profit],[Measures].[Internet Sales Amount],[Measures].[Internet Gross Profit Margin];4

To load this CSV file in genbiL, I’ll need to use the load file function:

case load file 'calculations.csv';

Note that in this CSV file each column is separated by a semi-column (;). In the columns alias and measure, I’ve more than one value and each of these values are separated by a coma (,).

To parse this CSV file with genbiL, I’ll need to use the function split. This function is transforming a single value into an array of values using a separator.

case split column 'alias' with value ',';
case split column 'measure' with value ',';

After these two lines of code, the test-cases’ scope consists in two cases and each of them has an array of three elements in the columns alias and measure.

Next steps are straightforward: loading the template (that we’ll compose just after this), generate the tests and save them.

template load file 'calculations.nbitt';
suite generate;
suite save 'calculations.nbits';

Once we’ve the CSV file and the genbiL script, we still need to build a template. There are two tricky parts in this template. The first one is to generate the list of measures in the MDX queries. Elements of this list must be separated by a coma. If you’ve three elements, you’ll need two comas.

select
   {
      [Measures].[Internet Gross Profit],
      [Measures].[Internet Sales Amount],
      [Measures].[Internet Gross Profit Margin]
   } on 0,

To achieve that with the StringTemplate engine, you’ll need to use the separator keyword (doc) for any variable containing an array.

select
   {
      $Measure; separator=", 
"$
   } on 0,

The second tricky part sounds similar but is really different. defining the list of aliases. In this case there is no separator, if you’ve three elements, you’ll need exactly the same xml elements three times. An additional detail is that we’ll need to add some figures starting at 1 and increasing on each new element. Hopefully StringTemplate has the concept of anonymous templates that can be really helpful for this. In this case we’ll explain that for each value contained in the array of column alias, we’ll have to create an xml element alias and add an attribute column-index with an incremental value. Note that the incremental value is automatically assigned to $i$ by StringTemplate and is starting at 1. Each element contained in the array of alias will be assigned to the variable x in the anonymous template. The anonymous template is surrounded by the pipe and the right curly brace.

$alias:{ x | <alias column-index="$i$">$x$</alias>}$

The template will also the StringTemplate functions trunc returning all the elements of an array except the last one and the function last returning the last element of an array.

The full listing for the template is available here under:

  <test name="$trunc(alias); separator=" and "$ to calculate $last(alias)$">
    <system-under-test>
      <resultSet>
        <query>
          <![CDATA[
          select
          {
         $measure; separator=", 
"$
          } on 0,
          {
          [Customer].[Country].members
          } on 1
          from
          [Adventure Works]
          ]]>
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <all-rows>
        $alias:{ x | <alias column-index="$i$">$x$</alias>}$
        <expression name="calculate">
           Round($expression$, $round$) = Round($last(alias)$, $round$)
        </expression>
        <predicate operand="calculate" type="boolean">
          <true />
        </predicate>
      </all-rows>
    </assert>
  </test>

And voilà, now you just need to run your genbiL script and you’ll get your two tests well generated. Feel free to add new cases!

Automating the testing of Elasticsearch queries with NBi

Since five years, you can run automated tests on your SQL server databases and SSAS cubes with the help of the open-source framework NBi. This framework was extended to support other relational databases (using the OleDb or ODBC drivers) in 2017. In 2018, I’m adding support for NoSQL databases. Earlier this year, I released some extensions for the Graph databases but This new extension is targetting Elasticsearch RESTful search and analytics engine. The code for this extension to the NBi framework is hosted on GitHub.

If you’re used to NBi, you can directly jump to the next paragraph. For newcomers, NBi is an open-source framework dedicated to the testing of BI and data centric solutions: it supports complex comparison of queries results, asserting conditions on each row of a result-set, running ETL (SSIS), validating queries from reports (SSRS) or many other possibilities. On top of this, it has an advanced feature to automate the process of creating test-suites.

In order to setup an environment for testing some queries on an Elasticsearch instance, you must download the version 1.18 of NBi and its extension for Elasticsearch in version 1.0. Unzip the content of the NBi download and then partially override it with the content of the extension in the NBi.Elasticsearch (dlls from both packages must be deployed in a unique directory). If NUnit 2.6.4 is not available on your computer don’t forget to download it. If you need more info about how to setup a NBi’s extension, read the documentation.

When NBi and its extension are installed, create your test-suite with its config and NUnit project files as you’d normally do it for any NBi test-suite (or, if you’re in the hurry, download the files with the examples here, but don’t forget to update the connection-string when trying to run it). If you need more info about how to setup a test-suite, read the here.

For this example, I started an elasticsearch instance and run the sample data from the Bank example.

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"

When the environment is correctly configured, you must edit your config file to reference the extension NBi.Core.Elasticsearch.

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="...">
    <extensions>
      <add assembly="NBi.Core.Elasticsearch"/>
    </extensions>
  </nbi>
</configuration>

The concept of connection-string is not familiar to Elasticsearch but to match with relational databases, we use a concatenation of the usual parameters defined to connect to a Elasticsearch instance with the low level client available in C#. You must provide an url with the hostname, the port, the username and password. The protocol should be elasticsearch at the moment to make a distinction with other databases relying on the http protocol.

elasticsearch://user:password@localhost:9200

In your test-suite, querying a relational database with SQL or an Elasticsearch cluster with a query written in the Query DSL is not different. You’re still using the result-set and query elements. The query is specified within the query.

Elasticsearch queries can return complex results such as hits, aggregations or counts. Currently, NBi support for Elasticsearch is limited to queries returning a hits or aggregations.

This really first test is asserting that the query dedicated to search the 5 persons with the higher balance will effectively returns 5 rows.

<test name="5 highest balances returns 5 rows" uid="0001">
  <system-under-test>
    <resultSet>
      <query>
        <![CDATA[
        GET bank/_search
        {
           "query": {"match_all": { }}
           , "size": 5
           , "_source": ["gender", "age", "balance"]
           , "sort": [ { "balance" : {"order" : "desc"}}]
        }
        ]]>
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <row-count>
      <equal>5</equal>
    </row-count>
  </assert>
</test>

In the second test, the same query than in previous example should return rows with an age between 20 and 40 and a balance greater than 48.000$

<test name="5 highest balance" uid="0001">
  <system-under-test>
    <resultSet>
      <query>
        <![CDATA[
        GET bank/_search
        {
           "query": {"match_all": { }}
           , "size": 5
           , "_source": ["gender", "age", "balance"]
           , "sort": [ { "balance" : {"order" : "desc"}}]
        }
        ]]>
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <all-rows>
      <combination operator="and">
        <predicate operand="age">
          <within-range>[20;40]</within-range>
        </predicate>
        <predicate operand="balance">
          <more-than>48000</more-than>
        </predicate>
      </combination>
    </all-rows>
  </assert>
</test>

The next test is checking the result of an aggregation. More specifically than the the three states with the higher average for the balance are Washington, Alabama and Rhode Island but also that the count of documents and the average balance are more or less in the expected range.

<test name="Top 3 of average balance by state" uid="0002">
  <system-under-test>
    <resultSet>
      <query>
        <![CDATA[
        GET /bank/_search
        {
          "size": 0,
          "aggs": {
            "group_by_state": {
              "terms": {
                "field": "state.keyword",
                "size" : 3,
                "order": {
                  "average_balance": "desc"
                }
              },
              "aggs": {
                "average_balance": {
                  "avg": {
                    "field": "balance"
                  }
                }
              }
            }
          }
        }
        ]]>
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <equalTo tolerance="1">
      <resultSet>
        <row>
          <cell column-name="key">WA</cell>
          <cell column-name="doc_count">(+)</cell>
          <cell column-name="average_balance">43265</cell>
        </row>
        <row>
          <cell>AL</cell>
          <cell>6</cell>
          <cell>41418</cell>
        </row>
        <row>
          <cell>RI</cell>
          <cell>[5;10]</cell>
          <cell>40040</cell>
        </row>
      </resultSet>
    </equalTo>
  </assert>
</test>
FoF-running-test-suite

Our test-suite is green … time to write some additional tests! Feel free to explore the other kind of tests that NBi is supporting and report your ideas for improvement at the <a href="http://The ” target=”_blank”>GitHub issues for NBi.Elasticsearch

To know more about NBi, check the website at www.nbi.io and if you’ve any question or suggestion, feel free to ask on Twitter or on the Github repository (section “issues”). The code of this extension is also available on Github.

Automating the testing of Neo4j cypher queries with NBi

Since five years, you can run automated tests on your SQL server databases and SSAS cubes with the help of the open-source framework NBi. This framework was extended to support other relational databases (using the OleDb or ODBC drivers) in 2017. In 2018, I’m adding support for graph databases. Earlier this year, I released an extension for the Azure Cosmos DB API and notably the Graph API and another for Tinkerpop-enabled databases supporting Gremlin queries. This new extension is supporting the Neo4j databases and the Cypher query language. The code for this extension is hosted on GitHub.

If you’re used to NBi, you can directly jump to the next paragraph. For newcomers, NBi is an open-source framework dedicated to the testing of BI solutions: it supports complex comparison of queries results, asserting conditions on each row of a result-set, running ETL (SSIS), validating queries from reports (SSRS) or many other possibilities. On top of this, it has an advanced feature to automate the process of creating test-suites.

In order to setup an environment for testing some cypher queries, you must download the version 1.18 of NBi and its extension for Gremlin in version 1.0. Unzip the content of the NBi download and then partially override it with the content of the extension in the NBi.Neo4j (dlls from both packages must be deployed in a unique directory). If NUnit 2.6.4 is not available on your computer don’t forget to download it. If you need more info about how to setup a NBi’s extension, read the documentation.

When NBi and its extension are installed, create your test-suite with its config and NUnit project file as you’d normally do it for any NBi test-suite (or, if you’re in the hurry, download the files with the examples here, but don’t forget to update the connection-string when trying to run it). If you need more info about how to setup a test-suite, read the here.

For this example, I started a Neo4j instance and created some vertexes and edges based on the classical movies database shipped with Neo4j

:play movies graph

When the environment is correctly configured, you must edit your config file to reference the extension NBi.Core.Neo4j.

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="...">
    <extensions>
      <add assembly="NBi.Core.Neo4j"/>
    </extensions>
  </nbi>
</configuration>

The concept of connection-string is not familiar to Neo4j but to match with relational databases, we use a concatenation of the usual parameters defined to connect to a Neo4j instance with the bolt driver. You must provide an url with the hostname, the port, the username and password.

bolt://user:password@localhost:7474

In your test-suite, querying a relational database with SQL or a Neo4j instance with a cypher query is not different. You’re still using the result-set and query elements. The cypher query is specified within the query. It supports parameters and template-tokens.

Cypher queries can return complex results such as paths that are not easily mapped to a table (result-set). Currently, NBi support for Cypher is limited to queries returning a list of values.

In this test, NBi is asserting that the query to return Tom Hanks’ movies, released after 2000, is effectively what we’re expecting.

<test name="Tom Hanks'movies after 2000" uid="0001">
  <system-under-test>
    <resultSet>
      <query>
        MATCH (tom:Person {name: "Tom Hanks"})-[:ACTED_IN]->(tomHanksMovies)
        WHERE tomHanksMovies.released>2000
        RETURN tomHanksMovies.title, tomHanksMovies.released
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <equalTo>
      <resultSet>
        <row>
          <cell>Charlie Wilson's War</cell>
          <cell>2007</cell>
        </row>
        <row>
          <cell>The Polar Express</cell>
          <cell>2004</cell>
        </row>
        <row>
          <cell>The Da Vinci Code</cell>
          <cell>2006</cell>
        </row>
        <row>
          <cell>Cloud Atlas</cell>
          <cell>2012</cell>
        </row>
      </resultSet>
    </equalTo>
  </assert>
</test>
FoF-running-test-suite

Our test-suite is green … time to write some additional tests!

To know more about NBi, check the website at www.nbi.io and if you’ve any question or suggestion, feel free to ask on Twitter or on the Github repository (section “issues”). The code of this extension is also available on Github.

Automating the testing of gremlin queries with NBi

Since five years, you can run automated tests on your SQL server databases and SSAS cubes with the help of the open-source framework NBi. This framework was extended to support other relational databases (using the OleDb or ODBC drivers) in 2017. In 2018, I’m adding support for graph databases. Earlier this year, I released an extension for the Neo4j bolt driver supporting Cypher queries and another compatible with some of the Azure Cosmos DB API and notably the Graph API. This new extension is supporting all the Apache Tinkerpop-enabled databases and is tested on Azure Cosmos Db and Tinkerpop Server but should also work other graph databases supporting gremlin such as JanusGraph, Titan and OrientDB. The code for this extension is hosted on GitHub.Note that Azure Cosmos DB is now recommending to use the Gremlin driver in place of the Graph API dlls, so this extension should also be your first target if you want to connect to Azure Cosmos DB database and you want to run gremlin queries.

If you’re used to NBi, you can directly jump to the next paragraph. For newcomers, NBi is an open-source framework dedicated to the testing of BI solutions: it supports complex comparison of queries results, asserting conditions on each row of a result-set, running ETL (SSIS), validating queries from reports (SSRS) or many other possibilities. On top of this, it has an advanced feature to automate the process of creating test-suites.

In order to setup an environment for testing some gremlin queries, you must download the version 1.18 of NBi and its extension for Gremlin in version 1.0. Unzip the content of the NBi download and then partially override it with the content of the extension in the NBi.Gremlin (dlls from both packages must be deployed in a unique directory). If NUnit 2.6.4 is not available on your computer don’t forget to download it. If you need more info about how to setup a NBi’s extension, read the documentation.

When NBi and its extension are installed, create your test-suite with its config and NUnit project file as you’d normally do it for any NBi test-suite (or, if you’re in the hurry, download the files with the examples here, but don’t forget to update the connection-string when trying to run it). If you need more info about how to setup a test-suite, read the here.

For this example, I started a Tinkerpop Server and created some vertexes and edges with:

g.V().drop()
g.addV('person').property('id', 'thomas')
  .property('firstName', 'Thomas').property('age', 44)
g.addV('person').property('id', 'mary')
  .property('firstName', 'Mary')
  .property('lastName', 'Andersen')
  .property('age', 39)
g.addV('person').property('id', 'ben')
  .property('firstName', 'Ben')
  .property('lastName', 'Miller')
g.addV('person').property('id', 'robin')
.property('firstName', 'Robin')
  .property('lastName', 'Wakefield')
g.V().has('firstName','Thomas').addE('knows')
  .to(g.V().has('firstName','Mary'))
g.V().has('firstName','Thomas').addE('knows')
  .to(g.V().has('firstName','Ben'))
g.V().has('firstName','Ben').addE('knows')
  .to(g.V().has('firstName','Robin'))

When the environment is correctly configured, you must edit your config file to reference the extension NBi.Core.Gremlin.

<configuration>
  <configSections>
    <section name="nbi" type="NBi.NUnit.Runtime.NBiSection, NBi.NUnit.Runtime"/>
  </configSections>
  <nbi testSuite="...">
    <extensions>
      <add assembly="NBi.Core.Gremlin"/>
    </extensions>
  </nbi>
</configuration>

You also need to edit your config file. Add all the binding redirections that the Azure Cosmos DB is coming with. To achieve this, copy/paste the content of the runtime element of NBi.Core.CosmosDb.dll.config into a new runtime element in the config file of your test-suite FoF.config.

<runtime>
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity name="System.Diagnostics.DiagnosticSource" publicKeyToken="cc7b13ffcd2ddd51" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.1.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.Http" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.2.0.0" newVersion="4.2.0.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.WebSockets" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.2.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="System.Net.WebSockets.Client" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-4.0.2.0" newVersion="4.0.1.0" />
      </dependentAssembly>
      <dependentAssembly>
        <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-11.0.0.0" newVersion="11.0.0.0" />
      </dependentAssembly>
    </assemblyBinding>
</runtime>

The concept of connection-string is not familiar to Tinkerpop or Gremlin but to match with relational databases, we use a concatenation of the usual parameters defined to connect to an instance supporting gremlin queries. You must provide the hostname, the port, the enableSsl, the username and password. You must also add a tip for NBi to identify the correct type of connection-string by adding an api tag with the value gremlin.If your database is a Azure Cosmos DB instance, then your username will be /dbs/{db}/colls/{coll} where {db} stands for the name of the database and {coll} for the name of the collection. Also, note that the hostname is a url part of the subdomain .gremlin.cosmosdb.azure.com.

hostname=https://gremlin-demo.gremlin.azure.com;
port=443;
enableSsl=true;
username=database=/dbs/Friends/colls/FoF;
password=F...==;
api=gremlin;

The same connection-string for a Tinkerpop Server, running on your computer, will be

hostname=localhost;
port=8182;
enableSsl=false;
username=database=Anything;
password=Anything;
api=gremlin;

Note that username and password are not expected by a Tinerpop Server. Anyway, they can’t be omited but any value will make it.

In your test-suite, querying a relational database with SQL or a Tinkerpop-enabled database with a gremlin query is not different. You’re still using the result-set and query elements. The gremlin query is specified within the query. It doesn’t support parameters at the moment but you’re still free to use template-tokens.

Gremlin queries can return complex results such as paths that are not easily mapped to a table (result-set). Currently, NBi support for Gremlin is limited to queries returning a list of Vertexes or a list of Edges or a list of values.

In this first test, NBi is asserting that the query to return Thomas’ friends of friends is effectively returning the correct result. In this case a unique vertex representing a person named Robin.

FoF-Thomas-Robin
<test name="Thomas' friends of friends" uid="0001">
    <system-under-test>
      <resultSet>
        <query>
          g.V().has('firstName', 'Thomas').outE('knows').inV().hasLabel('person').outE('knows').inV().hasLabel('person')
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <equalTo values-default-type="text">
        <resultSet>
          <row>
            <cell>robin</cell>
            <cell>person</cell>
            <cell>vertex</cell>
            <cell>Robin</cell>
            <cell>Wakefield</cell>
          </row>
        </resultSet>
      </equalTo>
    </assert>
  </test>

In this second test, we’re asserting that the all the vertexes contain a property age greater than 20.

<test name="All of them are older than 20 (or unknown)" uid="0002">
    <system-under-test>
      <resultSet>
        <query>
          g.V()
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <all-rows>
        <combination operator="or">
          <predicate operand="age">
            <null/>
          </predicate>
          <predicate operand="age">
            <more-than>20</more-than>
          </predicate>
        </combination>
      </all-rows>
    </assert>
  </test>

And finally, in this last test, we’re checking that the result-set returned by a projection& matches the expected result.

  <test name="Count of person's relationships" uid="0003">
    <system-under-test>
      <resultSet>
        <query>
          g.V().project('FirstName','KnowsCount').by('firstName').by(out().count())
        </query>
      </resultSet>
    </system-under-test>
    <assert>
      <equalTo>
        <resultSet>
          <row>
            <cell>Mary</cell>
            <cell>0</cell>
          </row>
          <row>
            <cell>Robin</cell>
            <cell>0</cell>
          </row>
          <row>
            <cell>Ben</cell>
            <cell>1</cell>
          </row>
          <row>
            <cell>Thomas</cell>
            <cell>2</cell>
          </row>
        </resultSet>
      </equalTo>
    </assert>
  </test>
FoF-running-test-suite

Our test-suite is green … time to write some additional tests!

To know more about NBi, check the website at www.nbi.io and if you’ve any question or suggestion, feel free to ask on Twitter or on the Github repository (section “issues”). The code of this extension is also available on Github.

Table.Distinct and case pitfalls with Power Query/M language

In a previous post, I already blogged about case pitfalls in Power Query/M language. By default, comparisons are case-sensitive and that could be an issue when sorting list or tables but also when using the List.Distinct and Table.Distinct functions.

To explain the issue, I had the following table:

distinct-case

Note that at the top of the table, I’ve twice the value ADO.NET, once with uppercase and once lowercase for the “Net” part. If I try to apply a Table.Distinct, the two values will be considered as distinct and one of them won’t be removed.

distinct-case-sensitive

Will it be an issue? If this your key and it’s part of a model where this key is part of one side of a one-to-many then it will be an issue.

duplicate-error-model

The question on the table is how to manage this case? The MSDN documentation informs us that a second parameter can be defined for Table.Distinct

Removes duplicate rows from a table, ensuring that all remaining rows are distinct.

Table.Distinct(
   table as table, 
   optional equationCriteria as any
   ) as table  

But no additional info about what’s the equationCriteria is available on this page. You’ll have more luck if you try to read the documentation for the function List.Distinct. There, you’ve a few samples and you can understand that you’ll be able to use any function but also the Comparer list of functions. The comparers are also documented in MSDN.

In that case, the most obvious choice is to the comparer named Comparer.OrdinalIgnoreCase to be sure that the comparison of the different items won’t be impacted by the difference of cases.

distinct-case-insensitive.

Handling Html entities when parsing web pages with Power Query/M Language

In a previous blog post, I explained how to handle the html entities when they were number-based such as &#192; or &#9827 ;. This post covers the other possibility to define an html entity, by name: &Agrave; or &clubs;.

Before going further, let’s notice that normally, entities defined by their name are case-sensitive. It means that &Agrave; resulting in “À” is different to &agrave; resulting in “à”. Our code will also support mistake in casing, if and only if this mistake will not result in an ambiguity. Based on that &AGRAVE; won’t be allowed but &CLUBS; will be.

The code is relatively straightforward, create a list of of all the known html entities and the corresponding values.

let
   list =
   {
       [Name="Agrave", Number=192]
       , [Name="Aacute", Number=193]
       , [Name="Acirc", Number=194]
       , [Name="Atilde", Number=195]
...

Then parse the entity’s name and look for the corresponding record in the list. The search must be executed with the specified casing, if nothing is found use a case-insensitive approach. If there is no match or more than one then raise an error.

        },
        token = if Text.StartsWith(entity, "&")
                   and Text.EndsWith(entity, ";")
                then substring
                else entity,
        substring = Text.Range(entity, 1, Text.Length(entity)-2),
        correctCasing = List.Select(
                 list
                 , each [Name]=token),
        anyCasing = List.Select(
                 list
                 , each Text.Upper([Name])=Text.Upper(token)),
        single =  try List.Single(correctCasing)[Number]
                  otherwise List.Single(anyCasing)[Number],
        handleError = Error.Record(
                       "Invalid parameter"
                       , "Unable to translate the html entity &" & token & ";"
                       , if List.Count(anyCasing)=0 then
                            "No match with the list of known entities."
                         else
                            "Multiple matches with the list of known entities. Check your the casing of your entity's name."),
        char = try Character.FromNumber(single)
               otherwise handleError
    in
        char;

This function is available as a gist at https://gist.github.com/Seddryck/36d24e05c233cccf90e6a88a1af39c68

Passing a function to another function to improve maintainability in Power Query/M Language

Developing is relatively easy but maintaining is harder. A code relatively easy to maintain must have a good balance between expressiveness and compactness. In Power Query/M Language, passing a function to another function, as a parameter, is helping to keep your code compact and to reuse large sets of code where the difference could be tiny. Have you already coded twice the same function one using a Text.Upper and another Text.Lower? Then this blog post could interest you. Surely, abusing this feature could result in an obfuscated code. Use it cautiously and do not abuse this feature.

To illustrate this, I’ll use the same use-case than in this blog post where I was trying to parse the results of my daughters at their gymnastics championship. One of the (many) difficulties in this parsing was to determine the firstName, lastname and team. These three informations were on the same line, separated by a space. In the most common pattern it looks such as:

CHARLIER Clémence A.R. GymBraine
CHARLIER Alice A.R. GymBraine

The pattern could have been

Lastname FirstName Team

But you’ll quickly find some exceptions and difficulties. First notable pitfall, you can’t determine how many words will compose the team’s name. It could be two (A.R. GymBraine) or three (ABC Gym Dinant) or four (Royale Garde Saint Jean) with optionally hyphens, quotes, dots and parentheses. Nothing reliable, you should take the hypothesis that everything after the first name is the team’s name.

Some people have a composed last name such as VAN DER PERRE. We can’t really rely on the hypothesis that the last name is just the first word. The pattern is more that the last name is fully capitalized.

For the first names, we’ve the same kind of issue, the first names Anne-laure and Lilly-shiva are composed but easily identifiable due to the no space between them. Unfortunately some first names are not inline with the previous assumption. I found a Marie antoinette and another Adina - ioana. Hopefully, you’ll notice than in both case only the first letter is an uppercase, all remaining letters are lowercase and the team’s name is always starting by an uppercase!

From that point the pattern is the following:

LAST NAME First name Team's name

The delimiter condition for the last name is that the next two characters are an uppercase followed by a lowercase (Fi in the case above). For the first name, the delimiter condition is a space followed by a uppercase (space + T in the example above).

Cooooool … we “just” have to code that.

First step, let’s create an helper function to check if a character is uppercase or lowercase. A naïve implementation could be

isUppercase = (c as text) => c= Text.Upper(c) 

But this code will return true for a space or an hyphen. It’s not what we’ll need, we need to strictly limit ourselves to letters. In a previous blog post, I already explained how we can use Character.FromNumber and Character.ToNumber to use the ASCII table and check if a character was a figure or not. We can apply the same approach with uppercase and lowercase.

isUppercase = (c as text) => 
  c = Character.ToNumber(c) >= Character.ToNumber("A") 
      and Character.ToNumber(c) <= Character.ToNumber("Z") 

I could do the same for lowercase replacing “A” by “a” and “Z” by “z” and create a new function. Sure, but it’s probably a good place to use for the first time the pattern of passing a function to another function. I’ll rewrite the previous function as:

isCase = (c as text, case as function) => 
  c = Character.ToNumber(c) >= Character.ToNumber(case("A")) 
      and Character.ToNumber(c) <= Character.ToNumber(case("Z")) 

Then, I could create two additional helpers:

isUpperCase = (c as text) => isCase(c, (x)=>Text.Upper(x)),
isLowerCase = (c as text) => isCase(c, (x)=>Text.Lower(x)),

Did you notice, how the function Text.Upper was passed as a parameter?

Now that we can test the casing of a character, we’ll need to tackle another challenge. We need to parse two consecutives characters in a string and find the first occurrence matching the corresponding delimiter condition.

Let’s start by building a list of the two consecutive characters. To do that we’ll use List.Generateand the not well-known fourth parameter the “selection”. This parameter let’s you define a function that will generate the member of the list in place of using the iterator.

let
   PositionOf = (message as text) as text => 
    let
       list = List.Generate(
            ()=>0,
            each _ < Text.Length(message)-1,
            each _+1, 
            each [
                i = _, 
                x = Text.Range(message, _, 1), 
                y = Text.Range(message, _+1, 1)]
        ),

This function is running from the first character of the string to the antepenultimate character and compose a record with respectively the position of the pair of characters in the string, the first character of the pair and the second one.

Then, we’ll need to select the first record matching a condition to get the position. The condition will change depending if we’re looking for the first name or the last name … so let it be a function!

       first = List.First(
                 List.Select(list, each condition(_))
               )[i], 
        word = Text.Start(message, first)
    in
        word
in
    PositionOf

The conditions for the first name and last name will be expressed as:

conditionLastName = (r as record) => 
    isUpperCase(r[x]) and isLowerCase(r[y]),
conditionFirstName = (r as record) => 
    r[x]=" " and isUpperCase(r[y]),

To test the previous code, we can implement the following test routine:

    list = {
              "ALPHA Beta L'équipe", 
              "ALPHA GOGO Beta Team's name", 
              "ALPHA GOGO Beta - gamma My 'team'"},
    temp = List.Transform(
              list,
              each [ 
                 full = _,
                 lastName = 
                    Text.Trim(PositionOf(_, conditionLastName)),
                 firstName = 
                    PositionOf(Text.Range(_, Text.Length(lastName)+1),
                               conditionFirstName),
                 team = Text.Range(_, 
                           Text.Length(firstName) +
                           Text.Length(lastName) + 1)
           ]),
    result = Table.ExpandRecordColumn(
                Table.FromList(temp, 
                    Splitter.SplitByNothing()),
                "Column1", 
                {"full","lastName","firstName","team"})
in
    result

And we’ll get the following result:

function-end-result