Month: November 2018

Analyzing sport results with a graph and Azure Cosmos DB

On previous blog posts, I made usage of Power BI to analyze the results of my daughters participating in gymnastics competitions. On this new series, I’ll analyze their results in karate (kumite to be precise) … using a larger set of tools: Azure Cosmos DB, Power BI and ML.Net.

Let’s start with the analysis to perform: I want to understand their results (is it an expected defeat or is it something that could be set as a counter-performance) or predict them (what’s the likelihood that we must stay at the sport hall until the podium ceremony).

Before going further, a few things that you should know about Karate tournaments before diving into the results:

  • The process is the same than in any Tennis tournament: bout’s winner goes the next round and loser is out.
  • At the exception that if you’ve been defeated by someone going to the final, you’ll be enlisted for a second stage with other competitors defeated by the same finalist and you’ll run for the bronze medal
  • It means that you’ve two bronze medals (because you’ve two finalists) and not one
  • There is always a bout’s winner: in case of draw at the end of the time allocated to a bout, the first competitor who scored is the winner (it’s a bit more complex but let’s keep this simple). This advantage is named “senshu”. In case of 0-0, judges will take the decision
  • Categories are based on age. In most sports, your age is determined based on your birth year … not in Karate where it’s based on your birthdate. It means that you can be in the same category than another competitor during months and suddenly, in the middle of the season, be in another category just because one of you get his birthday.
  • Categories are based on weight. But the bounds for the weight are not the same at each tournament, and the weight is also something really dynamic at this age.

Who thought that it would be easy to analyze? Based on the last bullets, I took the decision to collect all the results of the categories surrounding the age of my daughters (hopefully, they are twins) and I’ll take all the sub-categories based on weight.

Based on sportsML wording all the karatekas participating in the same category of a tournament are part of a division.

Now, that we have our goal and some rules, let’s download the results. We’ve a major issue: results are partial. For some competitions, you only have the podium. It means that for some tournaments, I’ve no notion of bouts’ result but just the final standing.

Based on that, we’ve the following (main) entities: we’ve the tournament, the category (age/weight) derived into a division, the bouts (part of a round and a stage) and the competitors (karatekas). For the relations, we’ve: participates/wins/loses (between karatekas and bouts), belongs to (between tournament, divisions, stages, rounds and bouts), ranks (between karateka and a division). I could go further gathering information about club and nationalities but it’s out-of-scope at this point.

I took the decision to model this as a graph. Why? Three important factors drove me to this choice:

  • First argument, I’m not really interested by the entities but more by their relations. That mostly exclude column-based and document models.
  • Second argument, I’ll look for patterns such as “Who has been defeated by A and has defeated C?”
  • karate-query-1

    or have we got any cycle such as “A defeated B who defeated C but A defeated C”.

    karate-query-2
    This kind of queries are complex to model in SQL compared to graph query languages (Gremlin or Cypher).

  • Last argument: the partial results. A graph can easily handle partial information with its schema-less approach. You shouldn’t consider this argument with a storage point-of-view (I’m aware of the null value in relational databases). This argument makes sense if you consider that the relations could be inferred in a graph and that is harder to put in place with a relational model.

 
Based on consideration above my schema looks like:
karate-full-schema

As you could see, the “defeated by” used in the query above is something that could be inferred by the pattern wins/loses between two karatekas and a bout.

karate-inferred-relation

In next post, I’ll explain how to load the dataset with the Azure Cosmos DB bulk executor library, then I’ll cover how to query the database with Gremlin to get the first insights about my daughters’ results.

Advertisements

Generating tests requiring a synchronous iteration over multiple arrays with genbiL

Todays’ challenge is to create a test-suite, with genbiL, where each test is a all-rows assertion with an undetermined count of predicates. The predicates will be combined with an or operator. Pragmatically, each test will have the following pattern for the assertion:

<all-rows>
  <combination operator="or">
    <predicate operand="...">
      <equal>...</equal>
    </predicate> 
  </combination>
</all-rows>

In the second part of this blog post, I already explained how to deal with multiple values for a variable of the template … what’s different? The case that we’re facing is somewhat similar but not identical. In the previous blog post, each variable of the template (.nbitt) was set to an array by genbiL and the template iterated over the values of the array. But the iterations were independent of each other.

If you take a look to the predicate case that we’re facing now, we haven’t one value to iterate over but two: the operand(attribute of the xml element predicate) and the reference (value between the equal elements). Much more important, we need to iterate synchronously between these two values. It means that each time I iterate for a value of the operand, I also need to iterate over a value for the reference. Let’s take a tangible case: my first predicate would be on field “foo” and I want to check if this field is equal to “1”. My second predicate is on field “bar” and will check if this field is equal to “5”. This should lead to the following assertion:

<all-rows>
  <combination operator="or">
    <predicate operand="foo">
      <equal>1</equal>
    </predicate> 
    <predicate operand="bar">
      <equal>5</equal>
    </predicate> 
  </combination>
</all-rows>

That clearly means that if I create a template-variable named operand and set it with values foo and bar and a second template-variable named reference to set with values 1 and 5. I can’t first iterate on the first variable and then on the second variable or I’d have something such as:

<all-rows>
  <combination operator="or">
    <predicate operand="foo">
    </predicate> 
    <predicate operand="bar">
    </predicate> 
      <equal>1</equal>
      <equal>5</equal>
  </combination>
</all-rows>

Hopefully, StringTemplate is a powerful engine and manage this kind of synchronous iterations over two or more arrays (StringTemplate use the wording multi-valued attribute and not array). To achieve that, you’ll need to define an anonymous-template in your template and specify two variables to this anonymous-template.

$xArray,yArray:{ x,y | ...}$

This syntax means that you want to synchronously iterate over two arrays. This template iterates max(n,m) times where n and m are the lengths of xArray and yArray, respectively.

In our specific case the anonymous-template will be:

$operands,references:{operand,reference|<predicate operand="$operand$">
          <equal>$reference$</equal>
        </predicate> 
        }$

There is nothing fancy in the genbiL code to do to support this kind of template, just load the cases and transform some variables into arrays.

case load file 'Acceptance\GenbiL\Resources\Predicate.csv';
case split columns 'operands', 'references' with value '/';

template load file 'Acceptance\GenbiL\Resources\Predicate.nbitt';
suite generate;
suite save as 'Acceptance\GenbiL\Resources\Predicate.nbits';

The csv file would be (.csv):

operands;references
foo/bar;1/5

and the full template (.nbitt):

<test name="All rows from '$field$' validate the rules.">
  <trait name="template-name">predicate.nbitt</trait>
  <system-under-test>
    <resultSet>
      <query>
        select * from myTable
      </query>
    </resultSet>
  </system-under-test>
  <assert>
    <all-rows>
      <combination operator="or">
        $operands, references:{operand, reference|<predicate operand="$operand$">
          <equal>$reference$</equal>
        </predicate> 
        }$
      </combination>
    </all-rows>
  </assert>
</test>