Writing work items and not tasks

This post is part of a series about specifications, requirements,work items and test-cases in the field of Business Intelligence solutions. First part is about why I don’t want specifications and second is about how to write good requirements. This third part is settings the focus on work items before jumping to the test-cases to validate with stakeholders.

I’m always expecting that the whole team has a good understanding of the business we’re working for (even and especially when it’s a really complex business). The main reason is that it reminds them that our goal is to deliver working software to someone who expects specific benefits from using it … not just deliver a software corresponding to some specifications. To gain this good understanding of the business, everyone need to read (and understand) the requirements.

The key question is “Why do we write work items, if everyone is aware of the requirements?”. The answer is really simple, if you just have a problem, you can have many ways to solve it (probably many poor solutions but some great). First reason to write work items is to help the team to go on the same direction.

What are work items in this case? They are the solution, not the problem.  This solution must be debated with a maximum of team members to be sure that we’re covering all the aspects of the problem. It’s what I’m naming the design. I’m not expecting to have interactions with business stakeholders during this phase, it’s not their role to validate that I’ve a good solution because (at this moment). They will probably not understand what they will effectively receive and should be their unique concerns (the what and not the how). This validation from stakeholders or end-users is for later during the writing of test-cases.

The design phase is not something with a clear start and clear end. It’s something that is starting during the reception of the requirements and is on-going until the end of the implementation. Anyway, you should have a good overview of your final design before starting the implementation.

At the end of this “main” design phase, we need to write the target solution somewhere to be sure we effectively implement what has been debated (No, just a picture of the whiteboard is rarely a good idea). This is a good idea to write down ou design in case we need to adapt it. Indeed if this design is written somewhere, we’ll clearly see the impact of our changes (what to re-do, effect on planning, …) . But also what has been poorly taken into account during this initial design phase … it’s usually an excellent lesson-learnt.

Until now, I’m speaking about a design and not about work items. What’s the difference? Work items are a decomposition of the design. The design is unique, work items are many.

When the team knows what should be done (the solution/the design), we also need to separate the work between the different team members. With one big document describing the solution, it would be a pain. To overcome this pain, I’d recommend to create some work items to subdivise the work to be executed by different resources and indicate the relations between them (precedence, …).

where-to-start

In order to achieve this, I’ve a personal rule: no work item will require more than 40 hours of implementation’s work (dev + test). If at any moment this rule is not validated, the work item must be split (It could be during the design, during the development or during the testing).

But it’s not so easy: We also need to be able to decide that a work item will be discarded due to a lack of time to implement it or a ratio cost to implement versus added-value too low. To achieve this it means that we can’t split the work between layers (database, etl, cube, report). If we need to build a dimension, I can’t split the work by layer and take the risk that we decide that we’ve no time to implement the ETL when the cube, database and reports are already “done“. We always need the implementation in all our layers or this new dimension will be pointless! So don’t split work items by technical layers.

Another good reason to avoid the split by layer is the lack of responsibility of the developers when the work items are expressed this way. Even, if the team has made an amazing job during the design phase, it won’t be perfect. We’ll meet unexpected issues (data quality or technical) and we’ll need to change our plans. When you’re dealing with the dimension from A to Z, you know for sure that any tiny problem during the ETL will create bigger issues during the implementation of the cube. So you need to fix the issue in the ETL … even if it was not initially described in the design. When splitting by layer, it’s really easy for the developer to not care about a tiny problem (to keep his personal planning) and let the next layers fix the issues. It will drive you to a poor solution where all issues are fixed in the last layers with tricks … not something that you’d like to maintain.

What’s a work item? It’s a small part of the global design, that can be implemented in less than 40 hours, discarded of the scope, and that will give a business value to the end-user when implemented.

Added Value - Highway Sign

What about a deployment to QA, a code-review or a refactoring? Are they work items? Not really. Naturally, if you don’t deploy, your solution is pointless but it doesn’t mean that it’s really part of your design. So it’s not a work-item but a task. And the script to deploy? Again not a work-item because it won’t directly give some added-value to the end-users and you can’t really discard this. So again it’s a task not a work-items. Tasks should not be discussed with the end-users (and surely not validated) but must be included in the planning.

Back to the work-items, what does it look like? Our work items are usually entitled:

  • Create a new dimension D
  • Add attributes X and Y to existing dimension T
  • Build a hierarchy Z on existing dimension U
  • Create new measure-group M and link it to dimension D and T
  • Create new report “M by X and Y”

Sometimes, with our experience, we can split a work item into two distincts work items with essential attributes of the dimension in the first-one and some nice-to-have in a second. It will be easier to discard or post-pone the second work-item. But, that’s already an advanced level.

Then there is the description of the work-item and it’s usually where the things are becoming less straightforward … especially for really good developers. The usual problem with poor work items is not in how many we’ve, rarely their duration, and neither on their title but always in the content.

Poor work items are usually too detailed. They don’t write the solution at the conceptual level but at the logical or physical level. Most of the time, they are doing this to help their teammates … but it’s just the opposite that is happening.

An illustration? Surely! If your master data solution has two distinct fields first name and last name and you only need one field name. At the conceptual level, you’ll express that you’re expecting a unique field being the concatenation of first then last name separated by a single space.

Note that I never said if it should be done at the ETL level or in a view just before the cube. I have not explicitly written how to perform the concatenation! It would have been easier for me to write something like first + ' '+ last … except that if the fields first name and last name are not trimmed in the master data, no one would trim them. Why? Because it was not requested! It’s so detailed that it would be too easy for the implementer, to say, I just applied your formulae and it doesn’t work, not my problem. Or even I thought that you specifically wanted to not trim them! Stay at the conceptual level, explicit the result that you want to achieve and not the how-to. If the developer is not sure about the need to trim or not, he will ask the question if you stayed at the conceptual level not if you explicitly requested to do a first + ' '+ last.

Keep in mind that you cannot develop a whole solution without a few bugs. When you develop we’ve a few lifeguards such as a compiler and a few manual runs … writing detailed  work-items is just denying this evidence and telling to others, “take a look I can develop a solution bug-free without a compiler and testing” … everybody knows that you’re wrong. Stay at the conceptual level.

If most of your work items are tasks in disguise, then the development process becomes task-focused (doing things) instead of delivery-focused (creating value) … and that’s usually my root issue with poorly written work items.

doing-things

How to help the juniors without going to deep in the work items?

  1. Before they start to implement, ask them to explain how they will implement the work item. Don’t be too rude with them, keep in mind that you’re asking them to code on a whiteboard (without compiler, …). It’s difficult but at least, you can point them to a few possible issues and especially be sure that they have understood the work item.
  2. During the development, review each couple of hours their code with them. Ask them to explain what and why they are making these choices. Propose alternatives. Let them make their own choices.
  3. Write the tests in parallel of the development and ask them to use the tests during the whole development process (not only at the end when the whole development is done) and discuss any divergence between the code and the tests.

You should notice a few constants in the three suggestions:

  • don’t do their job
  • put them in front of their responsibilities: don’t make the choice for them!
  • work in parallel, not ahead or after … with them.
writer.jpg

Who should write the list of work items? I’ll give two answers to this question. At the beginning, it’s important to make an excellent work when creating the work items. By creating, I mean to decide if it’s one or two or three work items, check the precedence constrains (and avoid loops) and give good titles to these work items. From my point of view, it means that most junior team members won’t be able to do it. If they are, I’d strongly recommend to give them the senior adjective!

This is not a full-time job. If you’ve someone in your team writing work items during days, something is wrong in your process. It usually takes a couple of days of work to write work-items to implement in a couple of weeks. If you’ve a huge team, consider to have several writers but it’s really important that they have a good overview of what others are writing. I’d also recommend to consider to not have one dedicated resource to write the work items, the job-owner must change at every release … minimum. It’s really important that senior implementers keep their hands in the code and do their part of the job by coding … not just by reviewing code and writing work items.

But finally, work-items are alive and they must be regularly updated, reviewed and sometimes discarded. Every team member has the right to propose an adaptation and is responsible to update a work item, not just the initial writer. Every update must be internally debated with a maximum of team members and the initial writer.

You can use the work items in your discussions with stakeholders. It will be your unit to discuss what must be postponed or discarded. If your work items are described with too technical terms, it will be difficult for your stakeholders to discuss at this level … another good reason to stay at a higher level.

Keep in mind that discussing on an artefact to gain a confirmation that it must be prioritized or not is not the same as asking for a validation of the content of the work item. I stated before that work items shouldn’t be validated and I hold on … test-cases will help us to confirm with stakeholders if our design is correct or not! and it’s the scope of the next blog post.

Handling special characters from html files with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The two next blog posts ([1] and [2]) illustrated how to parse rather complex Json responses and retrieve the commits’ activity and the issues’ activity of a repository.

In this new blog post, I’ll try solve an issue that is there since the first blog post. When parsing the MVP website and retrieving the first and last names, I had an issue with some special characters: my own name was retrieved as “Cédric Charlier” in place of “Cédric Charlier”. For some reasons some special characters are encoded (but some are not). Anyway we need to decode them.

From a technical perspective, I’ll solve this issue with a recursion but also with some error handling in Power Query/M language.

mvp-analysis-powerbi-012

The first step is to analyze the problem that we’ve in front of us. The special characters encoded have always the same pattern. A ampersand (“&”) followed by a cross (“#”), a set of digits and a semi-colon (“;”) to end the fragment. We’ll first need to identify this kind of pattern in the string. I’ll simplify to only search for the ampersand and the semi-colon.

let
    ConvertHtmlSpecialChars = (string as text) =>
let
    start = Text.PositionOf(string, "&"),
    end = Text.PositionOf(string, ";"),

Then, I’ll extract this fragment from the string. Then I’ll confirm that this framment is effectively starting by “&#”. If it’s the case then I’ll return the number in this fragment.

    value = Text.Range(string, start, end-start+1),
    cleanValue =
        if
            Text.StartsWith(value, "&#")
        then
            Text.Start(
                Text.End(
                    value
                    , Text.Length(value)-2
                )
                ,Text.Length(value)-3
            )
        else
            null,

Now, that I’ve this number, I can translate it to a normal character with the help of the function Character.FromNumber and replace the fragment by the new value.

    replace = Character.FromNumber(Number.FromText(cleanValue)),
    replaced = Text.Replace(string, value, replace),

Now, I need to handle the fact that I can have more than one special character in my string and I must process each of them. Once again recursion will help me!

    continue =
        if string=replaced
        then replaced
        else ConvertHtmlSpecialChars(replaced)
in
    continue

That’s it? Not really. My function is crashing easily. If I’ve no special characters in my string, the function Text.Range(string, start, end-start+1) will crash because -1 (start’s value) is not a valid. But I’ve many other places where I could have an issue.

A little bit of M theory to understand the basics of error handling in this language. When an error is thrown by a function this error is propagated to the caller and so on. It’s really practical because you can catch the error at a higher level and you don’t need to intercept all the potential errors.

In this case, I’ll catch the potential error between the replacement and my recursive condition.

newString = try replaced otherwise string,

The keyword try means that we’re expecting a potential issue during the call to the function replaced. The keyword otherwise specifies that if we effectively have an error during the call to the function replaced then we need to use another function in place of. In this case, the function is just the value of string (the initial value). So, if at any moment during the substitution, I’ve an error then I’ll use the original string.

I still need to update the recursive condition to use this new function:

continue =
        if string=newString
        then newString
        else ConvertHtmlSpecialChars(newString)
in
    continue
in
    ConvertHtmlSpecialChars

Now, I’ll need to add the call to this function to the table dealing the MVPs’ names:

#"Html Special Chars" =
        Table.TransformColumns(
            #"Renamed Columns"
            ,{{"MVP name", ConvertHtmlSpecialChars}}
        )
in
    #"Html Special Chars"

And now I can compare the before

mvp-analysis-powerbi-013

and after

mvp-analysis-powerbi-014

Do … While loop with Power BI (M language).

In a first blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. The third blog post was about social media and how to connect to the GitHub API to retrieve more info about activities on GitHub. The next blog post illustrated how to parse a rather complex Json response and retrieve the commits’ activity of a repository with the help of a projection.

In this new blog post, I’ll make an additional query to GitHub to retrieve the issues’ activity of a repository. At the opposite of the previous calls to the GitHub API, the response will be cut into different pages. It means that one call to the API won’t be enough to retrieve the whole list of issues. Indeed, issues are returned 42 by 42 and I’ll need to retrieve one, two or more pages of information to have my count of issues created during the last 52 weeks.

Let’s start by taking a look to the GitHub API to retrieve information about issues’ activity: /repos/:login/:repos/issues But, that’s a bit more complex. If you use this method as-is, it will return only the open issues and in a random sort. Hopefully, we can override this behavior by applying modifiers:

/repos/:login/:repos/issues?state=all&sort=created

State is there to identify the state of the issues to be returned, I’m enforcing to receive all the issues independently of the current state. With the sort parameter, I’m requesting to order the items by creation date. By default the sort will be descending, I don’t need to explicitly specify it.

With this request, I’ll retrieve the 42 last issues created in a repository. If I want  to retrieve the 42 previous, I’ll need to add an additional parameter.

/repos/:login/:repos/issues?state=all&sort=created&page=2

The parameter page let’s me navigate in my list of issues (ordered by creation date).

If I want to retrieve all the issues, I need to go through all the pages. But how many pages do I need to retrieve … it depends some repositories have no issues, others have thousands.

Many programming languages are imperative (and not functional or declarative) and have a structure do...while to perform loops that are stopped when a condition is met (in this case probably the condition is that the page returns no issue). The M language is a declarative language and has no structure to handle loops. It’s not really a problem because we can overcome this limitation by using a recursion.

The initial state of the recursion will be to perform a query on the GitHub API with parameter page set to 1. Then we’ll parse the response and check if we’ve issues created during the last 52 weeks. If it’s the case, we’ll perform and additional request incrementing the parameter page of 1 unit until we receive no issue created during the last 52 weeks.

Because the function is recursive we need to provide an additional parmeter (page).

let
    GetIssuesDuringPeriod = (
        login as text
        , repository as text
        , countOfWeeks as number
        , page as number
    ) =>

Then we’ll call the GitHub API with the correct modifiers and the parameter page.

let
    #"List of issues" =
        Json.Document(
            Web.Contents(
                "https://api.github.com/repos/"
                & login & "/" & repository
                & "/issues?state=all&sort=created&page="
                & Text.From(page)
            )
        ),

We’ll consume the result of the query by analyzing the response and counting all the issues created since less than 52 weeks.

The first function returns the date corresponding to 52 weeks ago. That date will be compared to the value of the field created_at of each issue. If the date is anterior then we’ll add 1 to our count of issues. If not we won’t change our count.

To achieve this I’m using a projection named List.Accumulate. I’m setting the initial value (s .. stands for seed) to 0 and for each issue, I’m adding 1 to the seed or not based on the date comparaison. This function is returning the seed when all items have been iterated.

#"StartDate" =
        Date.AddWeeks(
            DateTime.LocalNow()
            ,  -countOfWeeks
        ),

    #"CountOfIssues" =
        List.Accumulate(
            #"List of issues"
            , 0
            , (s, c) =>
                if
                    DateTime.FromText(c[created_at], "en-us")
                    >=#"StartDate"
                then
                    s+1
                else
                    s
        ),

Now, I’ll need to take a decision and decide if I need to iterate on an additional page or not. If the count returned in the previous function was 0 it means that I had no issue created during the last 52 weeks on this page. Due to the enforced ordering in my request, I know that I can stop there. If the result was greater than 1, I’ll perform and additional iteration on my recursive function to check if I’ve additional issues on the next page. To achieve this I’ll recall the function itself and increment the parameter page. The result of this new call will be added to the result of the current call.

#"Recursive" =
        if
            #"CountOfIssues"> 0
        then
            #"CountOfIssues"
            + GetIssuesDuringPeriod(
                login
                , repository
                , countOfWeeks
                , page+1
            )
        else
            0
in
    #"Recursive"
in
    GetIssuesDuringPeriod

Now, I can add the result of this function to the table about GitHub repositories:

#"Added issue activity" =
        Table.AddColumn(
            #"Added commit activity"
            , "Count of issues created this year"
            , each GetIssuesDuringPeriod(
                [Login]
                , [#"Repository's name"]
                , 52
                , 1
            )
        )
in
    #"Added issue activity"

And I can create the following visualization to check the popularity of repositories by language (C#, R, JavaScript)

mvp-analysis-powerbi-011

and by MVP

mvp-analysis-powerbi-010

The next blog post of this series, will about parsing special html characters and translating them into correct Unicode values.

 

 

Projections with Power BI (M language)

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It was time to get information about social media for each MVP and connect to the GitHub API to retrieve more info about their activities on GitHub. This third blog post was ending on a remark about the limited interest of the values available (count of public repositories, followers) and the need to retrieve more pertinent info such as the commit activity on the repositories owned by the MVPs.

From a technical perspective, I’ll show how to make a projection in Power BI (M Language). A projection is a process of selecting data in a different shape rather than a specific entity being queried. There are many ways of projection such as GroupBy or SingleOrDefault but I’ll illustrate this with the less known Transform.

Let’s try this! We first need the list of all the repositories of an MVP. The starting point is the list of social media, filtered on the website “GitHub”. Then to get a list of repositories for the login of the MVP, I’ll use the GitHub API /users/:login/repos.

let
    Source = #"Social Media",
    #"Removed Columns" =
        Table.RemoveColumns(
            Source
            ,{"Url"}
        ),
    #"Filtered to Github website" =
        Table.SelectRows(
            #"Removed Columns"
            , each ([Website] = "GitHub")
        ),
    #"Get repositories from Github" =
        Table.AddColumn(
            #"Filtered to Github website"
            , "Custom"
            , each Web.Contents(
                "https://api.github.com/users/"
                & [Login] & "/repos"
            )
        ),

This give me a nice table with all the informations about repositories in a binary format:

mvp-analysis-powerbi-007

At the opposite of previous post, the expanding of the response is a bit easier. I won’t use the pivot functions, just some expands.

    #"Expanded Transform Binary from Github" =
        Table.ExpandListColumn(
            #"Transform JSON to Table"
            , "Transform Binary from Github"
        ),
    #"Expanded Transform Binary from Github1" =
         Table.ExpandRecordColumn(
            #"Expanded Transform Binary from Github"
            , "Transform Binary from Github"
            ,{
                "name", "owner", "html_url"
                , "description", "fork"
                , "stargazers_url", "contributors_url"
                , "homepage", "stargazers_count"
                , "watchers_count", "language"
                , "forks_count", "open_issues_count"
                , "forks", "open_issues", "watchers"
            }
                ,{"name", "owner", "html_url"
                , "description", "fork", "stargazers_url"
                , "contributors_url", "homepage"
                , "stargazers_count", "watchers_count", "language"
                , "forks_count", "open_issues_count"
                , "forks", "open_issues", "watchers"
            }
        ),
    #"Expanded owner" =
        Table.ExpandRecordColumn(
            #"Expanded Transform Binary from Github1"
            , "owner"
            , {"login"}
            , {"owner.login"}
        ),

A bit of cleaning by removing and renaming some columns

    #"Removed Columns1" =
        Table.RemoveColumns(
            #"Expanded owner"
            ,{"Custom"}
        ),
    #"Renamed Columns" =
        Table.RenameColumns(
            #"Removed Columns1"
            ,{{"stargazers_count", "Count of stars"}
            , {"watchers_count", "Count of watchers"}
            , {"forks_count", "Count of forks"}
            , {"name", "Repository's name"}
            , {"open_issues_count", "Count of issues"}}
        ),

Filtering out the forks

#"Filtered Rows" = Table.SelectRows( #"Renamed Columns" , each ([fork] = false) ),

And I’ve the following result:

mvp-analysis-powerbi-008.png

That’s cool but not it’s not finished. I still haven’t got some information about commits. To achieve this, I’ll need to make an additional query to the GitHub API and request the commit activity of each repository: /users/:login/:repos/stats/commit_activity.

Let’s build a function to achieve this.

let
    GetCommitsDuringPeriod = (
        login as text
        , repository as text
        , countOfWeeks as number
    ) =>
let
    #"List of commits by week" =
        Json.Document(
            Web.Contents(
                "https://api.github.com/repos/"
                & login
                & "/"
                & repository
                & "/stats/commit_activity"
            )
        ),

This api send a response with a rather complex structure: A list of records, week by week, the commits on each weekday, the commits of the whole week (equal to the sum of the commits on each weekday – named total) and the week identification (Unix timestamp).

mvp-analysis-powerbi-009

I’m not interested by most of this information and I just need to keep the information total for the last 52 weeks. This is where I’ll apply a projection and I’ll use the function List.Transform. This function, let’s me explain that I want to transform my initial list into a new one. For each element, I’ll just hold the field total.

    #"List of totals" =
        List.Transform(
            List.LastN(
                #"List of commits by week"
                , countOfWeeks
            )
            , each _[total]
        ),

In this case, the transformation will return a list of number (corresponding to each total). Again, I’m not really interested by the detail and I can sum all these records with the help of the function List.Sum.

    #"CountOfCommits" = List.Sum(#"List of totals")
in
    #"CountOfCommits"
in
    GetCommitsDuringPeriod

Now, I can easily integrate my function to the previous table, I just need to add a column calling this function.

    #"Added commit activity" = Table.AddColumn(
            #"Filtered Rows"
            , "Count of commits this year"
            , each GetCommitsDuringPeriod(
                      [Login]
                      , [#"Repository's name"]
                      , 52
                 )
            )
in
    #"Added commit activity"

In the next blog post, I’ll add the issues’ activity for each repository. From a logical point of view, it’ll be the same process but the technical challenge will be the pagination of the results of the GitHub API.

Retrieve GitHub data with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. In a second blog post, I parsed the individual pages of each MVP to get information about her awards. It’s now time to get information about social media for each MVP. Then we’ll parse these informations to get the login of each MVP on GitHub and finally use this login to connect to the GitHub API and retrieve more info.

From a technical perspective, I’ll show

  • how to query an external API to get some information
  • how to return more than one value for a function
  • how to build a function to identify some patterns
  • how to transform pairs of key/value into several columns.

Ready? Let’s start …

The first task is to grab the social media information on the profile page. In previous blog posts, we stored this individual page in a table, the starting point will be this table. Once again, we apply the strategy of the html pattern to identify the useful lines for our task.

let
    IndividualPageTags = #"Individual page",
    #"Added SocialMediaTable" = Table.AddColumn(IndividualPageTags, "Tags", each GetSocialMedia([IndividualPage])),
    #"Removed Columns" = Table.RemoveColumns(#"Added SocialMediaTable",{"IndividualPage"}),
    #"Expanded Tags" = Table.ExpandListColumn(#"Removed Columns", "Tags"),
    #"Extract url" = Table.AddColumn(#"Expanded Tags", "Url", each GetUrlFromTag([Tags])),
    #"Removed Columns1" = Table.RemoveColumns(#"Extract url",{"Tags"}),
    #"Filtered out null" = Table.SelectRows(#"Removed Columns1", each ([Url] <> null)),

Now that we’ve a list of links to the MVP’s social medias, we still have to identify the type of social media (GitHub, Facebook, LinkedIn, …) and to extract the login/username from the url. To achieve this task we’ll create a function but it means that our function will not return a single field but two fields (the type of social media and the login). With the M language, it’s possible to return a record (a record is a container with one or more fields, each field is identified by a name).

To build the record, we use the function Record.AddField.

let
    GetWebsiteInfo = (url as text) =>
let
    url = Text.Trim(url, {" ", "/"}),
    possibilities = {
        [pattern = "facebook"  , value = "Facebook"],
        [pattern = "linkedin"  , value = "LinkedIn"],
        [pattern = "github"    , value = "GitHub"],
        [pattern = "twitter"   , value = "Twitter"]
    },
    Website = List.Single(List.Select(possibilities, each Text.Contains(url, _[pattern])))[value],
    Login = Text.Range(url, Text.PositionOf(url, "/", Occurrence.Last)+1),
    Records = Record.AddField(Record.AddField([], "Website", Website ), "Login", Login)
in
    Records
in
    GetWebsiteInfo

We can add this function to our previous code. The call to the function will add a record to each line. This record must be expanded to reveal the values for the website and the login.

    #"Get Website and Login" = Table.AddColumn(#"Filtered out null", "Custom", each GetWebsiteInfo([Url])),
    #"Expanded Website and Login" = Table.ExpandRecordColumn(#"Get Website and Login", "Custom", {"Website", "Login"}, {"Website", "Login"})
in
    #"Expanded Website and Login"
mvp-analysis-powerbi-006.png

Cool, now that we’ve a list of social medias with information that we can parse, we can use them to contact GitHub and get more info about the GitHub account. First we need to start from the social media table and filter to hold lines referring to GitHub.

let
    Source = #"Social Media",
    #"Removed Columns" = Table.RemoveColumns(Source,{"Url"}),
    #"Filtered to Github website" = Table.SelectRows(#"Removed Columns", each ([Website] = "GitHub")),

GitHub exposes an interesting API that we can use to get some information about accounts, repositories, issues and much more. This API is exposed on http and returns a JSON document. We can easily interact with this API with the functions Web.Contents and Json.Document. To achieve this task , we need to build an url with the format users/:login where :login is a variable corresponding to the login of the user.

    #"Get user info from Github" = Table.AddColumn(#"Filtered to Github website", "Custom", each Web.Contents("https://api.github.com/users/" & [Login])),
    #"Transform JSON to Table" = Table.AddColumn(#"Get user info from Github", "Transform Binary from Github", each Record.ToTable(Json.Document([Custom]))),
 

Before executing this code Power BI will ask us about security settings on the GitHub API. This API is a new source and each source must define it’s own security. The GitHub API can be contacted with <em>anonymous</em>, <em>basic authentication</em>, <em>OAuth2</em>. We could use the <em>anonymous</em> way but we’ll be quickly limited because of the limitation of 50 calls/day to the API. I selected <em>basic authentication</em> and registered my own GitHub credentials into Power BI. When it’s done we can now execute the load of this new table.

Now, we’ve information about GitHub accounts and we need to reformat it. We’ll first expand the Json table received from GitHub. It’s a simple set of key/value pairs. For some convenience, we need to  exclude the key equal to plan (this info is only available for your account). Then, it will be time to remove some columns that are now useless.

    #"Expanded Table to Name/Value pairs" = Table.ExpandTableColumn(#"Transform JSON to Table", "Transform Binary from Github", Table.ColumnNames(#"Transform JSON to Table"[#"Transform Binary from Github"]{0})),
    #"Filtered out ""Plan""" = Table.SelectRows(#"Expanded Table to Name/Value pairs", each ([Name] <> "plan")),
    #"Removed Columns1" = Table.RemoveColumns(#"Filtered out ""Plan""",{"Website", "Login", "Custom"}),
mvp-analysis-powerbi-004

Based on this content, we can pivot the data to transform the key/value pairs into columns. The pivot function is expecting a summarize function (usually List.Sum). Keep in mid that here the values are text and not numbers. We should specify List.Max because anyway we’ll have one value for each pivoted row and this function accepts a text argument.

    #"Pivoted Key/Value" = Table.Pivot(#"Removed Columns1", List.Distinct(#"Removed Columns1"[Name]), "Name", "Value", List.Max),

 

mvp-analysis-powerbi-005.png

When done, we need some additional cleaning:

    #"Removed Columns2" = Table.RemoveColumns(#"Pivoted Key/Value",{"id", "gravatar_id", "url", "subscriptions_url", "organizations_url", "repos_url", "events_url", "received_events_url", "type", "site_admin", "private_gists", "total_private_repos", "owned_private_repos", "disk_usage", "collaborators"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"login", "GitHub Login"}, {"name", "GitHub fullname"}, {"public_repos", "Count of public repos"}, {"followers", "Count of followers"}})
in
    #"Renamed Columns"

Now, with this new table, we can build the following report.

mvp-analysis-powerbi-003

I’ve a few remarks about this report. First, it’s a bit weird to read that only 35 MVP have registered a GitHub account on their profile pages (it’s less than 10%). Some have probably forgotten to register this information.

Count of followers and public repositories are not the best parameters to determine the popularity or the activity on GitHub of someone. I’ll improve this in a next blog post.

Scan web pages with Power BI

In a previous blog post, I explained how to list all the Data Platform MVP by parsing a single web page. It was great but I just get a few information about the MVP (name and country), it’d be great to also know since when they are MVP, how many awards and links to social media (in the next blog post I’ll use these links to get more info about GitHub accounts).

In the previous blog post, I add build a ProfileUrl field with the url of the individual profile page, it’s time to use it.

The first task is to grab the awards’ information on the profile page: categories, first year and count of awards. But before going further I need to take into account that I’ll probably use the individual profile page of each MVP more than once (I’ll also use these pages to get data about Social Media). Based on this, it’s better to avoid to duplicate the code to load these pages everywhere (will also save me from several download for each pages).

let
   IndividualPageTags = #"MvpId and Name",
   #"Added Page" = Table.AddColumn(#"MvpId and Name", "IndividualPage", each Web.Contents([ProfileUrl])),
   #"Removed Columns" = Table.RemoveColumns(#"Added Page",{"Name", "ProfileUrl", "PhotoUrl", "Index"})
in
   #"Removed Columns"

To get the information on this âge we’ll use the same strategy than in the first part of this post. Transform the binary to a list of lines and filter them on a pattern.

let
   IndividualPageTags = #"Individual page",
   #"Added AwardsTable" = Table.AddColumn(#"IndividualPageTags", "Awards", each GetAwards([IndividualPage])),
   #"Expanded Custom" = Table.ExpandTableColumn(#"Added AwardsTable", "Awards", {"Award", "FirstYear", "CountOfAwards"}, {"AwardsList", "FirstYear", "CountOfAwards"})
in
   #"Expanded Custom"

The filtering task is performed by the function GetAwards. This function accepts, as a parameter, the individual profile page of the MVP. The type is binary. This function is looking for the pattern div class="ly infoContent". After cleaning, we’ve three rows. The first one contains the different categories, the second the first year and finally the count of awards. It’s possible to transform the three rows into three columns using the Table.Transpose function.

let
    AwardsInfo = (page as binary) =&gt;
let
    Lines = Lines.FromBinary(page),
    Tags = List.Select(Lines, each (Text.Contains(_, "
&lt;div class=""ly infoContent""&gt;"))),
    #"Converted to Table" = Table.FromList(Tags, Splitter.SplitTextByDelimiter("&gt;"), null, null, ExtraValues.Error),
    #"Changed Type" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column1", "Column3"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns","&lt;/div","",Replacer.ReplaceText,{"Column2"}),
    #"Transposed Table" = Table.Transpose(#"Replaced Value"),
    #"Renamed Columns" = Table.RenameColumns(#"Transposed Table",{{"Column1", "Award"}, {"Column2", "FirstYear"}, {"Column3", "CountOfAwards"}})
in
    #"Renamed Columns"
in
    AwardsInfo

When we’ve retrieved the awards’ information, we can add the data about first year and count of awards to our main table.

let
    Source = #"MvpId and Name",
    #"Merged MvpId and Country" = Table.NestedJoin(Source,{"Index"},Country,{"Index"},"NewColumn",JoinKind.Inner),
    #"Expanded Country" = Table.ExpandTableColumn(#"Merged MvpId and Country", "NewColumn", {"Country"}, {"Country"}),
    #"Removed Index" = Table.RemoveColumns(#"Expanded Country",{"Index"}),
    #"Merged with Awards" = Table.NestedJoin(#"Removed Index",{"MvpId"},Awards,{"MvpId"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded Awards" = Table.ExpandTableColumn(#"Merged with Awards", "NewColumn", {"FirstYear", "CountOfAwards"}, {"FirstYear", "CountOfAwards"})
in
    #"Expanded Awards"

The relation between an MVP and the awards’ categories is a one-to-many. We need to build a second table with all the awards’ categories received by each MVP. To achieve this we first need to transform the list of categories into multiples rows. To achieve this, we need to split our string with the separator coma. then we can unpivot the data on the MVP Id and then remove the useless column Attribute. To clean the remaining spaces of some categories, we just need to trim.

let
    Source = Awards,
    #"Removed Columns" = Table.RemoveColumns(Source,{"IndividualPage", "FirstYear", "CountOfAwards"}),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Removed Columns","AwardsList",Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv),{"AwardsList.1", "AwardsList.2"}),
    #"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Split Column by Delimiter", {"MvpId"}, "Attribute", "Value"),
    #"Removed Columns1" = Table.RemoveColumns(#"Unpivoted Columns",{"Attribute"}),
    #"Trimmed Text" = Table.TransformColumns(#"Removed Columns1",{{"Value", Text.Trim}})
in
    #"Trimmed Text"

It’s now time to check the relations between our main table (All info) and the new table (Awards Nomination).

mvp-analysis-powerbi-003

We can now build a new report with some information about the awards received.

mvp-analysis-powerbi-002

The next step is to retrieve the information about social media for each MVP and to use this information to retrieve additional info about their usage of GitHub … but it’s for the next blog post.

Parse a web page with Power BI

A few days ago, I found the excellent blog post of James Dixon about an age and sex analysis of Microsoft USA MVPs. I really like F# because it’s so efficient for this kind of tasks: in just a few lines of code you can parse the web, call an API and build a great data frame. Just after reading this article, I started to try to code the same project in Power BI/Power Query (M language) but during this journey, I changed a bit the idea to finally focus on Data Platform MVP and their Github accounts.

Let’s start by getting a list of Data Platform MVP. Microsoft doesn’t expose an API for its MVP website, so we need to parse the website. This website expose a search engine where it’s easy to filter on the Award Category “Data Platform”. By default, the search results page returns the 18 first results and let’s you navigate to next pages. If you click on the bottom figures, you can change the count of MVPs by page to 36 or 48. If you’re focus on your task then you’ll see that this value (18,36 or 48) is available as a parameter in the url. I tried to change this value to 1000 in the url, without any problem: the web page was listing 1000 MVPs. Based on this, it’s possible to render all the data platform MVP on one single page with the following url: https://mvp.microsoft.com/en-us/MvpSearch?&ex=Data Platform&ps=1000&pn=1

Now, that we’ve a way to retrieve all the MVP, let’s try to consume this from Power BI. Unfortunately, the website doesn’t return a table but a rather complex list of div tags. From my personal experience, it’s often the case when you try to parse a website. I could try to build a parser with M functions but it would be rather complex and there is an easier solution. I’ll first convert the binary object into a set of lines. For debugging purpose I’m also creating a parameter LimitRowsForDevelopment that I set to 700 to speed up the grabbing process when developing and I switch it back to 10000000 when running “in production”.

let
    Source = Web.Contents("https://mvp.microsoft.com/en-us/MvpSearch?&ex=Data Platform&ps=1000"),
    Lines = Lines.FromBinary(Source),
    #"Kept First Items" = List.FirstN(Lines,LimitRowsForDevelopment)
in
    #"Kept First Items"

When I’ve this list of lines, I need to extract a subset of this list where all the lines have the following patterns each Text.Contains(_, "<a href")and Text.Contains("</a>"). It will give me a list of links to the individual profile pages but also the name of the MVP.

I’s time to parse the link found above and extract the Mvp Id (a few figures identifying the MVP).

This can easily be achieved with three functions. The first one is a kind of overload of Text.PositionOf where you can specify the position of the occurrence.

let
    GetPositionInText = (string as nullable text, substring as text, position as number) =>

let
    Source = if string = null then 0 else List.First(List.Skip(Text.PositionOf(string , substring, Occurrence.All), position-1))
in
    Source
in
    GetPositionInText

This function list all the occurrences of the character Text.PositionOf, skip the n-1 first occurrences List.Skip and take the first of the remaining occurrences List.First.

When we’ve this quick and useful function we can parse the a href tag and ask Power BI to get the value between the second quote and the last occurrence of a dash.

let
    GetMvpIdFromUrl = (url as text) =>

let
    EndPos = GetPositionInText(url, """", 2),
    SubText = Text.Range(url, 0, EndPos),
    StartPos = Text.PositionOf(SubText, Character.FromNumber(45), Occurrence.Last)+1,
    Source = Text.Range(SubText, StartPos, EndPos-StartPos)
in
    Source
in
    GetMvpIdFromUrl

This function uses a few tricks not well-known about the M language. First it’s possible to define intermediate steps as variables that you can consume later (It’s the case for EndPos, SubText and StartPos). Another trick useful with special characters is the possibility to identify them as a number and transform them into the characters with the function Character.FromNumber.

For the name of the MVP, we use the same strategy and following function:

let
    GetNameFromUrl = (url as text) =>

let
    StartPos = GetPositionInText(url, ">", 1)+1,
    EndPos = GetPositionInText(url, "<", 2),
    Source = Text.Replace(Text.Range(url, StartPos, EndPos-StartPos), "  ", " ")
in
    Source
in
    GetNameFromUrl

Now back to the main table and we can extract names and MVP ids with the following code:

let
    LinkTags = List.Select(#"Listing page", each Text.Contains(_, "<a href=""/en-us/mvp/") and Text.Contains(_,"</a>")),
    #"Converted to Table" = Table.FromList(LinkTags , Splitter.SplitByNothing(), {"OriginalUrl"}, null, ExtraValues.Error),
    #"Get Mvp Id" = Table.AddColumn(#"Converted to Table", "MvpId", each GetMvpIdFromUrl([OriginalUrl])),
    #"Get Name" = Table.AddColumn(#"Get Mvp Id", "Name", each GetNameFromUrl([OriginalUrl])),
    #"Removed Columns" = Table.RemoveColumns(#"Get Name",{"OriginalUrl"}),
    #"Added Index" = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1)
in
    #"Added Index"

The last line of code is the addition of an index that we’ll use just after. Indeed, I’ll go back to my list of lines retrieved in #"Listing page" and now extract the country. To achieve this I’ll again keep all the rows with a given pattern and then perform string manipulation on the remaining lines.

let
    CountryTags = List.Select(#"Listing page", each (Text.Contains(_, "<span class=""subItemContent"">") and Text.Contains(_, "</span>"))),
    #"Converted to Table" = Table.FromList(CountryTags, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Converted to Table","Column1",Splitter.SplitTextByEachDelimiter({">"}, QuoteStyle.Csv, false),{"Column1.1", "Country"}),
    #"Removed Columns" = Table.RemoveColumns(#"Split Column by Delimiter",{"Column1.1"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns","</span>","",Replacer.ReplaceText,{"Country"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1)
in
    #"Added Index"

Now that I’ve the two tables loaded (one with Name and MVP id, a second with Country), I can merge them to create a unique dataset with the help of the index.

let
    Source = #"MvpId and Name",
    #"Merged MvpId and Country" = Table.NestedJoin(Source,{"Index"},Country,{"Index"},"NewColumn",JoinKind.Inner),
    #"Expanded Country" = Table.ExpandTableColumn(#"Merged MvpId and Country", "NewColumn", {"Country"}, {"Country"}),
    #"Removed Index" = Table.RemoveColumns(#"Expanded Country",{"Index"}),

...

The MVP id is also really helpful to build two additional urls. The first one will redirect to the profile page of the MVP and the second will be used to retrieve the picture of the MVP.

...

    #"Added ProfileUrl" = Table.AddColumn(#"Removed Index", "ProfileUrl", each "https://mvp.microsoft.com/en-us/PublicProfile/" & [MvpId]),
    #"Added PhotoUrl" = Table.AddColumn(#"Added ProfileUrl", "PhotoUrl", each "https://mvp.microsoft.com/en-us/PublicProfile/Photo/" & [MvpId]),
in
    #"Added PhotoUrl"

Well, it’s now time to make some visualizations to show the results.

mvp-analysis-powerbi-001

In a next blog post, I’ll show how to extract information not available in the search page but only on the profile pages of the MVP. I’ll include these info into the existing dataset. Then it will be time to merge this dataset with some information that I’ll retrieve from Github.

Regression test-suite for BI solution – Part 2

In the previous blog post of this series, I’ve explained how to query a DMV and how to modify a little bit the set of test-cases generated by the query. In this post, we’ll see how to join two sets of test-cases to obtain a unique set.

Now that you’ve loaded the content of the a first DMV, we can apply the same recipe for the second DMV. Well, It’s not so easy, if we just do this, the second call to case load will override the result of the first query. To keep in memory two sets of test-cases, in genbiL, you’ll have to give them a name. This action is done through the keyword scope (more info). This action must be executed when you want to work on a different set of test-cases.

case scope 'dimensions';
case load query
{
 ...
}
on '...';

case scope 'hierarchies';
case load query
{
 ...
}
on '...';

To specify on which scope you want to apply actions (filter, load, hold, …), you first need to specify the action scope with the name of the scope on which you want to apply actions. It’s possible to switch the scope at any moment and you can o back to a scope previously loaded.

To join the two sets, you’ll need to use the command cross (more info) with a jointure. But, before crossing the two existing datasets, you need to scope on a third empty scope that will contain the result of the cross.

case scope 'dimensions and hierarchies';
case cross 'dimensions' with 'hierarchies' on 'dimension_unique_name';

Now that you’ve understood how to cross (join) two datasets, you can safely load some other DMVs and combine them to build a global picture of your cube. First step, load measures and measuregroups then combine them. Second step, use the DMV returning the intersections between dimensions and measuregroups. If you combine this to the two already combined datasets, you’ll have the interesting interactions between hierarchies and measures. Probably the best starting point to automatically generate queries to apply on your new cube and old cube and compare result-sets. Let’s do this:

Create a new a new template to support comparison query-to-query (save it as regression\my-template.nbitt):

<test name="Compare '$Measure_Caption$' by '$hierarchy_caption$'">
  <description>Assert that the same query, 
  executed on two instances of the cube, 
  returns the same result-set. Specifically, 
  for measure '$measure_caption$' on hierarchy 
  '$hierarchy_caption$'</description>
  <edition author="$username$" created="$now$"/>
  <category>Measuregroup '$measuregroup_name$'</category>
  <trait name="Template">my-template.nbitt</trait>
  <system-under-test>
    <execution>
      <query>
        <![CDATA[
        select 
          $measure_unique_name$ on 0
          , non empty($hierarchy_unique_name$.children) on 1 
        from 
          [$cube_name$] 				
        ]]>
      </query>
    </execution>
  </system-under-test>
  <assert>
    <equalTo keys="all-except-last" tolerance="$tolerance$">
      <query>
        <![CDATA[
        select 
          $measure_unique_name$ on 0
          , non empty($hierarchy_unique_name$.children) on 1 
        from 
          [$cube_name$]
        ]]>
      </query>
    </equalTo>
  </assert>
</test>

Add the following genbiL code to load all the DMV

case scope 'dimensions';
case load query
{
	select
		[Dimension_Unique_Name]
		, [Dimension_Caption]
	from
		[$system].[MDSchema_dimensions]
	where
		[Cube_Name]='your-cube-name'
		and [Catalog_Name]='your-initial-catalog'
		and Dimension_Is_Visible
}
on 'Provider=MSOLAP.4;Data Source=your-data-source;Initial Catalog=your-initial-catalog';

case scope 'hierarchies';
case load query
{
	select
		[Dimension_Unique_Name]
		, [Hierarchy_Unique_Name]
		, [Hierarchy_Caption]
	from
		[$system].[MDSchema_hierarchies]
	where
		[Cube_Name]='your-cube-name'
		and [Catalog_Name]='your-initial-catalog'
		and Hierarchy_Is_Visible
}
on 'Provider=MSOLAP.4;Data Source=your-data-source;Initial Catalog=your-initial-catalog';

case scope 'levels';
case load query
{
	select
		[Hierarchy_Unique_Name]
		, [Level_Unique_Name]
		, [Level_Caption]
	from
		[$system].[MDSchema_levels]
	where
		[Cube_Name]='your-cube-name'
		and [Catalog_Name]='your-initial-catalog'
		and Level_Is_Visible
		and [Level_Number]<>'0'
}
on 'Provider=MSOLAP.4;Data Source=your-data-source;Initial Catalog=your-initial-catalog';

case scope 'measureGroups';
case load query
{
	select
		[MeasureGroup_Name]
	from
		[$system].[MDSchema_measureGroups]
	where
		[Cube_Name]='your-cube-name'
		and [Catalog_Name]='your-initial-catalog'
}
on 'Provider=MSOLAP.4;Data Source=your-data-source;Initial Catalog=your-initial-catalog';
case filter distinct;

case scope 'measures';
case load query
{
	select
		[Cube_Name]
		, [Measure_Unique_Name]
		, [Measure_Caption]
		, [MeasureGroup_Name]
		, [Measure_Display_Folder]
	from
		[$system].[MDSchema_measures]
	where
		[Cube_Name]='your-cube-name'
		and [Catalog_Name]='your-initial-catalog'
		and Measure_Is_Visible
}
on 'Provider=MSOLAP.4;Data Source=your-data-source;Initial Catalog=your-initial-catalog';

case scope 'links';
case load query
{
	select
		[MeasureGroup_Name]
		, [Dimension_Unique_Name]
	from
		[$system].[MDSCHEMA_MeasureGroup_Dimensions]
	where
		[Cube_Name]='your-cube-name'
		and [Catalog_Name]='your-initial-catalog'
		and Dimension_Is_Visible
}
on 'Provider=MSOLAP.4;Data Source=your-data-source;Initial Catalog=your-initial-catalog';

Once we’ve the separated DMV, we can cross them together to obtain our model

case scope 'dimensions-measures';
case cross 'measures' with 'links' on 'MeasureGroup_Name';
case cross 'dimensions-measures' with 'dimensions' on 'Dimension_Unique_Name';
case cross 'dimensions-measures' with 'hierarchies' on 'Dimension_Unique_Name';

Add a field tolerance of 0.001

case add column 'tolerance' values '0.001';

Then, generate the tests based on the test-cases loaded in the newly created scope and the template build above.

template load file 'Regression\my-template.nbitt';
suite generate;

And that’s it.

Now up-to-you to build other queries and filter out the not pertinent test-cases.

SQL Saturday in Vienna and MS Cloud Summit in Paris to start in 2017

I’m sitting in my couch, watching a movie with my daughters, reminiscing about holidays, and getting ready for the first speaking engagements of 2017.

sql-saturday-579-viennaI’m completely thrilled to be invited to Vienna on January 20th to speak at SQL Saturday #579. I’ll be tackling a rather difficult subject … testing BI solutions. During this session, we’ll look to the features of the open-source framework named nbi. This framework is providing support for automated tests on the fields of databases, cubes, reports and ETLs, without the need of .Net skills. The demos will show us the best approaches to quickly and effectively assert the quality of BI developments. We’ll go a step further, generating the tests by an interesting system of templates and test-cases sources.

mscloudsummit-paris-2017I’ll also be speaking (in French) in Paris at the MS Cloud Summit Paris 2017. The topic will be Azure Stream Analytics. This is a completely new session and it’s taking a lot of patience, perseverance, and hard work to get ready (and to be honest it’s not 100% finished). Building the slide deck is the quick and easy part. But designing the demo scenarios, scripting and testing the demos, is the time-consuming part (Then will come the practicing part).

For both conferences, if Business Intelligence is not your cup of tea, there are plenty of other topics to choose from. With sessions geared toward Database Development, Database Administration/Security and even, in Paris, Office 365, Power Flow, Power Apps, Azure functions, … Both conferences offer a wide variety of content for everyone.

See you there?

Requirements for BI solutions

A recurrent question that I received after a conferences or during a course sounds really simple but is rather complex to answer: How do you write specifications for Business Intelligence solutions? That will be the starting point of this series of four blog posts about specifications, requirements, work-items and test-cases.

The methodology that I advocate is very business focused. We need to understand business processes, objectives, analytic requirements and associated potential business impact, and eventually project success criteria.

A global Business Requirements deliverable typically covers the following topics: critical business and analytic description including semantic, existing systems and data to support the capabilities required, initial project success criteria, and key issues/risks with tactical and strategic recommendations.

Business Process to be analyzed

Keep in mind that the requirements are not there to define the solution but to explicit the problem. Following this idea, don’t try to model by yourself a report, but describe your business process(es) and your analytical needs. Don’t ask for a report but describe the level of maturity of your end-users and what they will do the data/information. Keep in mind that a BI solution could go further than a data warehouse, a cube or a report but can also be trainings, coaching/support and even new staff members.

I don’t care about the format of the requirements. It could be written on any media that will be readable and shareable in the company. But, keep in mind that implementers should be able to read this document from A to Z. Avoid to write documentation based on hyperlinks as the main mechanism to go from one section to another. It’s great to have hyperlinks to facilitate navigation on some parts of the document, but keep in mind that at any moment they need to be sure about what has been read and what has not been. Based, on this we do not recommend to write your requirements as a website but better on a word document.

Strictly avoid UML or E/R schemas (or any software related notation). Firstly, the level of expertise, with the notation, of each document’s reader/writer will vary, resulting in different and incorrect interpretations of a same schema. Some readers/writers will understands some details, some will not. Keep also in mind that these notations are usually designed to build low level specifications. BI Requirements must stay at the higher level. Last but not least requirements will be validated by stakeholders do you suppose that your CxO is understanding the E/R notation?

A draw is more explicit than a paragraph” is a sentence that I valuate. But I usually add that “a draw without explanations has no value”. If you put some schemas or drawing be sure that they don’t provide information that are not explicitly written in the paragraphs surrounding them. As a guideline, keep in mind that I’ll not read the schemas and draws. They will just be used to double-check my own understanding. Do the same!

The first section of your document should be a large introduction to your business. We’re dealing with BI, we can’t be expert of each process of the company so you need to provide us enough information about the processes without having to read pages of details. The first section deals with this idea: a description of the business process to be analyzed with the BI solution.

This section of the document should explain the Business Process that you want to analyze through Business Intelligence. I just put the stress, I’m not interested, at this step, to know “how you’ll analyze” but I’m focused on “What will be analyzed”. The existing Business Process generating data that we’ll transform to provide some information that will be analyzed.

If you want to analyze the relation between the customers’ budget and the sales then I’m expecting that you describe “how customers are buying things with your website”, “how customers are buying in your shops” and “how we get information about their budget”. In other words, what are the different activities of the Business processes “Internet Sales” and “Shop Sales” including especially but not exclusively the activities “CRM” and “Order”.

This is really a Business Processes description that I’m expecting and not an application process. At this moment, information about software supporting the business processes are more a distraction than really helpful.

This part of the document is critical, the more correct and exhaustive this part will be, the faster we’ll be able to start the development.

Define the terms

Vocabulary or semantic is something essential in Data Modelling and so Business Intelligence. You need to provide a definition of terms that are used in the Business Processes described previously.

This vocabulary will be also reused in the resulting BI Solution. So if you’re using the word “Customer” in your requirement we’ll use “Customer” and do not expect “Client” in the implementation.

Avoid usage of synonyms. It’s probably a better writing style to make usage of them, but it’s also ambiguous. If a “Customer” and a “Client” have the same meanings, choose one of them and use it until the end of the document. I’m not expecting a Goncourd or Nobel price of literature but clear business requirements.

Be explicit about your definition. I don’t expect and are not interested by a list of items with two words standing for a definition. Give us a context and rules to apply to validate your vocabulary.

Example: Define “customer” for a hotel chain

Possibilities include:

  • A customer is someone who makes a reservation
  • A customer is someone who has had at least one stay at a hotel
  • A customer may be a business or individual who rents a hotel facility
  • A customer is someone who signs up for the hotel loyalty program

Candidate definition:

A customer is an individual, group, or business who has received one or more hotel services, including hotel stays, facility rentals, and/or other additional hotel services, such as spa services. Hotel services may or may not have been paid for by the customer. A customer does not have to complete a hotel stay to be considered a customer.”

It’s usually easier to understand and assimilate if the definitions are given in-text with the business process description. I do not recommend a dictionary or glossary disconnected of the business process description. It’s usually difficult to maintain and difficult to read.

Often, I’m receiving documents where the semantic in the drawings and the text is not the aligned. It’s really confusing. Even if you’re copy-pasting schemas from other documents ensure that the semantic is the same than in other parts of your document. If, for editorial reasons, you can’t update the text on the drawings then clearly state that the terminology is not identical and give clear correspondences.

Analytics targets

In the second section of the document, we’re dealing with the analytics that should be conducted on top of the previously described Business Processes.

The topic is not only what we want to analyze but also about the skills of the users. The more-skilled the users are, the more degree of freedom we’ll offer. But if your users have absolutely no IT knowledge and no time, we’ll probably deliver a report in their mailbox.

 Users: skills, mobility and audience

Typically, users will fall into one of three categories in relation to the timeliness of data required:

  • Report broadcasts and scheduling: Many BI users won’t require real-time data access. For example, the marketing department might rely on weekly sales reports to gauge the effectiveness of brand strategies and specific marketing campaigns over a significant period of time. Daily, or real-time reports – such as transaction level reports – would provide too much detail. The strength of BI in this instance is in its ability to track and identifying trends over time, while simultaneously allowing users to connect shifts in the data to real-life events (marketing campaigns) to support strategy development.
  • Alerts: In some instances, certain user groups won’t need to actively monitor business activities, but just the ability to be alerted to abnormal or unacceptable circumstances – when key metrics fall outside predefined parameters. For example, the accounts payable team don’t need constant data access to perform their daily tasks. However, if an invoice is set to pass its payment deadline, an alert can grab attention to ensure that it’s processed in a timely manner.
  • On-demand data access: Some users will require the ability to directly access and search real-time, or near real-time, operational data and reports, as well as perform ad-hoc queries. For example, BI users coordinating the intersection of ‘just in time’ manufacturing and supply chain operations would demand the ability to access, analyze and act on up-to-the-minute reports and data to ensure the timely delivery of stock from the warehouse, to transport operators, and to retail outlets.

Be sure to give a correct overview in terms of skills, mobility and point-in-time of the users that will use the solution. It’s expected that you’ve more than one type of users. Different categories of users will access different artefacts of the solution.

Analytics needs

Articulate your reflection in terms of question that will be asked. Don’t stop on the first question. If users tell you that they want to “Monitor Quality”, go further ask them to explain how they will define “Quality”. Is the difference between two measures in percentage or absolute value? Ask them what they will effectively monitoring? Will they check if the quality improves? Do they want to receive alerts when the quality decrease? Is just one measure bellow the expectation enough to ring the bell of the alerts?

It happens that users have no idea for these questions. They just want the raw data. In such case, they also ask to be able to define a lot of parameters by themselves. It’s a current and normal situation. It probably means that you need to add a phase of data discovery in your business intelligence path. Self-Service BI is in the BI Toolset to provide some data discovery features to end-users.

You don’t need to be especially exhaustive in the definition of the analytics needs. BI models have the ability to always answer to a family of question and not a unique question. So if you’ve forgotten an analysis in your requirements, you’ve a probability that you’ll still be able to answer this question with existing models. But it’s only a probability, other families of analytic needs will require different models. So it’s probably better to describe non-similar needs.

Business Impact

The last part of the document is optional. We can build a solution without it. Anyway, I’ll still be interested and will surely read it in detail. With this section, I’ll probably provide a better solution and share your business objectives.

This part of the document is more important for the stakeholders to have some support when requesting a budget. It explains what are the expected return on investment and especially how do you think they will be achieved. What are the changes in the business processes that you’ll support with this new application? How these information needs map to the business goals?