Auditing login attempts with SQL Server

Today’s quest is similar to the blog post that I wrote two months ago about auditing security changes: we’re admin of our non-production environnements of SQL Server and we’d like to put in place some auditing. In this case we’d like to store all the login attempts on our database/server.

The core issue is that we have many users and logins on our databases and we have huge doubt their respective needs. The root cause is identified: sometimes, for a short period of time, we’re making exceptions to our own rules and let a few other friend projects access to our DEV database. On some other cases, we’re connecting our own solution in DEV environnement to the QA environnement of another solution. Why … planning, data quality issue, … we’ve valid reasons to do it … but these exceptions should be removed as soon as possible. And you know what? People forget. Nowadays, on our largest solution, we have 20 users but only 7 of them are expected and documented … other should be removed. But before executing this cleanup, we’d like to be sure that these users are not effectively used by other solutions. If it’s the case, we’ll need to update first the configuration of the corresponding solution.

To know if a user is effectively in use on a database, you need to capture the attempts of connexion (success and failure). Technically, you’ve several ways to implement this in SQL Server but I decided to use the audit feature to accomplish this task.

who-is-attempting-to-log-to-your-database

The first step is always to create an audit on the server or database (here, I’d like to track the whole server).

create server audit
    [Audit Login Informations]
to file
(   FILEPATH = N'N:\Temp\Trace\'
    , MAXSIZE = 1024 MB
    , MAX_FILES = 10
    , RESERVE_DISK_SPACE = OFF
)
with
(   QUEUE_DELAY = 1000
    , ON_FAILURE = CONTINUE
    , AUDIT_GUID = '33e055a9-1924-47c3-9777-e8875a5bcd37'
);

This audit must be activated

alter server audit
    [Audit Login Informations]
with (state=on);
go

Then the specification of this audit must be described. In our case we’ll intercept five groups of events related to login attempts (successful or not) and to contained database or not.

create server audit specification
    [Audit Login Informations specification]
for server audit [Audit Login Informations]
    add (FAILED_DATABASE_AUTHENTICATION_GROUP)
    , add (FAILED_LOGIN_GROUP)
    , add (SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP)
    , add (SUCCESSFUL_LOGIN_GROUP)
    , add (DATABASE_LOGOUT_GROUP)
with (state=on);
go

By experience this kind of script is probably more useful and less breakable, if you add a cleanup phase before creating your audit and its specification.

Let’s start by testing if the audit is already existing. The view sys.dm_server_audit_status will let you check if the audit is existing and if it’s running. If it’s the case then you need to stop it and drop it (and its specification).

if exists(
    select
         *
    from
         sys.dm_server_audit_status
    where
         name='Audit Login Informations'
)
begin
    alter server audit specification
        [Audit Login Informations specification]
    with (state=off);

    drop server audit specification
        [Audit Login Informations specification];

    alter server audit
        [Audit Login Informations]
    with (state=off);

    drop server audit
        [Audit Login Informations];
end

That’s it! You can now query your audit to get all the connection attempts on your server by the means of this simple query:

declare  @LogFilePath varchar(255);
set @LogFilePath = N'N:\TEMP\Trace\';

declare  @AuditName varchar(255);
set @AuditName = N'Audit Login Informations';

select
    @LogFilePath=log_file_path
from
    sys.server_file_audits
where
    name=@AuditName;

set @LogFilePath=@LogFilePath
    + REPLACE(@AuditName, ' ', '%5')
    + '*.sqlaudit';

select
    session_server_principal_name
    ,database_name
    ,object_name
    ,m.class_type_desc
    ,statement
    ,convert(datetime,
        switchoffset(
            convert(datetimeoffset, event_time)
            , datename(TzOffset, sysdatetimeoffset())
        )
    )
from
    sys.fn_get_audit_file (@LogFilePath,default,default) f
inner join
    sys.dm_audit_class_type_map m
    on m.class_type=f.class_type
order by
    event_time desc;
Illustration performed with the help of icons from the noun project: Database by Dmitry Mirolyubov, Sightview by Andy Ivandikov, Lock by Maxim Kulikov. Common creative licence.

Managing firewall rules for Azure databases with SSMS

When you create a new Azure database, you usually need to open the firewall to remotely administrate or query this database with SSMS. An option is to create rules from the Azure Portal. It’s surely a convenient way to do it when you create a database but I prefer to keep a minimum of tools and when the Azure portal is not open, I prefer to not have to open it just to define a few firewall rules.

Opening the firewall with SSMS is a kind of chicken and eggs problem: to connect to your database/server, you need to open the firewall. Hopefully, SSMS has a great suite of screens to call the underlying API of Azure Portal and open the firewall for the computer running SSMS.

Attempt to connect to your Azure database.

ssms-azure-01

If the firewall is not already open for this IP, you’ll receive the following screen:

ssms-azure-02

The first step is to identify you with your Azure subscription.

ssms-azure-03If you’ve created your account a long time ago, you could receive this screen … If your account is linked to professional resources (granted by your comany) the good choice is probably the first. If you’re developing from home with resources that you’ll pay by yourself, the correct choice is probably the second.

ssms-azure-05

When you’re successfully identified, you can create a new rule. Open the firewall for your own IP or for a subnet.

ssms-azure-06

Now, you can connect and query your database or the virtual master database (open the folder System Databases).

ssms-azure-07

Firewall rules applicable at the whole server are effectively stored in the master database, you can list them by querying the view sys.firewall_rules.

ssms-azure-08

At this moment, you see that I was a bit lazy and didn’t clean all these rules since a long time. If you want to remove some rules, you can use the stored procedure sp_delete_firewall_rule.

ssms-azure-10

It’ll clean-up your list of firewal rules.ssms-azure-11

 

If you want you can create firewall rules at the database level by connecting to the database and using the sys.database_firewall_rules. The firewall rules at the database level are evaluated before these at the server level. To improve performance, server-level firewall rules are temporarily cached at the database level. To refresh the cache, see DBCC FLUSHAUTHCACHE

Web.Contents and 404 page not found in Power BI

The M language (empowering Power Query and Power BI) supports the try ... otherwise ... feature. It is handy to catch errors when processing some of the rows and handle these cases in alternative paths.

Unfortunately this feature is not intuitive when surrounding a call to the function Web.Contents(). Intuitively you could expect that an answer 404: page not found would start the otheriwse path. But it’s not the case, an error will throw!

To correctly manage this kind of error, you’ll need to add a second parameter to your Web.Contents call (after the url): [ManualStatusHandling={404}]). This parameter specifies to Web.Contents() that you don’t want to apply the default behavior of Power BI when receiving a 404 error: you’ll handle this case by yourself and not stop the load of the table.

The pragmatic option is probably to continue your transformation flow independently of the response received from the web server. Your next step will (probably) miserably fail and you’ll be able to add a try ... otherwise ... on this step to handle an error.

But, if you really want to do this cleanly, you’ll have to parse the metadata of the result of Web.Contents. The function Value.Metadata gives you this opportunity. In the metadata, you’ll find a Response.Status field where you’ll be able to check the value received: 404, 100, … Based on this value you can decide which path to follow with a simple “ìf … then … else“`

let
  Source = Web.Content(
    "http://unexisting.com"
    , [ManualStatusHandling={404}])
  ),
  ResponseCode = Value.Metadata(Source)[Response.Status],
  NextStep = if ResponseCode=404 then ... else ...

How to connect to PostgreSQL with NBi?

I often receive questions such as “Does NBi work with PostgreSQL?” or “Is it possible to use another database than SQL Server?”. The answer is definitively: YES! NBi is working with all databases supporting OleDB or ODBC connections.

Yesterday, I edited a blog post for the connection to MySQL, today we’ll see how to connect to PostgreSQL from NBi by using both the ODBC and the OleDB connectors. At the opposite of MySQL, PostgreSQL has at least one ODBC driver and at least one OleDB provider still maintained (and even actively developed).

I’ll start by the ODBC driver for PostgreSQL. You can find the binaries on the official PostgreSQL website. Download them and install this driver. You can check that the ODBC driver has been correctly installed in the “Program and Features” panel:

postgresql-odbc-connector

 

Another great way to check that the ODBC drivers are installed is to user PowerShell and the cmdlet

Get-OdbcDriver

This method returns the following list where you should find something related to PostgreSQL.

postgresql-odbc-driver

This method will also give you the exact name of the driver, something where my google-fu usually miserably fails. In this case I’ve two drivers named PostgreSQL ANSI(x64) and PostgreSQL Unicode(x64). Be sure to use these exact names in your connection strings!

When these connectors are successfully installed, edit your test-suite and define the connection string to your PostgreSQL server as:

<query
   connectionString=
      "Driver={PostgreSQL ANSI(x64)};
       Server=127.17.0.2;
       Database=postgres;
       UId=CI-Build;
       Pwd=Xyz;" 
>
  <![CDATA[
     select 'Hellow world'
  ]]>
</query>

That’s it!

For the OleDB provider, I’ll use the commercial version provided by Intellisoft and available on this page. This product comes with a free trial. Once downloaded and installed, you can check that it has correctly been registered by running the following PowerShell cmdlet:

(New-Object system.data.oledb.oledbenumerator).GetElements()

and it should contain the following result:

postgresql-oledb-provider

Unfortunately, this method doesn’t return the nickname of the provider that we must use in our connection string. In this specific case, this name is PNGP.1.

Because this OleDB provider is not pre-registered by NBi, we need to map it to the OleDb namespace as explained in the documentation of NBi. The config file must be updated with:

<nbi testSuite="PostgreSQL-TestSuite.nbits">
  <providers>
    <add id="PGNP.1" invariant-name="System.Data.OleDb"/>
  </providers>
</nbi>

Once it’s done, you can safely edit your connection string into your test-suite.

<query
  connectionString=
    "Provider=PGNP.1;
     Data Source=127.17.0.2;
     Initial Catalog=postgres;
     User ID=CI-Build;
     Password=Xyz;"
>
  <![CDATA[
     select 'Hello World'
  ]]>
</query>

We could move this connection string to the settings (references or defaults) or to the config file.

You can now configure a connection to a PostgreSQL instance without any problem!

How to connect to MySQL with NBi?

I often receive questions such as “Does NBi work with MySQL/PostgreSQL?” or “Is it possible to use another database than SQL Server?”. The answer is definitively: YES! NBi is working with all databases supporting OleDB or ODBC connections.

Today, I’ll specifically answer the question about MySQL, tomorrow I’ll edit another blog post about PostgreSQL. For MySQL, I’ll connect from NBi by using an ODBC connector. Indeed, MySQL has no support for an OleDB connector. To be 100% transparent, some old versions of OleDB connectors exist but it doesn’t sound to have any active development on these projects.

Start by downloading the latest version of the ODBC driver for MySQL from the MySQL website. Then install it. You can check that the ODBC driver has been correctly installed in the “Program and Features” panel:

mysql-odbc-connector

Another great way to check that the ODBC drivers are installed is to user PowerShell and the cmdlet

Get-OdbcDriver

This method returns the following list where you should find something related to MySQL.

mysql-odbc-driver

Do you need the 32-bits(x86) or 64-bits(x64) drivers? It depends of the application using the ODBC driver … both need to match. When using the nunit-console or the GUI of NUnit, you’ll need the 64-bits version. But if you’re running your tests from Visual Studio then you’ll need the 32-bits version. You can install both on the same computer but it’s apparently recommended to install first the 64-bits version and then the 32-bits. Haven’t checked this personally.

When the connector is successfully installed, edit your test-suite and define the connection string to your MySQL server as:

<query
   connectionString=
      "Driver={MySQL ODBC 5.3 UNICODE Driver};
       Server=127.17.0.3;
       Database=adventureworks;
       User=CI-Build;
       Password=Xyz;" 
>
  <![CDATA[
     select count(*) from adventureworks.address
   ]]>
</query>

We could also use this connection string in our settings (defaults or references) or in our config file.

Now, you can use your MySQL instance exactly the same way you’d use a SQL Server instance.

Handling enum values in fact tables with Power BI

In a previous blog post, I wrote about the analysis of survey’s results and specifically how to build a hierarchy when this notion was not really transparent into the source data. Then, in a second post, I explained how I skipped to create some temporary tables and used transpose and unpivot.

I’ll now continue my journey and explain how to build the tables to manage the results. Some of the challenge that I’ll have to face are:

  • Answers are an enum where I’m expecting some values such as NONE, LARGELLY, FULLY, … [Green area]
  • NOT APPLICABLE should be treated as a no-answer (that is also possible).
  • I DON’T KNOW, should be considered as an answer but not taking into account when I create an aggregation (average) of the results received.
survey-excel-file

To start, I’ll create two tables and not one. In the first I’ll create a new dimension with the list of possible answers to the questions. I can do this manually with the “Enter Data” feature.

enter-data

For the second, I’ll restart from the previously created table Result-Staging. I’ll just merge this table with the newly manually created Value table.

let
    Source = #"Results-Staging",
    #"Merged Queries" = Table.NestedJoin(#"Source",{"Answer"},Value,{"Label"},"NewColumn",JoinKind.Inner),
    #"Expanded NewColumn" = Table.ExpandTableColumn(#"Merged Queries", "NewColumn", {"Value"}, {"Value"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Expanded NewColumn",{"Answer"}),
    #"Filtered Rows1" = Table.SelectRows(#"Removed Columns1", each ([Value] <> null))
in
    #"Filtered Rows1"

Due to the usage of an Inner Join, all the questions where the answer was not one of the predefined values of the enumeration are discarded. It’s a good new because it’ll remove the yellow area, the red area but also all the NOT APPLICABLE answers!

results-final

Our model looks like:

survey-model

We still have to implement a few measures. The count of answers with a given label for a category or a Respondent can be easily acgieved with the straightforward DAX formulae:

CountOfAnswer = COUNT(Results[Value])

We’ve previously decided that a DON’T KNOW, shouldn’t influence our average of the answers. To apply this decision, we just need to filter the table Result and remove all the values equal to 0 (Enum value of DON’T KNOW). Then we calculate the average and subtract 1 to get a value between 0 and 4. Coooool, except that if we’ve no value non equal to 0, it will return -1 … not what we’re expecting. We’ll need to validate that the average is not null before subtracting 1.

AverageOfResponse =
   VAR NonZeroValues =
       CALCULATE(
           AVERAGE('Value'[Value]),
           FILTER('Value', 'Value'[Value]<>0)
       )
   RETURN
       IF(ISBLANK(NonZeroValues), BLANK(), NonZeroValues-1)

That’s it, now I can use the DON’T KNOW values in a count of answers received but they don’t influence my average:

results-charts

 

Merge on a filtered set of rows without creating a new table (Power BI).

In a previous blog post, I wrote about the analysis of survey’s results and explictely how to build a hierarchy when this notion was not really transparent into the source data.

I’ll now continue my journey and explain how to build the table/dimension “respondent”. To summarize the challenge that I’ll face:

  • Each respondent means a new column (and not a new row), each additional information for the respondent add a new row (and not a new column) [Orange area]
  • Some information are available in the list of answers given by the respondent and not in the information on the top [yellow area]
  • For maintenance reasons, I’ll try avoid to create temporary tables that will only be used as a subset of another table
survey-excel-file

The first step is really easy, we’ll just keep the first rows by removing all the rows where the column “A” has a value. when done, we can remove this column that will only contain null values.

let
   Source = Source,
   #"Filtered Rows" =
      Table.SelectRows(
          Source,
          each ([Column1] = null)
      ),
   #"Removed Columns" =
      Table.RemoveColumns(
           #"Filtered Rows",
           {"Column1"}
      ),

The next step is to transform each row into a column and vice versa. Usually people with a SQL background try to apply a pivot or unpivot to achieve this task. There is an easiest way with the M language … just use the function Transpose

#"Transposed Table" = Table.Transpose(#"Removed Columns"),

You’ll obtain the following result:

respondent-transpose

It’s now to promote the first row as header.

#"Promoted Headers" = Table.PromoteHeaders(#"Transposed Table"),

Now that we already have a few information about our respondent, we’ll need to add the block of yellow information to this table.

To achieve this, I’ll first create a new table named Result-Staging and containing the gray area of my excel sheet. Let’s start by removing the column “B”, promoting the first row as header and filter rows to remove them within the orange area (column “A” is null).

let
    Source = Source,
    #"Removed Columns" = Table.RemoveColumns(Source,{"Column2"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Removed Columns"),
    #"Filtered Rows" = Table.SelectRows(
        #"Promoted Headers",
        each ([Column1] <> null)
    ),
results-gray

Then, I’ll try to change this format to have one column with the QuestionId, another with the Respondent and finally a third with the response. To transform the current data table into this format I’ll need to unpivot all columns except Attribute and Value.

    #"Unpivoted Other Columns" =
        Table.UnpivotOtherColumns(
            #"Filtered Rows",
            {"Column1"},
            "Attribute",
            "Value"
        ),
    #"Renamed Columns" =
         Table.RenameColumns(
              #"Unpivoted Other Columns",
              {
                    {"Column1", "QuestionId"},
                    {"Attribute", "Interviewee"},
                    {"Value", "Answer"}
               }
          )
in
    #"Renamed Columns"
results-unpivot

In this table, only the results with a QuestionId equal to 111 really interest me for a merge with the existing table Respondent. If you’re familiar with the UI of Power BI Desktop then you’ll probably think to create a new table referencing this one then filter on QuestionId equals 111 and finally merge. It’ll work but applying this strategy could result in many “temporary” tables. A lot of these small tables used only for a few steps before merging with other tables tend to be a nightmare for maintenance. You can use the “Advanced formula editor” to not display this kind of temporary tables and embed them in your main table.

On the code of the Respondenttable, add a step to filter the content of the table Result-Staging.

#"Age table" =
   Table.SelectRows(
      #"Results-Staging",
      each [QuestionId] = 111
   ),

If you take a look to the code the function #"Results-Staging" doesn’t exist in the code associated to table Respondent. In fact it’s the end-result of the function/table Results-Staging !

Now that we’ve our filtered set of rows corresponding to the ages of the respondents, I can join it with the content of the table Respondent.

#"Merged Queries" =
   Table.NestedJoin(
       #"Promoted Headers",
       {"Name"},
       #"Age table",
       {"Interviewee"},
       "NewColumn",
       JoinKind.LeftOuter
   ),

With a bit more of expanding and renaming, I can finally get my table Respondent.

respondent-final

In the next blog post, I’ll show how to manage the table of the results.

 

Creating a hierarchy when the source has the levels flattened in a unique column (Power BI)

With a few blog posts, I’ll explain how to create a model for an analysis of answers to a survey. This post is based on a real-life experience that I’m fighting right now, but I completely changed the type of questions and the answers.

My first challenge is to create a clean dimension with the hierarchy of questions based on an input file were this notion of hierarchy was not explicit and everything was flatten in two columns.

Before jumping into the action, I’ll explain the file containing my data. I received a wonderful Excel file with the results of the survey. This file is a bit weird, let’s summarize the format:

  • first column contains a “number” corresponding to a hierarchy to organize the different questions. Each question is part of a sub-category and each sub-category is part of category. The second column contains the question it-self or the label of the (sub-)category (Gray area)
  • an additional column is created for each respondent. (columns C to F)
  • first rows are personal information, not answers to questions (Orange area)
  • some personal information (age) are also available in the list of answers to questions (Yellow area)
  • Most of the question accepts predefined values, including “Not applicable” and “Don’t know” (Green area)
  • Some people were not allowed to answer some questions, I’ve no value for them in the file (Red area)
survey-excel-file

I decided to apply a few “business rules” to analyze the results,

  • “Not applicable” should be considered as “no answer” and not taken into account. This rule must apply when I’m aggregating the answers or counting the answers by questions
  • “Don’t know” is a valid answer that I’d like to see when I’m analyzing the how many answers I received but not when I’m aggregating the answers to get a single value.

To load this Excel file in Power BI, I’ll just use standard functions and define a first table “Source” that won’t be enabled to load in report.

My next task will be to create a table (or dimension) with the different questions. I also want to include a hierarchy in this dimension: I should be able to browse the questions by categories and sub-categories.

Let’s create a new table named “Question” by referencing the “Source” table. Then remove the other columns than A and B.

question-null-id

We’ll also need to remove the rows where the content of “Column1” is null (these rows are personal information, not questions/categories).

At the beginning of this post I said that the first column was a number … to be precise it’s not a whole number (as I expected) but a decimal number! The root cause are rounding errors in Excel … apparently the original file was not configured correctly and the columns were not “text”:

question-rounding

To solution this issue, I’ll just apply the rounding by myself with 0 decimals … not “up” or “down” rounding but just a normal rounding. Then I’ll change the type of this column to “text” because I won’t use this column for aggregations.

#"Rounded Off" =
   Table.TransformColumns(
      #"Renamed Columns",
      {{
         "QuestionId",
         each Number.Round(_, 0), type number
      }}
    ),
#"Changed Type" =
    Table.TransformColumnTypes(
       #"Rounded Off",
      {{"QuestionId", type text}}
    ),

Based on the column “QuestionId”, I can determine if this row is a question, sub-category or category by just taking a look to the length of this text. 1 digit means a category, 2 for a sub-category and 3 for a question … let’s add a column with this information.

#"Added level" = Table.AddColumn(
   #"Changed Type",
   "Level",
   each
      if
         Text.Length([QuestionId])=1
      then "category"
      else if
         Text.Length([QuestionId])=2
      then "sub-category"
      else "question"),

Then I’ll add two additional columns with the Id of the sub-category and the category. To obtain these values, I just need to extract the two first digits for a sub-category and the very first for a category.

#"Added CategoryId" =
   Table.AddColumn(
      #"Added level",
      "CategoryId",
      each Text.Start([QuestionId],1)
   ),
#"Added SubCategoryId" =
   Table.AddColumn(
      #"Added CategoryId",
      "SubCategoryId",
      each Text.Start([QuestionId],2)
   ),
question-level

Once I’ve calculated the “Id” for the sub-category, I’d need to replace it by the label. To achieve this, I’ll just do a merge of this table on itself based on the SubCategoryId equal to the QuestionId:

question-merge
#"Merged Queries" =
    Table.NestedJoin(
        #"Added SubCategoryId",
        {"SubCategoryId"},
        #"Added SubCategoryId",
        {"QuestionId"},
        "NewColumn",JoinKind.LeftOuter),

Expand the table and select the column Question that you should rename sub-category. Apply the same tactic for the addition of the label of the Category and you should have the table bellow.

question-categories.jpg

This table has too many rows because we still have rows for the categories and sub-categories. We can just filter them out using the column level.

#"Filtered Rows1" =
     Table.SelectRows(
          #"Renamed Columns2",
          each ([Level] = "question")
     ),

The column level is now pointless (one unique value equal to “question”) and we can safely remove it (and also the columns for Ids of category and sub-category) to have a clean dimension table:

question-final

In the next blog post, I’ll explain how to build my “interviewee” dimension.

Parsing a string with Power BI (M language)

A few days ago, I read an interesting blog post of Lukas Lötters [b|t] where he is explaining how to use regular expressions in Power BI (M language) by running an R script. It’s a really smart way to solve the issue. But I wanted to try to resolve this problem without delegating the parsing of the string to R (just for the challenge!).

In a nutshell, the challenge is to find a substring in a string. This substring is identified by a pattern \((19|20)\d{2}. If you’re not familiar with regular expressions, it just means a parenthesis followed by “19” or “20” and two additional digits.

regex_meme.jpg

Let’s start by defining a few test-cases that the function to implement will need to pass:

  • “The Shawshank Redemption (1994)” / result: 1994
  • “My Dad Says (2010) {Dog Ed Pursuit (#1.7)}” / result: 2010
  • “Years Younger (2004/I)” / result: 2004
  • “abcd(19xy) hqskj dhq (2020)” / result: 2020
  • “fdsdf (1995) sdfsdf (19)” / result: 1995
  • “hkjehr (195x) (1993)” / result: 1993
  • “hkjehr (19-9) (2005)” / result: 2005
  • “hkjehr (199.) (2000)” / result: 2000
  • “hkjehr (19+9) (1993)” / result: 1993

The three first test-cases are directly retrieved from the original file to parse, for the others, they are just additional potential cases directly produced by my own brain.

To solve this challenge, I’ll create a function expecting a text.

let
    ExtractYear = (string as text) =>

My first step will be to identify all the opening parenthesis in the string. Usually, we use the function Text.PositionOf to get the first occurrence of a character in a string. But if we specify the parameter Occurrence.All, this function is returning a list with all the occurrences.

let
    starts = Text.PositionOf(string, "(", Occurrence.All),

The previous function is returning a list of positions in the string. I’ll now extract the 4 characters following the parenthesis and will replace the previous content of my list with these 4 characters, List.Transform is a good friend for this kind of task.

    blocks =
        List.Transform(
            starts
            , each Text.Range(string, _ + 1, 4)
        ),

Unfortunately, I’m not sure that I’ve 4 characters after the parenthesis in my string (test-case 5). I need to be careful and handle the error. We can implement this in a try ... otherwise ... syntax.

    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),

Now, for each opening parenthesis, I’ve the next 4 characters (or null if they are not existing). I’ll need to filter this list of 4 characters to only take into account those starting by “19” or “20”. To apply this filter to my list, I’ll use the function List.Select.

    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),

This filter will already considerably reduce the list of potentials year. The only test-cases where I still have many potential years are the test-case 6 and next. Expl: hkjehr (195x) (1993). This test-case has 2 remaining elements in the list: 195x and 1993.

To continue to filter out candidate “year”, I need to check that the characters in position 3 and 4 are digits. I have not found a native function in M to test that a character is a digit, so I’ll implement this by myself. But first step, extract characters in position 3 and 4 and pass them as parameter to the to-be-created function named isDigit. This new function will return true/false (a logical in M).

    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),

To validate that a character is a digit, I decided to rely on characters and encoding. All the digits are encoded between positions 48 and 57 in the ASCII table.

ascii-charsMy function isDigit will just test that the character, transformed to a number (by the help of Character.ToNumber) is between the same conversion applied to “0” and “9”.

    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),

This internal function is accepting a parameter. Take a look to the syntax where I stipulated this parameter after the equal symbol and before the arrow symbol (Just as for a normal function).

When all these tests have been applied to filter the candidate for the release year I’m expecting to have just one remaining candidate! I’ll assert this with the function List.Single. This function is throwing an error if the list is empty or has more than one element.

    value = List.Single(blocks1920dd)

That’s it! My newly created function is validating all the cases that I set at the beginning.

challenge-completed

Full code:

let
    ExtractYear = (string as text) =>
let
    starts = Text.PositionOf(string, "(", Occurrence.All),
    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),
    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),
    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),
    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),
    value = List.Single(blocks1920dd)
in
    value
in
    ExtractYear

Use test-cases to validate your design with stakeholders

This post is part of a series about specifications, requirements, work items and test-cases in the field of Business Intelligence solutions. First part is about why I hate and I’d not recommend specifications, the second is about how to write good requirements for BI solutions. Then, in the third part I set the focus on work items and what does it mean according to me. I’ll now explain the role of test-cases in this strategy to deliver good BI solutions.

Let’s go back and see what is on the table until now:

  • Requirements:
    • A description of the business process to be analyzed
    • List of “what” (and not “how”) the stakeholders are expecting to do with our solution.
    • Information about the end-users (experience, …)
  • Work items
    • A general design, the solution that I’m planning to build
    • A list of deliverables to be implemented, each of them is a small and independent (meaning that they can be discarded or postponed).

Until now, we haven’t validated anything. General design and work items are just the team’s vision of what could be the best solution for the end-users. If we’re implementing this, we take a lot of risks. Too many! If something is not correctly described in the requirements or something not understood in the business process, we’ll only know it when revealing the solution to the end-users, it will be too late. We must have a hand shake between the implementation team and the stakeholders before implementing the whole solution, ideally before developing the first lines of code.

invisible-handshake-with-giant

Keep in mind that you should never ask to the stakeholders to give a “go” for the whole solution. Will you ask them to confirm that the solution you’ve in your head is perfectly what they want with just a high level description (slides, whiteboard, …). It’d just be impossible (even for you). That’s the same than expecting from you to give a precise amount of hours to implement a solution with just a few lines of requirements.

We need to have a validation from stakeholders but what can we ask them to validate? Work items? The main issue with work items is the difficulty for end-users to have a good vision of what will be the solution: “What they will have in their hands”! Work items are usually too small to give them this overview. If we ask them to validate something else than the work items, we’ll also be responsible that work items and this “other thing” are related to each others during the whole implementation process.

So, the first option is to write a 3000 pages document with a really detailed explanation of what will be built. Cross your fingers that the stakeholders have time to read it (by experience, they won’t … and yes it will be your problem now or later but it will be). If they have time let me guess that won’t understand more than 50% (again sooner or later, it will be an issue for you).  Don’t forget to pray that you’ve not make some small mistakes in this description, or it will be your responsibility! You’ll need to deliver it (even if it’s not possible)… and don’t forget that this document must be written before the development so without any compiler to check your rough ideas. Oh, finally, you’ll need to maintain the link between this document and your work items. From my experience, go an buy a lottery ticket, you’ve more chances to succeed.

lottery-ticket.png

A second option? Write test-cases. Most of the industry practices ask stakeholders to validate “designs” or “specifications” but rarely “test-cases”. When you think about it, it’s weird: test-cases are much more precise and less abstract than the two others, so it’s easier to understand for the stakeholders and we’ve less opportunities to discuss the interpretation of a test-case than the interpretation of a specification. Funny. From my experience of the software industry, it’s usually because consultancy company don’t write test-cases … they are relying on end-users to effectively tests! And both (stakeholders and consultants) think that they are saving money with this “strategy”. Are we serious?

are-you-serious

Let’s go back to test-cases. A common definition is “A test-case is a set of conditions under which we will determine whether an application, software system or one of its features is working as it was originally established for it to do”.

We can reverse the definition and say that “the way the application, software system or one of its features is expected to work is define by a set of conditions named a test-case”. It means that to express the result that I want (the what), I’m explaining what I’ll execute(the how). It gives a very good opportunity to keep the discussion with the stakeholders at the what-level and not the how-level. When you’re validating the test-case, you ask them to validate that it’s effectively the expected result … and not that it’s effectively how they were dreaming to implement it.

The difference between validating a “specification” and a “test-case” can be explained in a few words: Don’t ask them to validate a formulae, ask them to validate the output of the formulae when it receives a given set of parameters in input. That’s completely different.

Let me be clear on one important topic: validating a test-case has two points. The first is obvious, when you’re giving these parameters, you’re expecting this result. The second is less obvious ; you must also validate that the end-users are effectively able to give the parameters. For many reasons (usage of codes in place of labels, data not available at the moment of the request, …), the solution could not be implementable or usable!

How should you build the test-cases? From the end-user point of view: Show them the report or pivot table that are expecting/will receive. Don’t bother them with validation of the ETL … they don’t care and it’s not their problem, it’s yours. I agree that 50% of the errors will be in the ETL but it’s not something that you should discuss with end-users!Focus on deliverables: a dimension, a measure-group, a report and alert!

point-of-view.png

The big issue with BI solutions is usually that when we’re speaking about input parameters, we’re speaking a thousand or a million values. Building datasets for test-cases can be a huge task. My recommendation is to use real and already existing data as much as possible. If you can use production data (of the previous years). If you can’t argue for it and show to stakeholders the cost of building datasets … they could change their mind. If it’s really not possible, you’ll need to build datasets. But in both cases, really think about the need to assert the final result on all the data, a subset won’t be enough.

To illustrate this, I’ll explain my own case. At my current job, we’ve complex aggregations over time and the daylight savings switches are a nightmare for us. Some testers would try to validate all the aggregations at the day level for the 14 years of data. It will take them days to implement  … Who will validate such a result? do you really think that someone will spent his time to validate 5000 complex sums/averages/weighted averages? No one. I’m recommending to test one standard day and to test the two daylight savings of a same year. 3 tiny test-cases versus one huge test-cases.

There is another big advantage to use tiny test-cases versus huge test-cases. When my test will be successful for the standard day and the summer daylight saving but not for winter, I’d directly understand that I’ve a problem in loading my data for the winter shift and nowhere else. But when your huge case is failing, you must first analyze it deeply to check what is exactly failing and it could take a lot of time before realizing that it’s just for the winter shift that you’ve an issue.

But what will happens if my solution is correct for these 3 days but not for the others? I could have a bug that the solution is not loading the data before the year 2015! That’s a possibility and it’s why I should have some tiny tests with aggregation at the year level to ensure that I’m not only loading the last years of data. … but check my words … my test has changed! I’m not anymore testing that I’ve a correct aggregation, I’m testing that I’ve plausible values! Something that is usually much more quick to write and to validate.

The building of the set test-cases should be done with stakeholders. lets go back to an illustration from the real world. A few years ago, on of my testers spent many hours to implement a mock-up of a pivot table. The mock-up was with real figures and he spent a lot of times to calculate the expected content of each cell (around 16). After one day of work, he met the subject matter expert. In less than one second, the answer was there: “No, something is wrong!”. How could he says in less than 10 seconds that it was wrong. First, we had an issue with units (KW or MW), it means that all the figures had been a thousand times larger than expected. for him it was obvious, for us we had no idea if it was 45 or 45.000. In less than one second the subject-matter expert could identify that something was wrong but also explain how to fix it. Great! But there was something else, another issue. The expert started to move the members in a different sort and then identified another mistake some values were expected to be zero and it wasn’t. The expert explained that the lower part of the matrix should be zero due to the business process and it wasn’t the case. We discussed about the potential root causes of our mistakes and identified a wrong interpretation of a business rule described in the requirements. In just a few minutes we captured 3 potential mistakes in the development (units, sorting of members, business rule misinterpretation) … They continued to work together during one hour and reviewed the whole test-suite, adding test-cases that only subject matter experts could bring on the table and simplifying many test-cases. Great added-value for everyone.

work-together.jpg

At the opposite of writing work items, writing test-cases can be executed by juniors and seniors developers. What … developers writing tests? If your team members or you can’t write tests by themselves how could you hope that you develop with a minimum of bug? How can you anticipate all potential issues?  That’s the definition of testing. If a tester has absolutely no notion of development, I’ve huge doubt that he can anticipate potential issues, it’s just monkey testing. It has some value but does it sound serious to build your testing strategy on exclusively performing monkey testing? I doubt.

Nevertheless, if your team is totally new to testing, it could be useful to add a test manager to your team during a few months (or years). He should be able to review the test-cases with you, help you to identify alternative paths … help your team to go to the next level of maturity in testing. But your developers should write the test-cases by themselves. Not necessarily  write the test-cases on what they are implementing. They can pair with another implementer and one is testing the work-item that the other is coding (and vice versa).

Writing a test-case, is a slow (but valuable) process. I usually recommend a 1:1 ratio between the time of writing (not executing) the test-case and the time to develop the code. Keep in mind that I’m not just writing a test-case on a sheet of paper. I’m also validating my solution with the stakeholders before it’s effectively coded! Reusability? Once validated, I’ll use this artefact to build a robust test-suite that will be automated to validate my development. I’m also building a really good artefact to detect possible issues and impacts with some change requests. Much more than a test that I’ll through away within 10 days. Still not convinced? Have you noticed that I’m not writing documentation … My documentation are my test-cases. With just playing the test-suites, I can confirm that my documentation is inline with my development … living and always up-to-date documentation is priceless.

priceless