Parsing a string with Power BI (M language)

A few days ago, I read an interesting blog post of Lukas Lötters [b|t] where he is explaining how to use regular expressions in Power BI (M language) by running an R script. It’s a really smart way to solve the issue. But I wanted to try to resolve this problem without delegating the parsing of the string to R (just for the challenge!).

In a nutshell, the challenge is to find a substring in a string. This substring is identified by a pattern \((19|20)\d{2}. If you’re not familiar with regular expressions, it just means a parenthesis followed by “19” or “20” and two additional digits.

regex_meme.jpg

Let’s start by defining a few test-cases that the function to implement will need to pass:

  • “The Shawshank Redemption (1994)” / result: 1994
  • “My Dad Says (2010) {Dog Ed Pursuit (#1.7)}” / result: 2010
  • “Years Younger (2004/I)” / result: 2004
  • “abcd(19xy) hqskj dhq (2020)” / result: 2020
  • “fdsdf (1995) sdfsdf (19)” / result: 1995
  • “hkjehr (195x) (1993)” / result: 1993
  • “hkjehr (19-9) (2005)” / result: 2005
  • “hkjehr (199.) (2000)” / result: 2000
  • “hkjehr (19+9) (1993)” / result: 1993

The three first test-cases are directly retrieved from the original file to parse, for the others, they are just additional potential cases directly produced by my own brain.

To solve this challenge, I’ll create a function expecting a text.

let
    ExtractYear = (string as text) =>

My first step will be to identify all the opening parenthesis in the string. Usually, we use the function Text.PositionOf to get the first occurrence of a character in a string. But if we specify the parameter Occurrence.All, this function is returning a list with all the occurrences.

let
    starts = Text.PositionOf(string, "(", Occurrence.All),

The previous function is returning a list of positions in the string. I’ll now extract the 4 characters following the parenthesis and will replace the previous content of my list with these 4 characters, List.Transform is a good friend for this kind of task.

    blocks =
        List.Transform(
            starts
            , each Text.Range(string, _ + 1, 4)
        ),

Unfortunately, I’m not sure that I’ve 4 characters after the parenthesis in my string (test-case 5). I need to be careful and handle the error. We can implement this in a try ... otherwise ... syntax.

    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),

Now, for each opening parenthesis, I’ve the next 4 characters (or null if they are not existing). I’ll need to filter this list of 4 characters to only take into account those starting by “19” or “20”. To apply this filter to my list, I’ll use the function List.Select.

    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),

This filter will already considerably reduce the list of potentials year. The only test-cases where I still have many potential years are the test-case 6 and next. Expl: hkjehr (195x) (1993). This test-case has 2 remaining elements in the list: 195x and 1993.

To continue to filter out candidate “year”, I need to check that the characters in position 3 and 4 are digits. I have not found a native function in M to test that a character is a digit, so I’ll implement this by myself. But first step, extract characters in position 3 and 4 and pass them as parameter to the to-be-created function named isDigit. This new function will return true/false (a logical in M).

    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),

To validate that a character is a digit, I decided to rely on characters and encoding. All the digits are encoded between positions 48 and 57 in the ASCII table.

ascii-charsMy function isDigit will just test that the character, transformed to a number (by the help of Character.ToNumber) is between the same conversion applied to “0” and “9”.

    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),

This internal function is accepting a parameter. Take a look to the syntax where I stipulated this parameter after the equal symbol and before the arrow symbol (Just as for a normal function).

When all these tests have been applied to filter the candidate for the release year I’m expecting to have just one remaining candidate! I’ll assert this with the function List.Single. This function is throwing an error if the list is empty or has more than one element.

    value = List.Single(blocks1920dd)

That’s it! My newly created function is validating all the cases that I set at the beginning.

challenge-completed

Full code:

let
    ExtractYear = (string as text) =>
let
    starts = Text.PositionOf(string, "(", Occurrence.All),
    blocks =
        List.Transform(
            starts
            , each try Text.Range(string, _ + 1, 4) otherwise null
        ),
    blocks1920 =
        List.Select(
            blocks
            , each
                Text.StartsWith(_, "19")
                or Text.StartsWith(_, "20")
        ),
    blocks1920dd =
        List.Select(
            blocks1920
            , each
                isDigit(Text.Range(_,2,1))
                and isDigit(Text.Range(_,3,1))
        ),
    isDigit = (x as text) as logical =>
        Character.ToNumber(x)>=Character.ToNumber("0")
        and Character.ToNumber(x)<=Character.ToNumber("9"),
    value = List.Single(blocks1920dd)
in
    value
in
    ExtractYear
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s