Parsing semi-structured text files with the Power Query M language

My daughters participated to the provincial championships of gymnastics during this winter, both qualified for the regional championships. Between the two events, I analyzed the results of the different provinces with Power BI to guess what they could expect at the regional level.

The first unexpected issue was the file results: PDF files. I tried several tips such as open them in Word or with other processors and try to convert them to text files but finally the best and easiest solution was to directly copy/paste them in text files before loading them in Power BI.

Once loaded in Power BI, I had a huge table with the following pattern

##Category 1
Participant 1.1
Participant 1.2
Participant 1.3
##Category 2
Participant 2.1
Participant 2.2
Participant 2.3
Participant 2.4
Participant 2.5

I needed to transform that semi-structured file to the following structured table with two columns:

Participant 1.1     Category 1
Participant 1.2     Category 1
Participant 1.3     Category 1
Participant 2.1     Category 2
Participant 2.2     Category 2
Participant 2.3     Category 2
Participant 2.4     Category 2
Participant 2.5     Category 2

To achieve that transformation, I added an index to my table of flat results. Then creating a reference query, I isolated the categories by filtering out all the rows not starting by ##. The result was similar to that table:


My goal was to determine a range of lines corresponding to each category. Based on the line index of a participant, I must be able to determine her category. The index of categories will tell me the first and last line corresponding to a given category. I already had the starting line index of the categories but not the corresponding last line index . This value is simply the value of the start index of the next category minus one. The following code is adding this information and managing the last category with a try ... otherwise ... using the last line of my result file.

#"Filtered Rows" = 
      each Text.StartsWith([Column1], "##")),
#"Renamed Columns" = 
   Table.RenameColumns(#"Filtered Rows",{{"Index", "Start range"}}),
#"Added Index" = 
    Table.AddIndexColumn(#"Renamed Columns", "Local Index", 0, 1),
#"Added Custom" = 
    Table.AddColumn(#"Added Index", "End Range", 
           try #"Renamed Columns"{[Local Index]+1}[Start range]-1 
           otherwise Table.Max(Source, "Index")[Index]),
#"Removed Columns" = 
    Table.RemoveColumns(#"Added Custom",{"Local Index"})

Now, on another function, I’ll filter out all the categories and hold the participants and the line index. Based on this index, I’ll do a range loookup on the other table, containing the categories, and add the returned information as a new column to my table of participants (range join).

The first step is to create a function to handle a range join.

    JoinRange = (value as any, array as table) as any =>
        matchingRows = 
                 each value>=[Start Range] and value<=[End Range]
        returnValue = matchingRows{0}[Text]

Then I needed to make usage of this function in a new table "participants":

    Source = flat,
    #"Filtered Rows" = 
          each not Text.StartsWith([Column1], "##")
    LookupCategory = 
           #"Filtered Rows", 
           each JoinRange(_[Index], categories)
    #"Removed Columns" = 
    #"Removed Columns"

and I finally get the expected result:



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s