Parsing columns of CSV files with inconsistant naming and invalid content with Power Query/M

I’m currently working on the setup of a data quality dashboard and Power BI was our natural choice to display valuable information. For an advanced use-case, we needed to consume CSV files containing all the rows not validating a given data quality rule (violations). Each CSV file was completely different because each rule is consuming different information during the run. It means that in one file I could be facing the column EmployeeId and BirthDate but in the next file I could find a PublishDate and Area followed by Value. In the case of a specific project, all the data quality rules report a field related to a date information. The idea of the report was to count the violations by this date. Unfortunately, we quickly found a few issues with our CSV files:

  • Most of the time the column containing the date was named “DateTime” but it was not always the case
  • In some files, the first row (containing the header and so the columns’ name) was empty
  • In some files, the transformation from a JSON structure to a CSV was not correctly handled by the parser and we have the following information in place of the expected date: "value": "2018-01-01". The expected date is there but we need some additional parsing to extract it.

At the beginning, some of us thought that it would be easy to fix these issues by returning to the data quality team and ask them to fix these issues but it was not so easy. Identifing the rules needing a fix would be huge task (the CSV files are not created if the test is successful, maling it impossible to address this issue in one run and other impediments). I took the decision to go over this issue with the implementation of the following heuristic:

  • if the CSV has a column DateTime then we’ll use it
  • if the header is empty or no column is named DateTime then use the first column
  • if the content of the selected column is not a date then try to parse it as the inner content of a JSON element.

To implement this parsing, I started by the list of files and the table created from the CSV. In this list I’ll add a new column corresponding to the list of DateTime found in the CSV file applying the previously described heuristics. To implement theses heuristics, I created a function expecting one parameter, the original table retrieved when opening the CSV file.

let
    getDateTimes = (rawTable as table) =>
let
   …
in
   …
in
   getDateTimes

The first step is to promote the first row as the header containing all the column names.

csvTable = Table.PromoteHeaders(rawTable, [PromoteAllScalars=true])

Then we’ll need to create an inline function to assert if the header contains the expected column named DateTime.

hasDateTimeColumn = List.Contains(
    Table.ColumnNames(csvTable)
    , "DateTime"
)

We’ll consume this inline function to determine which column, we’ll need to keep. If this column exists then we’ll use it else we’ll use the first column.

targetColumn = if hasDateTimeColumn 
    then {"DateTime"} 
    else {List.First(Table.ColumnNames(csvTable))}

Now that we’ve identified the column, we can just keep this column and remove unneeded columns.

rawDateTimeTable = Table.SelectColumns(csvTable, targetColumn) 

Unfortunatelly, we cannot be sure of the name of this column and it’s easier if we can correctly named it. We’ll renamed the first column with the expected name.

renamedDateTimeTable = Table.RenameColumns(
    rawDateTimeTable
    , {{List.First(targetColumn), "DateTime"}}
)

Now, we’ll need to get the value from this field. It could directly be the value or it could be a subpart of the a JSON element. To handle the different paths to obtain this value we’ll need to create a new internal function named getDateTime. Each row of the table will have to call this function to get its corresponding value. We can achieve this with the help of the native function Table.TransformColumns

parseDateTimes = Table.TransformColumns(
    renamedDateTimeTable
    , {"DateTime", getDateTime}
)

The code of the function getDateTime is simply applying a trial and error. We first try to parse the value as a date. If it doesn’t work, we’ll consider to format the value in a JSON element and then extract the inner value. We can jump from the first path to the second simply with a try … otherwise ... structure.

getDateTime = (value) => 
    let
        datePath = DateTime.From(value)
        , jsonPath = Json.Document(
            Text.Combine({"{", value, "}"})
        )[value]
    in
        try datePath otherwise jsonPath

Once, we’ve extracted the value, we can cast it as a date. That will enable us to check if our heuristic was correct or not beacuse at the end we’re expecting to get a date!

typedDateTimeTable = Table.TransformColumnTypes(
    parseDateTimes
    , {{"DateTime", type datetime}}
)

Finally, we just need to return the column DateTime.

typedDateTimeTable[DateTime]

When I explained to the team in charge of Power BI that we would not fix the issues at the root cause and that they’ll have to provide the data, they told me it was impossible. Sometimes, I believe in as many as 6 impossible things before breakfast! Eventually, it took me around 30 minutes to write this code on a bed in an hotel room. The heuristics were covering 99% of the cases. At contrario, solving the respective issues in the data quality tool and in the parsers would have taken many days (but will be done … later).

The full code:

let
    getDateTimes = (rawTable as table) =>
let    
    csvTable = Table.PromoteHeaders(rawTable, [PromoteAllScalars=true])
    , hasDateTimeColumn = List.Contains(Table.ColumnNames(csvTable), "DateTime")
    , targetColumn = if hasDateTimeColumn then {"DateTime"} else {List.First(Table.ColumnNames(csvTable))}
    , rawDateTimeTable = Table.SelectColumns(csvTable, targetColumn) 
    , renamedDateTimeTable = Table.RenameColumns(rawDateTimeTable, {{List.First(targetColumn), "DateTime"}})
    , parseDateTimes = Table.TransformColumns(renamedDateTimeTable, {"DateTime", getDateTime})
    , typedDateTimeTable = Table.TransformColumnTypes(parseDateTimes, {{"DateTime", type datetime}})
    , getDateTime = (value) => 
        let
            datePath = DateTime.From(value)
            , jsonPath = Json.Document(Text.Combine({"{", value, "}"}))[value]
        in
            try datePath otherwise jsonPath
in
    typedDateTimeTable[DateTime]
in
    getDateTimes

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s