Creating a hierarchy when the source has the levels flattened in a unique column (Power BI)

With a few blog posts, I’ll explain how to create a model for an analysis of answers to a survey. This post is based on a real-life experience that I’m fighting right now, but I completely changed the type of questions and the answers.

My first challenge is to create a clean dimension with the hierarchy of questions based on an input file were this notion of hierarchy was not explicit and everything was flatten in two columns.

Before jumping into the action, I’ll explain the file containing my data. I received a wonderful Excel file with the results of the survey. This file is a bit weird, let’s summarize the format:

  • first column contains a “number” corresponding to a hierarchy to organize the different questions. Each question is part of a sub-category and each sub-category is part of category. The second column contains the question it-self or the label of the (sub-)category (Gray area)
  • an additional column is created for each respondent. (columns C to F)
  • first rows are personal information, not answers to questions (Orange area)
  • some personal information (age) are also available in the list of answers to questions (Yellow area)
  • Most of the question accepts predefined values, including “Not applicable” and “Don’t know” (Green area)
  • Some people were not allowed to answer some questions, I’ve no value for them in the file (Red area)
survey-excel-file

I decided to apply a few “business rules” to analyze the results,

  • “Not applicable” should be considered as “no answer” and not taken into account. This rule must apply when I’m aggregating the answers or counting the answers by questions
  • “Don’t know” is a valid answer that I’d like to see when I’m analyzing the how many answers I received but not when I’m aggregating the answers to get a single value.

To load this Excel file in Power BI, I’ll just use standard functions and define a first table “Source” that won’t be enabled to load in report.

My next task will be to create a table (or dimension) with the different questions. I also want to include a hierarchy in this dimension: I should be able to browse the questions by categories and sub-categories.

Let’s create a new table named “Question” by referencing the “Source” table. Then remove the other columns than A and B.

question-null-id

We’ll also need to remove the rows where the content of “Column1” is null (these rows are personal information, not questions/categories).

At the beginning of this post I said that the first column was a number … to be precise it’s not a whole number (as I expected) but a decimal number! The root cause are rounding errors in Excel … apparently the original file was not configured correctly and the columns were not “text”:

question-rounding

To solution this issue, I’ll just apply the rounding by myself with 0 decimals … not “up” or “down” rounding but just a normal rounding. Then I’ll change the type of this column to “text” because I won’t use this column for aggregations.

#"Rounded Off" =
   Table.TransformColumns(
      #"Renamed Columns",
      {{
         "QuestionId",
         each Number.Round(_, 0), type number
      }}
    ),
#"Changed Type" =
    Table.TransformColumnTypes(
       #"Rounded Off",
      {{"QuestionId", type text}}
    ),

Based on the column “QuestionId”, I can determine if this row is a question, sub-category or category by just taking a look to the length of this text. 1 digit means a category, 2 for a sub-category and 3 for a question … let’s add a column with this information.

#"Added level" = Table.AddColumn(
   #"Changed Type",
   "Level",
   each
      if
         Text.Length([QuestionId])=1
      then "category"
      else if
         Text.Length([QuestionId])=2
      then "sub-category"
      else "question"),

Then I’ll add two additional columns with the Id of the sub-category and the category. To obtain these values, I just need to extract the two first digits for a sub-category and the very first for a category.

#"Added CategoryId" =
   Table.AddColumn(
      #"Added level",
      "CategoryId",
      each Text.Start([QuestionId],1)
   ),
#"Added SubCategoryId" =
   Table.AddColumn(
      #"Added CategoryId",
      "SubCategoryId",
      each Text.Start([QuestionId],2)
   ),
question-level

Once I’ve calculated the “Id” for the sub-category, I’d need to replace it by the label. To achieve this, I’ll just do a merge of this table on itself based on the SubCategoryId equal to the QuestionId:

question-merge
#"Merged Queries" =
    Table.NestedJoin(
        #"Added SubCategoryId",
        {"SubCategoryId"},
        #"Added SubCategoryId",
        {"QuestionId"},
        "NewColumn",JoinKind.LeftOuter),

Expand the table and select the column Question that you should rename sub-category. Apply the same tactic for the addition of the label of the Category and you should have the table bellow.

question-categories.jpg

This table has too many rows because we still have rows for the categories and sub-categories. We can just filter them out using the column level.

#"Filtered Rows1" =
     Table.SelectRows(
          #"Renamed Columns2",
          each ([Level] = "question")
     ),

The column level is now pointless (one unique value equal to “question”) and we can safely remove it (and also the columns for Ids of category and sub-category) to have a clean dimension table:

question-final

In the next blog post, I’ll explain how to build my “interviewee” dimension.

Advertisements

3 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s