Tag Archives: query

Power Query (M)agic: Dynamically remove leading rows AND columns



Does Power Query sometimes seem too rigid? It’s great for defining a sequence of transformations to clean your data but requires the incoming data to be in a consistent format. Today, I want to talk about a technique for dynamically handling sheets with different structures.

You can download a sample .pbix to follow along with me here.

Let’s look at my sample data. Suppose my sales manager sends me a monthly sales report structured similarly to that below:

Nar Table 1 1024x199 Power Query (M)agic: Dynamically remove leading rows AND columns

This table looks pretty clean, right? Using Power Query, you can load this sheet, remove the first two rows and columns, unpivot the data, and you have the numbers you’re looking for in tabular format.

Patting yourself on the back, you file the report away until February rolls around. The Sales Manager sends you the new numbers, and you hit refresh, and… an error! Peaking at the data, you see that he changed the layout of the file and also added new salespeople to the matrix!

Nar Table 2 1024x333 Power Query (M)agic: Dynamically remove leading rows AND columns

Nar Table 3 1024x334 Power Query (M)agic: Dynamically remove leading rows AND columns

As enjoyable as it would be just to tell the manager to stick to a consistent format, sometimes this isn’t realistic or feasible. So, let’s look at how we can use Power Query to handle this chaos:

  1.  Identify the header name for the column that demarcates the beginning of the matrix.
  2.  Automatically detect rows up to that header row.
  3.  Automatically detect columns up to that header column.
  4.  Remove the extra rows and columns, leaving you with a squeaky-clean dataset.
  5.  Unpivot the matrix for each sheet and append into a single dataset.

STEP 1: IDENTIFY THE TARGET HEADER ROW AND COLUMN:

Looking back at our sample data, you can see that there is a common cell in all our sheets that lies at the upper left of the data we want to extract. In this case, the target cell is the “Region” header. We want to remove all rows ABOVE that cell and all columns LEFT of that cell without hard-coding the location of that cell.

STEP 2: LOCATE THE HEADER ROW DYNAMICALLY:

We need some way to identify where the header row starts so that we can remove rows up to that point. This functionality is something that I would have thought is built in by default, but surprisingly is not! Luckily our friend Ken Puls over at ExcelGuru.ca came up with [a solution] for this which I’ve adapted slightly for our purposes.

Load your first worksheet into Power Query, add an Index column, and filter the table to the target value from step 1:

  1. Add Column > Index column > From 0.
  2. Add a custom step (click the fx button in the formula bar).
  3. Replace the formula with some custom M code: Table.FindText(#”Starting Table w/Index”, “Region”). Name the step “Filtered to Header Row.”

Nar Table 4 1024x322 Power Query (M)agic: Dynamically remove leading rows AND columns

Note that you’ll want to replace the yellow-highlighted text with the Target Header name from step 1. Table.FindText() scans the entire table for the target text and returns any row with that value. So be careful that your dataset doesn’t have that exact target word in other places!

Now we have our header row isolated along with the Index value for that row. Rename the step to “Filtered Header Row” as we’ll come back to this shortly.

Nar Table 5 1024x118 Power Query (M)agic: Dynamically remove leading rows AND columns

STEP 3: DETECT THE COLUMNS TO DELETE

Let’s move on to the more difficult part: dynamically removing leading columns. We have a bunch of columns, and we want to eliminate the first X, up to our target column. We’ll leverage the technique above and add some P3 secret sauce.

First, transpose the #”Filtered Header Row” step and add an index. That will make a single row table into a single column table that we use to identify the columns to remove.

  1. Transform > Transpose.
  2. Add an index column: Add Column > Index column > From 1.
  3. To handle blank columns in the header row (always a possibility in dirty data), add a custom column that checks if the column list has any nulls: Add Column > Custom Column > if [Column1] = null then “Column” & Number.ToText([Index]) else [Column1].

Our goal is to delete all columns left of the “Region” column (or above “Region” in the transposed table) so let’s find the index of that row:

  1. Right Click Column1 > Text Filters > Equals > “Region”.

Nar Table 7 2 Power Query (M)agic: Dynamically remove leading rows AND columns Equals > “Region”” width=”568″ height=”79″ srcset=”https://powerpivotpro.com/wp-content/uploads/2018/02/Nar-Table-7-2.png 568w, https://powerpivotpro.com/wp-content/uploads/2018/02/Nar-Table-7-2-300×42.png 300w” sizes=”(max-width: 568px) 100vw, 568px”>

We’re building upon Ken’s technique of finding the Index that corresponds to a target cell but this time with a transposed table. Since we’ll reference this number a couple of times, let’s Drill Down to just the Index number so that we have an easily referenceable step:

  1. Right-Click on the Index value > Drill Down.
  2. Rename that step to “TargetColumnIndex”.

Now, jump around a bit and reference the original column list and filter it down to include ONLY the rows that have an index number less than the target column.

  1. Click the fx button to insert a New Step.
  2. Revise the M code to point to the full column list: = Table.SelectRows(#”Added ColumnName”, each [Index] < #”TargetColumnIndex”.

Let’s break down what we’re doing here: the outer Table.SelectRows filters the inner table #” Added ColumnName” down to all rows that have an Index less than the “TargetColumnIndex” value we isolated a couple of steps ago.

Nar Table 8 1 1024x612 Power Query (M)agic: Dynamically remove leading rows AND columns

Finally, remove the helper columns keeping only “ColumnName,” and you have a nice list of columns to exclude!

STEP 4: REMOVE THE ORIGINAL EXTRA COLUMNS AND ROWS

We now have all the pieces we need to eliminate our junk rows and columns! Let’s jump back to our original query and clean it up.

Create a new step and change its code to reference the Starting Table:

  1. Click fx > rename step to “Back to Starting Table” > change code to = #”Starting Table”.
  2. Home > “Remove Top Rows.” Enter any value for the parameter.
  3. Edit the M code directly and change the 2nd parameter of Table.Skip(#”Back to Starting Table,” #”Filtered to Header Row”[Index]{0}), in this case, we want to reference the step where we isolated the header row number from earlier.
  4. Home > Use First Rows as Headers.

Nar Table 9 2 1024x409 Power Query (M)agic: Dynamically remove leading rows AND columns

Boom! We’ve dynamically eliminated the top rows!

Now for the final act, we’ll tweak Table.RemoveColumns (when you do “Remove Columns” Power Query uses this function) to use a dynamic parameter! Remember that list of columns we generated earlier, the list we want to extract? That’s what we’ll input into Table.RemoveColumns.

First, select any one of the junk columns and right-click > “Remove Columns.” Take a look at the formula that Power Query generated.

  1.  Table.RemoveColumns(#”Promoted Headers”,{“Column1”}).

We know that Table.RemoveColumns requires a list for its second argument, so we can reference the list of columns to remove from an earlier step:

  1. Table.ToList(#”Columns to Remove”).

Now we’re left with a clean matrix which we can easily unpivot without any problems.

    1. Right-click the Region column > Unpivot Other Columns.
    2. Rename columns something useful.

Nar Table 11 848x1024 Power Query (M)agic: Dynamically remove leading rows AND columns

STEP 5: Convert magic query to function

  1. The final step is to convert your magic query into a function so that you can apply this dynamic transformation to each file/worksheet that needs to be unpivoted.

Conclusion:

Using the technique of identifying a target row and column, you can create extremely powerful and dynamic queries that can handle input files that aren’t always perfect, because let’s face it, you can’t always rely on training or data validation to prevent end users from modifying your perfect templates. Our job as data modelers is to make the end user experience as friendly as possible by foreseeing and handling exceptions.

If you have any ideas on how you might use this in your reports, please feel free to share in the comments section!

We get it:  you probably arrived here via Google, and now that you’ve got what you needed, you’re leaving. And we’re TOTALLY cool with that – we love what we do more than enough to keep writing free content.  And besides, what’s good for the community (and the soul) is good for the brand.

But before you leave, we DO want you to know that instead of struggling your way through a project on your own, you can hire us to accelerate you into the future. Like, 88 mph in a flux-capacitor-equipped DeLorean – THAT kind of acceleration. C’mon McFly. Where you’re going, you won’t need roads.

Let’s block ads! (Why?)

PowerPivotPro

Power Query (M)agic: Right-Click Your Way To Data Cleaning Happiness

As a ‘water cooler’ of sorts for this community, we meet some amazing people. Doug Burke shares with us how easy Power Query makes it to do powerful cleaning with a click of the mouse.

The Problem

If you’re like me, you get crappy data from your source systems.  Not that the data is inaccurate. No. It’s just that the data does not fit your view of the world. It doesn’t suit your needs right out of the box. That makes it crappy.

You need to clean it to make it usable. You need the ‘power’ of Power Query to turn night into day.

Did you know that Power Query’s most potent data cleanups are just a right-click away? Let me share the secret.

It’s a simple three-step process:

1. Import data into Power Query

2. Right-click a column

3. Select a data cleaning function

Let’s take a look . . .

image thumb Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

Try it yourself. Import some data. Right-click a column. Look at all those functions.

I count 42 data cleaning possibilities just from what’s visible. That doesn’t even include the additional transformation powers of ‘Group By,’ ‘Add Column From Examples’ and ‘Add as New Query.’

A whole lot of something is going on here.

Let’s start with a simple example. We’ll use right-click-column to fix inconsistent customer names.

• Customer XYZ sometimes is listed as XY&Z. We need to remove the ‘&’ symbol.

image thumb 114 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

image thumb 115 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

image thumb 1 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

Your first reaction may be ‘So what. I can do the same thing with Excel’s Search and Replace’. True, but you would miss out on Power Query’s creating a series of ‘Applied Steps’ that are consistent, repeatable and set you up for more data transformations within the Power Query Editor.

Another example, this time using two columns. Find the unique instances of products bought by customers. It’s not as simple as it sounds because different customers purchase different products. You want to quickly see the unique combinations of which customers bought what products and exclude the duplicates.

image thumb 117 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

image thumb 118 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

image thumb 119 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

A quick review of using right-click column data cleanups:

  • Consistent
  • Repeatable
  • Sets you up for more data transformations within the Power Query Editor

How about one more example to drive the point home? This time, let’s fill in missing customer names. This is a frequent problem if your source data is an existing formatted report which excludes repeating values. Power Query can handle it in a snap.

image thumb 120 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

image thumb 121 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

image thumb 122 Power Query (M)agic: Right Click Your Way To Data Cleaning Happiness

‘Right-click-column’ is a fast, convenient way to get the job done. No need to review the Power Query ribbons for column cleanups functions. There are so many readily-available combinations that it boggles my mind.

Bonus: when done, you have a sequenced workflow (know as ‘Applied Steps’) that can be repeated the next time you get data.

Simple, right? That’s the point – – Power Query makes it simple. Right-click-column makes it simpler! Win-win!!

I am not saying this is the onlyway to clean up data. Or even the best way. But I am saying it’s a quick way. And sometimes its better to be quick than good.

Full disclosure: I’m using Excel 2013 Professional Plus 64-bit (2.49.4831.381). Your mileage may vary depending on your version of Excel and Power Query.

No one knows the Power BI ecosystem better than the folks who started the whole thing – us.

Let us guide your organization through the change required to become a data-oriented culture while squeezing every last drop* of value out of your Microsoft platform investment.

* – we reserve the right to substitute Olympic-sized swimming pools of value in place of “drops” at our discretion.

Let’s block ads! (Why?)

PowerPivotPro

Power Query (M)agic: Consecutive Days (Part 2)

Krissy’s back, tell a friend!
Guess who’s back, Guess who’s back,
Guess who’s back. Guess who’s back…

Well, as probably most of our readers know, Rob is well known for a GOOD movie reference in his blog posts every now and then.

Well, P3 World… I might be soon known for my GOOD song lyric references in my blog posts!  Only time will tell. Ha, ha!

So, with the intro out of the way, let’s get to it!

Two weeks ago I wrote a blog post about how to create a set of data with Consecutive Days using SQL or Power Query.  It was UNEXPECTED and SUPER EXCITING to have Adam Saxon, from Guy in A Cube, include a review of my blog post in his Weekly Round Up. Thanks Adam! I really appreciate the “…many ways to skin a cat” reference that was included in the review! It’s like he KNOWS me and that I love a good cat reference or even better yet a good a cat meme (… in addition to my love of angry rap music) – so applause for that!

I also had lots of GREAT comments from our readers – THANK YOU so much for taking the time to share and contribute your ideas with me, our team and our other P3 readers!

But I’d have to say what SURPRISED me the most about the comments was that the feedback focused mostly around the M code and the SQL code used in the example and how to streamline that code to make it better.  WOW! I totally appreciate the attention to detail and the coding best practices shared here!

However, what I was REALLY expecting was someone to ask me WHY?

WHY would I need to do this and WHAT are some USE CASES where I might need to leverage this kind of data preparation technique?  Maybe many of you have experienced the need for this technique before and this was obvious to you?  Maybe you already know, all too well, the pitfalls of the “Dirty Data” that we BI professionals all live in?

Well, I think there are many USE CASES where this technique could be applied.

I think it’s a good idea that I share at least one example USE CASE – just to close the loop and help our readers better understand WHY and WHEN you might need to use this technique.

Employee Hour data is provided by a time keeping system for both Exempt and Non-Exempt employees differently.

  • The Non-Exempt employees have regular hour data for every Mon. thru Fri. business day – these employees clock in and clock out daily.
  • The Exempt employees ONLY have data for days when they have taken time off for PTO, Sick Time or other Leave of Absences.  It is the assumed that the Exempt employee works 40 hours/week or 8 hours/day unless there is PTO, Sick Time or other Leave of Absences taken for a business day.
image7 thumb 1 Power Query (M)agic: Consecutive Days (Part 2)

Non-Exempt

image16 thumb Power Query (M)agic: Consecutive Days (Part 2)

Exempt

If one were to take the data “as-is” and load the Employee Hour data into a data model for analysis, we would not be able to accurately see Total Hours worked for both the Non-Exempt and Exempt employees.

In the example data for the chart shown below, there are three Non-Exempt employees (with 0-8 hours tracked for each Mon.-Fri. business day) and there are three Exempt employees (with only 0-8 hours for ONLY when the employee has taken PTO, Sick, or Leave Of Absence hours).

image thumb 106 Power Query (M)agic: Consecutive Days (Part 2)

The Problem

In fact, it looks like Total Hours worked is high for Fri., 1/12/2018.  When in fact, that is NOT the case at all!

Let’s further look at some simple charts for the Non-Exempt and Exempt employees, separately.

We can CLEARLY see that Friday’s and Monday’s are low Total Hours days for the Non-Exempt employees.  It is NOT as clear for the Exempt employees. The Total Hours shown are not actually the Total Hours but rather the Total Hours NOT Worked!  We know this by looking at the source data!

So, what is a Data Modeler to do in this situation?

First, I typically talk through the problem “using just words”.  Then, I begin to mock up the business problem and start to work on a possible solution.  Next, I test the solution to make sure that I’m getting the results that I want.  Finally, I apply the solution to my “real-world” data and QA my results.

If there’s extra time (which there never seems to be, so you must MAKE time) I will optimize and document my solution.

So, let’s get started…

I can use the Non-Exempt employee data “as-is”.

For the Exempt employee data, I will need eight hours for a set of Consecutive Days (specifically, Mon. thru Fri. business days) and then I will need to SUBTRACT the PTO, Sick Time or other Leave of Absences hours provided.

You can get your hands on my mock up by clicking the link to download the Excel file here.

In this file, I’ve created example data for 6 employee’s (3 Non-Exempt & 3 Exempt employees)

The detail for the steps used to prepare the data can be referenced in the Power Query Editor.

To launch the Power Query Editor in Excel 2016, select Data > Queries & Connections. Then either double click on any query right click on any query to display menu & select Edit.

image thumb 109 Power Query (M)agic: Consecutive Days (Part 2)

Launch Power Query Editor

The test solution can be found on sheet/tab called “p2”.   In the image below, we can now see that we have all six employees now shown for each Mon. – Fri. business day, and the hours each employee worked on that day.

image thumb 103 Power Query (M)agic: Consecutive Days (Part 2)

Correct Solution

image thumb 104 Power Query (M)agic: Consecutive Days (Part 2)

Inaccurate Result

Without the Consecutive Days approach, we’d end up with two different answers, one of which is NOT correct!

This is also a good time to really stress the importance of understanding your data!  Without knowing that the data is tracked differently based on the employee type, you could end up reporting inaccurate findings.

Well, my friends… I don’t have real-world data that I’m going to apply this logic against for THIS blog post.  But if YOU have this USE CASE or if you have one that is similar then YOU certainly can apply what you learned here against your real-world data wlEmoticon winkingsmile Power Query (M)agic: Consecutive Days (Part 2)

Using my example data, we can create a Pivot Table and use conditional formatting to create a Heat Map representation of the data – and quickly see where employees are NOT working a full eight hours!

image thumb 108 Power Query (M)agic: Consecutive Days (Part 2)

Heat Map pivot

I’ll leave that to you, our awesome readers – as you continue to create awesome and amazing solutions using Power Query, Power Pivot and Power BI! Additionally, you can reference some of these best practices that I’ve demonstrated and called out in the Excel file that is available for download here.

… and with that, to close out this post in true lyric form… “Third verse, same as the first”?  Well, in this case, (I’ve taken and adapted this from the Eminem’s 3rd verse of “Without Me”)

…I’ve been dope, suspenseful with cat meme potential ever since

Microsoft turned Excel into a BI utensil!

Have a great week!  And “Peace Out”!

We like to think that is no accident.  We’re different.  First of a new breed – the kind who can speak tech, biz, and human all at the same time.

Want this kind of readily-absorbable, human-oriented Power BI instruction for your team? Hire us for a private training at your facility, OR attend one of our public workshops!

Let’s block ads! (Why?)

PowerPivotPro

Power Query – Consecutive Days

Hello P3 World! My name is Krissy Dyess and I havehad the most amazing experience working as a full-time consultant at Power Pivot Pro since April 17, 2017!

Today is 1/9/2018 so that would be…

266 calendar days, if you DON’T include today’s date

267 calendar days, if you DO include today’s date

182 calendar days, if you include ONLY business days, excluding Saturdays/Sundays & 6 holidays

… and there are many other FLAVORS of responses that could be listed here as well!

Funny, how a simple intro can branch off into many different paths of interpretations – this is pretty much the world of data and analysis that we live in as data professionals!

For any given question, there can be MANY different ANSWERS, depending on what question you are REALLY trying to answer. Additionally, there can be MANY different SOLUTIONS!

In my 20+ years working with data, it seems like a good deal of my time is spent PREPARING the data. What does this mean? It means applying logic to structure the data in a way that is best for your data model and/or analysis.

Today, I would like to share a common data problem and 2 different SOLUTIONS used to PREPARE a set of data for Power BI.

And like most problems, there is usually more than just one or two ways to create a solution. So, if any of our readers out there would like to share any additional ways to solve this problem, please feel free to leave your ideas in the comments below!

Sometimes it’s necessary to create a data set that contains a consecutive list of dates and a value for a data point that will be measured or visualized. In this example, the Source Data shown has 7 records, starting with 07/31/2017 and goes through 08/10/2017. Notice that there is a gap between 08/05/2017 to 08/10/2017.

image thumb 55 Power Query – Consecutive Days

Source Data and End Goal

The end goal is to have consecutive days. Now, the end goal could be consecutive days through the last date available (08/10/2017) or through today’s date or etc. In my example, I will go through today’s date but depending on your problem and your end goal this could be different.

Why solution in SQL? Well… because until Power Query came around, SQL was my best friend when I needed to do any data manipulation!

For those of you looking to learn SQL, this will be a good example to follow along with. For those of you that don’t care to learn SQL, I’ll also show another solution using Power Query.

I’m using SQL Server Management Studio (SSMS), my local workspace, and then creating a New Query workspace.

image thumb 56 Power Query – Consecutive Days

New Query

In order to create a table and populate that table with example data, I select New Query and use the following scripts:

SQL Step 1

image thumb 57 Power Query – Consecutive Days

SQL Step 2

image thumb 58 Power Query – Consecutive Days

image thumb 59 Power Query – Consecutive Days

The SQL Script is entered in the New Query window, Execute or F5 is used to run the SQL script and the results are displayed in the Results Pane.

image thumb 95 Power Query – Consecutive Days

Executed Query

Voila! I’ve created a table with example data by running the above code!

image thumb 60 Power Query – Consecutive Days

Source Data

And we can see the 4 day gap between 08/05/2017 to 08/10/2017.

The next step would be to have a complete consecutive set of calendar days available in a table. Now, you might have a calendar table that you can already use for this step or you might not! There are many ways to create a calendar table! But for this example, I’m going to show a different technique to get a set of consecutive calendar dates for the time frame I’m interested in building.

SQL Step 3

image thumb 61 Power Query – Consecutive Days

 

image thumb 62 Power Query – Consecutive Days

Voila! I’ve created a consecutive calendar table with example data by running the above code!

NOTE:  If you are following along, your dates will be different based on when you run the code – this doesn’t matter for understanding the concept.  The purpose is to just get a set of dates, they can be any dates.

image thumb 63 Power Query – Consecutive Days

Consecutive Calendar Table

SQL Step 4

image thumb 64 Power Query – Consecutive Days

image thumb 65 Power Query – Consecutive Days

We can see that we have 161 rows of data starting 07/31/2017 going out until 01/07/2018.

image thumb 66 Power Query – Consecutive Days

Join Calendar

SQL Step 5 – Final Step

Which brings us to the final step, which is to join the result of the above step with the actual store quantity data to get the actual quantity for days where a quantity was available.And to do this, I created the final SQL query:

image thumb 67 Power Query – Consecutive Days

image thumb 68 Power Query – Consecutive Days

image thumb 70 Power Query – Consecutive Days

Final Join

And now we have the final SQL query & result set –

image thumb 69 Power Query – Consecutive Days

image thumb 71 Power Query – Consecutive Days

Final Results

Now, can we get to the same results using Power Query? Yes, we sure can!

Power Query Step 1

Let’s start with the example data. Since I started with SQL, I can just copy and paste my same example data from SQL into Excel. Then in Excel, I can use Insert > Table and to create a table named ‘StoreQuantity’. So far, so good!

image thumb 72 Power Query – Consecutive Days

Source Data

Power Query Step 2

Then since I’m using Excel 2016, I can use Data > From Table/Range to load and launch the Power Query Editor window with the example data.

image thumb 73 Power Query – Consecutive Days

From Table/Range

Power Query Step 3

 

The exact same example source data is now loaded into the Power Query Editor! Voila!

image thumb 85 Power Query – Consecutive Days

Power Query Editor

Power Query Step 4

I’ve renamed the source data query to ‘Staging_StoreQuantity’. And I’ve created a “Staging” folder for the data preprocessing steps.

image thumb 86 Power Query – Consecutive Days

Renamed Query

Power Query Step 5

For the next step, I can right click on ‘Staging_StoreQuantity’ and Reference or Duplicate another starting source of data.

Now, there are performance reasons that I might choose Reference over Duplicate.

If I were to choose Duplicate, Power Query will locate and retrieve the data again. Depending on the source of the data and the size of the data, this could take some time. However, if I choose Reference, then I’m using the data that is already loaded in the Power Query Editor. So, in this case, I’m choosing the Reference option!

image thumb 87 Power Query – Consecutive Days

Reference Source Data

Power Query Step 6

I’m going to rename the new query to ‘Staging_StoreQuantityAllDays’. And use the Home > Group By button to create a unique list by [store_id] of the first [create_date] where we have a quantity.

image thumb 88 Power Query – Consecutive Days

Group By

Power Query Step 7

Next, I can add a custom column expression using Add Column > Custom Column and the Power Query List.Dates function to create a list of consecutive dates for each [store_id]. I will use the [min_create_date] as the starting date in the expression and go out 365 days.

List.Dates(#date(Date.Year([min_create_date]), Date.Month([min_create_date]), Date.Day([min_create_date])), 365, #duration(1, 0, 0, 0))

image thumb 89 Power Query – Consecutive Days

Custom Column

After adding the [dt] custom column as a List, you can use the Expand to New Rows…

image thumb 90 Power Query – Consecutive Days

Expand to New Rows

… to get one row for each item in the List. Boom!

image thumb 91 Power Query – Consecutive Days

Distinct List

Just a few more steps to change the [dt] format to a date, filter steps to keep only the records through the current date … and Boom!

We now have a table with consecutive dates for the [store_id] starting with 07/31/2017 through to today!

image thumb 92 Power Query – Consecutive Days

Consecutive Date Table

Power Query Step 8

Finally, I can use the Home > Merge Queries as New to create the final query.

image thumb 82 Power Query – Consecutive Days

Merge Query

Staging_StoreQuantity is the resulting column, we use the Expand double arrows to select ONLY the [qty] column from our original data source.  I deselect the use original column as prefix so that only the [qty] name is returned for the column versus Staging_StoreQuantity.qty.

image thumb 83 Power Query – Consecutive Days

Expand Double Arrows

I renamed the final query to ‘StoreQuantity’ and I’ve created and named a new folder called “Data Model”, to store the final query that will be loaded to the data model and that’s it!

image thumb 93 Power Query – Consecutive Days

Final Query

You can see that we can use either SQL or Power Query to take source data and manipulate the data into the SAME desired results!

Which method is BETTER? Well, that depends. If you don’t have SQL or prefer not to use SQL, then Power Query is a GREAT option!

As for me, I have to refer to the song… “Make New Friends (Power Query) But Keep The Old (SQL)… One Is Silver And The Other Gold!”

Happy New Year Everyone wlEmoticon smile 1 Power Query – Consecutive Days

Please see the attached Excel file for the Power Query solution and the attached SQL script file for the SQL solution.

It takes a special kind of mindset to “bend” data (and software!) to the human will.  As this article demonstrates, we at PowerPivotPro can twist Power BI into a pretzel if that’s what an organization needs. (A robust, trustworthy, industrial-strength pretzel of course).

The data-oriented challenges http://foofacing your business require BOTH a nimble toolset like Power BI AND a nimble mindset to go with it. And as Val Kilmer / Doc Holladay once said, we’re your huckleberry.

Connect with the Experts

Let’s block ads! (Why?)

PowerPivotPro

Jittered Scatter Charts in Power BI via DAX and Power Query

In a recent post, I hinted that I’d be using theFootball Project V2 as my “platform” for some future techniques-oriented posts.  That future starts today!  And while I certainly have enjoyed writing my more change-management-oriented posts this year like this one and this one, it’s good for me to still get my hands dirty a bit from time to time.

Besides, there’s a certain “back to our roots” quality to be found when I write DAX and M posts.  That’s how it all started here.

In that previous post about the football project, I showed you a report without explaining any of it.  In particular, the “football field” scatter chart representation of a passing chart served as the focal point, and it’s a pretty “hot” visual…

image thumb 2 Jittered Scatter Charts in Power BI via DAX and Power Query

This Post is Primarily About How I “Jittered” the Highlighted Cluster of Dots to make Said Dots More Visible

image thumb 3 Jittered Scatter Charts in Power BI via DAX and Power Query

I can’t make this PBIX downloadable since it contains intellectual property of Cian Fahey, but here’s a quick glance at what the primary data table contains:

image thumb 4 Jittered Scatter Charts in Power BI via DAX and Power Query

This Shows Most of the Columns of My PassData Table
(Ex: in the Second Row, Tom Brady threw an accurate pass 50 yards in the air, outside
of the numbers painted along the left side of the field.  Brandin Cooks caught it
and then ran 4 additional yards as indicated by the [YAC] column)

And here’s what that second row of data looks like when it’s ultimately plotted on my scatter chart:

image thumb 5 Jittered Scatter Charts in Power BI via DAX and Power Query

The Dot Corresponding to the 2nd Row of Data Above: 50 Yards in the Air, Outside the Numbers Left
(I don’t try to visualize the 4 yards he ran after catching it – at least not yet)

Right before I started building this visual, I’d fortunately been reading an article by my good friend Chandoo, in which he “jittered” some dots in an Excel scatter chart.

So when I encountered the problem indicated in the left side of the image above, I knew exactly what I had to do, but this time with a Power BI twist!

I know.  Many of you are probably most-interested in the underlying image part of this technique, or at least more so than you are interested in the jittering.

Not that you’re interested in football, of course, but there are MANY other cases where a custom underlying image from your business would/will be appropriate.

So let’s give it a quick treatment.  Here’s what I did:

  1. I purchased a piece of football field clipart, and trimmed it down to the size I needed (few QB’s can throw a ball more than about 57 yards in the air, so I didn’t need the entire field).
  2. I inserted a scatter chart into my report, set the background image of the plot area to be my modified clipart file, and changed the Image Fit to “Fit” rather than Normal:

image thumb 6 Jittered Scatter Charts in Power BI via DAX and Power Query

  1. I then translated the [Pass Distance] column from my data into a [PreciseY] column using a DAX calculated column (this formula assumes the [Pass Distance] column contains values ranging from –5 to 55, and maps them to a Y-value ranging from 0 to 61:

image thumb 7 Jittered Scatter Charts in Power BI via DAX and Power Query

  1. Then I used [PreciseY] as the Y Axis column in the field list:

image thumb 8 Jittered Scatter Charts in Power BI via DAX and Power Query

  1. Next, I set the Y-Axis of the Scatter Chart so that it ranged from 0 to 62 (experimentation showed me that 62 worked better than 61 – I encourage you to NOT consider this an exact mathematical science, and instead, just trust your eyes):

image thumb 9 Jittered Scatter Charts in Power BI via DAX and Power Query

  1. Lastly, I repeated a similar process to generate a [PreciseX] column, based on a related table lookup (note that I only get five different original values for “X” location, as compared to the hyper-granular and precise Y location I get from the [PassDistance] column):

 

image thumb 10 Jittered Scatter Charts in Power BI via DAX and Power Query

The Formula

image thumb 11 Jittered Scatter Charts in Power BI via DAX and Power Query

The XCoordLookupTable

  1. Which of course also necessitates I set the range of the X axis of the chart as well.  Note that I used 0-100 even though my [PreciseX] only goes from 10-91, because I do NOT want dots appearing on the far right and left edges, as that would place them “out of bounds” from a football perspective:

image thumb 12 Jittered Scatter Charts in Power BI via DAX and Power Query

OK, that sets up our problem nicely – our dots are going to be pretty well-spaced by default along the Y-axis, but the small number of distinct values of X is going to lead to some serious “clobbering.”

And since we have a “bigger” problem on the X axis, that is the one I chose to jitter, but with a slight “nod” to the Y axis as well (as you will see).

I quickly realized I needed a new column in my model, one that uniquely identified each dot.

Why is that?  Well, I want every single row of my data to get its “own” dot on the graph, and the Power BI scatter chart is quite insistent on there being such a column:

image thumb 13 Jittered Scatter Charts in Power BI via DAX and Power Query

Now, sometimes you may WANT multiple rows to combine into one dot, but in this particular case, I want to see each row of my source data as its own dot.

When adding a new calculated column, there are LOTS of ways to uniquely “stamp” each row with its own distinct value.  I could do this in DAX, but it would require concatenating/combining enough columns together (in this case, probably [Game #], [Qtr], and [Time], since no two rows can “happen” at the same time in the same game.

But for other reasons that you will see shortly, I need the unique identifier to be a number, and I don’t want to go through the contortions of converting text values to numeric, plus as you can see, the data is incomplete in the [Time] column (lots of blanks).

So I went back into the power query that fetches this data in the first place, and added an Index column:

image thumb 14 Jittered Scatter Charts in Power BI via DAX and Power Query

Add an Index Column to the Query…

image thumb 15 Jittered Scatter Charts in Power BI via DAX and Power Query

…and here it is after running said modified query!
(Not sure why I chose to start the column at zero.  Doesn’t matter, 1 would have been fine too)

Now, I drop that [Index] column on the Details dropzone of the scatter chart (as pictured a few images up), and I’m now assured of “one dot per row of data.”

We get it:  you probably arrived here via Google, and now that you’ve got what you needed, you’re leaving. And we’re TOTALLY cool with that – we love what we do more than enough to keep writing free content.  And besides, what’s good for the community (and the soul) is good for the brand.

But before you leave, we DO want you to know that instead of struggling your way through a project on your own, you can hire us to accelerate you into the future. Like, 88 mph in a flux-capacitor-equipped DeLorean – THAT kind of acceleration. C’mon McFly. Where you’re going, you won’t need roads.

Everything we’ve done so far gets me to the place where my dots are all appearing in their [PreciseX] and [PreciseY] locations, but “clobbering” each other:

image4 thumb Jittered Scatter Charts in Power BI via DAX and Power Query

Our “Story So Far” Gets Us to This Place
(Our scatter chart field list at left, dots hiding each other at right)

We need an alternate version of our [PreciseX] column, one that slightly shifts the X-position of a dot left or right to avoid clobbering its close neighbors.  Just a little nudge left or right, one that doesn’t move it TOO much.

But how do we know which dots even NEED jittering/shifting?

If a dot is completely off on its own, its X coordinate doesn’t need to be adjusted.  We only need to pay attention to dots that are clumped together.

So…  how close together is too close? I experimented a bit before coming to the following conclusion.

A “cluster” is defined by:

  • A shared [PreciseX] value, AND
  • A [PreciseY] within a 2-unit “block”

In English, what I decided was that if two dots shared a [PreciseX] and were “really close” in [PreciseY] value, they were “clobbering” each other and required jittering.

I added these two calculated columns using DAX:

image thumb 17 Jittered Scatter Charts in Power BI via DAX and Power Query

The formulas are:

DepthRange =
IF (
    [Pass Distance] < 0,
    -1,
    IF ( [Pass Distance] <= 201 + INT ( [Pass Distance] / 2 ) )
)

Cluster =
“X: “ & [PreciseX]
    & “, Depth: “
    & [DepthRange]

DepthRange simply “compresses” pass distances into 2-yard increments.  (I could have used [PreciseY] as the input here instead of [Pass Distance], again with the “divide by two” approach – this value is never going to be displayed to the user, so it doesn’t need to be semantically meaningful, but it DOES need to “lump” similar Y coordinates together, which it does).

Cluster then merely concatenates [PreciseX] and [DepthRange] together, but in a somewhat-friendly format (which helped me to debug).

Now, I add three more columns – two intermediate columns that “build up” to the third, which is the [JitteredX] we desire:

image thumb 18 Jittered Scatter Charts in Power BI via DAX and Power Query

The formulas are:

RankWithinLocationCluster =
RANKX (
    FILTER ( PassData, PassData[Cluster] = EARLIER ( PassData[Cluster] ) ),
    PassData[Index]
)

JitterXAdjustment =
CALCULATE (
    AVERAGE ( PassData[RankWithinLocationCluster] ),
    FILTER ( PassData, PassData[Cluster] = EARLIER ( PassData[Cluster] ) )
)
    – [RankWithinLocationCluster]

JitteredX =
[PreciseX]
    + ( [JitterXAdjustment] * 2 )

And here they are explained:

  • RankWithinLocationCluster – among all rows/dots that share a Cluster value with this dot, how does this one “rank” in terms of its [Index] value?  In this case, we’re not ranking in the “scoring” sense – the Index value is pretty damn arbitrary here, but since [Index] is numeric and unique, we can use RANKX to essentially make sure that we don’t move dots around and simply “jitter” them right back on top of one another.
  • JitterXAdjustment – ok, THEN we calculated the average of the ranks within this cluster, and subtract THIS dot’s rank within the cluster.  This yields a set of positive/negative/zero integer values, with NO duplicate values within a cluster, which allows us to shift both left and right for a nice even “spread.”  Furthermore, in clusters of size 1, we get a 0 here (no shift).  This also happens to handle odd- and even-sized clusters quite well (if you have three dots in a cluster, their ranks will average to 2, and the middle dot’s rank will be 2, so it stays put).
  • JitteredX – I then take [JitterXAdjustment], multiply it by 2 to make the “spread” a bit wider (again, an eyeball-driven adjustment), then add that to [PreciseX] to yield my NEW x-coordinate, [JitteredX].

Instead of [PreciseX], we now use [JitteredX] as the X-Axis field, as illustrated here…

image thumb 19 Jittered Scatter Charts in Power BI via DAX and Power Query

And…  we’re done!

I suspect/hope that the two techniques demonstrated in this post (underlying image with translated X/Y coordinates, and then the jittering) will get your mental engines revving hard, and there will be questions like “wait, can I use this to do XYZ?” and/or “what if you wanted to jitter both X and Y?”

Bring ‘em on.  Even if you “just” already know how you’re going to apply this, we’re curious/excited to hear it.

The comment thread is now officially open for business wlEmoticon smile 1 Jittered Scatter Charts in Power BI via DAX and Power Query

Let’s block ads! (Why?)

PowerPivotPro

Power Query (M)agic Part 3: Divide & Conquer

Blue Pill Red Pill thumb Power Query (M)agic Part 3: Divide & ConquerWelcome back P3 Nation! By now we’re in full swing with this Power Query (M)agic series, three posts and counting! You’re now knowledgeable enough about Power Query (if you read Part One and Part Two, anyway) that I want to throw some cool techniques at you involving M code. The time has come to take off the metaphorical training wheels, dive into the deep end, leave the nest, take a leap of faith (boy does this guy love idioms…), because frankly you are READY FOR IT. “I know it, you know it, everyone knows it.”

“Take the blue pill, the story ends…take the red pill, you stay in Wonderland, and I show you how deep the (PQ) rabbit hole goes.” If you’re still reading, then you took the red pill and chose…wisely. Today I’ll be showing you how to add custom M code using the Advanced Editor. The Advanced Editor is where I LIVE and BREATHE when using Power Query, so welcome to my universe. wlEmoticon smile Power Query (M)agic Part 3: Divide & Conquer

Specifically, I’ll be demonstrating two important ways to visually organize your query by renaming your applied steps and adding visual dividers between them. This technique is especially useful when you have a lot of applied steps, and want to add visual cues between related applied steps.

Since you’ve read the previous post in this series about folders, you’ve seen that visual aids can be VERY HELPFUL when building queries! These techniques are by no means necessary, but they can be a tremendous help in organizing your model. Anything that makes it easier to step back into a model months after development is a win in my book.

RENAMING STEPS > TAKING NOTES!

So I’ll be the first to tell you that, despite my best efforts, I HATE taking notes. It takes time and energy I chose not to give, especially when I need to store them separately in OneNote or a Word document. A beautiful feature of Power Query is the ability to rename any applied step to whatever you desire, essentially allowing you to write a single line of notes for each step! Corner cutters of the world unite!

This saves so much time, and it’s done easily enough by right-clicking on ANY step, selecting Rename, then changing the original title to whatever friendly name you want. My two cents: rename it to something that will make sense to anyone coming back to this model months (or years!) down the road.

Simply right click on the step you want change, the select rename from the drop down list:

Rename Steps Power Query (M)agic Part 3: Divide & Conquer

But wait, there’s more…you can also add notes to each individual step! You can do this in the SAME dropdown menu as above. What’s even cooler is that those notes…well they show up as a tooltip when you over. So in essence you can create high level summary notes by renaming the step. Then for steps that more complicated (think Rube Goldberg complicated), you can add additional tooltip notes. Badass, that’s what that is. Honestly I didn’t know this existed for the LONGEST TIME! I never would have guessed additional notes would be buried in the properties menu.

Just right click on the step you want to add notes too, select properties, then add your notes:

Applied Step Properties thumb Power Query (M)agic Part 3: Divide & Conquer

Hovering over your step now results in tooltips displaying:

Tooltip thumb Power Query (M)agic Part 3: Divide & Conquer

The Way I Used To Do It (Old & Busted)

A comment on this post enlightened me to a better way to do this! So thanks to Milan Grahovac for the new hotness! I’ve decided to keep the old method in this post, because it does give a good intro to the Advanced Editor. Plus I’m sure some people will appreciate two ways to do something. With that said, let’s walk through the old method first.

Dividers can help you look like a Power Query Ninja if anyone ever digs into your models. Have any of you EVER looked at a workbook (or code) that was organized or had added comments, and NOT thought it looked professional? Take a minute to ponder that…you’ll realize that it almost ALWAYS left a better impression than one that didn’t.

I’m sure many of you are wondering “What the heck does this technique look like?!” Let’s seen an example! I create a calendar table in many of the models I build.  Building it requires a lot of steps, and many of which are related and can easily be grouped together. Below is a before and after screenshot of the steps from that query, before and after adding Dividers between applied steps.

Giant hand man (let’s call him Steve from now on) is back to pass judgement on my organizational skills:

PQ Dividers Before And After thumb Power Query (M)agic Part 3: Divide & Conquer

A keen observer (sarcasm much?) will notice extra lines I’ve added on right image, visually grouping sections of my applied steps. Adding these is easily accomplished by adding a few lines into the Advanced Editor window. It’s really easy to add them, but can be tricky to remove later.

Before I demo the technique, let me explain some important Power Query fundamentals. These concepts will help you understand how the applied steps work, and how they connect to each other. It’ll be important to know this if you ever want to remove these dividers later, or add/move steps to your query.

Fun Fact! Whenever you make a new transformation (new step) to your list, by default it references the previous step. In the image below the highlighted step (Added Calendar Year) references the previous step (Added Calendar Month), in the body of the M code. Every step will have a reference similar to the one you see here (make note as we’ll circle back to this later).

Example of how Applied Steps relate to each other:

Power Query Step Link thumb Power Query (M)agic Part 3: Divide & Conquer

Interested in Learning How to Do this Kind of Thing?

Our New Online Power Query Course just launched this year. You’ll learn Power Query from the best in the business, two Excel MVPs Ken Puls and Miguel Escobar. We’ve included 7 Hours of Self-Paced Video over 31 Courses, with lots of example files, and a Digital Badge powered by Acclaim. CLICK HERE to claim your spot!

PQ training Power Query (M)agic Part 3: Divide & Conquer

 

Let’s start by walking through how to add these dividers into our Applied Steps. First order of business, you need to know where the Advanced Editor Window is located! It’s found under the View ribbon (same place you turn on the formula bar to see the M Code).

Opening the Advanced Editor window:

Advanced Editor Window thumb Power Query (M)agic Part 3: Divide & Conquer

Advanced Editor Window – Not as scary as it looks, promise:

Advanced Editor Unedited thumb Power Query (M)agic Part 3: Divide & Conquer

As you gaze into the depths of code within the Advanced Editor window, you can see that each step is nicely given it’s own line. This is the typical layout for any unmodified query in Power Query. It makes it relatively easy to see where one applied step ends, and the next begins. Next I’m going to insert a dividing line into the code, to provide some degrees of separation. I’ll go ahead and insert it right before I start making all my custom columns (E.g. before #”Added Calendar Month”).

Advanced Editor window AFTER adding in the code for my dividing line:

Advanced Editor Added Steps thumb Power Query (M)agic Part 3: Divide & Conquer

Let’s break down this line of code to understand why I added what I did. Each applied step at is core is essentially two parts. The first part is the applied step name, and the second part is the data output. The name in this instance is #”===START Custom Columns===”. NOTE that you’ll ONLY see the applied step name in the Advanced Editor Window; the Formula Bar only shows the data output.

For any Applied Name, you technically don’t NEED the #”…“ part before and after the name unless it contains spaces. However, as a best practice, I always use them for habit and consistency. And the equals signs in my step name aren’t anything special or required, but rather a way to fill out the row and create a distinctive look compared to other steps.

The second part of any applied step is the data output. This is where the transformation goes whenever you create a new applied step. Some data (either a table output or single value) is REQUIRED here for any applied step. Hence the reason I’ve created “dummy” data in mine as a single numerical value. It doesn’t do anything other than make sure this artificial step works.

Applied Step results after adding my dividing line (Note the 1 value output when selected):

Adding a dividing step thumb Power Query (M)agic Part 3: Divide & Conquer

The Applied Step Added Calendar Month still references the SAME STEP before the addition of the divider:

Previous Step connection thumb Power Query (M)agic Part 3: Divide & Conquer

The divider step is essentially hidden between the other two, there to provide visual separation and nothing else. I’m a HUGE FAN of this technique because it’s a great way to slowly introduce users to the logic and components that make up each applied step. Now if you want to delete a divider step, you can easily just select the delete (X) icon next to the step to get rid of it, just like any other step.

Power Query can get finicky if you start trying to add and/or move steps around after creating your divider steps. If you do this, there are a few ways that a new or existing step could accidentally get referenced. If that happens it will temporarily break the query unless you redirect the step back to the correct one.

Error message after moving a step in front of my custom step:

Moving applied step error thumb Power Query (M)agic Part 3: Divide & Conquer

Moving the Added Calendar Year applied step in front of my custom step caused it to break…but why? Well, whenever a step is reordered, it will always default to using the previous step in the list. So by reordering this step right after my custom one, it assumes it should reference that as it’s data source…which we DO NOT WANT.

Circling back to the M code basics you learned earlier, you can FIX THIS by reassigning that source back to the correct step. You’ll want to change the step name in the Formula Bar to point it back to the Changed to DATETYPE applied step. Once you’ve done that the divider step will go back to ninja mode, hidden amongst the other steps.

Query working correctly after updating the step (data) source for Added Calendar Year to Changed to DATETYPE:

Moving applied step fixed error thumb Power Query (M)agic Part 3: Divide & Conquer

Alright folks, now let me show you how to do this WITHOUT the Advanced Editor! Again, as Milan Grahovac graciously mentioned in the comments. There’s an easy way to add these note line notes without touching M code. Same concept, but without the issue of accidentally breaking a step if you delete the custom line of code. That same dropdown menu we go to when renaming a step, well there’s another option in there we can use. There’s a button called Insert Step After that essentially references the previous step, but without doing any actual transformations. What this means is no performance hit, and a beautifully integrated step you can rename as a divider!

Same process as renaming. Right click, and select Insert Step After:

Inserted Step After thumb Power Query (M)agic Part 3: Divide & Conquer

Boom, new custom step:

New Integrated Step thumb Power Query (M)agic Part 3: Divide & Conquer

The beauty of this is twofold: no Advanced Editor needed, and it’s less prone to break. You can observe how this custom step is simply referencing the previous step, without any transformation applied. So this won’t slow down your refresh AT ALL. All this does it provide us a line in our Applied Steps to create a line or section. Just simply rename it like you would any other step. Now the old method is still a great gateway drug to the world of M coding. Next thing you know you’ll be inserting functions, folding queries, and appending tables oh my! If this managed to wet your appetite for the possibilities with M, then mission accomplished for me. Stay tuned for the next post in this series! Until then folks!

Let’s block ads! (Why?)

PowerPivotPro

Tuesday and Thursday Posts on DAX, Power BI, Power Query, and Other Data Topics – FOREVER

On September 5, 2011, I announced an ambitious initiative:  I committed to a blog post EVERY Tuesday and Thursday, indefinitely.  At the time I wasn’t sure if I could sustain it for more than a month or two, but a funny thing happened and I sustained it (with some help for sure) for about four solid years.  At that point, the pressures of running a full-on consulting company “overflowed the river banks” and our regularly-scheduled program petered out.  But those were some glorious years, and I credit that clockwork, “come-hell-or-high-water” schedule for a lot of the growth of this site (and subsequently, our company).

Beginning this coming week, you will see posts go live here EVERY Tuesday and Thursday morning.  From now until the end of time.

You see, the same thing that distracted me from the schedule in the first place (having a company that depends on me) has also become a massive advantage.  We have many amazing people here on staff – people with with amazing stories and techniques they can share.  We have the ability to hire an editor to keep the quality bar high when I’m not available.  We have someone who can “whip” the schedule and make sure we hold true to our promise.  And yes, that whole process ALSO can hold ME more accountable, so that I myself post more frequently as well.

(Speaking of the advantages of being a robust consulting company, have you considered hiring us to implement some kick-ass analytics for your firm, and/or adopt an empowered, non-dependent, business-driven data culture?  No one does it better or more rapidly than we do.  Not even close.)

This morning we held the final kickoff meeting for this new process.  The team is in place.  We already have a healthy series of posts in the hopper awaiting finishing touches.  It’s go time.

So, while I hate comparing myself to Longshanks, and certainly don’t condone HIS policies, well, the first part of the quote is just too perfect, and it forced me to learn some rudimentary Latin which can’t be all that bad.

Two reasons.

1) I want YOU, dear readers, to be able to depend upon this.  I’ve been listening to podcasts lately and have really grown to love even just the “looking forward to the next one” feeling.  I want that feeling for OUR data community.

2) Once we commit, we can’t go back.  So I’ll leave you with one last movie quote…

image 2 Tuesday and Thursday Posts on DAX, Power BI, Power Query, and Other Data Topics – FOREVER

See you Tuesday… wlEmoticon smile Tuesday and Thursday Posts on DAX, Power BI, Power Query, and Other Data Topics – FOREVER

Let’s block ads! (Why?)

PowerPivotPro

How to Streamline Query Times to Handle Billions of Records

Here at Sisense, we love a challenge, so when a client comes to us and tells us they need to find a way to run queries on billions of records without this slowing them down, our ears perk up and we leap at the chance to find a solution.

In fact, that’s how we recently found ourselves testing a billion transactional records and three million dimensional records – totaling a whopping 500GB of data – with 100 concurrent users and up to 38 concurrent queries, with a total setup time of just two hours… and an average query time of 0.1 seconds!

But wait! I’m getting ahead of myself. Let’s start by talking through some of the issues that affect how fast you can query data.

How Are You Storing Your Data?

Let’s start with the obvious: data warehousing.

Typically, working with masses of data means you also need extensive data warehousing in place to handle it, alongside Extract-Transform-Load tools that uploads data from the original source on a regular basis (Extract), adjusts formats and resolve conflicts to make the datasets compatible (Transform), and then delivers all of this data into the analytical repository where it’s ready for you to run queries, calculations, and trend analysis (Load).

This creates a single version of the truth – a source of data that brings together all your disparate pieces into one place. While this is great, there are also some drawbacks to data warehousing.

First of all, data warehouses are highly structured, and the row-and-column schema can be overly restrictive for some forms of data. Also, the sheer volume of data quickly overloads most systems, grinding to a halt if you run queries that attempt to tap into the entire data pool.

Then, there are data marts.

To help tackle the issues that come with working with huge data sets, many IT teams deploy data marts alongside their databases. These essentially siphon off access to a smaller chunk of the data – and then you select which data marts each department or user has access to. The outcome of this is that you put less pressure on your hardware, as your computer is tapping into smaller pools of data, but the flipside is that you have vastly reduced access to the organization’s total data assets in the first place.

At the other end of the scale, you have data lakes.

These are used to store massive amounts of unstructured data, helping to bypass some of the issues that come with using conventional data warehouses. They also make sandboxing easier, allowing you to try out different data models and transformations before you settle on a final schema for your data warehouse – to avoid getting trapped into something that doesn’t work for you.

The trouble with data lakes is that, while the offer formidable capacity for storing data, you do need to have all kinds of tools in place to interface between the data lake and your data warehouse, or with your end data analytics tool if you want to skip the need for warehousing on top. Systems like this that use data lakes aren’t exactly agile, so your IT team will need to be pretty heavily involved in order to extract the insights you want.

Alternatively, you might deal with unstructured data using an unconventional data storage option.

For example, you might use a NoSQL database like MongoDB.

This gives you tons of freedom in terms of the kind of data you add and store, and the way that you choose to store it. MongoDB also makes use of sharding techniques to avoid piling the pressure on your IT infrastructure, allowing for (pretty much) infinite scaling.

The downside, of course, is that the thing that makes this so great – the unstructured, NoSQL architecture – also makes it tricky to feed this data straight into a reporting tool or analytics platform. You need a way to clean up and reconcile the data first.

What About Tools Used for Analysis?

Dynamic DBMS tools like PostgreSQL can open doors.

PostgreSQL is an analytics and reporting tool that allows you to work with an enormous variety of data types – including native data types that give you much more freedom as you come to build and manipulate a BI solution, and “array” types, which help you to aggregate query results rapidly on an ad hoc basis.

Introducing PostgreSQL into the mix can be massively helpful in bringing together your disparate strands – but again, it can’t do everything. It can’t help much with qualitative data, and as a non-relational database (which wasn’t built to handle Big Data) it will buckle under huge volumes of information.

You can also use R for high end predictive analytics.

Lastly, once you have a solid BI system in place, you can add another layer of awesomeness by using R to build working models for statistical analysis, quickly and easily. R is incredibly versatile, and allows you to move away from static reporting by programming a system for analysis that you can adapt and improve as you go.

The thing is, though, this is an add-on: it doesn’t replace your current BI or data analytics system. R is an excellent programming language that can help you generate predictive analytics fast, but you need to have a rock-solid system in place for handling and preparing data in the first place.

How to Streamline Everything

I know what you’re thinking: I said I was going to explain how to streamline your data queries to help you generate results faster, but so far, all I’ve done is dangle some potential solutions and then show you how they fall short!

That’s because I haven’t revealed the secret sauce that binds all these pieces together in perfect harmony.

As you can see, each of the tools we’ve discussed are used to fix one problem in the storage, flow and use of data within your organization, but they don’t help with the big picture.

That’s where Sisense’s Elasticube comes in.

The Elasticube allows you to store data or drag it in directly from your existing stores at lightning speed, giving users unfettered access to their entire pool of data, whatever format it’s kept in (unless you choose to stagger permissions). Thanks to clever use of In-Chip Processing and a Columnar Database structure, you tap into only the data you need for the query, without restricting yourself permanently, as you would with a data mart.

You can then reconcile and harmonize this data with minimal hassle to treat all these strands as a single data source for the purpose of analysis and reporting.

Still within the Elasticube, you can map and manipulate these data sources to build your own dashboards and run your own queries at incredible speed.

Plus, using our range of custom-built connectors, you can link your Sisense Elasticube directly to MongoDB, PostgreSQL and other DMBS tools, and you can integrate Sisense with R for even more in-depth predictive analytics.

Where the Magic Happens

So that’s the big secret. Using the Sisense Elasticube, I was able to set up a system in 120 minutes that could run concurrent queries on data representing one billion online purchases, from three million origins/ destinations, with an average query time of 0.1 seconds and a maximum query time of just 3 seconds.

Pretty impressive, huh? Here’s what it looked like:

Sisense performance 4 770x415 How to Streamline Query Times to Handle Billions of Records

And here’s an example dashboard that we used to display the results in real time:

Sisense performance 5 770x401 How to Streamline Query Times to Handle Billions of Records

How’s that for streamlined?

Want to see exactly how I built this super-streamlined query model for yourself? Click here for the detailed tutorial.

Let’s block ads! (Why?)

Blog – Sisense

Power Query (M)agic Part 2: Organizing Your Queries

Welcome back P3 Nation for Part 2 of our Power Query (M)agic series! Today I’ll be addressing an issue many of us have experienced, what to do with the LAUNDRY LIST of PQ queries you have. Since it feels really awkward to say Power Query Queries…for the rest of this post they will be simply called queries, keeping it simple folks! Any respectable data model often ends up with a plethora of queries. The more queries you have, the harder it is to navigate or find the one you want. This is where today’s solution comes in! wlEmoticon smile Power Query (M)agic Part 2: Organizing Your Queries

Very Basic Opened Folder icon thumb Power Query (M)agic Part 2: Organizing Your QueriesTechnique #2 – Queries Are Like Documents In A File Cabinet

How do you organize documents? You put them in FOLDERS! If you’ve ever used Power Query to pull data into a data model in Excel or Power BI, then you know how many different queries it can have. The most I’ve had in a model I’ve designed is 78, crazy right… So it’s easy to imagine how a list of queries that long can get really complex, especially when you’re trying to find one to edit.

In PQ, Organizing Your Queries Is All About Making GROUPS!

A great (and easy) way that Power Query solves this issue is a feature that lets you add folders and even sub-folders. It’s a great way to organize and group your queries anyway you’d like. I’ll show you how to create these groups first, then I’ll explain the groups I typically like to make.

Giant hand man doesn’t like this uncategorized mess:

Queries without Groups thumb Power Query (M)agic Part 2: Organizing Your Queries

STEP 1 – Right click anywhere in the Queries section and select New Group:

Create New Group thumb Power Query (M)agic Part 2: Organizing Your Queries

STEP 2 – Right click on any query and select Move to Group:

PQ Move Query to Group thumb Power Query (M)agic Part 2: Organizing Your Queries

End Result Folder Organization…giant hand man approves this technique!

Queries With Groups thumb Power Query (M)agic Part 2: Organizing Your Queries

Some of you may have noticed the groupings I created. Arguably there’s no “best” technique when it comes to organizing your queries, but I find this one very straight forward. I like to group my queries into three main categories.

  1. E – Data Sources
  2. T – Staging Tables
  3. L – Load to Data Model

Think of these as checkpoints along a journey, starting from where the data is coming from…and ending in the data model.

Umm, ETL? What is that? ELI5 please!

Some of you know what those fancy “E.T.L.” letters mean. Cheers to you guys – you’ve scored in the top quartile of nerd! If you don’t know what those letters mean, then here’s a quick overview. If you find your self writing TLDR often then here’s the short version. Your data lives somewhere, you have to go get it, clean it up a bit, and put it in your data model. That’s ETL! That’s exactly what Power Query is designed to do! Here’s the longer version …

Extract Your Data

Our journey starts with gathering up our data. The data source is the place where I store ALL connections to the data sources, with each getting their own unique query. Remember that beautiful feature I showcased in Power Query (M)agic Part 1: Always Have Good References? Well those reference queries go here, enough said. wlEmoticon smile Power Query (M)agic Part 2: Organizing Your Queries

Transform Your Data

In the middle of our journey area the Staging Tables. They can also be called Transformation Tables or Mapping Tables, pick the terminology (dealers choice). These are where I put any tables that are used for transformations, joins, filters, etc… but aren’t actually LOADED into the Data Model. So to reiterate…transformations, custom mappings, and other miscellaneous COOL STUFF goes here. Some models might have lots of these, many models don’t have have any. Got it? Good, moving on.

Load Your Data

The destination of our wonderful journey! All roads lead to the data model. Simply put, if the query is being loaded as a table into the data model, it goes into the Load to Data Model group. You’ll notice that I have two sub-folders within this group as well, Lookup Tables and Fact Tables. Lookup Tables (Dimension Tables for the SQL-minded) are tables that are typically semi-static or infrequently updated (E.g. Customers, Employees, Calendars, etc…). Fact Tables (we also call them Data Tables!) are where the bulk of the data is kept and they usually grow or get updated frequently (E.g. orders, data recordings, time tracking, etc…).

More information about data modeling concepts can be found in another great post written by Rob called Data Modeling in Power Pivot and PBI. Alright that’s it for today’s post folks, hopefully this added a tool to your belt. So until next time folks!

Let’s block ads! (Why?)

PowerPivotPro

Why Excel’s Power Query Refresh Speeds Suck

Back on May 18, Reid posted a great post on how you should set up your Power Queries using referencing. It’s a technique that I’ve been preaching on my blog for years, and is something that we teach in the Power Query workshop as well. I mean, let’s face it, it only makes sense, right? Connect to the data source once, and re-use that information over and over again as you spin off different shapes for the rest of your tables.So what if I was to tell you that – from a performance standpoint – this is only actually a good practice in Power BI desktop, and probably the worst thing you could do to your Excel model? That doing this will drive Power Query refresh speeds down?  How’s that for a click-bait paragraph? Let’s take a look…

If you’ve been working with Power Query for any amount of time, you’ll start building models that look like this:

Img1 1024x389 Why Excel’s Power Query Refresh Speeds Suck

In fact, this is based on a real model that one of my clients built. Pretty normal really; it pulls data from a SQL stored procedure, 4 excel tables and a custom function. Many of us build models like this.Now, let me ask you a skill testing question… when you click Data –> Refresh All, how many times is the stored procedure executed?The answer is 1, right?

My client sat down with his IT department (he is lucky enough to have one that is super cooperative) and worked through some performance testing. The catalyst for this was that the 52,000 row model was taking ages to refresh. The time was sufficient to kill his ability to build and deploy Power Query solutions as the users just wouldn’t accept that kind of latency.Here’s how they tested:

  • The Model was built in 64 bit Excel 2013 using Power Query, feeding directly into Power Pivot Data Model. The data was then landed from the Data Model into a Pivot Table.
  • That same workbook was imported into Power BI and an equivalent Matrix visualization was created with the same information as the pivot table in the Excel version of the model.
  • Each model was refreshed several times. They timed the total refresh time with a stopwatch (since there wasn’t really another way to do it) and used SQL Sentry to log the number of times the stored procedure was called.
  • To account for human latency in the stopwatch start/stop, this test was repeated several times with the results being averaged.

What they found was… disturbing to say the least.

Here are the average power query refresh times that were logged for this model:

Program Refresh Time
Power BI Desktop 9 seconds
Excel 75 seconds!

That is absolutely insane! Excel is over 8 times as long to refresh the same data!Imke Feldman made a mention of this in her blog post Performance difference between Excel and PowerBI with M is huge! While the angle of her post is very much how impressed she is with Power BI, I have to ask the opposite question: Why does Excel Power Query performance suck so badly? These two technologies are supposed to be based on the same architecture, aren’t they? So why are we celebrating Power BI’s greatness and not screaming about Excel’s abysmal performance?The more telling issue actually shows up in the SQL Sentry monitoring of the Stored procedure calls:

Program Stored Proc Called
Power BI Desktop 2 times
Excel 6 times!

Seriously? How does this even make sense?To be honest, I was actually shocked to see that the stored procedure is called more than once in Power BI. I would not have expected that at all, and would have assumed that the SQL Stored Procedure would be executed once since it is only referenced by the Historical Data table. The only thing we can fathom here is that it is called once for previews, and once for the actual load… or possibly that privacy levels may force something to be called twice. I’m not 100% certain, except that we now know that it is called at least one more time that we think would be necessary.But Excel? Why is Excel calling that stored procedure six times? That makes no sense at all…Why don’t we assign a number to each of the queries that relate to the SQL Stored Procedure:

Img2 1024x388 Why Excel’s Power Query Refresh Speeds Suck

Adding one to the total on the theory of a preview, we get to six. Unlike Power BI, in Excel a model refresh re-executes each child query in full, never caching the results. No wonder the performance is horrendous!To solve this issue, my client has had to start flattening his query structure… something that is painful and become impossible when you run afoul of the formula firewall. And that’s the biggest rub here. In order to follow best practices and avoid formula firewall issue, you are SUPPOSED to build queries as shown above. And yet your performance will be crucified for doing it.Worse, I have no way to believe that this does not ALSO happen with all Excel tables in the model. Unfortunately we don’t have a software like SQL Sentry that works on Excel files…

Despite the fact that we can’t trace internal Excel refreshes, we can draw some conclusions based on power query refresh times. Here’s another model to consider, which is built using a method similar to that which Reid posted about. In this model, I connected to an Access Database using a Connection Only query. I then referenced the “Database” query and:

  • Loaded the Transactions table to the Data Model (61,166 records)
  • Loaded the COA table to the Data Model (334 rows)
  • Merged the two into a table called Flat and loaded this to the data model (61,166 rows)

Graphically, it looks like this:

Img3 Why Excel’s Power Query Refresh Speeds Suck

I then used VBA to time the refresh time of each individual query, with these results:

Img4 Why Excel’s Power Query Refresh Speeds Suck

“Now wait a minute…” you might be thinking, “you are calling each query individually, so of COURSE it’s going to refresh the child queries!” And that is absolutely true. So then I compared it to how long it takes to run Excel’s Data –> Refresh All command. In an ideal world, that should re-use the Transactions and COA tables that were loaded, correct?

Img5 Why Excel’s Power Query Refresh Speeds Suck

Um… Oh. It takes a fraction longer even. Possibly because it needs to find the precedent queries? I’m not sure, but what I can say is that it is plainly re-calculating each table as the total refresh is greater than the sum of the tables done individually.

This is by far the most frustrating thing here… my client reached out to Microsoft Premier support. (They are big enough that they have a dedicated support rep and premier support tickets.) It’s took a while, but they finally received this email back from the support team:

The performance difference is expected. Excel isn’t running all of the queries in the same cache context, so if two different data sets are based on the same source data then it can’t take advantage of local caching and this is by design. Power BI Desktop runs all of the queries for a “refresh all” inside a single cache context, which would explain why you see the stored procedure being called less often.At this time, there is no mechanism that the refresh time in excel can be improved.

Honestly, I don’t believe that to be the case here. I’ve been discussing this issue with Microsoft program managers and engineers for some time. This was an issue in PowerBI Desktop, and they fixed it. The fix just needs to be ported into Excel is all, but they haven’t been able to get it done yet.I don’t fault Microsoft for building new features. That is their job, and their company success is based on selling more licenses. I get that. Microsoft wants us to use Power Query, and constantly adding easier ways to clean up crappy data is a huge hook in the subscription model. The insider builds of Excel are already hiding the classic data connectors and making Power Query the default Get Data experience – something I think they should have done with the initial release of Excel 2016.When I talk to users they are blown away by the ability of Power Query to transform and clean up their data. It’s only recently that I’ve started getting more and more questions about power query refresh performance. I’m convinced that this is because the user base is still small, and many users just accept the awful performance because they can get cool stuff done. And that’s part of the reason Microsoft hasn’t been able to justify this. If it was your company, would you fix a performance issue that few people recognize, or develop new features to attract new users to your product? But it’s time to change the priority here. Why build all these amazing features and push them if the audience just ends up abandoning them because the refresh time is too long? To me the fact that child nodes are not cached is the number one issue facing Power Query today. I love Power Query. I don’t think anyone who knows me can say anything else. And I love the monthly update cycle that continues to deliver new features and improve the product. But believe me when I say that I would happily stop ALL development on Power Query for two years in order to get power query refresh fixed. It’s that big a problem. I’ve already seen clients abandon Power Query solutions due to the refresh times, and several others who avoid this great tool at all costs due to the refresh latency. Hanging out at the Ask the Experts booth at the Data Insights Summit, 75% of the people that came to talk to me were primarily there to figure out how to improve the performance of their queries.

Help us raise this as an issue with Microsoft. You can do that in two ways:Most importantly: Get everyone you know to vote for this item on Excel Uservoice. This is one of the primary vehicles to get our voices heard by the development team. They are listening, and this idea has been marked as under review BUT IT STILL NEEDS MORE VOTES. In Microsoft’s own words:

Img6 Why Excel’s Power Query Refresh Speeds Suck

I’d like to see better power query refresh as the #1 requested item on Excel Uservoice, and we have a long way to go to get there. But wait, you say… it’s marked as “Under Review”! Why do we still need to vote when it’s under review? Let me answer that by pointing you to this idea to make working with Pivot Tables easier (which you also may want to vote up.) It’s an idea that I KNOW they want to implement, but it’s been under review since Nov 2015 and still hasn’t met the bar:

Img7 Why Excel’s Power Query Refresh Speeds Suck

I talked to a couple of the team members at the Data Insights Summit, and they just haven’t had enough votes to signify that power query refresh in Excel is more important than some of the other things they are working on. Notice in the posts, they ask us to keep voting. The more votes we get, the easier it is for the teams to justify the cost of the fixes and new features to the ultimate approvers in Microsoft.That’s the really important thing to do here; KEEP VOTING. And if you’re really passionate about these (or any ideas), get your friends to vote. Get your family to vote. In fact, get anyone else that will listen to you to vote!

Let’s block ads! (Why?)

PowerPivotPro