Tag Archives: rows

Power Query (M)agic: Dynamically remove leading rows AND columns



Does Power Query sometimes seem too rigid? It’s great for defining a sequence of transformations to clean your data but requires the incoming data to be in a consistent format. Today, I want to talk about a technique for dynamically handling sheets with different structures.

You can download a sample .pbix to follow along with me here.

Let’s look at my sample data. Suppose my sales manager sends me a monthly sales report structured similarly to that below:

Nar Table 1 1024x199 Power Query (M)agic: Dynamically remove leading rows AND columns

This table looks pretty clean, right? Using Power Query, you can load this sheet, remove the first two rows and columns, unpivot the data, and you have the numbers you’re looking for in tabular format.

Patting yourself on the back, you file the report away until February rolls around. The Sales Manager sends you the new numbers, and you hit refresh, and… an error! Peaking at the data, you see that he changed the layout of the file and also added new salespeople to the matrix!

Nar Table 2 1024x333 Power Query (M)agic: Dynamically remove leading rows AND columns

Nar Table 3 1024x334 Power Query (M)agic: Dynamically remove leading rows AND columns

As enjoyable as it would be just to tell the manager to stick to a consistent format, sometimes this isn’t realistic or feasible. So, let’s look at how we can use Power Query to handle this chaos:

  1.  Identify the header name for the column that demarcates the beginning of the matrix.
  2.  Automatically detect rows up to that header row.
  3.  Automatically detect columns up to that header column.
  4.  Remove the extra rows and columns, leaving you with a squeaky-clean dataset.
  5.  Unpivot the matrix for each sheet and append into a single dataset.

STEP 1: IDENTIFY THE TARGET HEADER ROW AND COLUMN:

Looking back at our sample data, you can see that there is a common cell in all our sheets that lies at the upper left of the data we want to extract. In this case, the target cell is the “Region” header. We want to remove all rows ABOVE that cell and all columns LEFT of that cell without hard-coding the location of that cell.

STEP 2: LOCATE THE HEADER ROW DYNAMICALLY:

We need some way to identify where the header row starts so that we can remove rows up to that point. This functionality is something that I would have thought is built in by default, but surprisingly is not! Luckily our friend Ken Puls over at ExcelGuru.ca came up with [a solution] for this which I’ve adapted slightly for our purposes.

Load your first worksheet into Power Query, add an Index column, and filter the table to the target value from step 1:

  1. Add Column > Index column > From 0.
  2. Add a custom step (click the fx button in the formula bar).
  3. Replace the formula with some custom M code: Table.FindText(#”Starting Table w/Index”, “Region”). Name the step “Filtered to Header Row.”

Nar Table 4 1024x322 Power Query (M)agic: Dynamically remove leading rows AND columns

Note that you’ll want to replace the yellow-highlighted text with the Target Header name from step 1. Table.FindText() scans the entire table for the target text and returns any row with that value. So be careful that your dataset doesn’t have that exact target word in other places!

Now we have our header row isolated along with the Index value for that row. Rename the step to “Filtered Header Row” as we’ll come back to this shortly.

Nar Table 5 1024x118 Power Query (M)agic: Dynamically remove leading rows AND columns

STEP 3: DETECT THE COLUMNS TO DELETE

Let’s move on to the more difficult part: dynamically removing leading columns. We have a bunch of columns, and we want to eliminate the first X, up to our target column. We’ll leverage the technique above and add some P3 secret sauce.

First, transpose the #”Filtered Header Row” step and add an index. That will make a single row table into a single column table that we use to identify the columns to remove.

  1. Transform > Transpose.
  2. Add an index column: Add Column > Index column > From 1.
  3. To handle blank columns in the header row (always a possibility in dirty data), add a custom column that checks if the column list has any nulls: Add Column > Custom Column > if [Column1] = null then “Column” & Number.ToText([Index]) else [Column1].

Our goal is to delete all columns left of the “Region” column (or above “Region” in the transposed table) so let’s find the index of that row:

  1. Right Click Column1 > Text Filters > Equals > “Region”.

Nar Table 7 2 Power Query (M)agic: Dynamically remove leading rows AND columns Equals > “Region”” width=”568″ height=”79″ srcset=”https://powerpivotpro.com/wp-content/uploads/2018/02/Nar-Table-7-2.png 568w, https://powerpivotpro.com/wp-content/uploads/2018/02/Nar-Table-7-2-300×42.png 300w” sizes=”(max-width: 568px) 100vw, 568px”>

We’re building upon Ken’s technique of finding the Index that corresponds to a target cell but this time with a transposed table. Since we’ll reference this number a couple of times, let’s Drill Down to just the Index number so that we have an easily referenceable step:

  1. Right-Click on the Index value > Drill Down.
  2. Rename that step to “TargetColumnIndex”.

Now, jump around a bit and reference the original column list and filter it down to include ONLY the rows that have an index number less than the target column.

  1. Click the fx button to insert a New Step.
  2. Revise the M code to point to the full column list: = Table.SelectRows(#”Added ColumnName”, each [Index] < #”TargetColumnIndex”.

Let’s break down what we’re doing here: the outer Table.SelectRows filters the inner table #” Added ColumnName” down to all rows that have an Index less than the “TargetColumnIndex” value we isolated a couple of steps ago.

Nar Table 8 1 1024x612 Power Query (M)agic: Dynamically remove leading rows AND columns

Finally, remove the helper columns keeping only “ColumnName,” and you have a nice list of columns to exclude!

STEP 4: REMOVE THE ORIGINAL EXTRA COLUMNS AND ROWS

We now have all the pieces we need to eliminate our junk rows and columns! Let’s jump back to our original query and clean it up.

Create a new step and change its code to reference the Starting Table:

  1. Click fx > rename step to “Back to Starting Table” > change code to = #”Starting Table”.
  2. Home > “Remove Top Rows.” Enter any value for the parameter.
  3. Edit the M code directly and change the 2nd parameter of Table.Skip(#”Back to Starting Table,” #”Filtered to Header Row”[Index]{0}), in this case, we want to reference the step where we isolated the header row number from earlier.
  4. Home > Use First Rows as Headers.

Nar Table 9 2 1024x409 Power Query (M)agic: Dynamically remove leading rows AND columns

Boom! We’ve dynamically eliminated the top rows!

Now for the final act, we’ll tweak Table.RemoveColumns (when you do “Remove Columns” Power Query uses this function) to use a dynamic parameter! Remember that list of columns we generated earlier, the list we want to extract? That’s what we’ll input into Table.RemoveColumns.

First, select any one of the junk columns and right-click > “Remove Columns.” Take a look at the formula that Power Query generated.

  1.  Table.RemoveColumns(#”Promoted Headers”,{“Column1”}).

We know that Table.RemoveColumns requires a list for its second argument, so we can reference the list of columns to remove from an earlier step:

  1. Table.ToList(#”Columns to Remove”).

Now we’re left with a clean matrix which we can easily unpivot without any problems.

    1. Right-click the Region column > Unpivot Other Columns.
    2. Rename columns something useful.

Nar Table 11 848x1024 Power Query (M)agic: Dynamically remove leading rows AND columns

STEP 5: Convert magic query to function

  1. The final step is to convert your magic query into a function so that you can apply this dynamic transformation to each file/worksheet that needs to be unpivoted.

Conclusion:

Using the technique of identifying a target row and column, you can create extremely powerful and dynamic queries that can handle input files that aren’t always perfect, because let’s face it, you can’t always rely on training or data validation to prevent end users from modifying your perfect templates. Our job as data modelers is to make the end user experience as friendly as possible by foreseeing and handling exceptions.

If you have any ideas on how you might use this in your reports, please feel free to share in the comments section!

We get it:  you probably arrived here via Google, and now that you’ve got what you needed, you’re leaving. And we’re TOTALLY cool with that – we love what we do more than enough to keep writing free content.  And besides, what’s good for the community (and the soul) is good for the brand.

But before you leave, we DO want you to know that instead of struggling your way through a project on your own, you can hire us to accelerate you into the future. Like, 88 mph in a flux-capacitor-equipped DeLorean – THAT kind of acceleration. C’mon McFly. Where you’re going, you won’t need roads.

Let’s block ads! (Why?)

PowerPivotPro

Use calculated tables to show only used rows in dimensions

I got an interesting request this week, to improve the usability of the tabular model they wanted to filter the dimension table to only contain the rows that have data in the fact table. That way the slicers and other filters will never have data where you don’t have any sales or any facts. Often this is taken care before loading the data by creating view or writing queries but in this case this would have been expensive and time consuming for all tables in the model. But there is a way to do this in Power BI \ SSAS by using calculated tables.

Take the following model:

 Use calculated tables to show only used rows in dimensions

I filtered the FactOnlineSales table in the query to exclude some products. I ended up with a sales table that only contains data for 2 product, the product tables has 2517 products in total though.

If I now want to use the product as slicer or filter Power BI will show all of them, in this case it looks like this:

 Use calculated tables to show only used rows in dimensions

As you can see that is a long list! I just want to filter it down to only show those products where I have sales. To do this I will create a new calculated table that only contains those product that have values in the sales table:

FilteredProduct = FILTER(DimProduct,
                                    COUNTROWS(RELATEDTABLE(FactOnlineSales)) > 0
                        )

In short: Filter the rows in the DimProduct table to keep only those rows where we have more than 0 zero rows in the related sales table for the current product.

Now this works, I just have 2 rows left:

 Use calculated tables to show only used rows in dimensions

But I would like to make it a bit cleaner by hiding the table and setting the relationship to inactive, the problem is that by doing that the expression doesnt work anymore since I am using the RELATEDTABLE function. But there is a way to fix this by adding the USERELATIONSHIP function:

FilteredProduct = FILTER(DimProduct,
                                 CALCULATE(
                                            COUNTROWS(RELATEDTABLE(FactOnlineSales)) > 0
                                           ,USERELATIONSHIP(DimProduct[ProductKey],FactOnlineSales[ProductKey])
                                          )
                          )

Now this does the trick. The model now looks like this:

 Use calculated tables to show only used rows in dimensions

By hiding the DimProduct table the end user will not be able to see the table in the field list and my slicer now returns just the 2 items:

 Use calculated tables to show only used rows in dimensions

Mission accomplished!

Let’s block ads! (Why?)

Kasper On BI

Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

image thumb 3 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

The Pivot Pictured Above Acts as if We’ve Swapped Out Fields on Rows – in Response to a Slicer Click!

First off…my first post! Being one of the newest (and youngest) members of the PowerPivotPro family has been very exciting so far. As a way of introducing myself, I’d like to share a creative solution to a problem I’m sure many of you have encountered when building a report or dashboard.

Typical Client Report Requests:

  • Simple but complex
  • High-level yet detailed
  • Compact yet containing everything…

Quite understandably, clients love to channel their inner Marie Antoinette by basically asking to have their cake and eat it too. I actually relish those scenarios, they allow me to flex my “think outside the box” muscles!  And hey, it’s great to be working with a toolset that truly CAN accommodate the special demands posed by real-world situations.

Well one such scenario had a customer wanting a summary Pivot Table with about five columns (fields) worth of details in it. No problem, done and done! The problem we were encountering however…was that 90% of the real-estate space on our reporting page had already been used for other visuals, none of which this client wanted eliminated or reduced to make room. So I’m left with the predicament of figuring out how to fit all this data onto this dashboard…Tetris mode engage!

Unfortunately despite my best efforts to rearrange the dashboard (accompanied by my 80’s Rush Mixtape) I simply could not find any way to display a wide Pivot Table on this dashboard. So I circled back to the drawing board and asked myself what variables I could manipulate to achieve the desired outcome.

I realized that I had an assumption that the PivotTable had to be fixed, meaning that it always has to show all levels of the data. However I LOVE to design visuals for clients that are dynamic, only showing the relevant data to them (often based on slicer selections). So I politely asked my previous assumption to leave and invited over my good friend paradigm shift. After some long conversations and extensive Google searches I actually ran across a PowerPivotPro Blog Post written by Rob that inspired my eventual solution.

Discovering this post almost felt like a relay race and I was being passed the baton to cross the finish line. Using the idea from this post that a slicer could change the axis of a chart, I realized the same would work in a PivotTable. All five columns in my table were part of a hierarchy…so why not use this technique to display a single column that would DYNAMICALLY switch between any of the five levels of this hierarchy based on slicer selections. I would now be able to create a table that is both compact and would display all the data the client needed.

Time to break out the cake forks!

Now for the fun part as I share this recipe for success (that was the last cake joke I promise). The general idea will be to create a hierarchy in the data model, and then reference those in an Excel set to be used in my Pivot Table. I’ll be using tables from the publicly available Northwind DW data set for this example.

Download Completed Example Workbook

X

Get Your Files

FIRST, create a Customer Geography Hierarchy in the data model on the DimCustomer Table.

Hierarchy in the Data Model table:

Customer Geo Hierarchy with box thumb 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

SECOND, create a new DAX Measure called “Distinct Count of Country.” This will be used in our set to indicate whether or not a selection was made on our country Slicer.

=DISTINCTCOUNT( DimCustomer[Country] )

Now some of you technically savvy readers may be thinking “why didn’t he use the DAX Function HASONEVALUE?” I’ll explain more on this later when I explain how to write the set using MDX.

THIRD, create a new set for your pivot table referencing our recently created Hierarchy and DAX Measure. Note that the only way to access your sets is through a conditional ribbon that is displayed only when a cell selection is on a Pivot Table.

Opening the Set Manager window:

Sets with boxes thumb 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

Creating a new set using MDX:

Set Manager with boxes thumb 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

Writing the MDX Code:

MDX with boxes thumb 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

This MDX query works by utilizing an IIF statement, which operates the same was as in DAX or Excel. It checks to see if our (Distinct Count of Country) DAX Measure is greater than 1 (indicating no slicer selection). If TRUE it returns the Country column from our hierarchy, if FALSE (slicer selection has been made) then it returns the City column. It’s important to note that I must reference the columns in the hierarchy, if you were to put just the column names in this query it would not run. It’s also important that the “Recalculate set with every update” box is checked, this makes sure the MDX statement is calculated every time someone uses a slicer, otherwise it’ll appear like the set isn’t working.

Keen observers take note! Here’s where I explain WHY I used a DISTINCTCOUNT rather than HASONEVALUE in my DAX Measure. Let’s say a client would like to multi-select countries in the slicer and still have it display the City column on rows in the Pivot Table. If I were to use HASONEVALUE in my DAX Measure I would only switch to the city column when a single value was selected.

The way I’ve designed it we can change the value in the MDX query from 1 to any number we’d like (E.g. 2, 3, etc.) which gives us the flexibility to allow multiple slicer selections and still have it switch to the City column.

“Clever Girl…”

Clever Girl thumb Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

I’m not actually sure if I’m supposed to be the hunter or dinosaur in this analogy from Jurassic Park…but either way I felt clever for that last step.

FOURTH, we can now use our newly created set in a Pivot Table. You’ll notice that a new folder called Sets is now nested in our DimCustomer Table.

Placing the Customer Geo set on rows in our Pivot Table:

PivotTable with boxes thumb 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

Making a slicer selection to observe the dynamic switch from Country to City. Pretty cool!

PivotTable with slicer selection AND Boxes thumb 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

My client’s reaction could be summed up in a single word spoken by the immortal Keanu Reeves…

”Whoa…”

Now some of you may have noticed that I have a Total Sales value at the top of my Pivot Table. Now’s my chance to point out one unfortunate drawback of using sets on rows, it eliminates the totals row at the bottom of the Pivot Table. All is not lost though my friends, for every every problem a solution can always be found! In this case I created an artificial “Total’s row” at the top of the Pivot Table. I did this using the tried and true CUBEVALUE function to call the measure I’m already using in my PivotTable. NOTE that you need to make sure you connect all slicers (via the slicer_name) in the cube string for them to slice the CUBEVALUE as well. Finally, just a bit of dash formatting and some elbow grease and we have ourselves a Totals row!

CUBEVALUE Formula used in the cell for totals:

Totals Row with box thumb Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

There you have it, your very own Mighty Morphing Pivot Table. wlEmoticon smile 1 Mighty Morphing Pivot Tables or: How I learned to automatically change hierarchy levels on rows

Let’s block ads! (Why?)

PowerPivotPro

.Net 4.6.2. Framework client driver for Always Encrypted resulting in intermittent failures to decrypt individual rows

The SQL Product team has identified an issue with .Net 4.6.2 framework client driver for Always Encrypted enabled database on SQL Server 2016 and Azure SQL Database. The issue can lead to intermittent failure while trying to decrypt the records from the Always Encrypted enabled database with following error message

Decryption failed. The last 10 bytes of the encrypted column encryption key are: ‘7E-0B-E6-D3-39-CE-35-86-2F-AA’.The first 10 bytes of ciphertext are: ’01-C3-D7-39-33-2F-E6-44-C3-B1′.Specified ciphertext has an invalid authentication tag. 

The above failure to decrypt may potentially lead to incorrect query results which in turn may trigger incorrect behavior in the app, for example, attempts to insert missing values or to perform any other updates that will either produce further errors or produce inconsistent data in the database.

The SQL Product team is aware of the issue and is actively working on the fix which may be made available soon. In the interim, we have following recommendation for the users

  • Users who are on .Net framework 4.6.1 are not impacted by this issue and can ignore this. We recommend not to upgrade to .Net framework 4.6.2 if you are using Always Encrypted database feature until the fix for the issue is released.
  • Users who have the latest version of .Net framework 4.6.2 installed, we recommend, if possible to rollback to previous version of .Net framework 4.6.1. Note: In general, you should not uninstall any versions of the .NET Framework that are installed on your computer without testing,  because the application dependent on that version can potentially break and may not function as desired.
  • If you are unable to uninstall .Net framework 4.6.2 due to application dependency, you can workaround the issue by turning off column key encryption (CEK) by setting the SqlConnection.ColumnEncryptionKeyCacheTtl property to 0 in the .Net framework 4.6.2 driver.

After uninstalling .Net framework 4.6.2 or turning off Column Encryption Key (CEK) caching, users can confirm that the error does not reappear during a table scan (e.g., SELECT * FROM < table with Always Encrypted>) executed from a query window in SSMS. Executing this scan will help validate the error doesn’t occur.

Customers who encounter the above error during the validation scan and are unable to resolve the issue, should contact sqlalwaysencrypted@microsoft.com.  The team will be able to help access and recover all previously encrypted rows that were affected by this bug. There will be no permanent data loss caused as a result of this defect.

To determine which versions of the .NET Framework are installed on a system, see How to: Determine Which .NET Framework Versions Are Installed.

Parikshit Savjani
Senior Program Manager (@talktosavjani)

Let’s block ads! (Why?)

SQL Server Release Services

SQL Pattern Matching Deep Dive – Part 4, Empty matches and unmatched rows?

I have been asked a number of times during and after presenting on this topic (SQL Pattern Matching Deep Dive) what is the difference between the various matching options such as EMPTY MATCHES and UNMATCHED ROWS. This is the area that I am going to cover in this particular blog post, which is No 4 in this deep dive series.

When determining the type of output you want font-size: 12px;”> MATCH_RECOGNIZE to return most developers will opt for one of the following:

  • ONE ROW PER MATCH – each match produces one summary row. This is the default.
  • ALL ROWS PER MATCH – a match spanning multiple rows will produce one output row for each row in the match.

The default behaviour for MATCH_RECOGNIZE is to return one summary row for each match. In the majority of use cases this is probably the ideal solution. However, there are also many use cases that require more detailed information to be returned. If you are debugging your MATCH_RECOGNIZE statement then a little more information can help show how the pattern is being matched to your data set. In some cases you may find it useful, or even necessary, to use the extended syntax of the ALL ROWS PER MATCH keywords. There are three sub options:

  • ALL ROWS PER MATCH SHOW EMPTY MATCHES <- note that this is the default
  • ALL ROWS PER MATCH OMIT EMPTY MATCHES
  • ALL ROWS PER MATCH WITH UNMATCHED ROWS

Let’s look at these sub options in more detail but first a quick point of reference: all the examples shown below use the default AFTER MATCH SKIP PAST LAST ROW syntax. More on this later… 

TICKER DATA

Here is part of the ticker data set that we are going to use in this example – if you want to take a look at the full data set then see the example on the LiveSQL site:

 SQL Pattern Matching Deep Dive   Part 4, Empty matches and unmatched rows?

Empty matches

An obvious first question is: what’s the difference between an “empty match” and an “unmatched row”? This is largely determined by the type of quantifier used as part of the pattern definition. By changing the quantifier it is possible to produce the similar result using both sets of keywords.  To help explore the subtleties of these keywords I have simplified the pattern to just look for price decreases and you should note that we are using the * quantifier to indicate that we are looking for zero or more matches of the DOWN pattern. Therefore, if we run the following code:

SELECT
  symbol,
  tstamp,
  price,
  start_tstamp,
  end_tstamp,
  match_num,
  classifier
FROM ticker
MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(down.tstamp) AS start_tstamp,
           LAST(down.tstamp) AS end_tstamp,
           match_number() AS match_num,
           classifier() AS classifier
ALL ROWS PER MATCH SHOW EMPTY MATCHES
  PATTERN (DOWN*)
  DEFINE
    DOWN AS (price <= PREV(price))
)
WHERE symbol = ‘GLOBEX';

We get the following output:

 SQL Pattern Matching Deep Dive   Part 4, Empty matches and unmatched rows?

You can see that the result set contains all 20 rows that make up the data for my symbol “GLOBEX”. Rows 1- 3, 9, and 13-15 are identified as unmatched rows – the classifier returns null. These rows appear because we have defined the search requirements for pattern DOWN as being zero or more occurrences.

Based on this we can state that an empty match is a row that does not map explicitly to a pattern variable (in this case DOWN). However, it is worth noting that an empty match does in fact have a starting row and it is assigned a sequential match number, based on the ordinal position of its starting row. The above situation is largely the result of the specific quantifier that we are using: * (asterisk). Given that the DOWN variable can be matched zero or more times there is the opportunity for an empty match to occur. As the complexity of the PATTERN increases, adding more variables and using different combinations of quantifiers, the probability of getting empty matches decreases but it is something that you need to consider. Why? Because the MATCH_NUMBER() function counts the empty matches and assigns a number to them – as you can see above. Therefore, if we omit the empty matches from the results the MATCH_NUMBER() column no longer contains a contiguous set of numbers:

So that if we run the following code where we have specified “OMIT EMPTY MATCHES”:

SELECT 
  symbol, 
  tstamp,
  price,
  start_tstamp,
  end_tstamp,
  match_num,
  classifier
FROM ticker 
MATCH_RECOGNIZE ( 
  PARTITION BY symbol ORDER BY tstamp 
  MEASURES FIRST(down.tstamp) AS start_tstamp,
           LAST(down.tstamp) AS end_tstamp,
           match_number() AS match_num,
           classifier() AS classifier
ALL ROWS PER MATCH OMIT EMPTY MATCHES
  PATTERN (DOWN*)
  DEFINE 
    DOWN AS (price <= PREV(price))
)
WHERE symbol = ‘GLOBEX';

We get the following output:

 SQL Pattern Matching Deep Dive   Part 4, Empty matches and unmatched rows?

as you can see the MATCH_NUMBER() column starts with match number 4 followed by match 6 followed by match 10. Therefore, you need to be very careful if you decide to test for a specific match number within the MATCH_RECOGNIZE section and/or the result set because you might get caught out if you are expecting a contiguous series of numbers.  

Summary of EMPTY MATCHES

Some patterns permit empty matches such as those using the asterisk quantifier, as shown above. Three mains points to remember when your pattern permits this type of matching:

  1. The value of MATCH_NUMBER() is the sequential match number of the empty match.
  2. Any COUNT is 0.
  3. Any other aggregate, row pattern navigation operation, or ordinary row pattern column reference is null.

The default is always to return empty matches, therefore, it is always a good idea to determine from the start if your pattern is capable of returning an empty match and how you want to manage those rows: include them (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter.

Reporting unmatched rows?

Always useful to view the complete result set – at least when you are running your code against test data sets. Getting all the input rows into your output is relatively easy because you just need to include the phrase ALL ROWS PER MATCH WITH UNMATCHED ROWS. Other than for testing purposes I can’t think of a good use case for using this in production so make sure you check your code before submit your production-ready code to your DBA.

What about skipping?

Note that if ALL ROWS PER MATCH WITH UNMATCHED ROWS is used with the default skipping behaviour (AFTER MATCH SKIP PAST LAST ROW), then there is exactly one row in the output for every row in the input. This statement will lead us nicely into the next topic in this deep dive series where I will explore SKIPPING. Taking a quick peak into this next topic…obviously there are many different types of skipping behaviours that are permitted when using WITH UNMATCHED ROWS. It does, in fact, become possible for a row to be mapped by more than one match and appear in the row pattern output table multiple times. Unmatched rows will appear in the output only once.

Can a query contain all three types of match?

Now the big question: Can I have a query where it is possible to have both UNMATCHED ROWS and EMPTY MATCHES? Short answer: Yes.

When the PATTERN clause allows empty matches, nothing in the DEFINE clause can stop the empty matches from happening. However, there are special PATTERN symbols that are called anchors. Anchors work in terms of positions rather than rows. They match a position either at the start or end of a partition, or it used together then across the whole partition.

  • ^ matches the position before the first row in the partition
  • $ matches the position after the last row in the partition

Therefore, using these symbols it is possible to create a PATTERN where the keywords SHOW EMPTY MATCHES, OMIT EMPTY MATCHES, and WITH UNMATCHED ROWS all produce different results from the same result set. For example, let’s start with the following:

SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES match_number() AS mnm,
          count(*) AS nmr,
          classifier() AS cls
 ALL ROWS PER MATCH SHOW EMPTY MATCHES
 PATTERN ((^A*)|A+)
 DEFINE A AS price > 11)
WHERE symbol = ‘GLOBEX’
ORDER BY 1, 2;

returns the following 5 rows:

 SQL Pattern Matching Deep Dive   Part 4, Empty matches and unmatched rows?

this shows row 1 as an empty match for the pattern A* because we are matching from the start of the partition. This sets the MATCH_NUMBER() counter to 1. After the empty match the state moves to the pattern A+ for the remainder of the rows. The first match for this pattern starts at row 2 and completes at row 4. The final match in our data set is found at the row containing 15-APR-11. Therefore, if we omit the empty match at row 1 we only get 4 rows returned as shown here:

SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES match_number() AS mnm, 
          count(*) AS nmr, 
          classifier() AS cls
 ALL ROWS PER MATCH OMIT EMPTY MATCHES
 PATTERN ((^A*)|A+)
 DEFINE A AS price > 11)
WHERE symbol = ‘GLOBEX’
ORDER BY 1, 2;

returns the following 4 rows:

 SQL Pattern Matching Deep Dive   Part 4, Empty matches and unmatched rows?

Now if we use the last iteration of this example the MATCH_RECOGNIZE statement returns all the rows from the input data. The actual “unmatched rows” are identified as having a NULL match number and NULL classifier. The “empty matches” are identified as having a NULL classifier and in this example the COUNT(*) function returns zero.

SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES match_number() AS mnm, 
          count(*) AS nmr, 
          classifier() AS cls
 ALL ROWS PER MATCH WITH UNMATCHED ROWS
 PATTERN ((^A*)|A+)
 DEFINE A AS price > 11)
WHERE symbol = ‘GLOBEX’
ORDER BY 1, 2;

returns all 20 rows from our data set:

 SQL Pattern Matching Deep Dive   Part 4, Empty matches and unmatched rows?

LiveSQL

I have taken all the code and the associated explanations and created a tutorial on LiveSQL so you can try out the code for yourself: https://livesql.oracle.com/apex/livesql/file/tutorial_DZO3CVNYA7IYFU1V8H0PWHPYN.html.

Summary

I hope this helps to explain how the various output keywords that are part of the ALL ROWS PER MATCH syntax can affect the results you get back. You should now understand why your results contains match_number values that are not contiguous and why classifier can return a NULL value along with specific aggregate functions. I expect the hardest concept to understand is the idea of empty matches. As I stated earlier it is always a good idea to determine from the start if your pattern is capable of returning an empty match: are you using an asterisk * within the PATTERN clause? Then you can determine how you want to manage those rows: include the empty matches (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter.

What should be immediately obvious is that in all the examples I have used the default skip behaviour: AFTER MATCH SKIP PAST LAST ROW. In the next post I will explore the various skip keywords and how they can impact the results returned by your MATCH_RECOGNIZE statement.

What’s next?

In the next post in this series I am going to review the keywords that control where we restart searching once a pattern has been found: the keywords SKIP TO. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

Let’s block ads! (Why?)

The Data Warehouse Insider