My wife says that if I wait long enough, some of my clothing will come back into fashion. In the data analytics world, apparently we have waited long enough because an old Analysis Services Multidimensional concept called aggregations has come back into fashion. You may have seen announcements that Power BI supports its own flavor of aggregations now. This blog post is not about Power BI aggs. However, I’ll explain why we didn’t use them and then I’ll describe the approach we did take.
The Business Scenario
To provide some context for the real client scenario where we created this exact match agg approach, the client is a retailer and they tend to measure themselves on a per store per day scale as many retailers do. If one region has 50 stores and another region has 80 stores, it’s difficult to compare those two regions head-to-head unless you normalize the sales amounts to the scale of a single store and day. Similarly, if new stores are being opened halfway through the year and other stores are being closed, then it’s difficult to compare apples to apples without normalizing to a per store and per day scale. While this approach is very useful for analytics, it also has a downside since now most every metric must be divided by a distinct count of StoreKey and DateKey. As you know, distinct counts are very expensive relative to sum or count aggregations.
I was presented with this very situation where a distinct count over an 18 billion row table was the denominator for most calculations and performance was terrible. They wanted dashboards to render in under 10 seconds even when looking at sales change year-over-year for the entire company. This blog post summarizes the approach we took to optimize their Analysis Services Tabular model distinct counts to meet those performance expectations.
Could Power BI Aggregations Help?
We chose not to use Power BI aggregations for a few reasons. First, the data was in an Analysis Services Tabular model which was hundreds of GB in size and running on a 128 v-core VM with 4TB of RAM. Unfortunately, Power BI aggs aren’t supported on top of a live connection to SSAS currently. Second, though Power BI does support distinct count aggregations, the agg table must contain the distinct count key and it must be marked as a GroupBy column. In the real client scenario above, the 18 billion row fact table had one row per ProductKey per StoreKey per DateKey. Aggregating it to a product subcategory grain but keeping StoreKey and DateKey in the table only shrunk the table to 4 billion rows. That’s too large to be imported into Power BI at the current time.
We actually did take the approach of creating a product subcategory agg table in the SSAS Tabular model and it did make reports 3x faster. The nice thing about this approach was that 90% of reporting is done at the product subcategory and above so this agg table was hit on most reports. The other nice thing about this approach was that because StoreKey and DateKey were in the agg table, we could slice by any rollup in the Store dimension and any rollup in the Date dimension and hit the agg as long as the report was at the product subcategory level or above. Unfortunately, 3x faster wasn’t nearly fast enough; a distinct count over a 4 billion row table was still much too slow.
How to Make Distinct Counts an Order of Magnitude Faster
I have written in the past about optimizing large distinct counts by separating out the portion of the data which isn’t a distinct count and turning it into a sum. That approach wasn’t applicable to the store-day distinct count in this scenario. But the solution we came up with did take a page out of that book by turning the entire distinct count into a sum. How do you change a distinct count into a sum? By pre-aggregating the data. When I say pre-aggregating, I don’t mean the approach described above where you create a 4 billion row table but still perform the DISTINCTCOUNT at query time. I mean that you perform the distinct count ahead of time so that at query time it is a sum.
Why Call this an Exact Match Agg?
I can pre-aggregate an additive measure like summing up sales amounts to the product subcategory grain and storing it into a FactProductSubCategorySalesAgg table. When the user queries by product category, I can simply further summarize the FactProductSubCategorySalesAgg numbers to answer the query. However, if I’m dealing with distinct counts then that doesn’t work. You can’t rollup a subcategory agg to the category level because a distinct count is nonadditive. So distinct count agg tables (which truly pre-aggregate the distinct count rather than store the distinct count key in the table) only help queries which are an exact match. This means two things. First, I have to anticipate ahead of time the common rollups that users will request. Second, I have to find a way to store thousands of different grain rollups and I have to be able to determine at query time in DAX calculations whether I have already pre-aggregated the rollup the user is requesting. Since there are dozens of columns in the Product, Store, and Date dimension which users use in reports, when you figure out all the combination of filter grains that are commonly seen, you quickly realize there are thousands of rollup grains.
As a side note, in Power BI aggs each column in the agg table must be either a GroupBy column you grouped by to create the aggregation or a pre-aggregated sum/min/max/count column. Distinct count isn’t in the list. (Currently, you have to mark the distinct count column as “GroupBy” to support distinct count over the agg table. But that leads to a large agg table.) Though it’s possible the Power BI team could add “Distinct Count” to the list in the future, it wouldn’t be very useful since each Power BI agg table represents one grain of pre-aggregated data which can’t be further aggregated up and since typical scenarios require thousands of grains, it immediately becomes untenable to create thousands of distinct count agg tables. So we can’t really use the Power BI agg approach for our exact match pre-aggregated distinct counts. We have to come up with a different approach.
How to Store Thousands of Aggregation Grains
For this agg table we decided to pre-aggregate up the Product dimension and pre-aggregate up the Date dimension but keep the StoreKey in the agg table. So any filter on the Store dimension can hit the agg table (since StoreKey is in the distinct count key) but only pre-aggregated rollups on the Product and Date dimensions can hit the agg table. How do we store thousands of different Product and Date rollups without making a mess of the model?
Dates are easiest, so we will start with those. Users commonly look at month-to-date and prior year month-to-date. Assuming today is February 4, 2019, we can represent that month-to-date aggregation as DateRangeKey=20190204004 where the yellow part (20190204) is the ending date of the range and the green part (004) is the duration of the date range. Prior year month-to-date would be 20180204004. Year-to-date through February 4, 2019 would be 20190204035. So we import a hidden DimDateRange table into the Tabular model which has the following columns:
DateRangeKey |
MinDateKey |
MaxDateKey |
20190204004 |
20190201 |
20190204 |
20180204004 |
20180201 |
20180204 |
20190204035 |
20190101 |
20190204 |
How do we determine in DAX whether we have pre-aggregated the date range the user requested? I will show a full implementation later, but the following DAX snippet returns the DateRangeKey if we have it and returns BLANK() if we don’t have it pre-aggregated.
LOOKUPVALUE ( DateRange[DateRangeKey], DateRange[MinDateKey], MIN ( 'Date'[DateKey] ), DateRange[MaxDateKey], MAX ( 'Date'[DateKey] ), DateRange[DateRangeKey], MAX ( 'Date'[DateKey] ) * 1000 + COUNTROWS ( 'Date' ) )
Products are quite a bit harder. We decided to define each Product dimension rollup as a set of the ProductKeys that are in it. We called this the ProductAggSignature. Next, we looked for a way to represent that signature with some inexpensive DAX functions like SUM, MIN, MAX, and COUNT. For example, the first row represents ProductKeys 1, 3, 5, 7, and 9. The second row represents ProductKeys 1, 4, 5, 6, and 9.
ProductAgg |
Count |
Min |
Max |
Sum |
SumExponentMod |
998 |
5 |
1 |
9 |
25 |
176 |
999 |
5 |
1 |
9 |
25 |
192 |
To define each of those columns:
ProductAggSignatureKey – an identity column
CountProductKey – the row count of DimProduct which is included in the agg
MinProductKey – the lowest ProductKey included in the agg
MaxProductKey – the highest ProductKey included in the agg
SumProductKey – the sum of the ProductKey values included in the agg
SumExponentModProductKey – the sum of a new column or calculation in the Product table which is: (ProductKey*ProductKey*100)%77
That last column is certainly odd but as you can see in the example above, it was necessary to distinguish the first and second row.
After gathering business requirements and studying the product rollups and date rollups used in several key dashboards, we chose the rollups to pre-aggregate in Azure SQL DW.
How do we determine at query time in DAX whether we have pre-aggregated the product dimension rollup the user requested? I will show a full implementation later, but the following DAX snippet returns the ProductAggSignatureKey if we have it and returns BLANK() if we don’t have it pre-aggregated.
LOOKUPVALUE ( ProductAggSignature[ProductAggSignatureKey], ProductAggSignature[CountProductKey], COUNTROWS ( 'Product' ), ProductAggSignature[MinProductKey], MIN ( 'Product'[ProductKey] ), ProductAggSignature[MaxProductKey], MAX ( 'Product'[ProductKey] ), ProductAggSignature[SumProductKey], SUM ( Product[ProductKey] ), ProductAggSignature[SumExponentModProductKey], SUM ( 'Product'[ExponentModProductKey] ) )
Now that we’ve explained the DateRangeKey and the ProductAggSignatureKey, the new agg table is pretty straightforward.
DateRangeKey |
ProductAggSignatureKey |
StoreKey |
StoreDayCount |
20190204004 |
998 |
101 |
4 |
20190204004 |
998 |
102 |
4 |
In this scenario, we have pre-aggregated MTD February 4, 2019 for a product rollup I described above. We store one row per StoreKey in the rollup. The StoreDayCount is the number of days that store had sales of a product in that product rollup. Therefore, I can sum up 4 + 4 to get the correct “Store Day Count” number if I ask for this exact date range and product rollup.
The following diagram shows the tables and relationships in the Tabular model. The yellow highlighted tables and columns are the new ones added to implement this exact match agg optimization.
How do we do that in DAX? If both of the DateRangeKey and ProductAggSignatureKey LOOKUPVALUE functions return a number, then we return SUM([StoreDayCount]) from rows matching those keys. Otherwise, we fall back to our regular slow DISTINCTCOUNT measure. Here’s the full DAX measure:
Store Day Count (Optimized):= VAR DateRangeKey = LOOKUPVALUE ( DateRange[DateRangeKey], DateRange[MinDateKey], MIN ( 'Date'[DateKey] ), DateRange[MaxDateKey], MAX ( 'Date'[DateKey] ), DateRange[DateRangeKey], MAX ( 'Date'[DateKey] ) * 1000 + COUNTROWS ( 'Date' ) ) RETURN IF ( NOT ( ISBLANK ( DateRangeKey ) ), VAR ProductAggSignatureKey = LOOKUPVALUE ( ProductAggSignature[ProductAggSignatureKey], ProductAggSignature[CountProductKey], COUNTROWS ( 'Product' ), ProductAggSignature[MinProductKey], MIN ( 'Product'[ProductKey] ), ProductAggSignature[MaxProductKey], MAX ( 'Product'[ProductKey] ), ProductAggSignature[SumProductKey], SUM ( Product[ProductKey] ), ProductAggSignature[SumExponentModProductKey], SUM ( 'Product'[ExponentModProductKey] ) ) RETURN IF ( NOT ( ISBLANK ( ProductAggSignatureKey ) ), CALCULATE ( SUM ( StoreDayExactMatchAgg[StoreDayCount] ), StoreDayExactMatchAgg[DateRangeKey] = DateRangeKey, StoreDayExactMatchAgg[ProductAggSignatureKey] = ProductAggSignatureKey ), [Slow DISTINCTCOUNT Measure] ), [Slow DISTINCTCOUNT Measure] )
For this sample code I created this as a new “Store Day Count (Optimized)” measure in order to be able to easily compare the performance of the slow approach and the optimized approach. In a real model, I would have renamed the existing “Store Day Count” measure, hidden it, and replaced the “Store Day Count” measure with the DAX shown above.
Why Did I Not Use the Typical DAX Approach?
If I had been able to use Power BI agg tables, then there would have been no additional DAX changes necessary for Power BI to use the agg tables when possible. (The “typical DAX approach” in Power BI is no additional DAX.) However, we couldn’t use Power BI aggs for reasons stated above.
In SSAS Tabular, Marco Russo describes in this video (around the 1:02:00 mark) the typical DAX approach to detecting what rollups the user chose. Typically, you use ISFILTERED or ISCROSSFILTERED. The DAX query optimizer handles this very efficiently and this works well when you are detecting one fairly simple filter context like “did I filter by product subcategory?” However, when you are detecting thousands of different grain rollups, this approach will be impossible to code and maintain. That’s why we chose to perform a LOOKUPVALUE against the DateRange and ProductAggSignature tables to make this determination.
How Much Faster Is It?
For the real client scenario, the distinct count got about 23x faster. Reports that looked at large date ranges like YTD got approximately 4x faster (realizing we were only optimizing the distinct count denominator of calculations, not the numerator) and we were able to get reports to render under 10 seconds. Reports that looked at one day or one week were approximately 1.4x faster.
For the sample I published to Github (see below) the core distinct count measure became 210x faster and reporting queries that contained sales amounts and year-over-year comparisons became approximately 24x faster. Obviously, your results will vary depending on your data and hardware, but clearly this is an order of magnitude faster performance.
I like that we can deploy this optimization such that it is completely transparent to the user since the DAX code automatically falls back to the slower distinct count if we don’t have the rollup they want pre-aggregated. It’s complex to implement, but on large data volumes where interactive distinct count query performance is needed, it’s definitely an option to consider.
Try It Out and Take the Challenge
Realizing this is a rather complex solution, I published the code here on Github with sample data. It’s only 400 million rows not 18 billion rows but it should be enough to demonstrate a bit of a performance issue and show how the solution is an order of magnitude faster. Please give it a try and use the Issues tab for any questions you have.
I would also like to drop the gauntlet and challenge you to come up with a better solution! If you can make the solution faster or less complex to implement, I would love to hear it. Start a thread on the Issues tab to share your approach. Certainly, further optimizations of the solution are welcome. Thinking out-of-the-box like changing the technologies involved is also welcome.