My wife says that if I wait long enough, some of my clothing will come back into fashion. In the data analytics world, apparently we have waited long enough because an old Analysis Services Multidimensional concept called aggregations has come back into fashion. You may have seen announcements that Power BI supports its own flavor of aggregations now. This blog post is not about Power BI aggs. However, I’ll explain why we didn’t use them and then I’ll describe the approach we did take.
The Business Scenario
To provide some context for the real client scenario where we created this exact match agg approach, the client is a retailer and they tend to measure themselves on a per store per day scale as many retailers do. If one region has 50 stores and another region has 80 stores, it’s difficult to compare those two regions head-to-head unless you normalize the sales amounts to the scale of a single store and day. Similarly, if new stores are being opened halfway through the year and other stores are being closed, then it’s difficult to compare apples to apples without normalizing to a per store and per day scale. While this approach is very useful for analytics, it also has a downside since now most every metric must be divided by a distinct count of StoreKey and DateKey. As you know, distinct counts are very expensive relative to sum or count aggregations.
I was presented with this very situation where a distinct count over an 18 billion row table was the denominator for most calculations and performance was terrible. They wanted dashboards to render in under 10 seconds even when looking at sales change year-over-year for the entire company. This blog post summarizes the approach we took to optimize their Analysis Services Tabular model distinct counts to meet those performance expectations.
Could Power BI Aggregations Help?
We chose not to use Power BI aggregations for a few reasons. First, the data was in an Analysis Services Tabular model which was hundreds of GB in size and running on a 128 v-core VM with 4TB of RAM. Unfortunately, Power BI aggs aren’t supported on top of a live connection to SSAS currently. Second, though Power BI does support distinct count aggregations, the agg table must contain the distinct count key and it must be marked as a GroupBy column. In the real client scenario above, the 18 billion row fact table had one row per ProductKey per StoreKey per DateKey. Aggregating it to a product subcategory grain but keeping StoreKey and DateKey in the table only shrunk the table to 4 billion rows. That’s too large to be imported into Power BI at the current time.
We actually did take the approach of creating a product subcategory agg table in the SSAS Tabular model and it did make reports 3x faster. The nice thing about this approach was that 90% of reporting is done at the product subcategory and above so this agg table was hit on most reports. The other nice thing about this approach was that because StoreKey and DateKey were in the agg table, we could slice by any rollup in the Store dimension and any rollup in the Date dimension and hit the agg as long as the report was at the product subcategory level or above. Unfortunately, 3x faster wasn’t nearly fast enough; a distinct count over a 4 billion row table was still much too slow.
How to Make Distinct Counts an Order of Magnitude Faster
I have written in the past about optimizing large distinct counts by separating out the portion of the data which isn’t a distinct count and turning it into a sum. That approach wasn’t applicable to the store-day distinct count in this scenario. But the solution we came up with did take a page out of that book by turning the entire distinct count into a sum. How do you change a distinct count into a sum? By pre-aggregating the data. When I say pre-aggregating, I don’t mean the approach described above where you create a 4 billion row table but still perform the DISTINCTCOUNT at query time. I mean that you perform the distinct count ahead of time so that at query time it is a sum.
Why Call this an Exact Match Agg?
I can pre-aggregate an additive measure like summing up sales amounts to the product subcategory grain and storing it into a FactProductSubCategorySalesAgg table. When the user queries by product category, I can simply further summarize the FactProductSubCategorySalesAgg numbers to answer the query. However, if I’m dealing with distinct counts then that doesn’t work. You can’t rollup a subcategory agg to the category level because a distinct count is nonadditive. So distinct count agg tables (which truly pre-aggregate the distinct count rather than store the distinct count key in the table) only help queries which are an exact match. This means two things. First, I have to anticipate ahead of time the common rollups that users will request. Second, I have to find a way to store thousands of different grain rollups and I have to be able to determine at query time in DAX calculations whether I have already pre-aggregated the rollup the user is requesting. Since there are dozens of columns in the Product, Store, and Date dimension which users use in reports, when you figure out all the combination of filter grains that are commonly seen, you quickly realize there are thousands of rollup grains.
As a side note, in Power BI aggs each column in the agg table must be either a GroupBy column you grouped by to create the aggregation or a pre-aggregated sum/min/max/count column. Distinct count isn’t in the list. (Currently, you have to mark the distinct count column as “GroupBy” to support distinct count over the agg table. But that leads to a large agg table.) Though it’s possible the Power BI team could add “Distinct Count” to the list in the future, it wouldn’t be very useful since each Power BI agg table represents one grain of pre-aggregated data which can’t be further aggregated up and since typical scenarios require thousands of grains, it immediately becomes untenable to create thousands of distinct count agg tables. So we can’t really use the Power BI agg approach for our exact match pre-aggregated distinct counts. We have to come up with a different approach.
How to Store Thousands of Aggregation Grains
For this agg table we decided to pre-aggregate up the Product dimension and pre-aggregate up the Date dimension but keep the StoreKey in the agg table. So any filter on the Store dimension can hit the agg table (since StoreKey is in the distinct count key) but only pre-aggregated rollups on the Product and Date dimensions can hit the agg table. How do we store thousands of different Product and Date rollups without making a mess of the model?