Optimizing Large Distinct Counts in Analysis Services Tabular

The most glaring hole in the Tabular Performance Guide is information about tuning the performance of distinct counts. In fact I don’t believe distinct counts are even mentioned once in the whole whitepaper. (Since I helped author that whitepaper, visualize me hanging my head in shame and hoping this blog post will serve as my redemption.)

Thankfully others have picked up the slack and have published some great research on optimizing distinct count performance. Alex Whittles did some tests about the impact of DefaultSegmentRowCount on distinct counts. Alberto Ferrari wrote a nice whitepaper all about different approaches to distinct counts and understanding distinct count query plans and he also wrote a nice case study about InContact which suggested if you never need to report across business units you can separate business units into multiple SSAS databases to dramatically improve distinct count query performance. Marco Russo reminded us of an important performance improvement for large distinct counts that came out in SQL2012 SP1 CU9 and SQL2014 RTM CU1.


The Best Way to Optimize Distinct Counts

However, the best way to optimize a distinct count is to change it to a regular count. This blog post documents one way of doing this if there are many rows where the value you are distinct counting only appears once per table. For example in the following table we are distinct counting CustomerKey. Customer 99 and 88 only have one order:

OrderKey

CustomerKey

1111

99

2222

88

3333

77

4444

77

5555

66

6666

66

So we could transform this table to the following:

OrderKey

CustomerKey

CustomerKeyMulti

CustomerKeySingle

1111

99

 

1

2222

88

 

1

3333

77

77

 

4444

77

77

 

5555

66

66

 

6666

66

66

 

Now instead of DISTINCTCOUNT(FactOrder[CustomerKey]) we can improve performance by doing:

DISTINCTCOUNT(FactOrder[CustomerKeyMulti])
+ SUM(FactOrder[CustomerKeySingle])

That calculation will actually be 1 high because a distinct count on a column with nulls will count those nulls. So to fix that bug we need to make the calculation slightly more complex:

DISTINCTCOUNT(FactOrder[CustomerKeyMulti])
+ IF(
   SUM(FactOrder[CustomerKeySingle])>=1,
   SUM(FactOrder[CustomerKeySingle])-1
  )

How much faster is this optimized version than a plain distinct count? The following tests were done on a 200 million row table on my laptop which has an Intel i7-4702HQ processor. I did some similar tests on Azure D-series and G-series virtual machines and saw relatively similar numbers. As you can see, if 195 million rows are customers who only placed one order, then a 19 second distinct count becomes half a second once optimized. As a smaller percentage of rows are customers who only placed one order, the improvement after optimization shrinks to the point that the optimization is not worth the data preparation effort.

As with performance of most things in Tabular, the cardinality of columns largely determines performance. Drilling into these numbers in more detail, study the first row which took 18 seconds to run a distinct count across 200 million rows. Optimized it took 0.5 seconds and the optimized distinct count only had to distinct count 5 million values (200M - 195M = 5M distinct values in the CustomerKeyMulti column).

Rows with Unique Distinct Count Value

CustomerKey Cardinality

Regular Query Time (ms)

CustomerKeyMulti Cardinality

Optimized Query Time (ms)

195M

200M

18,480

5M

500

160M

180M

17,093

20M

1,858

120M

160M

13,003

40M

3,528

80M

140M

12,134

60M

5,620

40M

120M

11,632

80M

6,789

10M

105M

10,939

95M

9,523


Data Preparation for this Optimization

The ETL to calculate these two new columns is fairly expensive but depending on your data the resulting Analysis Services query performance enhancement may be worth it. The most straightforward approach to building the Multi and Single columns is:

select OrderKey
,
CustomerKey
,CustomerKeyMulti = case when CustomerKeyCount > 1 then CustomerKey end
,CustomerKeySingle = case when CustomerKeyCount = 1 then 1 end
from (
 select OrderKey
 ,CustomerKey
 ,CustomerKeyCount = count(*) over (PARTITION BY CustomerKey) > 1
 from FactOrder
) f

You could also split your single fact table into two fact tables to potentially optimize it even further. I briefly tested this and at least on my particular dataset this optimization didn’t make a significant difference, but your mileage may vary. If you don’t have any degenerate dimension columns, you can split the one optimized table into:

OrderKey

CustomerKey

CustomerKeyMulti

3333

77

77

4444

77

77

5555

66

66

6666

66

66

OrderKey

CustomerKey

CustomerKeySingle

1111

99

1

2222

88

1

 

Conclusion

Why perform an expensive distinct count if it’s not needed? If you can transform even a portion of your distinct count into a regular count, your query performance gain inside Analysis Services Tabular may be tremendous.