ProcessAdd on Large Dimensions

ProcessFull, ProcessUpdate, and ProcessAdd are the three processing options for keeping data fresh in dimensions in Analysis Services 2005. This post takes a look at these options as they relate to processing very large dimensions.


Background

When considering the different possibilities for a nightly Analysis Services processing strategy, we usually recommend the following evolution which attempts to weigh maintenance complexity versus processing performance. The first processing scheme is low complexity. The other partitioning schemes add complexity in order to improve performance.

Processing Scheme #1: ProcessFull entire database
If doing a ProcessFull on the entire database completes within the allowed time, then this is obviously the most straightforward method of keeping your cubes fresh. If ProcessFull takes too long, then...

Processing Scheme #2: ProcessUpdate dimensions and process fewest partitions possible
Next we recommend they look into partitioning their measure groups such that they only have to process a small subset of the partitions every night. For instance, if you partition by month, you may be able to just process the current month's partition nightly. Since not all partitions will be reprocessed nightly, dimension processing must be done so as not to unprocess the existing partitions. The easiest way is to accomplish this is to ProcessUpdate all dimensions, then ProcessData the partitions that have changed data (i.e. the current month), then ProcessDefault each measure group. (ProcessDefault will process aggs and indexes on the partitions you just processed. In addition, it will reprocess the flexible aggs which were dropped on old partitions during ProcessUpdate of dimensions.) This strategy is moderate complexity and requires custom coding to process your cube database. If this strategy isn't fast enough, it may be that there is an opportunity to optimize the processing of huge dimensions...

Processing Scheme #3: ProcessAdd dimensions instead of ProcessUpdate
Next we recommend looking into whether your huge dimensions allow updates or just inserts in your SQL data warehouse. If you can guarantee there will be only inserts and no updates, then you can speed up the processing of these huge dimensions by doing a ProcessAdd on them. This option adds more complexity on top of #2, so the purpose of this post is to discuss whether this option is worth it.

We have previously encountered a massive dimension which adds 20 million rows per month. At the time, it was being maintained using processing scheme #2, but the nightly processing performance needed to be tuned further. Since you can't partition dimensions in Analysis Services 2005, ProcessAdd is the option you're left with. I did the following research to decide whether it was worth the added complexity to recommend processing scheme #3. We ended up deciding it was worth it, but in order to do it, we had to move the columns from the large dimension which allowed updates and put them in a smaller, separate dimension. This seemed a reasonable compromise.

This post goes hand-in-hand with another post which gives full working XMLA examples of various ProcessAdd commands.


Setup for the Performance Test

I created a simple table:

create table LargeDimension (
ID int not null primary key clustered
,NumberAttribute int not null
,StringAttribute varchar(10) not null
,Subcategory varchar(10) not null
,Category varchar(10) not null
)

And built an Analysis Services database with nothing but the following dimension:

created with BIDS Helper

The only column that has high cardinality is the key (which is ID column in the table and is called Large Dimension in the attribute lattice picture above). Every other column has under 100 distinct values.


The Performance Test

The tests were done on an x64 server with 16GB of RAM (plenty for this test). I did two flavors of each test. On the first (the "Ordered" column below) I went with the defaults. All dimension attributes had AttributeHierarchyOptimizedState=FullyOptimized and AttributeHierarchyOrdered=true. On the second (the "Unordered" column below) I set AttributeHierarchyOrdered=false just on the key attribute. (I did try some other combinations but they didn't make a significant difference in terms of dimensions processing time. The AttributeHierarchyOptimizedState setting will be more significant if there are non-key attributes which are high cardinality. It will also impact performance of partition processing, but that's outside the scope of this post.)

Test Case Description

Ordered
Minutes

Unordered
Minutes

A. ProcessFull 10 million rows

1:30

1:06

B. ProcessUpdate 10 million rows with zero rows changed

1:46

1:30

C. ProcessUpdate 10 million rows with 10% of rows changed

2:01

1:28

D. ProcessAdd one new row to the existing 10 million rows

0:54

0:21

E. ProcessAdd 1 million new rows to the existing 10 million rows

1:06

0:41

F. ProcessAdd 90 million new rows to the existing 10 million rows

17:36

12:09

G. ProcessFull 100 million rows

19:04

13:23

H. ProcessAdd one new row to the existing 100 million rows

11:32

5:04

I. ProcessAdd 10 million new rows to the existing 100 million rows

22:50

17:08

J. ProcessUpdate 100 million rows with no rows changed

34:15

29:02

K. ProcessUpdate 100 million rows with 10% of rows changed

34:24

29:08


Conclusions

  • It is clear that ProcessAdd is worth the extra effort if you need to further optimize ProcessUpdate dimension processing.
  • It is clear that AttributeHierarchyOrdered=false on high cardinality attributes makes a significant difference to dimension processing. Test case H shows that AttributeHierarchyOrdered=false on high cardinality attributes can double performance of ProcessAdd.
  • ProcessUpdate is by far the most expensive option available. It is so much more expensive than ProcessFull that you should definitely verify whether processing scheme #1 actually performs slower than processing scheme #2 on your dataset.

Other Notes

  • Test case H unordered took about 5 minutes which is about exactly the time it took to duplicate the entire SSAS dimension directory to the same drive via copy/paste. (Those SSAS dimension files for 100 million rows took up about 8GB on disk.) So a good way to estimate your best-case ProcessAdd performance is to manually duplicate the dimension files to the same drive. This makes sense because during processing, Analysis Services works on a duplicate set of files until finished, at which point it swaps in the new files and drops the old files. This is done so there will be no cube downtime during processing.
  • Chris Webb discusses an alternate way of dealing with large dimensions by breaking them up. If your dimension ID was 123456789 then you could break that into three dimensions with keys 123, 456, and 789. While this makes end user ad-hoc analysis more clumsy, it certainly gets around the problem of large dimensions being expensive to process.
  • I also tried processing the dimension I tested above using the ProcessingGroup=ByTable setting, and the server ran out of memory by the time it had read in 60 million rows. Incidentally, page 68 of the performance guide has some helpful information on deciding when ByTable may perform better than ByAttribute. But bottom line, you won't know until you test it.
  • Test ProcessAdd thoroughly to make sure it is stable on your hardware and dataset. I've had sporatic trouble with it getting into an infinite loop and running out of disk space. If I had to guess, it appears to happen when your dimension can't be loaded into memory. I've reported this bug here.
  • I also researched the possibility of making the large dimension a ROLAP dimension, but I was disappointed with query performance.