Cube Processing

ProcessAdd Examples

The concept of incremental processing is well documented by T.K. Anand's fabulous whitepaper on processing (which you should probably digest before continuing), but there is a severe lack of full working example XMLA commands. The purpose of this post is to provide full working examples of ProcessAdd against Adventure Works.

1. ProcessAdd Partition (download ProcessAdd Partition.xmla)
This example uses an out-of-line query binding to run a ProcessAdd on the 2004 partition of the Internet Sales measure group. Real-life use for such an XMLA command would require that you specify an out-of-line SQL query which retrieves rows that currently do not exist in the cube. No such rows were available in AdventureWorksDW, so I just specified a query which would add duplicate rows to the partition for illustration purposes. This XMLA command can be created in Management Studio by right-clicking on a partition, choosing Process, changing to Process Incremental, clicking Configure, setting a query, then clicking the Script button:

2. ProcessAdd Dimension (download ProcessAdd Dimension.xmla and ProcessAdd Dimension 2008.xmla)
This example uses an out-of-line data source view to run a ProcessAdd on the Sales Territory dimension. Because dimensions can be built off of multiple tables from the DSV, you cannot use an out-of-line query binding like what was used for example #1. Instead, you have to use an out-of-line DSV. Real-life use for such an XMLA command would entail specifying a named query in the out-of-line DSV which returned only rows which did not exist in the Analysis Services dimension. Rows which already exist will be ignored, so it is wasteful to allow your query to return them. No such rows were available in AdventureWorksDW, so I just specified a query which returned rows that already exist in the dimension for illustration purposes. With the example out-of-line DSV in “ProcessAdd Dimension.xmla” using the query it does, the Sales Territory dimension will not actually get any extra members. However, the example in “ProcessAdd Dimension 2008.xmla” uses a query that will add Texas to the Sales Territory dimension.

Update: Note that the default error configuration changes in AS2008 so that duplicate key errors are not ignored during processing. Because of this design change, if you created your dimension in the development tools for AS2008, you have to override the ErrorConfiguration in the Batch in which you’re doing ProcessAdd on dimensions unless the rows you’re adding contain completely new values for every attribute in the dimension. In the “ProcessAdd Dimension 2008.xmla” example, we are adding a new region “Texas” to an existing country “United States”. Because “United States” already exists, we had to override the ErrorConfiguration and set KeyDuplicate to IgnoreError.

There is no UI for creating this XMLA command in Management Studio. So I recommend the following steps if you wish to build your own ProcessAdd command for a dimension.

  1. Right-click on the dimension in Management Studio, choose Process, change to ProcessUpdate, and click the Script button. Change ProcessUpdate to ProcessAdd in the XMLA script.
  2. In Business Intelligence Development Studio (BIDS), right-click in Solution Explorer on the Data Source object that currently provides the connection for processing this dimension, and choose View Code. Copy that XML. Paste it into the <Process> tag. Remove all the properties of the DataSource except for ID, Name, and ConnectionString. Note that even if you're not changing the connection string, it is still necessary to include an out-of-line data source with the out-of-line DSV. This is a known issue that will hopefully be fixed sometime after SP2.
  3. In BIDS, open the data source view which your dimension is currently processed off of. Right-click on the table(s) your dimension currently uses, and choose "Replace Table... With New Named Query..." Then edit the query specifying the appropriate where clause. Do not save the DSV changes. Right-click on the .dsv file in Solution Explorer and choose View Code. Copy the entire XML and paste it into the <Process> tag. Remove the CreatedTimestamp and LastSchemaUpdate properties (to avoid a warning message). There are other large sections (such as the Annotations) you can remove if you like, but they will not hurt anything.

Note that if you have more than one <Process> command, the out-of-line DSV must contain the tables required for all dimensions or partitions being processed which use that DSV normally. Here is such an example: Process Multiple Objects With Out-of-Line DSV.xmla

Normal processing involves Analysis Services querying a SQL database for the necessary data. Analysis Services also supports push mode processing where the data itself is provided in the XMLA command. Integration Services uses push mode processing whenever you have a Partition or Dimension Destination:
 

3. Push Mode ProcessAdd for Partition (download Push Mode ProcessAdd for Partition.xmla)
This example shows how to process a partition using push mode by specifying an out-of-line rowset. Generally push mode processing is done using Integration Services. If you need to do it without the aid of SSIS, I still recommend you build a simple package in SSIS, then run Profiler while you run the SSIS package. That XMLA query you will see in Profiler will not have the rowset schema or the actual data, but it will get you started. Note that in SSIS when you're using a Partition or Dimension destination, if a column is used twice (such as in the key and the name), you need to duplicate that column and put it in the pipeline twice. This is only a restriction of the UI in SSIS, not a restriction of push mode processing.

4. Push Mode ProcessAdd for Dimension (download Push Mode ProcessAdd for Dimension.xmla)
This example shows how to process a dimension using push mode by specifying an out-of-line rowset. See above for more information on push mode processing.

Miscellaneous Notes

  • If a dimension or partition is unprocessed, you can't do ProcessAdd on it. So make sure your processing procedures check this and do a ProcessFull the first time.
  • The MemoryUsage class of ASSP uses push mode processing to load a cube incrementally without reading from a SQL database.
  • This post goes hand-in-hand with another post which examines the performance of ProcessAdd.
  • Push mode processing is only allowed on partitions or dimensions which use one table from the DSV. So that eliminates snowflaked dimensions or measure groups with reference dimensions.

ProcessAdd on Large Dimensions

ProcessFull, ProcessUpdate, and ProcessAdd are the three processing options for keeping data fresh in dimensions in Analysis Services 2005. This post takes a look at these options as they relate to processing very large dimensions.


Background

When considering the different possibilities for a nightly Analysis Services processing strategy, we usually recommend the following evolution which attempts to weigh maintenance complexity versus processing performance. The first processing scheme is low complexity. The other partitioning schemes add complexity in order to improve performance.

Processing Scheme #1: ProcessFull entire database
If doing a ProcessFull on the entire database completes within the allowed time, then this is obviously the most straightforward method of keeping your cubes fresh. If ProcessFull takes too long, then...

Processing Scheme #2: ProcessUpdate dimensions and process fewest partitions possible
Next we recommend they look into partitioning their measure groups such that they only have to process a small subset of the partitions every night. For instance, if you partition by month, you may be able to just process the current month's partition nightly. Since not all partitions will be reprocessed nightly, dimension processing must be done so as not to unprocess the existing partitions. The easiest way is to accomplish this is to ProcessUpdate all dimensions, then ProcessData the partitions that have changed data (i.e. the current month), then ProcessDefault each measure group. (ProcessDefault will process aggs and indexes on the partitions you just processed. In addition, it will reprocess the flexible aggs which were dropped on old partitions during ProcessUpdate of dimensions.) This strategy is moderate complexity and requires custom coding to process your cube database. If this strategy isn't fast enough, it may be that there is an opportunity to optimize the processing of huge dimensions...

Processing Scheme #3: ProcessAdd dimensions instead of ProcessUpdate
Next we recommend looking into whether your huge dimensions allow updates or just inserts in your SQL data warehouse. If you can guarantee there will be only inserts and no updates, then you can speed up the processing of these huge dimensions by doing a ProcessAdd on them. This option adds more complexity on top of #2, so the purpose of this post is to discuss whether this option is worth it.

We have previously encountered a massive dimension which adds 20 million rows per month. At the time, it was being maintained using processing scheme #2, but the nightly processing performance needed to be tuned further. Since you can't partition dimensions in Analysis Services 2005, ProcessAdd is the option you're left with. I did the following research to decide whether it was worth the added complexity to recommend processing scheme #3. We ended up deciding it was worth it, but in order to do it, we had to move the columns from the large dimension which allowed updates and put them in a smaller, separate dimension. This seemed a reasonable compromise.

This post goes hand-in-hand with another post which gives full working XMLA examples of various ProcessAdd commands.


Setup for the Performance Test

I created a simple table:

create table LargeDimension (
ID int not null primary key clustered
,NumberAttribute int not null
,StringAttribute varchar(10) not null
,Subcategory varchar(10) not null
,Category varchar(10) not null
)

And built an Analysis Services database with nothing but the following dimension:

created with BIDS Helper

The only column that has high cardinality is the key (which is ID column in the table and is called Large Dimension in the attribute lattice picture above). Every other column has under 100 distinct values.


The Performance Test

The tests were done on an x64 server with 16GB of RAM (plenty for this test). I did two flavors of each test. On the first (the "Ordered" column below) I went with the defaults. All dimension attributes had AttributeHierarchyOptimizedState=FullyOptimized and AttributeHierarchyOrdered=true. On the second (the "Unordered" column below) I set AttributeHierarchyOrdered=false just on the key attribute. (I did try some other combinations but they didn't make a significant difference in terms of dimensions processing time. The AttributeHierarchyOptimizedState setting will be more significant if there are non-key attributes which are high cardinality. It will also impact performance of partition processing, but that's outside the scope of this post.)

Test Case Description

Ordered
Minutes

Unordered
Minutes

A. ProcessFull 10 million rows

1:30

1:06

B. ProcessUpdate 10 million rows with zero rows changed

1:46

1:30

C. ProcessUpdate 10 million rows with 10% of rows changed

2:01

1:28

D. ProcessAdd one new row to the existing 10 million rows

0:54

0:21

E. ProcessAdd 1 million new rows to the existing 10 million rows

1:06

0:41

F. ProcessAdd 90 million new rows to the existing 10 million rows

17:36

12:09

G. ProcessFull 100 million rows

19:04

13:23

H. ProcessAdd one new row to the existing 100 million rows

11:32

5:04

I. ProcessAdd 10 million new rows to the existing 100 million rows

22:50

17:08

J. ProcessUpdate 100 million rows with no rows changed

34:15

29:02

K. ProcessUpdate 100 million rows with 10% of rows changed

34:24

29:08


Conclusions

  • It is clear that ProcessAdd is worth the extra effort if you need to further optimize ProcessUpdate dimension processing.
  • It is clear that AttributeHierarchyOrdered=false on high cardinality attributes makes a significant difference to dimension processing. Test case H shows that AttributeHierarchyOrdered=false on high cardinality attributes can double performance of ProcessAdd.
  • ProcessUpdate is by far the most expensive option available. It is so much more expensive than ProcessFull that you should definitely verify whether processing scheme #1 actually performs slower than processing scheme #2 on your dataset.

Other Notes

  • Test case H unordered took about 5 minutes which is about exactly the time it took to duplicate the entire SSAS dimension directory to the same drive via copy/paste. (Those SSAS dimension files for 100 million rows took up about 8GB on disk.) So a good way to estimate your best-case ProcessAdd performance is to manually duplicate the dimension files to the same drive. This makes sense because during processing, Analysis Services works on a duplicate set of files until finished, at which point it swaps in the new files and drops the old files. This is done so there will be no cube downtime during processing.
  • Chris Webb discusses an alternate way of dealing with large dimensions by breaking them up. If your dimension ID was 123456789 then you could break that into three dimensions with keys 123, 456, and 789. While this makes end user ad-hoc analysis more clumsy, it certainly gets around the problem of large dimensions being expensive to process.
  • I also tried processing the dimension I tested above using the ProcessingGroup=ByTable setting, and the server ran out of memory by the time it had read in 60 million rows. Incidentally, page 68 of the performance guide has some helpful information on deciding when ByTable may perform better than ByAttribute. But bottom line, you won't know until you test it.
  • Test ProcessAdd thoroughly to make sure it is stable on your hardware and dataset. I've had sporatic trouble with it getting into an infinite loop and running out of disk space. If I had to guess, it appears to happen when your dimension can't be loaded into memory. I've reported this bug here.
  • I also researched the possibility of making the large dimension a ROLAP dimension, but I was disappointed with query performance.