Adventure Works bashing is quite a fun sport. It’s unfortunate Adventure Works makes such an easy target. In this post I want to pick on how the Adventure Works cube shows N/A for the Statistical Account subtotal.
Obviously it makes no sense to sum together Headcount and Units and Average Unit Price. Setting the Statistical Accounts subtotal to NA does make sense. Adventure Works uses the following code to do this:
// Assignment of constant for Statistical Accounts account member.
// This will allow parent member to be visible by default.
( [Account].[Accounts].&, [Measures].[Amount] ) = "NA" ;
Unfortunately, this is not the best way for several reasons.
Strings Break Aggregations
First, assigning a string will null out any subtotals and grand totals above where you assigned the string. This problem isn’t visible in Adventure Works since IsAggregatable is set to false on the [Account].[Accounts] hierarchy which removes the grand total. But if we do the following assignment on a hierarchy with a grand total, we see the problem:
( [Product].[Product Categories].[Category].[Accessories], [Measures].[Internet Sales Amount] ) = "NA";
You can see that when we assigned a string to the Accessories subtotal, it nulled out the grand total making it disappear. And null grand totals are a major usability problem, as I’ll explain in a second.
The second reason the Adventure Works NA approach is flawed is that it eliminates the sparsity in the cube for the Statistical Accounts subtotal. For example, let’s say we put [Date].[Date] on columns, note a bunch of dates with no data except for NA show up. This hurts performance tremendously and makes it more difficult to find meaningful data.
Typically, fixing this problem involves a conditional assignment which ensures empty cells stay empty:
( [Account].[Accounts].&, [Measures].[Amount] ) = IIf(IsEmpty([Measures].[Amount]),null,"NA");
But in Adventure Works, all the Statistical Accounts have a ~ unary operator which means that Statistical Accounts subtotal is always null. So to preserve sparsity and fill in the Statistical Accounts subtotal with NA only where any statistical accounts have data, we would have to do something like:
( [Account].[Accounts].&, [Measures].[Amount] ) =
Strings Are Less Efficient
The third reason the Adventure Works NA approach is flawed is the degradation in performance due to the use of strings. In talking with the SSAS team, Marius Dumitru summarized this problem by saying, “strings are larger than other data types, harder to manipulate, introduce data structure indirections and memory copies/allocations.”
Mosha Pasumansky mentioned how strings and sparsity impact performance in the discussion of the Ratio calculation in this old blog post.
Null Grand Totals
Recently some co-workers were building a PivotTable and no matter what they did, they could never get anything more than a blank PivotTable. After a bit of head scratching, I realized what was happening was that the grand total was nulled out, so they were never even able to start the PivotTable properly.
To demonstrate this problem, let’s pretend we want to null out the grand total on the Product Categories hierarchy:
( [Product].[Product Categories].[All Products], [Measures].[Internet Sales Amount] ) = null;
Now if we start a PivotTable and put something other than the Product Categories hierarchy on rows, you’ll see this null grand total prevents us from starting the PivotTable since no rows even show up:
Incidentally, you can tell from the “this will allow parent member to be visible” comment in the code at the top of this blog post that the developers of Adventure works had this very thing in mind. Since the ~ unary operator caused Statistical Accounts to be null always, it made it impossible to expand that subtotal since it would never appear.
Of course, your users could check the “Show items with no data on rows” option under PivotTable Options… Display. But you can’t expect your users to know to do that. PivotTables are designed for top down analysis. And if the top is null, your PivotTable is not user friendly.
So how do we avoid all three Adventure Works problems created by assigning strings? And how do we avoid the problem of null subtotals and null grand totals?
You will note Mosha Pasumansky’s approach in the blog post above was to use NULL and a four-part format string like '#.#;;;NA'. The problem with that approach is that it doesn’t work in Excel PivotTables. Excel retrieves the VALUE and the FORMAT_STRING cell properties then does the formatting itself. But the formatting code in Excel ignores the fourth part, unfortunately. (As a reminder, format strings are semicolon separated. The first part is for positive numbers, the second part is for negative numbers, the third part is for zero, and the fourth part is for nulls.) So our solution will have to limit itself to three-part format strings which work in Excel.
My approach of choice looks like the following:
scope( [Account].[Accounts].&, [Measures].[Amount] );
this = IIf(IsEmpty(Sum([Account].[Accounts].&.Children, [Measures].[Amount])),null,0);
Format_String(this) = ";;"; //format zeros as blank
It is assigning a zero to a cell if any statistical accounts have data in this slice, otherwise it’s assigning null. Then it’s using a custom format string just for the Statistical Accounts subtotal. The format string is just two semicolons. That three-part format string says to format zero as blank. Since the format string is inside the scope statement, we’re only customizing the format string for the Statistical Accounts subtotal.
If you prefer NA, then you would use the following format string code inside the scope statement:
Format_String(this) = ";;\N\A";
And as discussed above, typically IsEmpty(Sum([Account].[Accounts].&.Children, [Measures].[Amount])) would just be IsEmpty([Measures].[Amount]) but we had to use the more complex expression here because of the ~ unary operators in play.
This approach solves all the problems we were trying to avoid. It doesn’t null out the grand total. It preserves sparsity in the cube. It avoids using expensive strings. It avoids null subtotals and null grand totals. And it is usable in Excel.
You may have noticed I never post business intelligence industry news on this blog. That was a conscious decision on my part when I started this blog nearly five years ago. Why? Because so many others cover that much better than I ever could dream of.
I never really post news about myself either. Why? Because I doubt you really care, and I’d rather blog about BI. (I have to say it was difficult not to mention how proud I am to have achieved the Analysis Services Maestro designation since, at the moment, there are only 11 in the world and 3 in the United States.)
All that is to say that I’m breaking with tradition to announce that I’m speaking at SQL Rally in Dallas this May. Why? Because I love meeting other SSAS geeks in the area (and hopefully one or two of you are actually reading this blog).
99 Tips for Tuning and Enhancing Analysis Services
All-day preconference session on Wednesday, May 09, 2012
Come and hear Greg Galloway share all his best tips for optimizing performance of Analysis Services and improving the user experience and reach of your cubes. This session is targeted to BI developers responsible for maintaining Analysis Services cubes and to DBAs who are tasked with monitoring and troubleshooting Analysis Services issues. Expect to leave with an actionable checklist of ideas to try on Monday when you return to work, with a deeper grasp of the internals of Analysis Services, with MDX optimization experience, and with best practices that ensure your projects succeed. This preconference session will focus on Multidimensional cubes and MDX but will briefly discuss the advantages and limitations of new Tabular models and DAX in Analysis Services 2012.
Making Heads or Tails of Analysis Services Storage
Regular 75 minute session during the conference May 10-11, 2012
If you are about to start a new Microsoft business intelligence project, you have more Analysis Services choices and flexibility than ever before. Come learn what’s new in Analysis Services 2012 and learn the advantages and limitations of the new Tabular models compared with classic Multidimensional models. In this session we will take a deep dive into storage modes in Analysis Services. We’ll discuss the two storage modes a Multidimensional model can use: MOLAP and ROLAP. We’ll discuss the storage modes in new Tabular models: VertiPaq and DirectQuery. We’ll preview the new columnstore index in SQL Server 2012 and discuss how it impacts Analysis Services. And we’ll take a deep dive into two successful Analysis Services deployments at two Fortune 500 companies and discuss why one chose ROLAP and the other chose MOLAP.
I will do my best to avoid much overlap between those sessions, so I look forward to seeing you in both of them. Register for SQL Rally soon since prices go up after March 14th. The pre-conference session is only $219, which I think is quite a bargain.
Artis is a silver sponsor and will definitely have a big presence at the conference. And my Artis co-worker and fellow MVP Tim Mitchell is pulling double duty in helping organize the conference and speaking on Data Quality Services. (As a side note, I have to applaud how the program committee excluded their own speaking sessions from selection and simply put them up for community vote. Tim and others on the committee are fantastic speakers, so the community rightly chose many of their sessions.)
Most of my blog posts chronicle some lesson learned in the heat of battle at a client site. This blog post is a bit different since the lesson learned came from an Artis internal project we started and completed last year. At Artis, our time entry system feeds into an Artis data warehouse and an Analysis Services cube for reporting. We also support writeback against the cube for project managers to enter forecasts for projects. That way, at the weekly staffing meeting, it’s pretty simple to see that Greg is forecasted to work 95 hours between four projects next week. At that point, preventative measures can be taken… like alerting local emergency rooms that I may be making a visit.
Seriously though, Artis is a great place to work since I get to work with people like Drew Jones and Cindy Liang (who actually implemented this resource planning cube), and I’ve never once gotten close to working 95 hours. I’ve got an MDX query to prove it! But I digress…
Writeback Cube Security
Since forecasting and resource planning is the goal, we use role-based security in the cube to prevent project managers from setting a forecast for a prior week (since it’s already too late to plan resources at that point). A typical scenario in industry where write security would be needed would be making sure each regional manager can only set the budget for his region. Setting up this security was a bit tricky, so I’ll describe it step-by-step.
The first step is to create a ReadAll role. In this case, the ReadAll role grants members access to read the whole cube. I’ve definitely seen a number of clients make mistakes here and grant extra permissions that are not required. Nothing needs to be checked on the General tab (not even Read definition which grants them access to see the metadata in tools like SSMS). On the Cubes tab you need Read and optionally Drillthrough access. If you want a user to be able to read the whole cube from a tool like Excel, that’s all you need to do.
Next, build a separate Writeback role. On the Cubes tab, grant Read/Write, check Process, and (optionally) Drillthrough permissions. I’m frankly not sure why you need Process permissions to perform ROLAP writeback, but it appears you need Process permissions for ROLAP or MOLAP writeback:
In the Data Sources tab, make sure to grant Read permissions on the data source used for writeback. For example, when I created my writeback partition, I used the Greg Test data source to direct that writeback table to live in another database (such as a database with Full recovery model):
Finally, if you want to only let them writeback to certain cells, control this on the Cell Data tab. Since I was prototyping against Adventure Works data, the expression below returns true when the current year is later than 2003. In your cube, the expression might look at a current week flag in the Date dimension or something like that.
Am I Seriously Recommending Cell Security?!?
Wait! Cell security! Won’t that kill performance? Definitely Analysis Services best practices call to avoid cell security in favor of dimension data security because cell security is a performance killer. The reason I use it here is because I don’t know of a way to accomplish the writeback security without cell security. The question is whether this particular combination of cell security designed to be write security impacts read performance. I verified this security only impacts the performance of writes with a simple trick. In the read/write permissions box above, I put in the following expression to prove this:
ASSP.FireTraceEventAndReturnValue(99) < 100
Let me explain what that’s doing. This FireTraceEventAndReturnValue function was designed for the very purpose of being able to detect via Profiler when Analysis Services evaluates it. So it simply fires a “User Defined” event in Profiler and then returns the integer you pass into the function. That way, you can watch Profiler and verify cell security isn’t being evaluated during read-only queries. It only appears in Profiler when you query the UPDATEABLE property or when you perform a write. So just building a PivotTable in Excel isn’t slowed down by the above cell security since Excel currently doesn’t query the UPDATEABLE cell property. But I’m getting ahead of myself as I’ll describe the UPDATEABLE cell property below.
Why is this the case that cell security didn’t impact read performance? Note that we used two roles. One role grants read permissions which let you read the whole cube. Another role grants write permissions but then says you can only write to recent years. If we had just created the Writeback role, users would not have been able to read 2003 data and cell security would impact read performance, too. By creating two roles, the read and write permissions are managed separately.
As a side note, we have a support case open to fix a bug where attempting to write data to 2003 sometimes causes SSAS to crash. It appears this may happen when the user is in two roles, as described above. Hopefully that bug will be fixed soon. (Update:It looks like it will be fixed in SQL 2008 R2 SP1 CU6.) Coincidentally, the VBA described below provides a workaround to the bug.
Testing Out Writeback in Excel 2010
Now that we’ve setup our role-based security, let’s see what the user experience is in Excel 2010. We start a PivotTable as usual. But since I’m an administrator on my Analysis Services dev server, I need to tweak the connection string to pretend to be a lower privileged user. Go to the Data tab, click Connections, and click Properties for the connection. Then append either of the following snippets to the end of the connection string:
The first pretends to be a member of the ReadAll and the Writeback roles. This works fine as long as the security is hardcoded (i.e. not dynamic security which uses the UserName() function in MDX expressions in the role). The EffectiveUserName option is another option which impersonates a specific user, and I’ve blogged about it before.
Now I’ll build a simple PivotTable then enable What-If analysis. What happens when I edit a 2003 cell and publish my changes? I get this error message:
The error message is at least helpful in describing what went wrong. But it’s not helpful in describing which cell was secured. Also, it’s not helpful that it waited until I was finished entering my data before giving me an error during the Publish. Some immediate feedback would have been nice.
So how do you tell in advance which cells are secured and which cells are updateable. Fortunately, there’s a handy cell property called UPDATEABLE which does just that. Unfortunately, Excel 2010 what-if analysis doesn’t leverage it, so I have proposed this feature. For now we can use a little VBA to roll our own.
Using VBA to Query the UPDATEABLE Cell Property
The following VBA code catches the Worksheet’s PivotTableAfterValueChange event. I could have taken a number of approaches, and I’ll discuss the pros and cons of this approach below, but I chose this approach for this prototype because it was the most straightforward. When you type in a new value for a cell in a PivotTable, this event fires. This code borrows the ADO connection from the PivotTable and runs a simple query that retrieves the UPDATEABLE cell property for that cell’s coordinates in the cube. I definitely haven’t explored all the possible values of the UPDATEABLE cell property, so you’ll need to verify this code with your cube. But for the simple example in this blog post, I’ve seen two possible values. If UPDATEABLE is CELL_UPDATE_ENABLED then writeback is allowed and the cell is not secured. If UPDATEABLE is CELL_UPDATE_NOT_ENABLED_SECURE then the cell is secured and will produce an error upon writeback, so we discard that change and popup the error message:
In my mind, that’s a better user experience. You get immediate feedback that tells you which cell is the problem. So without further ado, here’s the code:
Private Sub Worksheet_PivotTableAfterValueChange(ByVal TargetPivotTable As PivotTable, ByVal TargetRange As Range)
On Error GoTo ErrHandler
Application.ScreenUpdating = False ' turns off screen updating
Const MD_MASK_ENABLED = &H0
Const MD_MASK_NOT_ENABLED = &H10000000
Const CELL_UPDATE_ENABLED = &H1
Const CELL_UPDATE_ENABLED_WITH_UPDATE = &H2
Const CELL_UPDATE_NOT_ENABLED_FORMULA = &H10000001
Const CELL_UPDATE_NOT_ENABLED_NONSUM_MEASURE = &H10000002
Const CELL_UPDATE_NOT_ENABLED_NACELL_VIRTUALCUBE = &H10000003
Const CELL_UPDATE_NOT_ENABLED_SECURE = &H10000005
Const CELL_UPDATE_NOT_ENABLED_CALCLEVEL = &H10000006
Const CELL_UPDATE_NOT_ENABLED_CANNOTUPDATE = &H10000007
Const CELL_UPDATE_NOT_ENABLED_INVALIDDIMENSIONTYPE = &H10000009
Dim pcache As PivotCache
Set pcache = TargetPivotTable.PivotCache
If Not pcache.IsConnected Then
Dim oCellset As New ADOMD.cellset
oCellset.ActiveConnection = pcache.ADOConnection
Dim sError As String
Dim oPivotTableCell As Range
For Each oPivotTableCell In TargetRange.Cells
Dim oPivotCell As PivotCell
Set oPivotCell = oPivotTableCell.PivotCell
Dim sMDX As String
sMDX = "select from [" & pcache.CommandText & "] WHERE " _
& oPivotCell.MDX & " CELL PROPERTIES UPDATEABLE"
oCellset.Source = sMDX
Dim iUPDATEABLE As Long
iUPDATEABLE = oCellset(0).Properties("UPDATEABLE")
If iUPDATEABLE = CELL_UPDATE_ENABLED Then
ElseIf iUPDATEABLE = CELL_UPDATE_NOT_ENABLED_SECURE Then
sError = sError & "Cell " & Replace(oPivotTableCell.Address, "$", "") _
& " is secured." & vbCrLf
If sError <> "" Then
sError = "The following cells do not allow writeback, " _
& "so their values were discarded." & vbCrLf & vbCrLf & sError
Application.ScreenUpdating = True ' turns on screen updating
There is one main downside to this approach. It can slow down the data entry process since after each cell is changed, we execute one MDX query per cell to retrieve the UPDATEABLE cell property. This is generally fairly snappy as running 100 of these MDX queries only took a second against the local network and several seconds over a slower VPN connection. But when there are a bunch of secured cells you attempt to update, discarding each change one cell at a time takes a couple of seconds, which adds up. Another downside to consider is that if you enter a change on Monday then save the workbook without publishing those changes, when you open the workbook on Tuesday your writeback permissions in the cube may have changed yet this VBA code doesn’t rerun for previous changes you’ve entered but not published.
Another approach would be to catch the PivotTableUpdate event and grab the coordinates of all the cells and run one MDX query to retrieve the UPDATEABLE property for all cells in the PivotTable, caching this information for later when the user updates cells. The reason this approach would be more complex is because the .MDX property of a subtotal PivotCell returns a tuple with one less member than a PivotCell that’s a detail level cell since the All member isn’t mentioned in the subtotal’s MDX. And you can’t easily combine two tuples with different dimensionality. So I went with the more straightforward approach for this prototype.
Note in the VBA above the use of the ADOMD.Cellset object. This requires a reference to any version of Microsoft ActiveX Data Objects (Multi-dimensional), and you can add that reference from within the VBA editor from the Tools… References menu:
What’s He Not Telling Me?
This post focused on role-based security for writeback and how to improve that user experience. I didn’t talk about setting up a writeback partition or whether to choose MOLAP or ROLAP writeback. I didn’t discuss how allocation method impacts writeback performance. I didn’t discuss how to writeback to non-leaf cells. I didn’t discuss the necessity of periodically moving data from the writeback table into the regular fact table. Maybe I’ll blog about those in the future.
I also didn’t discuss the pros and cons of the different architectures we could have chosen. This project lent itself to extending our existing time entry reporting cube with forecast writeback, but we could have also chosen to implement a good budgeting and planning tool like deFacto. That decision process is a blog post for another time.
To summarize, be sure to grant read permissions and write permissions in separate roles if the user should be able to read more of the cube than they can write to. The UPDATEABLE cell property can come in handy if you want to know in advance which cells a user has permissions to update. With a little VBA, it’s pretty straightforward to improve the writeback experience where writeback security is involved by querying the UPDATEABLE cell property.
1. Fixing some VBA formatting/syntax issues that happened when I pasted my the code into my blog
2. Updating the status of a hotfix for the SSAS bug I mentioned which will be fixed in SQL 2008 R2 SP1 CU6.
As an homage to my two favorite detectives, Nero Wolfe and Gregory House, here’s an Analysis Services detective story for you. If you enjoy the thrill of the chase, read on. If you take yourself too seriously and just want the facts, skip to the Conclusions section at the end.
Our story begins on a busy summer day in which our hero (me) finds himself covering for a co-worker on vacation. I get a call that queries are running slow on the Analysis Services box, so I go investigate. The first thing I noticed was that Task Manager shows zero free memory...
Image 1: Windows Server 2008 Task Manager showing zero free RAM
Chapter 1 (Or How I Discovered Analysis Services Didn’t Want More Memory)
Honestly, my first instinct was that Analysis Services thought it was in a low memory situation, and I suspected that might explain the performance issues. So I decided to investigate what was consuming that memory. You could look at msmdsrv.exe memory usage in Task Manager, but there are times when it doesn’t show the right number. So I checked the “MSOLAP$instance:Memory:Memory Usage KB” perfmon counter, and it told me SSAS was only using 1GB of memory. Considering this was a dedicated SSAS box, and considering LowMemoryLimit was set at 6GB and TotalMemoryLimit was set at 8GB, either SSAS didn’t need the memory, or worse, it did need it but didn’t think there was enough memory available.
Akshai Mirchandani on the SSAS dev team was kind enough to tell me that SSAS uses the GlobalMemoryStatusEx Windows API to determine how much available memory there is at the moment. I wrote a little EXE to pull that number, and I found that it told me there were 8GB available. That number matched the “Memory : Available MBytes” perfmon counter, as it should. I learned the Free number showing in Task Manager just means zeroed out RAM, so it’s nothing to be concerned with as long as there’s still enough Available memory. In short, ignore the “Free” number in Task Manager. The confusion caused by showing Free but not Available in Windows 2008 Task Manager is probably the reason they added Available back into Task Manager in Windows 2008 R2.
Akshai also suggested I double check it was 64-bit SSAS installed since 32-bit SSAS is limited to 2GB memory usage. So I ran the following from an XMLA window in Management Studio, looked for the <Edition> tag in the result, and verified it ended in 64 (i.e. <Edition>Enterprise64</Edition> means 64-bit enterprise edition).
My conclusion was that SSAS knew there was memory available to it and it could have grabbed more memory if it needed it, so it must not have needed the memory.
Chapter 2 (Or How I Discovered the Standby Cache)
Even though there was 8GB available RAM, I still wanted to track down what was consuming the memory such that Windows reported there was 0MB free. Knowing that when SSAS reads a file off disk, it uses a buffered read which is able to cache the file in the Windows System File Cache, I went to check the “Memory : Cache Bytes” perfmon counter. It said the Windows System File Cache was using 3GB currently. Since the cube in question was 30GB, and since I had set the LimitSystemFileCacheSizeMB setting to 3GB to make sure the system file cache didn’t get out of hand, I wasn’t surprised it was using exactly 3GB. (3GB was the value I had chosen on this server after studying usage patterns during querying and processing. This is just a dev server, so it doesn’t have as much memory as it really needed to perform optimally. For at 30GB cube, we would recommend at least 30GB of RAM usually. But for this dev server, we have to make due with what we have.)
Recapping, SSAS was using 1GB, and the system file cache was using 3GB. The server had 12GB total. Do the math. Something else is using 8GB of memory on this dedicated SSAS box. There weren’t any processes in Task Manager that were using significant chunks of RAM.
At this point, the Task Manager number saying Cached was about 10GB gave me my next hint. I began researching Windows 2008 caching and made one of my most significant discoveries of 2010. The discovery was this: there’s a standby cache. There were several articles that helped me figure this out, but one of the key articles was the Too Much Cache blog post which mentions the standby cache briefly. Let me explain how it works. Let’s say you let your active system file cache grow to 4GB, then you set the LimitSystemFileCacheSizeMB setting to 3GB. That extra 1GB doesn’t get zeroed; it just gets recategorized as “Standby Cache” rather than “Active Cache” and is now categorized as “available” so it can be quickly zeroed and given to other processes if needed. Or if the active system file cache needs it again, it can be quickly soft faulted back into active system file cache.
Why did I say this was such an important discovery? As Somak noted in the Too Much Cache blog post, grabbing a page off disk takes about 10 milliseconds, whereas grabbing a page out of RAM takes about 10 nanoseconds… that’s 1 million times faster. So obviously the system file cache is crucial to SSAS query performance. And accounting for the system file cache is key to repeatability in testing MDX query performance—you will never be able to compare two executions of an MDX query unless you can ensure the system file cache is clear before both executions. I’ll get to how you can clear the active and standby cache in a second.
At this point, I discovered a killer Sysinternals tool called RAMMap which gives you great insight into the system file cache. I ran a few queries against Adventure Works before the following screenshot. Note the Internet Sales 2004 partition is in the active system file cache, as shown on the File Summary tab:
Image 2: RAMMap showing active system file cache
Now I clear the active system file cache using ASSP (which I’ll mention later). Note the cached files just got moved into the standby cache:
Image 3: RAMMap showing standby system file cache
Now if I clear the standby cache using ASSP (which I’ll mention later) or using RAMMap’s “Empty… Empty Standby List” menu, you’ll note those Adventure Works files drop out of the standby cache.
Anyway, back to our detective story. Poor performance on a query had been reported, but I couldn’t reproduce it. I thought I was testing that query on a cold cache because I had cleared the active system file cache and the Analysis Services caches, but I hadn’t accounted for the standby cache. So the query was performing OK for me because many of the files were in the standby cache. Once I cleared the standby cache, I could reproduce the poor performing query, and I ended up building a new aggregation that solved the problem.
The lesson learned was to pay attention to the standby cache (the “Memory : Standby Cache Normal Priority Bytes” perfmon counter) in addition to the active cache (the “Memory : Cache Bytes” perfmon counter).
Chapter 3 (Or How ASSP Got Its Groove Back and Cleared the Standby Cache)
Being able to use the “Empty… Empty Standby List” menu item in RAMMap is nice, but wrapping that Windows API call into an Analysis Services stored procedure lets you clear the standby cache right from a Management Studio query window (or with an automated testing tool). So I added this functionality to the ASSP project. Shout out to Tim Mangan and Marco Russo who both helped track down that Windows API call.
About a year and a half ago I added some code into the ASSP project to clear the system file cache. I thought it would help MDX query test repeatability, but it strangely had little impact. Now I know why—it was only clearing the active system file cache (which simply shuffled those pages into the standby cache). Now ASSP has its groove back because it clears both the active and standby cache with the ClearFileSystemCache sproc. So if you want to test on a true cold cache without rebooting the server, the best practice is to run run the ClearAllCaches sproc (which internally clears both the Analysis Services cache with the ClearCache sproc and clears the active and standby system file cache with the ClearFileSystemCache sproc), then run your MDX query, all from an MDX window in Management Studio:call ASSP.ClearAllCaches()
<your MDX query here>
Chapter 4 (Or How I Stand on my Soapbox and Condemn Paging)
During the discussion above, I have only been speaking of memory in terms of physical RAM (rather than virtual memory). I can hear Marco Russo chiding me and whispering in my ear, “Programs request virtual memory from Windows, and Windows allocates RAM or pages memory to disk as it sees fit.” While he is right, physical RAM is all that’s relevant in practice. If there’s much paging going on, performance will be terrible. And this brings me to a favorite soapbox.
I can’t count the number of times I’ve seen a new client’s server in the following state. The server has SQL and SSAS on the same box. Both are set to the default memory limits (which is no memory cap for SQL and 80% of server memory for SSAS). By the time I see the server, SQL and SSAS are fighting each other for memory, causing each other to page out, and performance degrades dramatically. (Remember what I said about disk being a million times slower than RAM?)
What’s the solution? Cap all the large consumers of memory such that they don’t step on each others’ toes. That’s an easy principle, but it gets very difficult when you have to go through the testing iterations to prove your server performs better overall with memory limits for SQL and SSAS. For example, I have observed one particular stored procedure take 12 hours to complete if SQL was capped at 6GB of memory, but that same stored procedure would complete in 3 hours if SQL was capped at 14GB of memory. The point is that you have to thoroughly test your memory limit changes. But in my experience, a server with a well-tested memory usage plan will perform better and more consistently because it largely avoids paging.
An example memory usage plan on a 24GB RAM server with SQL and SSAS might be similar to the following. SQL will be capped at 10GB using the “maximum server memory” setting. SSAS will be capped at 8GB using TotalMemoryLimit/HardMemoryLimit settings (in combination with LowMemoryLimit). The active system file cache will be capped at 4GB using the LimitSystemFileCacheSizeMB setting. And we’ll assume the OS will use about 1GB, and programs like SSMS or BIDS will use about 1GB. Note the standby system file cache doesn’t need to be figured into your memory usage plan because that memory is categorized by Windows as Available.
|Memory Consumer ||GB|
|Active System File Cache ||4|
|Programs like SSMS ||1|
Even better, if you determine that during the ETL, SQL needs more memory and SSAS needs less, it is easy to build a SQL script and an XMLA script to programmatically change those memory limits before and after the ETL.
Another best practice to consider is using the “lock pages in memory” setting to prevent paging. This KB article describes how to enable this setting for SQL Server. For Analysis Services, once you’ve set that group policy, use the PreAllocate setting to tell Analysis Services to grab a chunk of RAM that won’t be paged out. (Even without “lock pages in memory” PreAllocate helps SSAS memory allocation performance on Windows 2003. On Windows 2008, the main benefit of PreAllocate is to lock pages in memory.) Just be aware (and don’t panic) that Analysis Services will take quite a while to start and will pretty much lock up the server during that period.
Conclusion (Or How I Summarize Seven Pages In One Paragraph)
Ignore the “Free” number in Task Manager and focus on the “Memory : Available MBytes” perfmon counter. Be aware of the active system file cache (the “Memory : Cache Bytes” perfmon counter) and the standby system file cache (the “Memory : Standby Cache Normal Priority Bytes” perfmon counter) as they both impact MDX query performance drastically. Use RAMMap to see what’s in the system file cache. To test MDX query performance on a cold cache, use ASSP.ClearFileSystemCache. Avoid paging by capping the memory usage of large consumers of memory and by using the “lock pages in memory” group policy. Test before and after.
It's common to implement your role-based security in your Analysis Services cube. That way Fred can only see the sales from the east region, but Frank can only see sales from the west region. Role-based security requires connections to your cube identify the end user so the cube can secure itself. One way of tackling this is to implement Kerberos so that the credentials of the end user pass from the end user's laptop, to the report server, to the cube.
But what if your end user's laptop is on another company's domain? Or what if your end user's laptop is on a home workgroup? That laptop can't start a Kerberos ticket that will be valid on your domain. One option is to throw ISA Server 2006 into this mix, use it to present the user with a login form over HTTPS, then let ISA start the Kerberos ticket:
For either of the above diagrams, you would setup your data source in SSRS as follows:
We would definitely recommend the above approach, but we realize that ISA and Kerberos add a good deal of complexity to your infrastructure. So the point of this blog post is to mention a good trick for getting role-based security without requiring your end-users' laptop to be able to start a Kerberos ticket.
Before I mention the main trick that's the focus of this post, I should mention that enabling basic auth over HTTPS in SSRS is one possibility to solve this problem. It's not as good of a user experience as the single sign-on login form you can customize with ISA Server, but it works.
A Checkbox in SSRS Many People Misunderstand
The following setup of a data source in SSRS will allow you to pass the end user's credentials to the cube without having a Kerberos ticket start at the end user's laptop, and without having ISA Server.
Because you stored credentials, SSRS will open the connection to SSAS using those domain\SSASAdminUser credentials. Because you checked the "Impersonate the authenticated user after a connection has been made to the data source" checkbox, behind the scenes SSRS is going to append the EffectiveUserName property into the connection string, specifying the end user who is running the report. The EffectiveUserName property tells SSAS to pretend to be the specified user so that the results of MDX queries use the role-based security setup for that end user.
Only SSAS administrators can use the EffectiveUserName property, so the domain\SSASAdminUser must be an SSAS admin. The other requirement is that you must setup Kerberos on the SSAS server. To be clear, you need to setup the SPN on the SSAS server and you need to setup the AD configurations for delegation, but you don't need to setup Kerberos on any other servers besides the SSAS server. So this is a fairly quick way to get up in running without having your whole network setup for Kerberos.
Since you're using a service account which is an SSAS admin, carefully validate that you've setup the data source correctly to use EffectiveUserName impersonation. You wouldn't want to show your users more data than they're allowed to see just because you forgot to check a box!
What happens when an end user tries to create a new subscription on a report with the above data source setup? You get this annoying error message: "Subscriptions cannot be created because the credentials used to run the report are not stored, or if a linked report, the link is no longer valid." It says the credentials have to be stored. Uh. They are. But it's just complaining that we've checked the "impersonate the authenticated user" checkbox. My personal opinion is that this is an oversight. Feel free to vote for this Connect suggestion.
So we have to go for a workaround. The best workaround I could find is to use an expression-based connection string. So edit your report in BIDS, edit the data source, click the fx (expression) button, and enter the following expression-based connection string:
="Data Source=SSAS;Initial Catalog=YourDB;EffectiveUserName=" & User!UserID
After you deploy it, check the data source has stored credentials. Don't check the "Impersonate the authenticated user" checkbox.
Now users can create their own subscriptions. When the subscriptions run, the user receives report data according to his role-based security.
Now with your expression-based connection string, you try to create a data-driven subscription. And you get this error message: "Data-driven subscriptions cannot be created because the credentials used to run the report are not stored, the report is using user-defined parameter values, or if a linked report, the link is no longer valid." I believe it's complaining about the user of User!UserID.
You need to create a separate report for data-driven subscriptions. The expression-based connection string in the report needs to be changed to:
="Data Source=SSAS;Initial Catalog=YourDB;EffectiveUserName=" & Parameters!User.Value
Then you need to create a hidden parameter called User. Then the data-driven subscription needs to pass in the appropriate domain\username into that parameter.
Make sure end users don't have access to this report designed for data-driven subscriptions because if they figure out how to set the User parameter, they could run the report and see someone else's data.
The Dark Side of Expression-Based Connection Strings
Note expression-based connection strings can't be used in shared data sources. So you have to manage this one report at a time. Plus you can't even set the expression-based connection string from Report Manager. So you have to set the expression-based connection string in the report at development time, and this makes it a pain to deploy the same report to dev and prod without changing the RDL. Also, once you have changed your report to use an expression-based connection string, you can't open the query designer. So expression-based connection strings really get in the way of the proper development, deployment, and manageability experience.
So you might consider developing your reports with regular shared data sources. Then after you deploy the reports to the server, you can run .NET code like the RSExpressionBasedConnectionString sample code I worked up. (You'll need to request a login to download that sample code.) Once you've got the sample code, edit the app.config pointing it at a folder in SSRS. It loops through all the reports in that folder, finds SSAS data sources, and changes them all to expression-based connection strings. You'll obviously need to tweak the code a good deal to have it do exactly what you want in your environment, but hopefully that's a start. I might know a few good consultants if you happen to need help customizing this for your needs. :-)
Teo Lachev also suggested to me that you could use an RDCE to accomplish the same thing as that .NET code above. But just like with the .NET code above which has to run after the report has been deployed, you'd have to set each report to use the RDCE after it is deployed. Neither option is ideal, but both would work.
Knowing about that "Impersonate the authenticated user" checkbox is a handy trick to have in your bag. Knowing that it's using EffectiveUserName under the covers is very helpful because it lets you use an expression-based connection string to enable subscriptions. But expression-based connection strings are almost more pain than they are worth. So you might consider automating the setup of those expression-based connection strings with some code like what I have mentioned above.
If I were the supreme leader of the Excel team (and I've had trouble convincing them of that), one of the first things I would do is add tooltips in the PivotTable field list so that when you mouse over a field, it would describe that field. I've asked for that feature, but so far, it doesn't look like it will make Excel 2010.
So... plan B. But let me first give some background on self-documenting cubes.
I'm big on self-documenting code because I've had too many experiences where separate documentation gets stale and irrelevant. Plus, I'm lazy. Thus, anything that makes it easier for me to keep documentation up-to-date and relevant is a good thing in my book.
Having descriptive names for measures and dimensions is one aspect of self-documenting cubes. You'll see descriptive calculation names like "Internet Sales Amount Variance to Annual Quota" in my cubes. But what I really want to focus on in this post is the Description property which appears on measures, dimensions, and all other objects in Analysis Services.
A few years ago at a user group, I was speaking with one prominent member of the Analysis Services team (who shall remain unnamed), and when I mentioned the Description property, he asked incredulously whether anybody ever filled those in. If you're wondering the same thing, the reason I fill in every single one is because it's a cinch to generate documentation that's relevant and up-to-date. You can use various tools that are out there. Or you can easily write a simple custom .NET app which loops through measures and dimension attributes and writes the descriptions to an HTML document. Or you can make use of descriptions with the following idea.
Since we don't get tooltips on the field list in PivotTables for now, what's plan B? I'll credit my co-worker Drew Jones with this idea. He suggested using actions in cubes to approximate this as much as possible. Here's what we came up with.
Right-click on any numeric cell in a PivotTable, and you can see the first 100 characters of the description property of that measure under Additional Actions...
Here's how you do that. First of all, you'll need the latest version of the ASSP assembly which includes the DiscoverSingleValue functions. Then you'll need to create an action in your cube with the following setup:
|Target type ||Cells|
|Target object ||All cells|
|Condition || |
, "<MEASURE_NAME>" + [Measures].CurrentMember.Name + "</MEASURE_NAME>"
+ "<CUBE_NAME>" + [Measures].CurrentMember.Properties("CUBE_NAME") + "</CUBE_NAME>"
|Action Content Type ||URL|
|Action Expression ||"http://intranet/Documentation/CubeDocumentation.html"|
|Caption || |
, "<MEASURE_NAME>" + [Measures].CurrentMember.Name + "</MEASURE_NAME>"
+ "<CUBE_NAME>" + [Measures].CurrentMember.Properties("CUBE_NAME") + "</CUBE_NAME>"
, "<MEASURE_NAME>" + [Measures].CurrentMember.Name + "</MEASURE_NAME>"
+ "<CUBE_NAME>" + [Measures].CurrentMember.Properties("CUBE_NAME") + "</CUBE_NAME>"
|Caption is MDX ||True|
The action looks like the following in BIDS when you're done:
The only downside is that Excel 2007 doesn't even ask for actions on cells any time there is a multi-select in a PivotTable filter. But other than that limitation, it's a pretty good workaround.
I have been collecting a few advanced tidbits about semi-additive measures over the last few months, and it's time to share them.
Role-Playing Time Dimensions
Semi-additive aggregation functions like LastNonEmpty work just like Sum on all dimensions except for the Time dimension. How does it determine which is "the Time dimension"? It looks for the dimension marked Type=Time:
But what happens when, like most cubes, you have multiple role-playing versions of that dimension in your cube? Which does it choose? Flip to the Dimension Usage tab in BIDS. For the column representing that measure group with the semi-additive measure, find the role-playing date dimensions. I've circled them in green below. Then find the topmost one that's connected to the measure group. I've circled that in orange below. That's the semi-additive dimension for this measure group. A LastNonEmpty measure in this measure group will look for the most recent day with data in that Date dimension.
If you discover that the wrong dimension is the semi-additive dimension, it's very easy to fix. You just need to make the correct dimension the topmost connected role-playing date dimension by reordering the cube dimensions. To reorder the cube dimensions, flip to the Cube Structure tab, look at the list of cube dimensions in the bottom left, then drag and drop them into the correct order, similar to the following:
I suggest you arrange the cube so that the semi-additive dimension be the same for every measure group. This is probably a topic for a post another time, but I'm a big proponent for having a dimension simply called Date which is connected to every single measure group as the most common meaning of date. This dimension should be the semi-additive dimension.
I was warned that there have been some code changes in the Analysis Services product in this area at some point. So I tested this behavior in AS2005 RTM, SP1, SP2, SP3, and AS2008 RTM. I found it worked as I described above in all those versions. But test your cube thoroughly.
On a related note, the BIDS Helper Validate Aggs feature understands semi-additive measures and role-playing dimensions correctly. If you’ve got a measure group with only semi-additive measures, and if there are any aggs that don’t include the granularity attribute of the correct semi-additive dimension, then it warns you that those aggs will not be used in most situations.
Performance of LastNonEmpty vs. LastChild
On my largest cube, we decided to go with LastChild over LastNonEmpty. The driving reason was that LastNonEmpty hit more partitions than it needed to. To be specific, if we asked for a LastNonEmpty measure for 2008, it would hit every partition in 2008, not just the latest. LastChild only hits the latest partition. I would like to credit Robert Skoglund of Microsoft Consulting Services with making this observation and suggesting the solution.
The only hurdle was that the current month/quarter/year weren't complete, so the LastChild didn't have any data. That was easy to fix with an addition to the MDX script like the following:
// the most recent day with data
CREATE HIDDEN SET [Yesterday]
AS Exists([Date].[Calendar].[Date].Members, [Date].[Yesterday Flag].&[True]);
// The Inventory measures are AggregateFunction=LastChild.
// When we are in a current period, use the most recent day to
// get the inventory counts.
SCOPE(Exists([Date].[Calendar].[Month].Members, [Date].[Yesterday Flag].&[True]));
this = [Yesterday].Item(0).Item(0);
SCOPE(Exists([Date].[Calendar].[Quarter].Members, [Date].[Yesterday Flag].&[True]));
this = [Yesterday].Item(0).Item(0);
SCOPE(Exists([Date].[Calendar].[Year].Members, [Date].[Yesterday Flag].&[True]));
this = [Yesterday].Item(0).Item(0);
[Date].[Calendar].[All] = [Yesterday].Item(0).Item(0);
Another reason for going with LastChild over LastNonEmpty is that the results are often more intuitive in situations where not every single item is snapshotted every single period. Teo Lachev explained this problem very well in his post Last Non Empty Affairs. Depending on the business requirements, you could workaround the problem as he did with MDX that looks at each Customer, or you could change the ETL to snapshot every single item every period, or you could use LastChild to ensure that your results come from a consistent time period. The LastChild trick won't work in every situation, but it is a handy trick to have in your bag.
We have chosen to use relative paths in Integration Services on several projects now, and since it has worked well, it is time to share this tip with the community. The whole purpose of using relative paths is to make packages more portable, as opposed to packages and dtsConfig files having to live in a specific directory structure or drive letter on servers and developers’ laptops. As evidenced by the volume of posts in the forums about problems with deploying SSIS packages, anything that makes packages more portable will make deployment less prone to error.
When I say “relative paths in SSIS” I mean the following:
- .dtsx files and .dtsConfig files are all inside one directory
- Execute Package tasks refer to the package to execute using a File connection manager, and that connection manager refers to the .dtsx file with a relative path (i.e. just the filename)
- Package Configurations are of type “XML configuration file” and the path to the .dtsConfig file is just a relative path (i.e. just the filename)
- dtexec executes packages from disk, not from SQL Server or from the SSIS package store
- When dtexec is run, the working directory has to be set to the directory with all your files. This is accomplished by running dtexec from a .bat file which uses the CD command to set the working directory.
- When designing packages, Visual Studio must be started with the appropriate working directory which is the directory with all the packages. This can be accomplished by double clicking the .sln file if it is in the same directory as the packages. Or you can write a .bat file that runs the CD command to change the working directory before it starts Visual Studio.
When setting up an Execute Package task, mimic the following screenshots to use a relative path:
When setting up a package configuration, use the relative path:
Unfortunately, in SSIS 2008, this Package Configuration Wizard screen does not allow you to enter a plain relative path due to a bug. However, there is a BIDS Helper feature which allows you to workaround this bug.
Setting Working Directory
An example .bat file which runs dtexec with the correct working directory would look like the following. The first line changes to the drive of the working directory. The second line actually changes to that directory. The third line runs dtexec.
dtexec /FILE "master.dtsx" /MAXCONCURRENT " -1 " /CHECKPOINTING OFF /REPORTING EW
When setting up a SQL Agent job step, you cannot use the “SQL Server Integration Services Package” job step type in a SQL Agent job since it does not let you set the working directory. But you can use the “Operating System (CmdExec)” job step type which lets you run the .bat file. (To go further, some will argue here and here that you should always use the “Operating System (CmdExec)” even when you’re not needing to set the working directory for relative paths.) To run this bat file, set the “command” property to the full path to the bat file in quotes:
An example .bat file which runs Visual Studio with the correct working directory would look like the following. The first line changes to the drive of the working directory. The second line actually changes to that directory. The third line runs dtexec.
"C:\Program Files\Microsoft Visual Studio 8\Common7\IDE\devenv.exe" "D:\MyPackages\test.sln"
If you forget and start Visual Studio from the Start Menu (such that the working directory is not set to your packages directory), the .dtsConfig file it creates for you from your relative path will end up in the “C:\Program Files\Microsoft Visual Studio 8\Common7\IDE” directory.
Adding .dtsConfig Files to Your Visual Studio Solution
You can include .dtsConfig files in your solution in Visual Studio. Move the .dtsConfig out of the directory with your .dtsx files, then in Windows Explorer copy the .dtsConfig file to your clipboard, then left click on the Miscellaneous folder in Solution Explorer in Visual Studio, then type Ctrl-V to paste it into your SSIS project. The .dtsConfig file is now in the same directory as your .dtsx files (which is important for relative paths, as discussed above). And your .dtsConfig file can now be checked into source control with your .dtsx files. And the .dtsConfig file will be readily accessible in Visual Studio. Opening that .dtsConfig by double clicking it, you can pretty-print the XML by pressing Ctrl-K then Ctrl-D. (Or if you have BIDS Helper installed, it will automatically pretty-print it for you upon opening it.)
As a side note, I should mention in this context that whenever I’m working with package configurations, I use the Expression and Configuration Highlighter feature of BIDS Helper which gives you a little teal triangle on anything controlled by a package configuration. I always look to see that every connection manager is controlled by an expression or a configuration.
There are several other blog posts I know of (here and here) around this topic, but they suggest solutions that are a bit more intrusive than the one we prefer which I’ve outlined in this post.
This relative paths strategy has saved us some headaches several times when moving the deployment destination for packages to different environments. So I would recommend you consider using the relative paths trick when deciding on the deployment and configuration strategy for your ETL project.
If you’ve regularly found yourself changing the target server property and the data source connection string before you deploy your cube to another environment, it’s simply because you’re not aware of a feature that hasn’t been publicized well. (I’m just judging based on my own ignorance of this feature until recently. However, now that I go look, both this article and Teo Lachev’s book mention Configuration Manager, though they don’t mention the exact list of settings that are configurable.)
Visual Studio Configurations
Visual Studio has a configurations feature so that you can specify settings for different build types. In .NET projects, it’s usually used to compile debug versus release builds. For business intelligence projects, it can be used to setup a configuration per deployment target environment:
What settings can be saved per configuration? Obviously, all the settings in the Project Properties dialog:
What I Didn’t Know…
What I didn’t know until recently was that there are other settings (like the data source connection strings) in the Analysis Services project which can be controlled per configuration. Basically, any setting which is environment specific is “configurable.” I’ll enumerate those settings below.
The other tidbit to discuss is where these settings are stored. Some are stored in the .dwproj file and others are stored in the .dwproj.user file. The settings in the dwproj file can be considered “shared” because this file is checked into source control and shared among team members. The settings in the dwproj.user file can be considered “private” as this file often not checked into source control or shared. The counterargument is that the dwproj.user file holds the TargetServer setting which is a crucial part of setting up separate Visual Studio configurations per environment, so you may want to consider checking in that file.
The following table enumerates all the settings controlled per configuration, so you should be sure to review each of them when you setup a new deployment environment.
Where To Edit Setting
Double click data source… edit connection string. The connection string is saved with the SQL security username and password removed.
Double click data source… change provider
Double click data source… this setting indicates whether SQL security username and password was removed when saved to disk. The username and passwords are stored in the dwproj.user file (see the UserIDs and UserPasswords setting below.)
Cube properties… Advanced… ErrorConfiguration… (custom)… KeyErrorLogFile
Cube properties… Configurable… StorageLocation
Project Properties dialog
Project Properties dialog
Dimension Properties… Advanced… ErrorConfiguration… (custom)… KeyErrorLogFile
Measure Group Properties… Advanced… ErrorConfiguration… (custom)… KeyErrorLogFile
Mining Structure Properties… Advanced… ErrorConfiguration… (custom)… KeyErrorLogFile
Project Properties dialog
Partition Properties… Advanced… ErrorConfiguration… (custom)… KeyErrorLogFile
Only applicable when using remote partitions
Partition properties… Configurable… StorageLocation
Project Properties dialog
Project Properties dialog
Cube Actions tab… properties on a report action… Path property
Cube Actions tab… properties on a report action… ReportServer property
Project Properties dialog
Project Properties dialog
Project Properties dialog
Project Properties dialog
Double click data source… set the SQL security username
Double click data source… set the SQL security password
Visual Studio configurations are very helpful for SSAS. In fact, I would call it a best practice to use them.
Configurations vs. Deployment Wizard
When is it appropriate to use Visual Studio configurations, and when is it appropriate to use the Analysis Services Deployment Wizard? I would recommend you always use Visual Studio configurations, and if you’re doing a more advanced deployment where the Deployment Wizard would help, then you can use it, too. (The Deployment Wizard has some advanced deployment options like retaining partitions on the server, or retaining security roles on the server, etc.)
The Deployment Wizard works very well together with Visual Studio configurations. When you build or deploy your cube from Visual Studio, it writes files to the bin directory. The “bin\.configsettings” file contains all your Analysis Services project configurations, and that file drives the defaults on the “Specify Configuration Properties” screen in the Deployment Wizard. If you’ve properly setup your Visual Studio configurations, then you’ll need to do less work in the Deployment Wizard.
Visual Studio Configurations for SSIS and SSRS
Integration Services has a project properties dialog that lets you configure settings regarding debugging and regarding the building of the deployment manifest for use with the Package Installation Wizard. (And thanks to BIDS Helper, it has settings regarding deploying packages directly from BIDS.) And obviously those settings from the project properties dialog can be stored per Visual Studio configuration. SSIS projects do also have a Data Sources folder in Solution Explorer, and the connection strings are stored per Visual Studio configuration. However, the Data Sources folder is more trouble than it is worth. (See the comments from June 22, 2007 at the bottom of this post of Jamie Thompson’s.) You should use package configurations to control connection strings, and package configurations are unrelated to Visual Studio configurations.
Reporting Services has a project properties dialog that lets you configure settings regarding deploying of reports. These settings can be stored per Visual Studio configuration. However there are no other settings that are stored per configuration. Unfortunately, connection strings of the data sources in the Shared Data Sources folder are not controlled per configuration. (But the OverwriteDataSources setting in the project properties dialog prevents this from hurting you much. You just have to manually configure the data source in Report Manager the first time you deploy that data source, and then it will not be overwritten on subsequent deployments.)
I was privileged to be able to work with Carl Rabeler from the SQLCAT team on a good whitepaper which performance tests several optimization techniques for many-to-many dimensions in Analysis Services. The conclusion was that a technique they call Matrix Relationship Optimization with the appropriate aggs results in the best MDX query performance. In a nutshell, the idea is that there are recurring patterns or signatures among m2m dimensions. For instance, in the Adventure Works database, every sales order in the database shares one of twelve common sales reason signatures. Query performance can be improved by reusing these shared signatures.
This optimization technique pushes some of the work to the ETL layer. Along with the whitepaper, they distributed sample Integration Services packages to load the matrix relationship tables. Those packages were, I thought, rather ingenious in their use of the script transforms to pivot and unpivot the sales reasons data and accomplish everything in one dataflow task. The packages they distributed are coded to do a full refresh during each run. I created a spinoff of their sample package that does an incremental load, so I thought I would post it in case it helps others get started. Download IncrementalSalesReasonMatrixLoad.zip and run against a fresh copy of AdventureWorksDW.
I would recommend using the incremental load package if (a) you have the ability to easily detect which rows in the source system have changed since the last successful run and (b) if the incremental load will not produce many unused signatures over time.
The whitepaper also discusses aggregations for m2m dimensions in detail. Carl’s experience with real customers suggested that aggs on the intermediate measure group are usually very helpful (i.e. are considerably smaller than the intermediate measure group itself). Thankfully, these aggs can be built as normal since the Query Subcube event in Profiler accurately reports the subcube vector.
On the other hand, he found that in real life, aggs on the data measure group to help with queries against m2m dimensions often end up being about the same size as the measure group itself, thus they are not helpful. It all just depends on your data. An agg is smaller on disk not only because it has a lower rowcount, but because it has fewer dimension keys. (It’s not as tall and not as wide.) Therefore, data measure groups with a large number of dimensions unused in queries benefit from aggs, even if the rowcount of the resulting agg alone isn’t much less than the measure group rowcount itself. So it is helpful to know how to build m2m aggs on the data measure group. Unfortunately, these aggs must be built by hand using Agg Manager, and they are rather tricky to build. Let me suggest the following steps for building them:
- Turn on Query Logging and set QueryLogSampling to 1.
- Install ASSP and run the following before each MDX query: CALL ASSP.ClearCache()
- Run your MDX queries which use the m2m dimension.
- Turn off Query Logging by setting QueryLogSampling to 0.
- Install BIDS Helper and launch Agg Manager. Right click on the agg design for your data measure group (e.g. Internet Sales) and choose “Add Aggregations from Query Log.” Click the “Execute SQL” button and then click OK after the query returns. Right click on the agg design and choose Edit.
- (Instead of using query logging, you can just run profiler and manually cut and paste the subcube vector from the Query Subcube event if you prefer.)
- Click the “Eliminate Redundancy” and “Eliminate Duplicates” buttons as usual.
- Click the “Validate Aggregations” button. It should prompt you with a message similar to: “This aggregation contains many-to-many dimension [Sales Reason]. It will not be used unless it also contains [Internet Sales Order Details].[Internet Sales Order]. The many-to-many dimension [Sales Reason] itself should not be included in the aggregation to workaround a bug.” For each agg with such a warning, edit that agg, then double-click the attribute in the m2m dimension which is already selected. This will unselect then attempt to reselect that attribute. At this point, Agg Manager will prompt you with a message similar to: “This aggregation will not be used when querying many-to-many dimension [Sales Reason] unless it also contains [Internet Sales Order Details].[Internet Sales Order]. The many-to-many dimension [Sales Reason] itself should not be included in the aggregation to workaround a bug. Would you like BIDS Helper to fix this for you?” Click Yes.
The other idea around many-to-many dimension optimization that I would like to mention is adding extra dimensions (particularly the time dimension) to the intermediate measure group and partitioning by them. Chris Webb explained this idea, too. To put it in my terms, if there’s a good deal of turnover among m2m signatures over time (i.e. signatures are retired over time) then this is a strategy to consider. In the whitepaper there are only a few tests in which the Matrix Relationship Optimization technique doesn’t perform as well as the other design alternatives. (The best example is on page 33.) I suspect that adding the time dimension to the intermediate measure group will help (though possibly at the expense of performance on other queries). As for the whitepaper, we agreed this situation is a bit more of a fringe scenario because retiring m2m signatures doesn’t happen as frequently in the real world, so we didn’t complicate the whitepaper by mentioning it. However, it is something to be aware of as you’re analyzing your m2m dimension data.
I’m glad to see that Erik Veerman’s m2m optimization ideas (#1 and #2) have now been fully fleshed out and publicized.