Analysis Services and the Case of the Standby Cache

As an homage to my two favorite detectives, Nero Wolfe and Gregory House, here’s an Analysis Services detective story for you. If you enjoy the thrill of the chase, read on. If you take yourself too seriously and just want the facts, skip to the Conclusions section at the end.

Our story begins on a busy summer day in which our hero (me) finds himself covering for a co-worker on vacation. I get a call that queries are running slow on the Analysis Services box, so I go investigate. The first thing I noticed was that Task Manager shows zero free memory...

Image 1: Windows Server 2008 Task Manager showing zero free RAM

Chapter 1 (Or How I Discovered Analysis Services Didn’t Want More Memory)

Honestly, my first instinct was that Analysis Services thought it was in a low memory situation, and I suspected that might explain the performance issues. So I decided to investigate what was consuming that memory. You could look at msmdsrv.exe memory usage in Task Manager, but there are times when it doesn’t show the right number. So I checked the “MSOLAP$instance:Memory:Memory Usage KB” perfmon counter, and it told me SSAS was only using 1GB of memory. Considering this was a dedicated SSAS box, and considering LowMemoryLimit was set at 6GB and TotalMemoryLimit was set at 8GB, either SSAS didn’t need the memory, or worse, it did need it but didn’t think there was enough memory available.

Akshai Mirchandani on the SSAS dev team was kind enough to tell me that SSAS uses the GlobalMemoryStatusEx Windows API to determine how much available memory there is at the moment. I wrote a little EXE to pull that number, and I found that it told me there were 8GB available. That number matched the “Memory : Available MBytes” perfmon counter, as it should. I learned the Free number showing in Task Manager just means zeroed out RAM, so it’s nothing to be concerned with as long as there’s still enough Available memory. In short, ignore the “Free” number in Task Manager. The confusion caused by showing Free but not Available in Windows 2008 Task Manager is probably the reason they added Available back into Task Manager in Windows 2008 R2.

Akshai also suggested I double check it was 64-bit SSAS installed since 32-bit SSAS is limited to 2GB memory usage. So I ran the following from an XMLA window in Management Studio, looked for the <Edition> tag in the result, and verified it ended in 64 (i.e. <Edition>Enterprise64</Edition> means 64-bit enterprise edition).

<Discover xmlns="urn:schemas-microsoft-com:xml-analysis">

My conclusion was that SSAS knew there was memory available to it and it could have grabbed more memory if it needed it, so it must not have needed the memory.

Chapter 2 (Or How I Discovered the Standby Cache)

Even though there was 8GB available RAM, I still wanted to track down what was consuming the memory such that Windows reported there was 0MB free. Knowing that when SSAS reads a file off disk, it uses a buffered read which is able to cache the file in the Windows System File Cache, I went to check the “Memory : Cache Bytes” perfmon counter. It said the Windows System File Cache was using 3GB currently. Since the cube in question was 30GB, and since I had set the LimitSystemFileCacheSizeMB setting to 3GB to make sure the system file cache didn’t get out of hand, I wasn’t surprised it was using exactly 3GB. (3GB was the value I had chosen on this server after studying usage patterns during querying and processing. This is just a dev server, so it doesn’t have as much memory as it really needed to perform optimally. For at 30GB cube, we would recommend at least 30GB of RAM usually. But for this dev server, we have to make due with what we have.)

Recapping, SSAS was using 1GB, and the system file cache was using 3GB. The server had 12GB total. Do the math. Something else is using 8GB of memory on this dedicated SSAS box. There weren’t any processes in Task Manager that were using significant chunks of RAM.

At this point, the Task Manager number saying Cached was about 10GB gave me my next hint. I began researching Windows 2008 caching and made one of my most significant discoveries of 2010. The discovery was this: there’s a standby cache. There were several articles that helped me figure this out, but one of the key articles was the Too Much Cache blog post which mentions the standby cache briefly. Let me explain how it works. Let’s say you let your active system file cache grow to 4GB, then you set the LimitSystemFileCacheSizeMB setting to 3GB. That extra 1GB doesn’t get zeroed; it just gets recategorized as “Standby Cache” rather than “Active Cache” and is now categorized as “available” so it can be quickly zeroed and given to other processes if needed. Or if the active system file cache needs it again, it can be quickly soft faulted back into active system file cache.

Why did I say this was such an important discovery? As Somak noted in the Too Much Cache blog post, grabbing a page off disk takes about 10 milliseconds, whereas grabbing a page out of RAM takes about 10 nanoseconds… that’s 1 million times faster. So obviously the system file cache is crucial to SSAS query performance. And accounting for the system file cache is key to repeatability in testing MDX query performance—you will never be able to compare two executions of an MDX query unless you can ensure the system file cache is clear before both executions. I’ll get to how you can clear the active and standby cache in a second.

At this point, I discovered a killer Sysinternals tool called RAMMap which gives you great insight into the system file cache. I ran a few queries against Adventure Works before the following screenshot. Note the Internet Sales 2004 partition is in the active system file cache, as shown on the File Summary tab:

Image 2: RAMMap showing active system file cache

Now I clear the active system file cache using ASSP (which I’ll mention later). Note the cached files just got moved into the standby cache:

Image 3: RAMMap showing standby system file cache

Now if I clear the standby cache using ASSP (which I’ll mention later) or using RAMMap’s “Empty… Empty Standby List” menu, you’ll note those Adventure Works files drop out of the standby cache.

Anyway, back to our detective story. Poor performance on a query had been reported, but I couldn’t reproduce it. I thought I was testing that query on a cold cache because I had cleared the active system file cache and the Analysis Services caches, but I hadn’t accounted for the standby cache. So the query was performing OK for me because many of the files were in the standby cache. Once I cleared the standby cache, I could reproduce the poor performing query, and I ended up building a new aggregation that solved the problem.

The lesson learned was to pay attention to the standby cache (the “Memory : Standby Cache Normal Priority Bytes” perfmon counter) in addition to the active cache (the “Memory : Cache Bytes” perfmon counter).

Chapter 3 (Or How ASSP Got Its Groove Back and Cleared the Standby Cache)

Being able to use the “Empty… Empty Standby List” menu item in RAMMap is nice, but wrapping that Windows API call into an Analysis Services stored procedure lets you clear the standby cache right from a Management Studio query window (or with an automated testing tool). So I added this functionality to the ASSP project. Shout out to Tim Mangan and Marco Russo who both helped track down that Windows API call.

About a year and a half ago I added some code into the ASSP project to clear the system file cache. I thought it would help MDX query test repeatability, but it strangely had little impact. Now I know why—it was only clearing the active system file cache (which simply shuffled those pages into the standby cache). Now ASSP has its groove back because it clears both the active and standby cache with the ClearFileSystemCache sproc. So if you want to test on a true cold cache without rebooting the server, the best practice is to run run the ClearAllCaches sproc (which internally clears both the Analysis Services cache with the ClearCache sproc and clears the active and standby system file cache with the ClearFileSystemCache sproc), then run your MDX query, all from an MDX window in Management Studio:

call ASSP.ClearAllCaches()
<your MDX query here>

Chapter 4 (Or How I Stand on my Soapbox and Condemn Paging)

During the discussion above, I have only been speaking of memory in terms of physical RAM (rather than virtual memory). I can hear Marco Russo chiding me and whispering in my ear, “Programs request virtual memory from Windows, and Windows allocates RAM or pages memory to disk as it sees fit.” While he is right, physical RAM is all that’s relevant in practice. If there’s much paging going on, performance will be terrible. And this brings me to a favorite soapbox.

I can’t count the number of times I’ve seen a new client’s server in the following state. The server has SQL and SSAS on the same box. Both are set to the default memory limits (which is no memory cap for SQL and 80% of server memory for SSAS). By the time I see the server, SQL and SSAS are fighting each other for memory, causing each other to page out, and performance degrades dramatically. (Remember what I said about disk being a million times slower than RAM?)

What’s the solution? Cap all the large consumers of memory such that they don’t step on each others’ toes. That’s an easy principle, but it gets very difficult when you have to go through the testing iterations to prove your server performs better overall with memory limits for SQL and SSAS. For example, I have observed one particular stored procedure take 12 hours to complete if SQL was capped at 6GB of memory, but that same stored procedure would complete in 3 hours if SQL was capped at 14GB of memory. The point is that you have to thoroughly test your memory limit changes. But in my experience, a server with a well-tested memory usage plan will perform better and more consistently because it largely avoids paging.

An example memory usage plan on a 24GB RAM server with SQL and SSAS might be similar to the following. SQL will be capped at 10GB using the “maximum server memory” setting. SSAS will be capped at 8GB using TotalMemoryLimit/HardMemoryLimit settings (in combination with LowMemoryLimit). The active system file cache will be capped at 4GB using the LimitSystemFileCacheSizeMB setting. And we’ll assume the OS will use about 1GB, and programs like SSMS or BIDS will use about 1GB. Note the standby system file cache doesn’t need to be figured into your memory usage plan because that memory is categorized by Windows as Available.

Memory Consumer GB
SQL 10
Active System File Cache 4
OS 1
Programs like SSMS 1

Even better, if you determine that during the ETL, SQL needs more memory and SSAS needs less, it is easy to build a SQL script and an XMLA script to programmatically change those memory limits before and after the ETL.

Another best practice to consider is using the “lock pages in memory” setting to prevent paging. This KB article describes how to enable this setting for SQL Server. For Analysis Services, once you’ve set that group policy, use the PreAllocate setting to tell Analysis Services to grab a chunk of RAM that won’t be paged out. (Even without “lock pages in memory” PreAllocate helps SSAS memory allocation performance on Windows 2003. On Windows 2008, the main benefit of PreAllocate is to lock pages in memory.) Just be aware (and don’t panic) that Analysis Services will take quite a while to start and will pretty much lock up the server during that period.

Conclusion (Or How I Summarize Seven Pages In One Paragraph)

Ignore the “Free” number in Task Manager and focus on the “Memory : Available MBytes” perfmon counter. Be aware of the active system file cache (the “Memory : Cache Bytes” perfmon counter) and the standby system file cache (the “Memory : Standby Cache Normal Priority Bytes” perfmon counter) as they both impact MDX query performance drastically. Use RAMMap to see what’s in the system file cache. To test MDX query performance on a cold cache, use ASSP.ClearFileSystemCache. Avoid paging by capping the memory usage of large consumers of memory and by using the “lock pages in memory” group policy. Test before and after.