Riffing on Raymond - When caches hurt

On Tuesday, Raymond posted "A cache with a bad policy is another name for a memory leak".

Of course that immediately reminded me of a war story from back in the days of Exchange 5.0.  One of the major features of Exchange 5.0 was an NNTP server. 

NNTP (and POP3) are rather interesting protocols.  In general, clients don't tend to stay connected to the server for a long time, instead they tend to connect, authenticate, perform a couple of operations nd terminate the connection.  Many protocols have similar usage patterns (HTTP comes to mind immediately). 

Of course, when you're building a server, you need to do performance analysis to determine how fast the server can go (or in the case of an email/news server, how many clients it can support).  For the purposes of performance analysis, we measure requests in terms of "transactions".  For the POP3 or NNTP server, each of these connect/authenticate/operations/terminate sequences was is considered a "transaction".  By measuring the amount of time it takes for each of these transactions to execute, you can easily determine the scalability of your server.  If each transaction takes <n> milliseconds, then you can only perform 1000/<n> transactions per second.  Obviously if you can perform <m> transactions at the same time, then the number of transactions per second becomes (1000/<n>)*<m>.  But <m> is typically a relatively small number (if you're CPU bound, then <m> is typically the number of CPUs on your machine, for instance).  Calculating <m> can be quite complicated, and is way out of scope for this post.

Obviously if you increase <m> or decease <n>, you'll improve your performance, and when doing perf analysis, you do both.  When we started breaking down <n>, we realized that there were a couple of huge items that were effecting <n>.  The first cost was authentication, the other was reading information about the newsgroup from the Exchange DS (that was required during the "operations" step).  The combined cost of these two items was about 500 milliseconds of the 510 total cost per transaction.

There were clear benefits here, so I built two separate caches, one for the directory lookups and the other for the authentications.  Both worked flawlessly, and our performance jumped dramatically.

Things went great until we started getting client stress tests run against the NNTP server.  All of a sudden, our testers started reporting that the NNTP server was leaking memory like crazy.  It was consuming hundreds of megabytes (on a machine with 64M of physical memory), and the machine was essentially unusable.

Not a problem, the Exchange store had extensive memory leak tracking logic, we just stopped the server to let the leak detection logic do its stuff and tell us where the leak was.

Of course you know what happened: The leak detection logic reported that there were no leaks.  So, being the typical arrogant developers that I am, I went back to the testers and said "There are no leaks, you must be dreaming".  Needless to say, that didn't go over too well, so I dug deeper.

I hooked a debugger up to the process and started poking randomly at allocated memory, and discovered that the DS cache was huge.  I dug a bit deeper and discovered that the problem was buried deep in the DS cache logic.  It turns out that the there was an uninitialized variable in the hash function that I was using in my cache, which caused all the entries in the DS cache to have their own buckets.

So yes, a cache with a bad policy (or a bad cache function) is undistinguishable from a memory leak.