A cache with a bad policy is another name for a memory leak

A common performance trick is to reduce time spent in the heap manager by caching the last item freed (or maybe the last few) so that a subsequent allocation can just re-use the item rather than having to go make a new one. But you need to be careful how you do this or you can end up making things worse rather than better. Here's an example motivated by an actual problem the Windows performance team researched.

Consider a cache of variable-sized buffers. I will use only a one-entry cache for simplicity. In real life, the cache would be more complicated: People tend to have a deeper cache of four to ten entries, and you would have to ensure that only one thread used the cache at a time; typically this is done by associating the cache with something that has thread affinity. Furthermore, you probably would keep the size of the cached buffer in a member variable instead of calling LocalSize all the time. I've left out all these complications to keep the presentation simple.

class BufferCache {
 BufferCache() : m_pCache(NULL) { }
 ~BufferCache() { LocalFree(m_pCache); }

 void *GetBuffer(SIZE_T cb);
 void ReturnBuffer(void *p);

 void *m_pCache;

If a request for a memory buffer arrives and it can be satisfied from the cache, then the cached buffer is returned. Otherwise, a brand new buffer is allocated.

void *BufferCache::GetBuffer(SIZE_T cb)
 // Satisfy from cache if possible
 if (m_pCache && LocalSize(m_pCache) >= cb) {
  void *p = m_pCache;
  m_pCache = NULL;
  return p;
 return LocalAlloc(LMEM_FIXED, cb);

When a buffer is returned to the cache, we compare it against the item already in the cache and keep the bigger one, since that is more likely to satisfy a GetBuffer in the future. (In the general case of a multiple-entry cache, we would free the smallest entry.)

// Flawed design - see discussion
void BufferCache::ReturnBuffer(void *p)
 SIZE_T cb = LocalSize(p);
 if (!m_pCache || cb > LocalSize(m_pCache)) {
  // Returned buffer is bigger than the cache:
  // Keep the returned buffer
  m_pCache = p;
 } else {
  // Returned buffer is smaller than the cache:
  // Keep the cache

Why is this a flawed design? I'll let you think about this for a while.

No really, I want you to think about it.

Are you thinking? Take your time; I'll be here when you're done.

Okay, since I know you haven't actually thought about it but are just sitting there waiting for me to tell you, I'll give you a bit of a nudge.

The distribution of buffer sizes is rarely uniform. The most common distribution is that small buffers are popular, with larger and larger buffers being required less and less often. Let's write a sample program that allocates and frees memory according to this pattern. To make the bad behavior easier to spot in a short run, I'm going to use a somewhat flat distribution and say that half of the buffers are small, with larger buffers becoming less popular according to exponential decay. In practice, the decay curve is usually much, much steeper.

#include <vector>
#include <iostream>

// Since this is just a quick test, we're going to be sloppy
using namespace std; //  sloppy
int __cdecl main(int argc, char **argv)
 BufferCache b;

 // seeding the random number generator is not important here

 vector<void *> v; // keeps track of allocated memory
 for (;;) {
  // randomly allocate and free
  if (v.size() == 0 || (rand() & 1)) { // allocate
   SIZE_T cb = 100;
   while (cb < 1024 * 1024 && (rand() & 1)) {
    cb *= 2; // exponential decay distribution up to 1MB
   void* p = b.GetBuffer(cb);
   if (p) {
    cout << " A" << LocalSize(p) << "/" << cb;
  } else { // free
   int victim = rand() % v.size(); // choose one at random
   cout << " F" << LocalSize(v[victim]);
   b.ReturnBuffer(v[victim]); // free it
   v[victim] = v.back();

This short program randomly allocates and frees memory from the buffer cache, printing (rather cryptically) the size of the blocks allocated and freed. When memory is allocated, it prints "A1/2" where "1" is the size of the block actually allocated and "2" is the size requested. When freeing memory, it prints "F3" where "3" is the size of the block allocated. Run this program, let it do its thing for maybe ten, fifteen seconds, then pause the output and study it. I'll wait. If you're too lazy to actually compile and run the program, I've included some sample output for you to study:

F102400 A102400/400 F800 F200 A800/100 A200/200 A400/400
A400/400 A200/200 F1600 A1600/100 F100 F800 F25600 A25600/200
F12800 A12800/200 F200 F400 A400/100 F200 A200/100 A200/200
A100/100 F200 F3200 A3200/400 A200/200 F51200 F800 F25600
F1600 F1600 A51200/100 F100 A100/100 F3200 F200 F409600 F100
A409600/400 A100/100 F200 F3200 A3200/800 A400/400 F800 F3200
F200 F12800 A12800/200 A100/100 F200 F25600 F400 F6400
A25600/100 F100 F200 F400 F200 F800 F400 A800/800 A100/100

Still waiting.

Okay, maybe you don't see it. Let's make the effect even more obvious by printing some statistics periodically. Of course, to generate the statistics, we need to keep track of them, so we'll have to remember how big the requested buffer was (which we'll do in the buffer itself):

int __cdecl main(int argc, char **argv)
 BufferCache b;

 // seeding the random number generator is not important here

 vector<void *> v; // keeps track of allocated memory
 SIZE_T cbAlloc = 0, cbNeeded = 0;
 for (int count = 0; ; count++) {
  // randomly allocate and free
  if (v.size() == 0 || (rand() & 1)) { // allocate
   SIZE_T cb = 100;
   while (cb < 1024 * 1024 && !(rand() % 4)) {
    cb *= 2; // exponential decay distribution up to 1MB
   void* p = b.GetBuffer(cb);
   if (p) {
    *(SIZE_T*)p = cb;
    cbAlloc += LocalSize(p);
    cbNeeded += cb;
  } else { // free
   int victim = rand() % v.size(); // choose one at random
   cbAlloc -= LocalSize(v[victim]);
   cbNeeded -= *(SIZE_T*)v[victim];
   b.ReturnBuffer(v[victim]); // free it
   v[victim] = v.back();
  if (count % 100 == 0) {
   cout << count << ": " << v.size() << " buffers, "
        << cbNeeded << "/" << cbAlloc << "="
        << cbNeeded * 100.0 / cbAlloc << "% used" << endl;

This new version keeps track of how many bytes were allocated as opposed to how many were actually needed, and prints a summary of those statistics every hundred allocations. Since I know you aren't actually going to run it yourself, I've run it for you. Here is some sample output:

0: 1 buffers, 400/400=100% used
100: 7 buffers, 4300/106600=4.03377% used
200: 5 buffers, 1800/103800=1.7341% used
300: 19 buffers, 9800/115800=8.46287% used
400: 13 buffers, 5100/114000=4.47368% used
500: 7 buffers, 2500/28100=8.8968% used
37200: 65 buffers, 129000/2097100=6.15135% used
37300: 55 buffers, 18100/2031400=0.891011% used
37400: 35 buffers, 10400/2015800=0.515924% used
37500: 43 buffers, 10700/1869100=0.572468% used
37600: 49 buffers, 17200/1874000=0.917823% used
37700: 75 buffers, 26000/1889900=1.37573% used
37800: 89 buffers, 30300/1903100=1.59214% used
37900: 91 buffers, 29600/1911900=1.5482% used

By this point, the problem should be obvious: We're wasting insane quantities of memory. For example, after step 37900, we've allocated 1.8MB of memory when we needed only 30KB, for a waste of over 98%.

How did we go horribly wrong?

Recall that most of the time, the buffer being allocated is a small buffer, and most of the time, a small buffer is freed. But it's the rare case of a large buffer that messes up everything. The first time a large buffer is requested, it can't come from the cache, since the cache has only small buffers, so it must be allocated. And when it is returned, it is kept, since the cache keeps the largest buffer.

The next allocation comes in, and it's probably one of the common-case small buffers, and it is given the cached buffer—which is big. You're wasting a big buffer on something that needs only 100 bytes. Some time later, another rare big buffer request comes in, and since that other big buffer got wasted on a small allocation, you have to allocate a new big buffer. You allocated two big buffers even though you need only one. Since big buffers are rare, it is unlikely that a big buffer will be given to a caller that actually needs a big buffer; it is much more likely to be given to a caller that needs a small buffer.

Bad effect 1: Big buffers get wasted on small callers.

Notice that once a big buffer enters the system, it is hard to get rid of, since a returned big buffer will be compared against what is likely to be a small buffer, and the small buffer will lose.

Bad effect 2: Big buffers rarely go away.

The only way a big buffer can get freed is if the buffer in the cache is itself already a big buffer. If instead of a one-entry cache like we have here, you keep, say, ten buffers in your buffer cache, then in order to free a big buffer, you have to have eleven consecutive ReturnBuffer calls, all of which pass a big buffer.

Bad effect 3: The more efficient you try to make your cache, the more wasteful it gets!

What's more, when that eleventh call to ReturnBuffer is made with a big buffer, it is only the smallest of the big buffers that gets freed. The biggest buffers stay.

Bad effect 4: When a big buffer does go away, it's only because you are keeping an even bigger buffer!

Corollary: The biggest buffer never gets freed.

What started out as an "obvious" decision in choosing which buffer to keep has turned into a performance disaster. By favoring big buffers, you allowed them to "poison" the cache, and the longer you let the system run, the more allocations end up being big "poisoned" buffers. It doesn't matter how rare those big blocks are; you will eventually end up in this state. It's just a matter of time.

When the performance team tries to explain this problem to people, many of them get the mistaken impression that the problem is merely that there is wasted space in the cache. But look at our example: Our cache has only one entry and we are still wasting over 90% of the memory. That's because the waste is not in the memory being held by the cache, but rather is in the memory that the cache hands out. (It's sort of like that scene in It's a Wonderful Life where George Bailey is explaining where all the money is. It's not in the bank; it's in all the places that got money from the bank.)

My recommendations:

  • Instrument your cache and understand what your program's memory allocation patterns are.

  • Use that information to pick a size cutoff point beyond which you simply will not use the cache at all. This ensures that big buffers never get into the cache in the first place. Choosing this cutoff point is usually extremely easy once you look at then allocation histogram.

  • Although you've taken the big buffers out of the picture, you will still have the problem that the small buffers will gradually grow up to your cutoff size. (I.e., you still have the same problem, just in miniature.) Therefore, if the cache is full, you should just free the most recently returned buffer regardless of its size.

  • Do not use the cached buffer if the waste is too great. You might decide to use multiple "buckets" of cached entries, say one for buffers below 100 bytes, another for buffers between 100 and 200 bytes, and so on. That way, the waste per allocation is never more than 100 bytes.

  • Finally, reinstrument your cache to ensure that you're not suffering from yet some other pathological behavior that I haven't taken into account.

Here's a new ReturnBuffer implementation that takes some of the above advice into account. Instrumentation shows that three quarters of the allocations are in the 100–200 byte range, so let's cap our cache at 200 bytes.

void BufferCache::ReturnBuffer(void *p)
 if (m_pCache == NULL && LocalSize(p) <= 200) {
  m_pCache = p;
 } else {

With this one seemingly-minor change, our efficiency stays above 90% and occasionally even gets close to 100%:

0: 1 buffers, 400/400=100% used
100: 7 buffers, 4300/4400=97.7273% used
200: 5 buffers, 1800/1800=100% used
300: 19 buffers, 9800/9800=100% used
400: 13 buffers, 5100/5100=100% used
500: 7 buffers, 2500/2600=96.1538% used
37200: 65 buffers, 129000/130100=99.1545% used
37300: 55 buffers, 18100/18700=96.7914% used
37400: 35 buffers, 10400/11000=94.5455% used
37500: 43 buffers, 10700/11000=97.2727% used
37600: 49 buffers, 17200/18000=95.5556% used
37700: 75 buffers, 26000/26800=97.0149% used
37800: 89 buffers, 30300/31900=94.9843% used
37900: 91 buffers, 29600/30600=96.732% used

Don't forget to check out performance guru Rico Mariani's reminder that Caching implies Policy. As he explained to me, "Cache policy is everything so you must be dead certain that your policy is working as you intended. A cache with a bad policy is another name for a memory leak."

Comments (37)
  1. Adrian says:

    Adam makes a good point.  Free lists for variably sized buffers doesn’t seem worth it.

    A simpler policy would be to always keep the most recently freed buffer regardless of its size.  The zillions of little allocations are the ones that kill performance, so those are the ones you want to satisfy.

    Trying to keep the largest buffer around for re-use seems pointless, since larger allocations are rare (especially once the program reaches its steady state), and larger allocations don’t tend to fragment the heap as badly.

  2. Andrew says:

    Therefore, if the cache is full, you should

    > just free the most recently returned buffer

    > regardless of its size.

    Why not free the oldest buffer in the cache? Presumably a cache with multiple elements would try to return the buffer that most closely matches the requested size so the oldest buffer is presumably the least desirable size for the current point in time? Or do requested memory allocation sizes typically not exhibit sufficient temporal locality to make the extra logic worthwhile?


    p.s. Typo in "Choosing this cutoff point is usually extremely easy once you look at then allocation histogram."

  3. Axe says:

    Andrew skipped Alex, Alexander, Alexi, etc.

  4. zipping ahead in the alphabet…

    Why not have the BufferCache keep a list of pointers to caches of varying size?  Serve the smallest cached memory that is at least as big as the amount requested.

  5. Nawak says:

    Still in the correct order….

    I never thought about making an ‘allocation’ cache and I am glad I did! (Or should I say ‘I am glad I did not’? Since… well… I did not)

    As Adam said, some problems are better left to the allocation manager…

  6. Not_Brian says:

    One reason for managing your own allocations is to make better use of CPU cache.  This is a very specialized problem, and it’s generally better to add it later.

  7. Orlphar says:

    Another possibility would be to return parts of a large buffered cache.  Ie, if you’ve got a 500 byte buffer cached and you get a request for a 100 byte buffer and a 200 byte buffer, return first the pointer to the start of the 500 byte buffer, and then a pointer 100 bytes into it for the second.

    Again, you’d be essentially writing your own heap manager at this point.

  8. Maurits: "Why not have BufferCache keep a list of pointers to caches of varying size?" Um, that was one of my recommendations.

    Orlphar: The whole point of these caches is to avoid (as much as possible) the overhead of a full heap manager in the first place.

  9. use multiple "buckets" of cached entries

    Ah, so it was.  Never mind then (sheepish grin)

  10. pHA HA HA says:

    im sory raymand but you oughta not comented aftar orlphar. it was al alphaabeticel until tthen. for shame!

    unles its a casesensitve sort? in wich case maurits is teh fly in theh onttment!

  11. pIA HA HA says:

    p.s. totly cool post btw.

  12. Nate says:

    You could also use a slab allocator, like available on FreeBSD and Linux.


    It’s a bad idea to do your own cache because it encourages memory fragmentation and you don’t coalesce or free the cache under memory pressure.  If your response to this is "ok, I’ll have a hook that is called in a low memory situation to free my cache", you don’t get it.

  13. Adam says:

    Hmph. In the end, you’ll just end up with your own heap manager which is not as widely tested as the one that comes with your system.

    If you’ve got a lot of same-sized items, I’d say that tracking free lists *might* be worth looking into if the heap manager on your system has *really* bad performance for that particular case. But caching variable sized free lists? That’s pretty much the definition of what the help manager does.

    If it’s *that* bad, I’d have thought a better idea would be to submit a bug report with test case and expected behavior/timings, and hope that the vendor releases a patch to fix the problem. Then *all* applications get to benefit.

  14. Ax says:

    I’d be interested in seeing where you see caching  algorithms screw up in the same-sized element case. Since the above issue is *very* specific to the variable-sized element case.

    (p.s. Named myself to keep the alphabetical order so far…)

  15. steveg says:

    Excellent post, Raymond, thanks. Anything that contains "exponential decay distribution" is always good.

    A question though, what sort of performance difference does BufferCache introduce vs a BufferCache-less version?

  16. steveg: It all depends on the scenario, but that wasn’t my point here. (For example, imagine that the buffers were really C++ objects with complicated constructors.)

  17. Cooney says:

    Since the memory in question is probably virtually mapped, will it really matter, so long as the address space isn’t exhausted?

  18. Dean Harding says:

    Cooney: It most definately will matter. The way it was originally written not only wastes memory, but you also end up with a pretty badly fragmented virtual address space (since you keep allocating bigger and bigger buffers, and only deallocating the small ones).

    Fragmentation of the virtual address space is possibly the biggest killer when writing long-running (i.e. server) applications.

  19. Neil says:

    If you’ve got a lot of same-sized items

    … then you should consider an arena allocator.

  20. josh says:

    If you’re gonna have a limit that the size is guaranteed to grow to eventually, why not just allocate that much to begin with and avoid some of the extra processing?

  21. Mr.Pedantic says:

    Why is this a flawed design?

    Well, the freeing of the NULL pointer was a bit of a giveaway:

    >if (!m_pCache || cb > LocalSize(m_pCache))   >LocalFree(m_pCache);

    Oh, you mean there were other flaws too? :-).

  22. josh: So you’re saying we should’ve just allocated 1MB buffers and handed them out, even if the caller asked for only 100 bytes?

    Pedantic: LocalFree is explicitly documented as accepting NULL (and treating it as a nop).

  23. richard says:

    Hmmm … the voice sounds decidedly unlike you (not that I would know your voice from anything other than your blogs).

    Personally, I think that sort of caching policy sounds stupid. People need to implement first, optimize later.

    Then when they get to the optimization part, they need to learn to actually test and measure rather than going by gut. If I had nickel every time someone pointed to the code and said “It’s the string handling that is killing the performance,” I’d probably be near a dollar. If I gave back a quarter for every time it was string handling / manipulation that was at fault … gee, I’d still be near a dollar.

    People have intuitive feelings about performance that are usually dead wrong. And then machismo kicks in and nobody wants to admit they were wrong, so a whole lot of effort goes into wringing out a modicum of performance.

    While we are no longer trying to shoehorn applications into a 1KB footprint, memory is still not free.

    If I had to implement that optimization, I would probably profile the memory request distribution; establish a threshold based on that analysis and only cache those memory blocks which fall below a certain threshold. Large memory blocks would be allocated and freed as needed (since they are rare occurrences).

  24. Eli says:

    Someone should show the Firefox team this.

  25. ChrisW says:

    Other negatives about private allocators even when they use the system heap manager include:

    1. Tools for finding heap overruns tend not to work.  Application Verifier (debug heap) becomes close to useless.

    2. The new mitigations for heap overruns put into XPSP2 etc are bypassed, so it will be easier to write exploits if there is a heap overrun.

    3. Leak detection tools will be confused, or not.  But if you tell a developer that these blocks looked leaked and they are on his special free list, he will be less interested in hearing from you the next time.

    4. I have seen a very famous internet facing component that had its own private heap manager on top of the Windows heap, with a policy to clear the cache when it got up to some limit.  Testers were told, and they tested that condition.  The problem was when it cleared it the first time, it set its state variables such that it never cleared the cache again.  That took a while to figure out.

    net net: I’m against private allocators.  Many of them in my experience have outlived their usefulness.

  26. josh says:

    "So you’re saying we should’ve just allocated 1MB buffers and handed them out, even if the caller asked for only 100 bytes?"

    No, of course not.  I’m not quite sure what I was thinking… now that I look at it again your buffering here seems much simpler than it did this morning.  :/

  27. Kal says:

    One big reason to use a cache (or otherwise similar memory management system that you design) is if you can get away with avoiding synchronization hits on memory allocation. That’s more to do with a custom memory allocator and less to do with a cache, but it is a reason one would do it – and it can result in shockingly impressive performance benefits in the right kind of system.

    The cache part…yeah, Rico’s notes about ‘DO NOT CACHE UNLESS YOU KNOW YOU NEED TO’ always stuck with me. Caching in such a way that is both correct and improves performance significantly is a hard problem and changes the behavior of your system. It’s not something to undertake haphazardly.

  28. rsclient says:

    I’ll tell my cache war story:

    It was a dark and stormy night.  The project made heavy use of STL (in the days when the ‘S’ in STL was a cruel joke), exceptions (in the days when exceptions were new) and threads (again, in the days when threads didn’t have very good support).  As a mere contractor working on a port, I had no say in the design of the project: I just had to port it.

    The compiler, alas, could support either exceptions or threads, but not both (there was even a comment in the compiler code about this).  My group decided to keep the exceptions, but turn all the threads into processes, and "just" share all the memory between the process using shared memory. This means that I had to write my own memory allocator.  Worse, once it allocated one block, it could never allocate another.

    As Raymond says: caches are hard.  You get a bunch of choices about picking the right block to give back, and choices about when to consolidate blocks.  In no case should you do this without lots of stress testing.  We had one particular pattern of allocate/release that just killed the allocator — the code "looked" OK, but the result was to always take a ‘big’ block, chop a bit off of it, and never be able to use it again…yuck!

    Recommendations: when in doubt, don’t cache.  If you do, cache similar-size objects together.  Plan to spend a week stress-testing the cache and verifying that it’s not causing Massive Headaches.  Add the cache-validation logic early; you’re going to need it.

  29. Ian Boyd says:

    So handing back a buffer that is larger than you asked for is a bad thing?

  30. Yuhong Bao says:

    unless it is flushed, of course.

  31. Amanjit Gill says:


    "the heap manager" is actually an interesting beast. Especially with the combo VC7 + Dinkumware Std Lib. C++ tends to stress the heap manager regarding allocation/deallocation of small values (there are best practices about how to avoid that). We can use custom Allocators to manage the memory business for our container, and also to see when things get created, destroyed, alloced or freed… see


    for an example. Then begin to scratch your head when you do a vector.resize(len*3) for a vector of small POD structs; The default impl directly calls malloc and bombards the heap manager with dozens of requests of small sizes, slowing down everything. This is not the Dinkumware lib bashing post, I am glad I have a very conforming implementation… REALLY… .) still I investigated the whole issue and found out

    1. you really want to use the low fragmenting heap if you are effectively trying to fragment it. Its off with Win XP Pro and later OSes!

    ULONG HeapFragValue = 2;

    HeapSetInformation((HANDLE)_get_heap_handle(), HeapCompatibilityInformation, &HeapFragValue, sizeof(HeapFragValue));    



    and watch the performance of your c++ app increase (for the use case of small allocs and deallocs, of course)

    2. you really want to use a custom allocator with c++ std containers to release some stress from the win32 heap manager

    the easiest way to achieve this is to use a different malloc in a custom allocator, for example doug lea’s malloc (dmalloc), which is also in glibc I think (as ptmalloc). I came up with this idea, because my P3 450/256MB RAM would resize my hashmap’s buckets faster than a P4 2.4GHz Notebook!!! I love vc6/7 but this was blowing me away. I simply called dmalloc and dfree in my custom allocator and the performance would at least double.

    3. you should evaluate something less conforming, but sometimes more performant

    STLPort. std::list, strings and stringstreams are a whole lot faster, or at least deserve a closer look during a performance evaluation

    Of course, there are a whole bunch of c++ techniques to avoid massive heap allocations, but I do not think resizing a vector of small POD objects should effectively fragement the heap.

    Weird stuff!

  32. Amanjit Gill says:

    The p3/450 was actually running linux (debian sarge). And the same c++ app (my own hasmap implementation) ran 3 times faster on linux than on a p4/2400 Win2003 Std Edition / VC7 / Dinkumware STL. It tested HashMap< std::string, long> which of courses means tons of allocations (worst corse test))

    But then again, massive hashmap insertions and deletions with constant bucket size ideally measure the bandwidth between the CPU and memory (bucket access is basically array access), so It could be that the p3 and the p4 do not differ too much in this aspect.

    bottom line: the combination of small heap allocations and c++ always deserve a closer look. Also Caching and Pooling can be are solution (depends)


  33. Ashod Nakashian says:

    Raymond, you go a long way to show some interesting facts. Great article, although sometimes I think you have too much free time!

    I learned about this case when I was asked to write a high-performance memory manager for a industrial software that would alloc/free 10s of GB’s of data over a period of 10-12 days in a single run with a final working set of about 20GBs (of course on 64-bit windows and linux systems). Profilling was by far the biggest lesson to learn although not the only important one. I must add that profilling a program like that is not an easy task, considering there is very little ‘small’ intput data to use, and reading/grepping through MB’s worth of logs is your only friend.


  34. Nathanael Nerode says:

    Why not just fix the heap manager?

    I mean, geez.  Avoiding the heap manager is silly: you end up writing your own heap manager.

    If the general-purpose heap manager doesn’t work for you, use a different, special-purpose implementation instead.  (In UNIX-land, this would be called a "replacement malloc".)  I believe there are essentially off-the-shelf implementations for all of the major memory use cases.

    This blog entry is great advice for the writers of heap managers, but not so much for everyone else!

  35. I got some interesting questions about how to build good middle-tier caches in my inbox last week. I

Comments are closed.