Locality of reference and performance

Eric Gunnerson told me he had some requests for information on how to get good locality of reference when using the CLR. This is a topic that's near and dear to me so that's all the provocation I needed to blog about it.

I think first it's important to understand just how significant locality of reference can be. Jan Gray recently posted an interesting entry in his blog, on performance trivia. I actually think it's not really trivial at all, the key message there is that on a modern processor, in round numbers, and in the worst case, accessing memory from the DRAM can be 1000 times slowerish than doing the best case math operations.

Maybe it would be helpful if I put that in terms of a tradeoff: if you had an algorithm that needed 500 CPU operations and 10 memory accesses you could say it has a cost of 500+10*1000=10500. If you changed your algorithm so that it used twice the computing logic (1000 operations) but needed only half the memory accesses, the cost would be 1000+5*1000=6000. You can see that the extra computation has more than paid for itself in terms of saved memory accesses. In fact, even if you had only saved one memory access, you'd still come out ahead at 1000+9*1000=10000 vs. the original 10500. Again this is just round numbers to illustrate possible savings... naturally you have to measure your own cases. We can haggle over whether your application will really get 1000 operations in one memory access, or more like 250, it's going to vary depending on how clever/lucky you are, but I think we can agree it's gonna be a big number.

OK, so I think we can all agree that good locality of data can make a really big difference, so how do you get good locality in managed programming?

In some ways, when using managed code it's actually a whole lot easier to get your head around what's going on. In traditional C++ programming you might have done a lot of exotic things such as creating separate heaps for certain classes of objects, complex object pooling or chunking, and assorted operator-new-with-placement tricks. Some of these make my head hurt.

Now, being as I'm a wishy-washy performance guy that never gives absolute advice, the best I can say is that some of those techniques you still might want to apply in some cases: if I were writing a text editor in managed code I might very well want to have char arrays of a nice chunky size that contain my text lines and then fill/split them in a b-tree-like fashion to keep things clustered when doing line processing. Reaching for data structures that have good locality built-right-in is rarely a bad idea, but you'll want to measure (see the video for more of that measuring stuff).

But what about some of those more advanced techniques, like custom heaps and so forth? Well, and again no advice is absolute, frequently they don't play well with the garbage collector (see resources below). But there are a couple of key things to keep in mind.

1. The Garbage Collector is not an object shuffler

When the collector decides it's time to collect it doesn't shuffle the live objects around at all like a disk-defragmenter might. What it does is slide objects together to create free memory at the end of the memory segment.

2. The Garbage collector tends to allocate from the end of currently used memory

Objects on the garbage collected heap are packed together with no gaps, consecutive allocations tend to be in increasing address order.

These two simple facts turn out to be very important because together they mean that allocations that are made together in time tend to also be close together in space, and furthermore they tend to stay that way. Even if you have some temporary objects that intervene they won't mess up your long term locality of reference because those objects will be squeezed away at the next collection. In unmanaged code you often need custom heap logic or pooling to arrange that a sequence of objects you allocate will be together (and stay that way), and certainly things allocated together in time won't tend to get "more together" as things proceed and objects are freed. You get all of those things for free in the garbage collected world.

The reverse tends to be true as well. Objects not allocated together in time will tend to not be together in space, and so data structures that are built up over time will tend to spread amongst the objects of different ages. Under these circumstances you could imagine that it might be profitable to reallocate entire data structures, or pieces of data structures from time to time to get things close together again. But please measure before (and after) doing anything that drastic! And don't say "Rico told me to do it!"

Now when given these facts some people think, "Hmmm, I should allocate my objects in chunks of 100 to get them all together". Well, that's dangerous advice... it may be better to do that, but it may not be. You have to consider that the objects you aren't using in those chunks are effectively wasted memory and they're keeping other related things from being close to the objects that are in use. If you do a lot of this, you get the same kinds of problems that your hard disk would suffer if it had cluster sizes that are too big (sometimes that's called internal fragmentation). If you were building an advanced data structure like a b-tree I wouldn't be shy about allocating objects for a whole node (which might be quite large) together, but many other data structures would see no net benefit.

How can you see locality? Well you really can "see" it by watching the allocation patterns for your various object types, and comparing those patterns to what you expect/hope for. CLRProfiler (see below) has excellent graphical views of the heap, color coded by object type. If you see patterns that remind you of a slab of granite (lots of different stuff mixed finely) that's probably not good. If you see patterns that are more like say layers of marble or, even better, nice big bricks, that's probably a good sign. You should be able to see the temporary objects getting squeezed out in the timeline views.

Jan just read this over my shoulder and he said "Don't forget to remind the nice people that the data that has best locality of reference is the data that's not there at all." Truer words were never spoken. In the locality of reference world, less is more.

(By the way Jan doesn't buy that whole marble/brick/granite thing at all but this my blog, and I feel like I have to say something about seeing locality and I refuse to give advice like "look at the CPU's perf counter for L2 cache misses and you'll see your locality").

So, don't be too fancy, it probably won't help much. Don't allocate stuff you don't need, that certainly won't help. Be mindful when using data structures that are heavy in pointers, those pointers will cost you space, and even though the little arrow you draw for the pointer on your whiteboard might be short, it surely doesn't mean that the distance between the objects is small, unless you made it so. Every pointer you put in your data structures is a chance to get the locality wrong.

Resources:

There's a more detailed treatment of caching effects in Jan's paper "Writing Faster Managed Code: Know What Things Cost", you'll want to skip to the section called "Of Cache Misses, Page Faults, and Computer Architecture" for some excellent details.

I wrote a paper on Garbage Collector Basics that provides a mental model for how the collector works which can be very useful when planning.

Peter Sollich's documentation for the CLRProfiler is rich in information about how the GC works. Even if you never plan to run CLRProfiler you should read the documentation to help get insights into what goes on in managed code.