In most data structure and algorithms classes, the model used for basic analysis is the traditional RAM model: we assume that we have a large, random-access array of memory, and count the number of simple reads/writes needed to perform the algorithm. For example, selection sort takes about n(n-1)/2 reads and 2n writes to sort an array of n numbers. This model was relatively accurate up through the 1980s, when CPUs were slow enough and RAM small enough that reading and writing the RAM directly didn’t create a significant bottleneck. Nowadays, however, typical desktop CPUs possess deep cache hierarchies – at least a sizable L1 and L2 cache – to prevent runtime from being dominated by RAM accesses. Data structures and algorithms that efficiently exploit the cache can perform dramatically more quickly than those that don’t, and our analysis needs to take this into account.
For example, suppose you have a list of bytes stored in both an array and a linked list. Provided the linked list nodes are allocated at random positions, iterating over the array will be dramatically faster, exhibiting optimal locality of reference, while iterating over the linked list will trigger a cache miss for every element. We can formalize this: assume a cache line – the size of the block read into the cache at one time – has size B bytes. Then traversing the linked list requires n cache misses, while traversing the array requires precisely ceiling(n/B) cache misses, a speedup of B times.
It is possible to design data structures that specifically take advantage of the cache line size B. For example, in a previous post I described an unrolled linked list data structure where each node fills up a cache line, achieving similar cache performance to arrays while still allowing efficient insertions and deletions. Such a data structure is termed “cache-aware”, because it explicitly takes knowledge about the cache model into account in its construction.
Unfortunately, this approach has a compelling disadvantage: one has to tune the data structure based on the cache line size B. This is a problem for many reasons: first of all, programs, even those compiled to native machine code, are often run on many different processors with different cache line sizes. A Pentium 4 can run the same machine code as a Pentium II, but has a very different cache model. But even when running on a fixed machine, these simple cache-tuned data structures can’t take advantage of multilevel caches, because each one has a different size and a different block size. One can attempt to build a data structure for a specific cache hierarchy, but this is even less robust against a change of machine.
In 1999, in his masters thesis, Harold Prokop came up with an interesting solution to this problem: the idea of a cache-oblivious algorithm. This does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for all cache line sizes B simultaneously. This allows the algorithm to robustly exploit caches across many machines without machine-specific tuning. More importantly, it allows the algorithm to exploit multilevel caches effectively: because it works for all B, it applies between each two adjacent levels of cache, and so every level is well-utilized.
In fact, the multilevel advantage extends beyond simply taking advantage of both the L1 and L2 caches – such algorithms perform well even when the storage hierarchy is expanded to encompass much slower media such as slow memory, Flash, compressed memory, disks, or even network resources, without knowledge of block sizes or access patterns. For example, cache-oblivious B-trees can be used to implement huge database indexes that simultaneously minimize disk reads and cache misses.
Note that just because the algorithm doesn’t depend on B doesn’t mean the analysis of the algorithm doesn’t depend on B. Let’s take an example: iterating over a simple array. As noted earlier, this requires about n/B cache misses, which is optimal. But neither the array’s structure nor the iteration algorithm explicitly take B into account. Consequently, it works for all B. In a multilevel cache, all data prefetched into every cache is used. This is the theoretical benefit of cache-oblivious algorithms: we can reason about the algorithm using a very simple two-level cache model, and it automatically generalizes to a complex, deep cache hierarchy with no additional work.
While cache-oblivious algorithms are clearly useful, at first it’s not clear that there even exist any other than simple array iteration. Thankfully, extensive recent research has revealed cache-oblivious data structures and algorithms for a multitude of practical problems: searching (binary trees), sorting, associative arrays, FFT computation, matrix multiplication and transposition, priority queues, shortest path, query processing, orthogonal range searching, and more emerging every year. Practical comparison studies have shown a significant performance gain on real processors compared to traditional algorithms, although carefully-tuned cache-aware algorithms still enjoy an advantage on the specific machines they are tuned for (see e.g. ).
Many of the cache-oblivious data structures and algorithms that have been published are relatively complex, but here I’ll describe a simple one just to give you a feel for it. This discussion is based on parts of this paper.
Consider the following simple search problem: we have a static list of records, and we wish to find the record with a particular key. Traditionally, this problem is solved with either an array and binary search, or a binary search tree. Both of these approaches exhibit dismal cache behavior. Database applications rely on B-trees, which group several records into each block. This greatly improves performance, but requires knowledge of the block size, and does not work across all levels of the cache hierarchy.
The key is the van Emde Boas layout, named after the van Emde Boas tree data structure conceived in 1977 by Peter van Emde Boas. Suppose for simplicity that the number of elements in a power of 2. We create a complete binary tree containing our records (where by “complete” we mean that all leaves are at the same level). This tree will have height h = log2n.
Take a look at the first h/2 levels of the tree. This subtree has 2h/2 leaves, each itself the root of a tree of height h/2. In the van Emde Boas layout, we first lay out the subtree of height h/2 rooted at the root, followed by the subtrees rooted at each leaf of this tree from left to right. Each subtree is recursively laid out using the van Emde Boas layout. An example diagram is shown in Figure 1 of the paper.
The analysis proceeds like this: at each step of the recursion, the size of the subtrees being laid out is the square root of the size of the subtrees laid out at the previous step. Consequently, as we recurse, at some point we will be laying out subtrees between sqrt(B) and B in size (let’s call it C). Each of these subtrees can be retrieved in a single memory transfer, and this layout partitions the tree into subtrees of this size.
The search procedure works like this: we access the root, which pulls in the first log2C levels of the tree. We have only cache hits until we access a leaf of the tree, which pulls in the next log2C levels of the tree rooted at that leaf. This continues until we reach the bottom of the tree. Because only 1 out of every log2C accesses causes a cache miss, the total number of misses is log2n/log2C = logCn. Because C is at least sqrt(B), log2C is at least (log2B)/2, and the total number of misses is at most 2logBn.
We made one small assumption above: that each subtree is aligned with the cache block boundaries; if this is not true, each subtree can require two accesses, bringing the final maximum to 4logBn. Probabilistic analysis shows that with a random starting point in memory and sufficiently large block size, the real constant is closer to 2.
This is the same asymptotic performance as B-trees, and is faster than ordinary binary trees by a factor of about log2n/2logBn = (log2B)/2. For example, for a disk block size of 2048 elements, this would be a speedup of about 6.5 times. The advantage is more than theoretical: to quote the paper, “preliminary experiments have shown, surprisingly, that CO [cache-oblivious] B-trees can outperform traditional B-trees, sometimes by factors of more than 2”.
But as practical as the research is in cache-oblivious algorithms, many applications and libraries have yet to take advantage of them. The data structures supplied with .NET, Java, Lisp, and so on are not cache-oblivious. We need to start putting this research into practice and reaping the benefits. I hope this gives you some insight into cache-oblivious algorithms, and I hope you will consider taking advantage of it the next time you develop a product whose performance critically depends on memory access patterns.