Cache-oblivious data structures


In most data structure and algorithms classes, the model used for basic analysis is the traditional RAM model: we assume that we have a large, random-access array of memory, and count the number of simple reads/writes needed to perform the algorithm. For example, selection sort takes about n(n-1)/2 reads and 2n writes to sort an array of n numbers. This model was relatively accurate up through the 1980s, when CPUs were slow enough and RAM small enough that reading and writing the RAM directly didn’t create a significant bottleneck. Nowadays, however, typical desktop CPUs possess deep cache hierarchies – at least a sizable L1 and L2 cache – to prevent runtime from being dominated by RAM accesses. Data structures and algorithms that efficiently exploit the cache can perform dramatically more quickly than those that don’t, and our analysis needs to take this into account.


For example, suppose you have a list of bytes stored in both an array and a linked list. Provided the linked list nodes are allocated at random positions, iterating over the array will be dramatically faster, exhibiting optimal locality of reference, while iterating over the linked list will trigger a cache miss for every element. We can formalize this: assume a cache line – the size of the block read into the cache at one time – has size B bytes. Then traversing the linked list requires n cache misses, while traversing the array requires precisely ceiling(n/B) cache misses, a speedup of B times.


It is possible to design data structures that specifically take advantage of the cache line size B. For example, in a previous post I described an unrolled linked list data structure where each node fills up a cache line, achieving similar cache performance to arrays while still allowing efficient insertions and deletions. Such a data structure is termed “cache-aware”, because it explicitly takes knowledge about the cache model into account in its construction.


Unfortunately, this approach has a compelling disadvantage: one has to tune the data structure based on the cache line size B. This is a problem for many reasons: first of all, programs, even those compiled to native machine code, are often run on many different processors with different cache line sizes. A Pentium 4 can run the same machine code as a Pentium II, but has a very different cache model. But even when running on a fixed machine, these simple cache-tuned data structures can’t take advantage of multilevel caches, because each one has a different size and a different block size. One can attempt to build a data structure for a specific cache hierarchy, but this is even less robust against a change of machine.


In 1999, in his masters thesis, Harold Prokop came up with an interesting solution to this problem: the idea of a cache-oblivious algorithm. This does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for all cache line sizes B simultaneously. This allows the algorithm to robustly exploit caches across many machines without machine-specific tuning. More importantly, it allows the algorithm to exploit multilevel caches effectively: because it works for all B, it applies between each two adjacent levels of cache, and so every level is well-utilized.


In fact, the multilevel advantage extends beyond simply taking advantage of both the L1 and L2 caches – such algorithms perform well even when the storage hierarchy is expanded to encompass much slower media such as slow memory, Flash, compressed memory, disks, or even network resources, without knowledge of block sizes or access patterns. For example, cache-oblivious B-trees can be used to implement huge database indexes that simultaneously minimize disk reads and cache misses.


Note that just because the algorithm doesn’t depend on B doesn’t mean the analysis of the algorithm doesn’t depend on B. Let’s take an example: iterating over a simple array. As noted earlier, this requires about n/B cache misses, which is optimal. But neither the array’s structure nor the iteration algorithm explicitly take B into account. Consequently, it works for all B. In a multilevel cache, all data prefetched into every cache is used. This is the theoretical benefit of cache-oblivious algorithms: we can reason about the algorithm using a very simple two-level cache model, and it automatically generalizes to a complex, deep cache hierarchy with no additional work.


While cache-oblivious algorithms are clearly useful, at first it’s not clear that there even exist any other than simple array iteration. Thankfully, extensive recent research has revealed cache-oblivious data structures and algorithms for a multitude of practical problems: searching (binary trees), sorting, associative arrays, FFT computation, matrix multiplication and transposition, priority queues, shortest path, query processing, orthogonal range searching, and more emerging every year. Practical comparison studies have shown a significant performance gain on real processors compared to traditional algorithms, although carefully-tuned cache-aware algorithms still enjoy an advantage on the specific machines they are tuned for (see e.g. [1][2]).


Many of the cache-oblivious data structures and algorithms that have been published are relatively complex, but here I’ll describe a simple one just to give you a feel for it. This discussion is based on parts of this paper.


Consider the following simple search problem: we have a static list of records, and we wish to find the record with a particular key. Traditionally, this problem is solved with either an array and binary search, or a binary search tree. Both of these approaches exhibit dismal cache behavior. Database applications rely on B-trees, which group several records into each block. This greatly improves performance, but requires knowledge of the block size, and does not work across all levels of the cache hierarchy.


The key is the van Emde Boas layout, named after the van Emde Boas tree data structure conceived in 1977 by Peter van Emde Boas. Suppose for simplicity that the number of elements in a power of 2. We create a complete binary tree containing our records (where by “complete” we mean that all leaves are at the same level). This tree will have height h = log2n.


Take a look at the first h/2 levels of the tree. This subtree has 2h/2 leaves, each itself the root of a tree of height h/2. In the van Emde Boas layout, we first lay out the subtree of height h/2 rooted at the root, followed by the subtrees rooted at each leaf of this tree from left to right. Each subtree is recursively laid out using the van Emde Boas layout. An example diagram is shown in Figure 1 of the paper.


The analysis proceeds like this: at each step of the recursion, the size of the subtrees being laid out is the square root of the size of the subtrees laid out at the previous step. Consequently, as we recurse, at some point we will be laying out subtrees between sqrt(B) and B in size (let’s call it C). Each of these subtrees can be retrieved in a single memory transfer, and this layout partitions the tree into subtrees of this size.


The search procedure works like this: we access the root, which pulls in the first log2C levels of the tree. We have only cache hits until we access a leaf of the tree, which pulls in the next log2C levels of the tree rooted at that leaf. This continues until we reach the bottom of the tree. Because only 1 out of every log2C accesses causes a cache miss, the total number of misses is log2n/log2C = logCn. Because C is at least sqrt(B), log2C is at least (log2B)/2, and the total number of misses is at most 2logBn.


We made one small assumption above: that each subtree is aligned with the cache block boundaries; if this is not true, each subtree can require two accesses, bringing the final maximum to 4logBn. Probabilistic analysis shows that with a random starting point in memory and sufficiently large block size, the real constant is closer to 2.


This is the same asymptotic performance as B-trees, and is faster than ordinary binary trees by a factor of about log2n/2logBn = (log2B)/2. For example, for a disk block size of 2048 elements, this would be a speedup of about 6.5 times. The advantage is more than theoretical: to quote the paper, “preliminary experiments have shown, surprisingly, that CO [cache-oblivious] B-trees can outperform traditional B-trees, sometimes by factors of more than 2”.


But as practical as the research is in cache-oblivious algorithms, many applications and libraries have yet to take advantage of them. The data structures supplied with .NET, Java, Lisp, and so on are not cache-oblivious. We need to start putting this research into practice and reaping the benefits. I hope this gives you some insight into cache-oblivious algorithms, and I hope you will consider taking advantage of it the next time you develop a product whose performance critically depends on memory access patterns.

Comments (10)

  1. Seun Osewa says:

    I haven’t yet grasped the algorithm you described, but thankfully if such algorithms are implemented in our favorite operating systems and runtime engines, we’ll be able to benefit from them for free.  I wonder if people can write cache-aware programs in dynamic languages like Python?

  2. Justin says:

    A couple of questions regarding your analysis:

    1) Is C calculated at some point? How does the search procedure pull in log(2)(C) records?

    2) Does building a balanced binary tree figure in the run time calculation?

    If you are interested in Haskell, have you seen the Bytestring library? They are taking advantage of caching technologies while coding at a very high level. Check out this thread:

    http://www.haskell.org/pipermail/haskell-cafe/2007-June/026654.html

    I ran the code presented on a 100 MB file and it was pretty damn fast.

  3. MSDNArchive says:

    Thanks for your comments Seun and Justin. To answer Justin’s question: C is not calculated and only occurs in the analysis. The idea is that the van Emde Boas layout, by virtue of its construction, contiguously lays out each subtree of size C. Because C < B, when we access this subtree, the cache system transfers in in the whole block, which contains the entire subtree (or if it crosses a block boundary, it might require two transfers to pull it in).

    To answer your other question, in this discussion I don’t mention how the tree is constructed. Because this is a static search tree, this is only done once, so you could do something as simple as sorting the data and then building the tree from left to right. If you check out some of the papers I referenced on search, they talk about how to construct *dynamic* cache-oblivious search trees, which allow insertions and deletions as well.

    ByteStrings look very interesting, actually – there’s a detailed paper on them here that won a "Most Practical Paper" award:

    http://www.cse.unsw.edu.au/~dons/papers/CSL06.html

    From the following description, I’d characterize ByteStrings as a cache-aware data structure: "Profiling was used to find an optimal chunk size: too small, and performance

    approaches that of a [Char] structure, too large (larger than the L2 cache) and

    performance also falls away. In practice, a chunk size that allows the working set

    to fit comfortably in the L2 cache has been found to be best." This is again the kind of tuning that cache-oblivious structures seek to avoid, although cache-aware structures do tend to have somewhat better performance when tuned well.

    To answer Seun’s question, you can construct cache-oblivious data structures in pretty much any language that allows arrays. Python arrays however can only contain primitive types, which makes it more difficult in some cases, since you have to represent things like references using array indexes instead of pointers. In the case of the static structure described above, the array index of each node can actually be computed, so this isn’t so much of an issue, but other cache-efficient algorithms may require this.

    Thanks again for your comments.

  4. Nathan Myers says:

    It should be noted that cache-oblivious algorithms were invented well before 1999.  The earliest I know of was implemented by (now U. Waterloo Prof.) Todd Veldhuizen in a C++ matrix multiplication library called Blitz++. (Veldhuizen, incidentally, pioneered several other important coding techniques.)  The library was demonstrated to beat soundly IBM’s own cache-aware Fortran compiler doing fluid dynamic simulations.  In this library, the matrix was traversed in an order dictated by a Hilbert curve, which exhibits excellent locality.  Veldhuizen knew exactly what he was doing, and why.  That his work is still not acknowledged by latecomers, long after they have been notified of it, can only be attributed to academic rudeness.

    Quote (published 1997): "One problem with blocking is selecting the block size: it has to be chosen carefully with the particular architecture and cache sizes in mind to get optimal performance. In the Blitz++ library, I decided to adopt a slightly different approach, based on the Hilbert space filling curve. In this approach, there are two loops: an outer loop which follows an approximation to the Hilbert space filling curve (1), and an inner loop (2) which traverses the corresponding column of the array. The space filling curve ensures good cache use, and the inner loop allows the CPU pipelines to pick up steam as the column is traversed.  I’ve found this traversal ordering has a very nice property: performance increases steadily for every extra bit of cache you have, resulting in good performance on all platforms. For most platforms, the L1 cache hit rate is 70% to 76% (compared to only 54% for the Fortran 77 program described above). This isn’t the case for blocked algorithms, where performance makes abrupt jumps at critical cache sizes, making tuning very platform-dependent."

    Source:

    http://ubiety.uwaterloo.ca/~tveldhui/papers/DrDobbs2/drdobbs2.html

    More:

    http://oonumerics.org/blitz/

    The method was later re-invented for use in FFTW, a numerical Fourier transform library coded, ultimately, in OCaml, but distributed in C.  FFTW’s authors have also failed to acknowledge Veldhuizen’s priority.

  5. Alex says:

    While it is important to credit Todd as well as other people who published similar observations on cache oblivious behaviour as far back as the 70s, the key contribution in Prokop’s thesis is in providing the cache oblivious design paradigm. This consists of (a) design an algorithm that is optimal for a single unknown cache size B (b) analyze it in the I/O model of Aggarwal and Vitter assuming perfect paging and last but certainly not least (c) a proof that the analysis holds for all cache levels (modulo a constant and the tall cache assumption).

    As far as I understand, none of the early observers of cache oblivious behaviour had such a general framework.

  6. Nathan Myers says:

    This is an exceedingly peculiar construction: "observers of cache oblivious behavior".  Cache-oblivious algorithms don’t occur in nature.  They must be deliberately designed — invented — with the explicit goal of maximizing cache locality.  The construction denies agency to the inventors.

    There is no evidence to suggest Veldhuizen followed Prokop’s "design paradigm", and no hint of evidence that he would have got better results if he had done so.  In other words, these steps (a) and (b) may be necessary in order to construct (c), but are manifestly not necessary to design an optimal algorithm.  Prokop deserves credit for determining that (a) and (b) would allow him to write proofs, and for writing the proofs.  He does not deserve credit for originating a family of algorithms which predated his work; at most, he deserves credit for proposing a catchy name for the family.

    The statement "it is important to credit (etc.)" would carry more weight if those so obliged ever bothered actually to do so.  An observer scanning references in published papers might reasonably conclude none of the prior work had been done, or (suspecting it had) get any idea of who did the work.

  7. Alex says:

    <i>They must be deliberately designed — invented — with the explicit goal of maximizing cache locality. </i>

    Not necessarily so. Here’s what happened: some researchers were performing experiments with sorting algorithms. They observed that certain well known sorting algorithms did not show a sharp drop in performance when data exceeds RAM size. Intrigued by this, they ran tests on different memory sizes and observed that these algorithms naturally adapted to all cache sizes.

    Prokop and his group stumbled upon cache oblivousness under very similar circumstances, that is as the result of experimental observations. The difference is that the 70s researches just made a note of it as an interesting fact, while Prokop et al. decided to first fully explain the observed instance  (which Todd also did) and then proceeded build a general theory around it.

    <i>There is no evidence to suggest Veldhuizen followed Prokop’s "design paradigm"</i>.

    I agree. That is precisely why Prokop gets the claim and Veldhuizen doesn’t. Prokop developed the entire model while Veldhuizen (and others) found single data points.

    The chain of events is something like "Tycho Brae, Johannes Kepler, Isaac Newton". Tycho made the initial planetary observations, Kepler was the first to obtain a formal statement for them (Kepler’s laws), and Newton developed the principles that govern the entire thing (classical mechanics). Just substitute "70s researchers" for Tycho, Todd Veldhuizen for Kepler and Prokop for Newton. Attribution should be thus apportioned.

    <i>The statement "it is important to credit (etc.)" would carry more weight if those so obliged ever bothered actually to do so.</i>

    Again, no objection there. I’ve heard similar grunting from the 70s researchers. They also gripe that their role in this discovery is ignored.

  8. Nathan Myers says:

    Again, Veldhuizen did not "observe" an "instance".  He also did not "stumble upon" anything.  He reasoned from first principles, and deliberately set out to create a family of cache-oblivious algorithms.  He completely succeeded, and incorporated the the algorithms into a practical, generally useful matrix library, released freely for anyone to use or to study.  He measured the results and published everything, so that anyone else could obtain the same results.

    He did not follow Prokop’s "design paradigm" because (a) no such thing existed yet, and (b) it was manifestly unnecessary.  Its demonstrated superfluity would lead an objective observer to doubt claims to its generality. Indeed, Prokop’s "paradigm" is just one design approach among many, and not an especially efficient one, however amenable it may be to subsequent analysis.  Veldhuizen’s approach, however "general", has directly inspired every subsequent practical matrix library, from POOMA to VSIPL++.

    "Griping" and "grunting" are prejudicially disparaging descriptions of legitimate complaints.  Analogies to astronomical luminaries are similarly prejudicial, and similarly without foundation.  Re-making Veldhuizen’s contribution from a deliberate undertaking to an accidental "observation" is also prejudicial.

    It’s always easy to make up reasons why a latecomer’s work is more deserving of acknowledgement than an originator’s.  Such ease leads the scrupulous to suspicion of those reasons.  Regardless of the (probably unworthy) true reasons, the scrupulous action would be to set about correcting the injustice.  Facile apologetics only compound it.

  9. Kuszmaul says:

    Also, I’ve heard many people state that carefully tuned cache-aware algorithms can be cache-oblivious algorithms.  I’ve seen cases where that’s not true. Surprisingly, for

    some situations, cache-oblivious codes are faster than any known cache-aware code.  It may be true that cache-aware can beat cache-oblivious algorithms.  Or maybe not.

    -Bradley

  10. Matteo Frigo says:

    Nate,

    as one of the authors of the 1999 cache-oblivious FOCS  paper, allow me to note that you are barking up the wrong tree.  Charles Leiserson is one of the

    most careful authors I know, especially in matters

    of scholarship, and your implication of negligence

    and/or malice is completely unwarranted.

    In our paper, we dutifully reference the 1987 paper

    [2] by Aggarwal et al., which contains many similar

    ideas in a different context; we acknowledge the 1969 paper by Singleton [29]

    which recognizes the locality benefits of recursive FFTs; we cite our own 1996 work [9] (predating Todd) with the analysis of the recursive matrix multiplication algorithm and a related algorithm for LU decomposition, later perfected by Toledo [32].

    It is unfortunate that Todd’s Dr. Dobbs article that

    you link to does not

    exhibit so much respect for prior research.  (E.g.,

    Todd’s stencil algorithm is a suboptimal variant

    of an 1981 algorithm by Hong and Kung, ref. [21] in our paper.)

    Let me also point out

    that while we used matrix multiplication as a simple example, most of our 1999 paper is devoted to two cache oblivious sorting algorithms, which were indeed novel at the time of publication (or at least,

    nobody has claimed priority in the past ten years).

    Regards,

    Matteo Frigo