Unrolled linked lists

Today I'll be discussing unrolled linked lists, a simple variant of the linked list which has many of its desirable properties but exploits the cache to yield considerably better performance on modern PCs.

In elementary data structures classes, students are often taught about arrays and linked lists and the algorithmic tradeoffs between the two. In an array, you can get the kth element quickly, but adding elements to the middle of the array is slow; linked lists are the opposite. However, arrays have a number of practical benefits that this simple treatment doesn't mention:

  • Arrays are compact; an array stores almost entirely data, while linked lists (especially double-linked) can require many times as much space for bookkeeping (pointers, allocation metadata, alignment) as for actual data, particularly for small elements.
  • Modern PCs have multi-level cache hierarchies that make traversing an array (visiting the elements in order) very fast. Cache hits are so fast that in cache-sensitive analysis they are considered "free"; we only count cache misses. If a cache line has size B, then the number of cache misses is about n/B. A linked list, on the other hand, requires a cache miss for every node access in the worst case. Even in the best case, when the nodes are allocated consecutively in order, because linked list nodes are larger, it can require several times more cache misses to traverse the list.

Let's make this concrete: using Gentoo Linux and gcc on a Pentium 4 3.5GhZ machine, I constructed a linked list of 60 million integers and created an array of the same 60 million integers. I compiled with full optimization. Traversing the linked list required 0.48 seconds, while traversing the array required 0.04 seconds, 12 times faster. Moreover, when I introduced code to fragment the memory pool, the advantage increased dramatically to 50 times or more. The linked list also required 4 times as much memory - twice as much for next pointers, and twice as much again for allocation metadata.

Don't throw out your linked lists just yet, though. If we build the 60 million element array above by inserting elements into the middle, it would take weeks, despite the cache advantage. In applications where insertions and deletions in the middle are frequent, arrays just don't work. Also, arrays have the distinct disadvantage that they are slow to grow, and in a fragmented memory pool, it may not be possible to allocate a large array at all. We need something with the cache advantages of an array but the quick insertions and incremental growth of a linked list.

Enter the unrolled linked list. In simple terms, an unrolled linked list is a linked list of small arrays, all the same size N. Each is small enough that inserting or deleting from it is quick, but large enough so that it fills a cache line. An iterator pointing into the list consists of both a pointer to a node and an index into that node.

Let's consider insertion first. If there is space left in the array of the node in which we wish to insert the value, we simply insert it, requiring only O(N) time. If the array already contains N values, we create a new node, insert it after the current one, and move half the elements to that node, creating room for the new value. Again, total time is O(N). Deletion is similar; we simply remove the value from the array. If the number of elements in the array drops below N/2, we take elements from a neighboring array to fill it back up. If the neighboring array also has N/2 values, then we merge it with the neighboring array instead. Those familiar with B-Trees may note some similarities in these operations.

Your first reaction may be that it seems wasteful to have every array being potentially half-empty. On average, though, each array will be about 70% full. Even in the worst case, this can be competitive with ordinary linked lists, which require at least one and as many as four or five words of overhead per element. By amortizing space overhead over several elements, an unrolled linked list can achieve very close to the compactness of an ordinary array. The overhead per element is proportional to 1/N, and in practice is at most a few bits per element. We can adjust N to trade off compactness and operation time. The space advantage becomes much greater when we consider lists of small values like characters or bits. Finally, we can set a higher low-water mark than 50% if necessary, at the cost of more frequent node splits and merges.

Although space is critical in many applications, the primary advantage of unrolled linked lists is their cache behavior. If each node is roughly a multiple in size of the cache line size and all are full, and B is the cache line size, we can traverse each node optimally (N/B cache misses), and traverse the list with about (n/N+1)(N/B) cache misses, very close to the optimal n/B cache misses for realistic values of N.  In the average case we need about 40% more cache misses than traversing an array, and in the worst case we need about twice as many - which sounds a lot better than 14 to 50 times as many.

In practice there's a bit of extra overhead for dealing with the added complexity. I tried out a simple unrolled linked list in our original experiment with a list of 60 million integers. I placed 124 integers in each node. With all nodes full, and a fragmented memory pool, it created the list in 0.64 seconds and traversed it in 0.10 seconds, about 2.5 times slower than the array (we may be observing some L2 cache effect here). With the nodes 50% full, it required about 0.15 seconds. Total space usage with all nodes full was 17% more than the array; with 50% usage it was about twice this, of course. Memory pooling could narrow the space gap.

One small issue with unrolled linked lists is keeping track of the number of elements in each array. One easy and space-efficient way to deal with this, where applicable, is to use some reserved "null" value to fill unused array slots. By aligning the nodes, sometimes the number of elements can be stored in the low bits of the pointers. Otherwise it adds a bit of overhead per node.

I hope some of you found this illuminating. I hope to talk about more cache-sensitive data structures in the future, and talk about cache-oblivious data structures. Please ask any questions you have. Thanks for reading.

Derrick Coetzee
This posting is provided "AS IS" with no warranties, and confers no rights.