A concrete illustration of practical running time vs big-O notation

One of the five things every Win32 programmer needs to know is that memory latency can throw your big-O computations out the window. Back in 2003, I ran into a concrete example of this.

Somebody started with the algorithm presented in Fast Algorithms for Sorting and Searching Strings by Jon L. Bentley and Robert Sedgewick, implemented it in C#, and compared the performance against a HashTable, TernaryTree and SortedList. Surprisingly, the hash table won on insertion and retrieval of tens of thousands of randomly-generated strings. Why?

Remember, big-O notation hides the constants, and those constants can get pretty big. What's more, the impact of those constants is critical for normal-sized workloads. The big-O notation allows you to compare algorithms when the data sets become extremely large, but you have to keep the constants in mind to see when the balance tips.

The central point of my presentation at the PDC was that complexity analysis typically ignores memory bandwidth effects and assumes that all memory accesses perform equally. This is rarely true in practice. As we saw, leaving L2 is a big hit on performance, and accessing the disk is an even greater hit.

The tree doesn't rebalance, so inserting strings in alphabetical order will result in a bad search tree. (They do call this out in their paper.) To locate a k-character string, Bentley-Sedgewick traverses at least k pointers, usually more. ("How much more" depends on how many prefixes are shared. More shared prefixes = more pointers traversed.) It also requires k(4p) bytes of memory to store that string, where p is the size of a pointer. Remember those pesky constants. High constant factor overhead starts to kill you when you have large datasets.

More on those constants: Complexity analysis assumes that an add instruction executes in the same amount of time as a memory access. This is not true in practice, but the difference is a bounded constant factor, so it can be ignored for big-O purposes. Note, however, that that constant often exceeds one million if you take a page fault. One million is a big constant.

Going back to memory bandwidth effects: At each node, you access one character and one pointer. So you use only 6 bytes out of a 64-byte cache line. You're wasting 90% of your bus bandwidth and you will certainly fall out of L2.

Bentley-Sedgewick says that this is beaten out by not traversing the entire string being searched for in the case of a miss. I.e., their algorithm is tuned for misses. If you expect most of your probes to be misses, this can be a win. (The entire string is traversed on a hit, of course, so there is no gain for hits.)

Note also that this "cost" for traversing the string on a miss is overstated due to memory bandwidth effects. The characters in a string are contiguous, so traversing the string costs you only L/64 cache lines, where L is the length of the string, and one potential page fault, assuming your string is less than 4KB. Traversing the tree structure costs you at least L cache lines and probably more depending on your branching factor, as well as L potential page faults.

Let's look at the testing scenario again. The testing was only on hits, so the improved performance on misses was overlooked entirely. What's more, the algorithm takes advantage of strings with common prefixes, but the testing scenario used randomly-generated strings, which generates a data set opposite from the one the algorithm was designed for, since randomly-generated strings are spread out across the problem space instead of being clustered with common prefixes.

Those are some general remarks; here are some performance notes specific to the CLR.

I don't know whether it does or not, but I would not be surprised if System.String.GetHashCode caches the hash value in the string, which would mean that the cost of computing the hash is shared by everybody who uses it in a hashing operation. (Therefore, if you count the cost incurred only by the algorithm, hashing is free.)

Note also that Bentley-Sedgewick's insert() function stores the object back into the tree in the recursive case. Most of the time, the value being stored is the same as the value that was already there. This dirties the cache line for no reason (forcing memory bandwidth to be wasted on a flush) and—particularly painful for the CLR—you hit the write barrier and end up dirting a whole boatload of cards. A very small change avoids this problem: Change

    p->eqkid = insert(p->eqkid, ++s); 


    Tptr newkid = insert(p->eqkid, ++s);
    if (p->eqkid != newkid) p->eqkid = newkid;

(and similarly in the other branches). "This code is short but subtle, and worth careful study." How very true.

Note also that if you use their "sleazy hack" of coercing a string to a Tptr, you had to have changed the type of eqkid from Tptr to object. This introduces a CLR type-check into the inner loop. Congratulations, you just tubed the inner loop performance.

Now go to the summary at the end of the article.

  1. "Ternary trees do not incur extra overhead for insertion or successful searches." I'm not sure what "extra" means here, but hash tables have the same behavior.

  2. "Ternary trees are usually substantially faster than hashing for unsuccessful searches." Notice that they are optimized for misses.

  3. "Ternary trees gracefully grow and shrink; hash tables need to be rebuilt after large size changes." True, but the CLR hashtable does this so you don't have to. Somebody wrote it for you.

  4. "Ternary trees support advanced searches such as partial-match and near-neighbor search." These operations weren't tested.

  5. "Ternary trees support many other operations, such as traversal to report items in sorted order." These operations weren't tested either.

Notice that the article doesn't claim that ternary trees are faster than hashing for successful searches. So if that's all you're testing, you're testing the wrong thing. One of the big benefits of ternary trees is the new operations available (4 and 5), but if you're not going to perform those operations, then you ended up paying for something you don't use.

There are several morals of the story.

  1. Constants are important.
  2. Memory bandwith is important.
  3. Performance optimizations for unmanged code do not necessarily translate to managed code.

  4. What are you really testing?

Mind you, Bentley and Sedgewick are not morons. They know all this.

[Typo fixed 11am, thanks Nathan_works and Jachym Kouba.]

Comments (29)
  1. nathan_works says:

    It also requires k(4k) bytes of memory to store that string, where p is the size of a pointer.

    Assuming that should be P(4K) ? (end result: 4PK and not 4K^2 ?)

    I’m curious how many times you, Raymond, have needed to implement any of the shiny algorithms presented in say, the <i>other</i> CLR (Cormen/Leiserson/Rivest, now with Stein I see) Introduction to Algorithms book..

    As an apps programmer, you do pay attention to not purposefully write O(N^2) or O(N!) type algorithms, and careful use of STL or CLR containers can help.. But too many fancy algorithms aren’t easily maintained, given the time to understand what the algorithm is doing versus the pressure to fix the !@#$@ bug. KISS works for apps, but perhaps not for the OS..

  2. Colin Blair says:

    Anyone who finds this post interesting should also check Delay’s recent blog entry on a Sedgewick design: http://blogs.msdn.com/delay/archive/2009/06/02/maintaining-balance-a-versatile-red-black-tree-implementation-for-net-via-silverlight-wpf-charting.aspx

  3. Tom says:

    Software development is about maintaining an acceptable balance between speed of development, code maintenance and customer satisfaction.  Most people focus on the first two to the detriment of the latter, leading to the "software slow? — throw more hardware it at" solution to performance problems.  Indeed, the faster that software and features are developed, the better things seem — at least to those selling the software!  But if you think about it, the cost of one software developer squeezing out a few seconds from his program is insignificant compared to the amortized cost of the users (assuming to have more than a handful) waiting for the program to run over the program’s expected lifetime.

    As at least three notable people have mentioned, software gets twice as slow every twelve months while processor speeds double every eighteen months.  It’s a losing battle!  And it certainly explains why the software I ran on my 8086 PC seems just as fast as today’s software.  Why isn’t today’s software faster?

    Raymond points out a very good reason: people assume memory accesses are uniform when they are not.  This mistaken assumption leads to the selection of algorithms that look good in theory but perform terribly in practice.  The non-uniformity of memory access requires the developers looking for performance must choose algorithms and data structures that take advantage of the non-uniformity to squeeze the maximum out of the hardware.  For something like Windows Kernel, the GDI or the Windows Shell, performance is critical.  Not only does it save time for millions of users around the world, but it also helps sell more units if the OS is snappy and people don’t have to wait.

    Now, I’m not saying that a developer needs to eek out a few microseconds from something like MSN Messenger (or is it Office Communicator?).  There’s no point as it is I/O bound, especially from the user’s side.  But anything that is CPU bound and takes more than two seconds to complete should probably be a good candidate for optimization.  Yes, I said two seconds!  Think back to those studies that were done that if a web pages takes more than 1.5 seconds to load, people switch to something else.  That’s the attention span of today’s computer users.

  4. Wyatt says:

    I don’t know whether it does or not, but I would not be surprised if System.String.GetHashCode caches the hash value in the string

    Not according to reflector.  If you think about it, it would require increasing the size of every string by 4 bytes, most of which will never be hashed.  Not good for overall system performance.

  5. George says:

    Moral 1 reminds me of one of Rob Pike’s notes on C programming.

    "Fancy algorithms tend to run more slowly on small data sets than simple algorithms. They tend to have a large constant factor in O(n) analysis, and n is usually small. So don’t get fancy unless Rule 2 indicates that n is big enough."

    (From http://en.wikipedia.org/wiki/Unix_philosophy#Pike:_Notes_on_Programming_in_C)

  6. DWalker says:

    Different people have different interests, but … THIS is the kind of article that I find fascinating.  Thanks!

  7. Jachym Kouba says:

    [[ It also requires k(4k) bytes of memory to store that string, where p is the size of a pointer. ]]

    Hi Raymond, I believe you made a little typo there :)

  8. ERock says:

    Wouldn’t this just further illustrate how Computer Science isn’t interchangeable with Software Engineering?

  9. Joshua Muskovitz says:

    Is there a functioning link to the "5 things" video? The page referenced doesn’t still have a functioning video stream. :-(

  10. Tomas says:

    @Tom: You’re way overanalyzing the situation.  Assuming that memory access is uniform is about 499th on a list of the top 500 performance issues in software today.  And programmer productivity is enormously important.  What are all those users going to do in those accumulated two seconds?  Spend two extra seconds browsing the web, probably.  Or take an extra coffee break.  Whereas if the programmer spends five weeks prematurely optimizing, that’s five weeks of features that will be missing from the product.

    @naaa: Watch some of today’s effects extravangas with the sound off sometime.  You’ll be shocked at how unrealistic the special effects look (this also says something about how important a contribution sound makes to a movie).  The current fetish for sweeping camera movements give the effects shots a completely different tempo and style to the live action.

    If done right, photochemical effects can look absolutely convincing.  See 2001: A Space Odyssey, or the drydock scenes of Star Trek: The Motion Picture — they’re carefully crafted to look like someone is really out in space with a camera, shooting the footage of these giant spaceships.  Compare to Star Treks 2-4, which were done by a different company (ILM) and really look cheesy in comparison.

    And yes, ILM did the Indiana Jones trilogy.  And the effects looked cartoonish.  But for those movies, perhaps that’s actually a good thing.

  11. naaa says:

    >> Why isn’t today’s software faster?

    It does things you couldn’t even imagine on the 8086. Really. And it was soooo slower. Really.

    It’s just that at the time it was "just fast enough", and now is still "just fast enough".

    It’s like graphics.. you ask someone about the original Prince of Persia graphics and they would say it is fantastic. Lemmings graphics was fantastic. Heck, Eye of the beholder gfx was fantastic and probably even Monkey Island I gfx was. Really, they *were* at the time.. now they suck.

    The same for movies. Watch 70-80s sci-fi or Indiana Jones or whatever.. suck.

    The past is always remembered better than it was.

  12. BC says:

    Our typical use-case is 1 context struct, and 1 is always required, and there is no upper limit.  I suggested we just define one instance in the data segment, and allocate the others as necessary.  Locate the context nearby other frequently-accessed data to potentially realize gains like cache-hits, eliminate a node in the list of allocations…

    The other person disliked this because it was "cleaner" and "safer" to just always allocate the contexts, instead of special casing on the first instance, and I had no numbers to prove there was a performance benefit, so we always allocate contexts.

    Do you think it would be worth the effort to try and measure the differences between the two approaches?  

    Is it good engineering to think along the lines of reducing memory allocations and where data is located, or is it an example of overly-engineering or needlessly optimization?  


  13. Josh says:

    Wouldn’t  ‘newer’ data structures optimized for cache hits be ideal in this case? Why optimize for cache misses?

  14. Randall says:

    Makes me wonder how B+trees and other "paged" structures perform.   There are challenges to doing them efficiently in managed code (c# has no placement new, right?), and I guess the practical question is whether your working set fits in RAM — if very little fits in RAM, B+trees FTW (bigger pages -> fewer seeks), but if it does fit, there are probably lots of log(n) structures that are faster.  

    It surprises me a little that collections (seem to) be written in regular managed code, rather than C or C++ that knows how to talk to the garbage collector.  Maybe there’s just no performance gain, or maybe the latter approach is pretty much un-implementable.

  15. Ben Voigt [C++ MVP] says:


    C++/CLI programs can opt out of the garbage collector by allocating from native heaps (actually you can do that from C# or VB.NET, but it’s far more work).  But there’s no control over the garbage collector, if you allocate from the managed memory pool then you get the exact same behavior as every other managed language.  There is no placement new with managed instances, as a consequence of enforced strong typing of objects (which is used to provide type safety guarantees, but is also enforced on the managed pool even when accessed type-unsafely).

  16. Maybe remove this post says:

    If you care about memory you can model it with big Oh as well. You and t-a-w show a fundamental misunderstanding of the purpose of big Oh notation.

    Big Oh is a generalization so you can understand how the algorithm scales. Everyone knows you throw away the constant so you can easily label the COMPLEXITY (NOT THE RUNTIME) of the algorithm.

    I suggest you go read wikipedia or some other source you trust about big Oh you can actually understand how it is used. It is obvious to me from this quote " HashTable, TernaryTree and SortedList. Surprisingly, the hash table won on insertion and retrieval of tens of thousands of randomly-generated strings. Why?" You have no idea what you’re talking about. Why is it surprising the hashtable won? It has near O(1) insertion and access big Ohs, while sortedlist and ternarytree have at least O(log(N)) for each access.

    This simple example pulled from your own text demonstrates your fundamental misunderstanding of big Oh. Also the fact that you didn’t bother to model memory either in big Oh shows how unaware you are of it.

  17. I’m confused about the surprise about this result — hashing is O(1), and search/sorting in any tree is O(logN). Why wouldn’t hashing be faster? The surprise would be if the tree was faster, even for small datasets. I can see linear search in an array being faster for small datasets, but hashing, done correctly (good key generation, sufficient size hash table, good re-mapping functions) should be faster than trees across a wide variety of data set sizes.

    Am I missing something?

  18. yme says:

    Hashing strings isn’t O(1); strings can be arbitrarily long.

    The time required to find a string in a hash table doesn’t increase with the number of items in the table, but it does increase with the length of the string you’re looking for, because first you have to compute its hash, and to do that you have to process the entire string.

  19. configurator says:

    @yme: I was just about to say that.

    Hashing strings can never be free; since the hash code treats every byte in the string (it must, to be effective), the entire string is processed. And the hash is never cached with CLR strings.

  20. Mario says:

    @yme: nobody ever claimed that hasing is free… but when you analyze a sorting algorithm the size of the instance is given by the number of elements you must sort, not their respective lengths. In this context, the hashing step becomes a constant (it takes the same time to hash a particular string when you are sorting a string array of 10 or 10000 elements).

    As I see it, this is exactly what Raymond’s moral #1 says: Constants are important.

  21. Jolyon Smith says:


    I as going to say more or less the same thing which boils down to….

    Sure if you reference movies from a certain period with crap/cheesy effects you may think you have proven that movies from that period ALL had crap/cheesy effects.

    But you haven’t.  You’ve simply proven that EVERY period of history is filled with products (of any/all types, not just movies) that are produced by people happy to accept "just good enough" as "good enough".

    The human capacity to not only accept but reward the barely adequate is boundless.

    In movies specifically, seamless/undetectable FX, even CGI, is not always the goal – style plays a part too.

    I refer you to Dark City and The Matrix, both produced at around the same time and both taking a very different approach to their FX.

    Some might naively say that Dark City was "cheap", but they are missing the stylistic points.

    But there is nothing stylish about software running slowly on modern hardware.

    Windows 7 vs Vista proves that it’s not some immutable law of the universe that software HAS to get slower in order to progress, or that "prettier" has to come at the cost of performance.

  22. Mike Dimmick says:

    Strings in .NET are immutable. This is so that you can safely pass them around between threads without fear that they’re going to change – and without requiring the extensive locking or ‘lock-free’ synchronization (still requires memory barriers, so you’re stalling a core somewhere) to manage mutable strings in multiple threads. To keep this behaviour, .NET would have to compute the hash code when the string was constructed, an unnecessary overhead considering most strings aren’t hashed.

    I guess the plan of checking the value and, if not set, recomputing it, then writing the hash out, is also possible, but in effect you’ve not saved anything much since parallel rehashing then recomputes the hash multiple times. How often is the same string object rehashed, anyway? Strings compiled into the code are interned, true, but it doesn’t search for an interned string object with the same character sequence when you construct an arbitrary string, it just creates a new one.

    All in all, sounds like an ‘optimisation’ that’s really a pessimisation for the 99% case.

  23. a lurker says:

    Good article; I like to read about how Big-O isn’t really the end of the story.

    Btw, typo (I think): “So if that’s all your testing”

    [Fixed, thanks. -Raymond]
  24. Serveral people here claimed it’s obvious that Strings to not cache the hash.

    It’s not obvious why this is not done. Java for example caches the hash. the hash is compute lazily, at the first time it is needed.

  25. Note also that Tries/Ternary tries can consume much less memory compared to HashTables depending on how much the Strings have in common

  26. DWalker says:

    Dark City!  Great, great movie.

  27. Joseph Koss says:

    Note that it is also possible to use a hash table to implement a tree. The early-out arguement in favor of tree’s is moot because of this.

    One general methodology leverages zobrist-style hashing to generate a nodes hashed address within the table, while another directly computes a heap-like numerical address and then hashes that value.

    This does not change the fact that hash tables cause lots of cache misses if the hash function is considered "good."

  28. BC says:

    "Wouldn’t  ‘newer’ data structures optimized for cache hits be ideal in this case? Why optimize for cache misses?"

    I misspoke, I meant page-swaps, not cache-hits.

    If the "context-we-always-need" is on the same physical page as "tables-we-always-use", then we would swap in just one page (table+context) vs. 2 pages (table+pointer and pointed-to-context).  

    Raymond’s article jogged my memory on this conversation.  If "constants are important" then it seems we would be doubling the number of page-swaps.  If "memory bandwidth is important", then it seems like it would be a good idea to define the "context-we-always-need" next to the "tables-we-always-use" – just in principle, no measuring required.

Comments are closed.