More Performance Tidbits for library writers


However careful you are with your performance culture when creating everyday professional code you need to be doubly careful when creating libraries for other developers.  This is of course because when writing a library you can’t know in advance exactly which functions will be used in which ways and how often.  Even esoteric looking functions might become vitally important in some customer scenario — so more than the usual caution is warranted.

Below is an edited bit of advice I gave my own team a while ago, presented for your benefit.  I present these as guidelines but like all performance related advice you have to take it with a grain of salt… there are times to break rules. These experiences are based on what I’ve seen in our own Base Class Libraries (BCL) but they’re often applicable to other libraries.

The motivation for many of these rules comes from the fact that simple functions in the BCL frequently end up getting called O(n) times (or even O(n*lg(n)) times) in the course of larger scale operations.  It is because so many things are built on top of what we’ve done that we must be extra careful.

Principle #1:  Hashing anything must be a linear time operation and must not allocate any memory

If we spend more than O(n) time (n representing the size of object to be hashed) doing the hashing then we risk impacting the effectiveness of the hash collection in a major way.  We actually must do better than just O(n) — we want to make exactly one pass over what needs to be hashed.  Similarly if we’re allocating any memory we will likely churn the heap like crazy when it comes time to do those high-speed lookups.

Principle #2:  Comparing any two objects must be a linear time operation and must not allocate any memory

This rule is perhaps even more important than case #1 — comparison operations are frequently used in the context of sorting, or in hash tables so the operation might be repeated many many times.  O(n*log(n)) for a sort for instance.  An example of a bad way to compare two tree type structures is to call ToString() on both and then compare the serialized form…  Again we cannot afford to make more than one pass over each of the two input objects to do the comparison and of course comparisons that can early-out are better than comparisons that cannot.

Principle #3:  When Parsing anything, know the input language and use a suitable technique for that input

A sure sign that you parsing something is that the name of the function you are writing includes the word Parse in it. :) :)

When parsing, it is vitally important that you fully describe the input language — not only for your own benefit but for the benefit of those that come after you.  Good ways to do this include:  regular expressions, grammars, and so forth.  In addition to describing the input language you must describe what the semantics of that language are, i.e. what do the various forms mean?   This is also vitally important because others looking at the code later will otherwise have to reverse engineer the language you parse and will be faced with the dubious prospect of wondering whether exotic input forms accepted by your parser, but unlike the canonical input forms, were really intended or are bugs.

Having specified the language, apply a suitable technique.  Good parsers make exactly one pass over their inputs and create no, or virtually no, temporary objects in the course of parsing i.e. they create the output data structure and nothing else.  Bad parsers do things like make one pass over the input to convert all whitespace to blanks followed by another pass to convert backslashes to forward slashes followed by assorted substrings to create string pieces which can then be passed to helper functions.   Generally speaking, the sorts of things you need to parse in a low-level API can be described by a regular expression and are therefore parseable, in place, with only a tiny amount of state to directly generate the output. 

To underscore these first three principles, very modest transformations in BCL logic netted a savings of 1.4M (yes that’s megs) of temporary strings a test case that creates 5 basic app domains.  That’s 1.4M of 2.2M total in that scenario.

Principle #4:  Temporary string objects are not a good thing

Naturally there are exceptions but by and large if you are using String.Substring or String.Split in low level APIs you’ve probably made a mistake.  Consider using a technique in which you keep track of offsets within your original input string and take advantage of the general String.Compare forms for comparing in the middle of the string.  This is especially important in the context of #1 and #2 above.   Note: this advice is really unique to libraries and even then only certain low-level libraries like mscorlib.

Principle #5:  Don’t create abstractions to enumerate things which can be trivially enumerated without an abstraction

For instance — if you have an ArrayList, don’t use the Enumerator just to walk the elements — use an index, it’s much faster and requires no extra memory.  In one scenario I found 12% of their total allocations were going to simple ArrayList enumerators — those could have been integers with no loss of clarity.

Similarly, composing BinaryReaders on top of MemoryStreams so that you can read bytes out of a byte array is not such a good thing to do.  You could perhaps index the bytes with an int.

Principle #6:  When creating data structures, pretend that following a pointer is a very costly operation

Actually you don’t have to pretend because it really is costly compared to say going to the next element of an array.

Modern processors can issue multiple instructions per clock when things are going great, but following pointers means more cache misses which means more CPU time spent doing nothing.  You especially have to beware of data structures which are built up over a long period of time — if the allocations did not happen near each other in time then the extra benefits of locality that the GC provides can be lost.  That means cache misses which brings your average performance way down and an otherwise nice looking data structure ends up killing you.

Principle #7:  You’re not done with your implementation until you’ve measured its performance

If you remember nothing else, remember this:  it’s terribly easy to hook up a something like clrprofiler to your new implementation (you don’t even need to rebuild!).  Please do look at what your code did and study what got allocated.  Make sure that it is satisfactory and as you expected — if you don’t do this, the job’s not finished yet.

In a case I was just looking at the other day, innocuous looking changes for readability and better factoring of the code (laudable goals) inadvertently causes a 2x space regression (i.e. the new way used twice the memory).   Those sorts of things jump right out at you if you use allocation profiler — and they’re almost always drop-dead-simple to fix.

I’d like to close with what I began with — these are only guidelines and the most frustrating thing about performance work is that it there are powerful secondary and tertiary effects so “even the wisest cannot tell” what any given change might do without proper experimentation and measurement.  So take these thoughts all under advisement only, then remember #7.

Thanks for listening,

Comments (22)

  1. Rico – thank you for that. Glad i was listening :)

  2. Rico Mariani says:

    Doh, it’s Principle not Principal — I noticed that when I first sent this out as email and then I didn’t fix it when I posted it as a blog…

    Thanks readers :)

  3. Stu Smith says:

    Regarding point 5, it’s a shame the compiler and BCL couldn’t work together to handle this. Using foreach seems to me to be the natural way of iterating any array or collection, but it has this performance hit. ArrayList (for example) is virtual, yet I would have thought most people use it as-is. Perhaps if ArrayList had been ArrayListBase (abstract but with no abstract methods), with a sealed ArrayList derived from it, then the various compilers could have recognised that, and emitted an IL for-loop instead of a foreach. And surely arrays could be compiled in this way as the framework stands?

  4. Rico Mariani says:

    List<T> doesn’t have this problem so hopefully people will be happy with List<Object> anywhere they would have used ArrayList going forward.

  5. I’m with Stu on this. I actually wanted to say it myself, but Stu put it very well.

    I think that the compiler could do a very nice job recognizing the BCL’s collections. Since these collections have the overhead during foreach, but not during for, it could compile foreaches to fors.

    You’re saying "it’ll be fixed for version 2.0", but we need it inside systems that are already in production with v1.1. Can something be done about this?

    Also, could you elaborate on the "List<T> doesn’t have this problem" bit?

  6. Rico Mariani says:

    Quoting myself from the Performance and Scalability PAG

    Enumeration Overhead

    The .NET Framework version 1.1 collections provide an enumerator by overriding IEnumerable.GetEnumerator. This turns out to be less than optimal for a number of reasons:

    -The GetEnumerator method is virtual, so the call cannot be inlined.

    -The return value is an IEnumerator interface instead of an exact type; as a result, the exact enumerator cannot be known at compile time.

    -The MoveNext method and Current properties are again virtual and so cannot be inlined.

    IEnumerator.Current requires a return type of System.Object, rather than a more specific data type which may require boxing and unboxing, depending on the data types stored in the collection.

    As a result of these factors, there are both managed heap and virtual function overhead associated with foreach on simple collection types. This can be a significant factor in performance-sensitive regions of your application.

    For information about how to minimize the overhead, see "Consider Enumerating Overhead"

    The relevant section is:

    http://msdn.microsoft.com/library/en-us/dnpag/html/scalenetchapt05.asp?frame=true#scalenetchapt05_topic25

    These problems were all fixed in List<T> — sadly we we’re stuck with it in ArrayList because we’d have to break existing applications to fix it.

  7. Anonymous says:

    Weblogs.asp.net had a cool link to Rico Mariani’s weblog that has a lot of good posts about tuning the most out of .NET applications. Some of the tips are pretty generic and can be applied to other programming languages and platforms as well. Specif

  8. Joe says:

    Isn’t foreach on an array (not an arraylist, a real array) optimized already by the compiler? In fact it should be faster than using indexes because there won’t be an around bounds check (which can happen for anything that the simplisting index based iteration).

  9. Rico Mariani says:

    Foreach on regular arrays gives you the best code. "For" on arrays often has no bounds check either if it’s coded in one of the obvious safe ways.

  10. Dmitriy Zaslavskiy says:

    Rico could explain why does LinkedListNode stores back reference to LinkedList?

    I am not asking why in term of implementation but rather in term of library design?

    Why pay an extra pointer for each element.

    In fact I would argue that a lot of the time people know that the objects they want to keep will go into this or that collection. Therefore they might be willing to derive from LinkedListNode to save on extra 12 bytes (+ allocations). Or maybe just to implement interface ILinkedNode ?

  11. Rico Mariani says:

    I’d like to remind folks of an important truth…

    I only have two rules

    #1 Measure

    #2 Do your homework

    "Don’t use foreach" isn’t in the list.

    Better advice would be "foreach is good for everything but ArrayList" but even that isn’t quite right…

    #1 and #2 imply an important corollary –"Don’t use pithy advice from so called experts (such as me) instead of doing your own measurements"

  12. Ovidiu says:

    As usual, a great post :) As for the "pretend that following a pointer is a very costly operation" part, I think the best picture one can get is by the following comparison:

    <<Now 2GHz is a difficult thing to imagine for a human. Put simply that is 2 billion (Dr Evil pose) instructions per second at maximum throughput. So lets put this on our terms. Let’s say once processor “clock cycle” is not 1/2,000,000,000 of a second but rather 1 second. On that scale accessing the nearby L1 on-chip cache takes 6 seconds, the off chip (L2) cache 2-3 minutes, and accessing main memory takes 3-4 weeks. Accessing the disk (just one disk access) by comparison takes a whopping 1 year on this timescale.>>

    (From http://weblogs.asp.net/simonme/archive/2004/05/31/145024.aspx)