Performance Quiz #7 — Generics Improvements and Costs — Solution


Believe it or not I actually spend a good amount of time thinking about my little quizzes hoping to come up with a small piece of code that illustrates a bunch of different things in a simple enough way that many people feel they can jump right in and comment.  And also there should be a nice curve ball in there somewhere.  Performance Quiz #7 is no different.  :)


Now before I jump into the analysis I want to remind you that this is just a micro-benchmark.  The job of these things is to magnify one or more phenomena so that it is easily visible for tracking, analysis, or teaching.  We can’t really conclude a whole lot about best practices overall from a micro-benchmark.  But with that in mind, let’s have a look at what’s going on.


First let me answer the questions directly:


Q1: Which is faster?


The Generics version is considerably faster, see below.


Q2: How much faster?


On my development machine (it’s running a recent post Beta2 CLR build) I get these numbers:


Generic foreach time: 2.42414s 
Arraylist foreach time: 6.04955s 


So the generic version is about 2.5 times faster on this machine.


Q3: Why is it faster?


I think the data is better at showing why the ArrayList is slower but it’s maybe just as important to discuss things that are not why it is slower.  Let’s look at this table of inclusive times.  Here I’m showing only function that are at least 1.5% of the time or greater — I picked that number somewhat arbitrarily because it seemed to cut off enough of the low level stuff while leaving the bits that were useful for analysis.  This is a single run of the executable that does both tests so you can see them compared to each other in context (you could do the same looking at two runs side-by-side but I found this easier)
























































































































 Exclusive  
Percent
  Inclusive  
Percent
           Function Name (some names removed to make it easier to read)
         –     99.64 __CorExeMain
         –     99.63    …
         –     99.21      ExecuteEXE(struct HINSTANCE__ *)
         –     99.21        SystemDomain::ExecuteMainMethod(struct HINSTANCE__ *,unsigned short *)
         –       1.60          SystemDomain::InitializeDefaultDomain(int)
         –     97.59          Assembly::ExecuteMainMethod(class PtrArray * *)
         –     97.59            ClassLoader::RunMain(…)
         –     97.59              …
        –     97.59                …
         –     97.59                  …
         –     97.59                    …
         –     97.59                      …
         –     97.56                        NS.Test.Main(string[])
      4.93    26.72 ***                          NS.Test.GenericTest()
    18.86    18.86                            System.Collections.Generic.List`1.Enumerator.MoveNext()
      1.78      2.85                            System.Collections.Generic.List`1.GetEnumerator()
      8.99    69.23 ***                          NS.Test.ArrayListTest()
    11.23    11.26                            Virtual Dispatch Stub (unusual nesting due to tail jmp)
      9.39    23.29                            System.Collections.ArrayList.ArrayListEnumeratorSimple.get_Current()
    13.90    13.90                              Virtual Dispatch Stub (unusual nesting due to tail jmp)
    17.06    20.00                            System.Collections.ArrayList.ArrayListEnumeratorSimple.MoveNext()
      2.11      2.11                              An indirection buffer in mscorwks that I don’t understand :)
      0.52      4.56                            System.Collections.ArrayList.GetEnumerator()
      0.94      1.64                              System.Collections.ArrayList.ArrayListEnumeratorSimple..ctor(…)
      0.10      1.87                              JIT_NewFast(…)
      0.03      1.64                                FastAllocateObject(class MethodTable *)
      0.04      1.61                                  Alloc(unsigned int,int,int)
      0.05      1.54                                    WKS::GCHeap::Alloc(…)

OK looking at this we can see some interesting stuff is going on here.  First off you can see that the Generic solution is only 26.72% of the run time while the arraylist version is 69.23% of the run time.  Together that’s 95.94% of the time, the rest of the time we can attribute to the startup process and some taxes, notably 1.6% initializing the one and only appdomain.  If we divide those relative fractions we’ll again see that the generic version is about 2.5 times faster.  So far so good (when the wall clock time is similar to the profiler time I’m a happy guy).


OK but now why?


Well let’s eliminate some things.


Is it boxing?  No.  It should be no surprise that boxing isn’t the dominant factor here.  After all the arraylist contains only 20 elements so we boxed exactly 20 times, compared to the 200,000,000 read operations we did the boxing is just noise.  


Now you say: “OK Rico, you’re splitting hairs, I didn’t mean boxing, I meant unboxing of course.”  Well there’s plenty of that in this example but remember unboxing an integer is basically a test to see if the boxed object is in fact an integer type and them copying out the integer.  That’s not something to be terribly afraid of, no heap allocations for that.  Is Unboxing the key factor here?  I’d say not.


What about all those allocations, many people noticed that the arraylist version is allocating enumerators like crazy.  That must be the problem.  10,000,000 enumerator allocations can’t be good right?  Surely that’s the dominant cost?  Well, it turns out it isn’t.  This guy here JIT_NewFast tells the tale, only 1.87% of the run time was spent in allocations.  Now since this guy is single threaded, if any collections happened they would be triggered by that call so the collections are part of that 1.87%.  In fact if you look at the performance counters while this program is running you’ll see that the average time in the garbage collector is only really about one half of one percent (0.5%) so most of that time is just doing the allocations and not collecting. 


Why is that?  Well when the collector runs on this program it finds that exactly one object in generation zero is reachable (the now current enumerator) there are so few roots that it is triviality itself to walk them all (there’s probably like 10ish of them total) and since dead objects don’t have to be visited all we have to do to collect is move the one live enumerator to the start of generator zero and we’re done.  So we move something like 16 bytes on every collect.  That’s pretty optimal!


So what *is* killing us?


Well it’s a combination of things but the short answer is there are many virtual function in the arraylist version and the List<T> doesn’t have those.  When we re-did our collection classes for generics in Whidbey we drastically simplified the enumeration pattern.  It doesn’t need virtual dispatches (or allocations) now and it’s much more friendly to the inliner.  The main savings come from a streamlined execution path — it’s just less instructions.  So the main factor is simply this: the Generics way is less code.


Note: The C# foreach construct did the right thing in both cases but in the Generics case the implemenation of the enumeration pattern is simply better so when C# does the right thing the resulting code is better.  


Now don’t use this as an argument to avoid foreach — some people like to do that.  The fact is this is a microbenchmark and if we were doing any real work in that loop the overhead would be considerably lower as a fraction of the total execution.  Really I only caution against using foreach on ArrayLists and then only if the iteration is very common.  I forced that situation here with 10,000,000 loops over a small list.


Why didn’t we fix ArrayList while we were at it?  Ugh… we can’t.  If we “fixed it” we would break existing applications.  Because the class isn’t sealed it’s not possible for us to change the virtualness of the key functions involved so we have to live with that mistake.  But I think you’ll find that generally List<Object> offers better performance than ArrayList across the board (but measure!).


Are those allocations a problem at all?  Well yes they are but they basically induce a processor tax because touching all that memory induces more cache misses.  The cost manifests itself in terms of an overall higher average cycles per instruction executed for the same code but I wouldn’t say that’s the dominant factor.  The cost associated with the virtual dispatch stubs is really showcasing the true source of the problem here.


Sadly our profiler didn’t automatically recognize the virtual dispatch stubs… I annoted those manually.  Oh well, there’s always room for improvement.


Q4: What price do we pay for this speed?  Quantify it if you can.


Well in this case the cost is very low because List<int> was built into mscorlib anyway.  But suppose you were using a generic list in your own system, how much would that cost?


To measure this I build a version of a test application that created 200 different List<T> instances over 200 different enum types.  Then I did the same with 400 instances.  I took the workingset of both of those two and subtracted to get the overhead of adding 200 List<T> to my test hardness.  I measured working set with a simple vadump command.  The key bits of code look like this:


  enum e1 { a };
  static List<e1> l1 = new List<e1>();
  enum e2 { a };
  static List<e2> l2 = new List<e2>();
  …



The difference for 200 instances was 1676k of shared pages of which 28k were private pages.  So dividing that by 200 we get 8581 bytes for a List<T> instance (and the enum type, I didn’t seperate the cost although that could be done too) of which 143 are private bytes.


To pay for this overhead in terms of saved boxes, figure an average of about 16 bytes saved per boxed item (this accounts for internal pointers in the boxed object, some allocation overhead and the pointer to the box, we could haggle over whether the overhead is more like 12 bytes or more like 16 bytes but it’s my article *grin*).  So 536 instances stored in the collection would cover that — lets round it to 500 because that’s easier to remember.


So if you are going to store more than about 500 items at once you will likely come out ahead space-wise if you use a generic, if not you can still get better performance by using List<Object> instead of ArrayList in most cases.


Of course you’ll still have to measure your own scenarios. :)

Comments (8)

  1. Brien says:

    It’s disappointing that when discussing performance of the CLR the answer always seems to be don’t use virtual functions (and interfaces?).

    It would be nice if high performance and polymorphism were not mutually exclusive.

    Especially since the Java guys have seemed to figure out how to make this work.

  2. Jeffrey Sax says:

    I guess I considerably underestimated the impact of the virtual function calls.

    To distinguish this effect from the memory usage, I tested the ArrayList option again, but this time took the enumerator construction out of the loop, and did the enumerating manually using MoveNext and Current.

    The result: this code is roughly 10% faster than the foreach version. Relative to the generic version, the allocations cause a slow-down of roughly 30%, much more than the 5% or so from the raw allocations.

    In other words, about 1/8 of the difference is caused by the allocations and their side effects – significant, but not dominant.

    The rest is due to virtual call overhead, divided roughly 55/45 between dispatch stubs and Current not being inlined.

  3. ricom says:

    >>It’s disappointing that when discussing performance of the CLR the answer always seems to be don’t use virtual functions (and interfaces?).

    Well that’s a bit of a leap. I am the first to say don’t use features you don’t need — that’s just being smart but don’t use virtuals? No I wouldn’t go there.

    All we can really conclude here is that the implementation choices in the original ArrayList were pretty unfortunate for raw performance and the advice we now give for how to implement enumerators (which we followed in the List<T> class) is helping a great deal.

    I wouldn’t conclude anything broad from this micro-benchmark. Comparisons between platforms have been done an notoriously give mixed results.

  4. Michael says:

    Great work, Rico!

    Just another reminder that I’ll be studying every day for the rest of my life in an attempt to learn a fraction of what I think I need to know (and that’s not including what I *want* to know… sigh… ).

  5. What tool did you use to obtain the trace you presented in answer to Q3?

  6. Alex says:

    Here it comes :)

    foreach loop:

    Generic List: 1.80703441359167 sec

    ArrayList: 6.64174133863382 sec

    for loop:

    Generic List: 0.603019657526306 sec

    ArrayList: 2.61797574831438 sec

    While List<> enumerator is better than ArrayList it’s still non-optimal for enumerating a collection.

  7. Alex says:

    > What about all those allocations, many people noticed that the arraylist version is allocating enumerators like crazy. That must be the problem. 10,000,000 enumerator allocations can’t be good right? Surely that’s the dominant cost? Well, it turns out it isn’t. This guy here JIT_NewFast tells the tale, only 1.87% of the run time was spent in allocations. Now since this guy is single threaded, if any collections happened they would be triggered by that call so the collections are part of that 1.87%. In fact if you look at the performance counters while this program is running you’ll see that the average time in the garbage collector is only really about one half of one percent (0.5%) so most of that time is just doing the allocations and not collecting.

    This argument is applicable only to microbenchmarks. In real life application you have allocations of different kinds of objects not only enumerators. The huge number of allocated enumerators will increase the pressure on the GC. And GC will spend considerably more time doing collections, and collections will happen more frequently.

    The worst thing is that because enumerators are temporary objects the cost of their allocation and collection is very difficult to find in a real life application. You won’t see them in the CLR Profiler, because they are short-lived. Yet they contribute to the time spent in GC.

  8. ricom says:

    >>While List<> enumerator is better than ArrayList it’s still non-optimal for enumerating a collection.

    I actually measured times that were very similar to the above, I was planning to write something about that when I got back from vacation. I think maybe I still will.

    However I still urge caution in interpreting the results of Micro-benchmarks in any kind of broad way. This particular one was designed to magnify enumeration overheads which is why I didn’t find it fair comment to say something like foreach is so and so % slower than for.

    If you were to dig deeper in this you’d find that the reasons it being slower break down into two categories:

    1) It has safety checks to help you if the collection is modified out from under you(generally a good tradeoff)

    2) It doesn’t get fully inlined (partly due to the above)

    With ArrayList the problem, as I’ve mentioned, is that the implementation choice was not especially good.

    In neither case is there any find of "foreach" issue — it’s question of how the enumerator is implemented and what choices were made there. You could even write your own arraylist wrapper value type (no storage overhead) that makes a different choice and get different behavior. You could have the for behavior if you want it.

    Interestingly, if you do the same experiment for arrays you’ll find (well I found) that the foreach pattern is FASTER with /o+ why? The C# compiler generatates better IL for that pattern and the jit does a better job stripping bounds checks. I tried to rewrite my for loop in such a way as to get rid of every bit of overhead but I couldn’t figure out the secret sauce. With foreach I got it without even trying.

    This data doesn’t support a general conclusion like foreach is always worse on collections. Sometimes foreach is the only choice as far as that goes. High performance collections can and do make difference enumeration design choices that can give great performance. Wrapper classes on our generics can easily make different choices — ones you like better.

    That said it’s important to know what choice was made in our own basic collection types and it doesn’t happen to be the very fastest one. Although it is much better. Remember the enumeration overhead is very much magnified in this example — by design.

    >>In real life application you have allocations of different kinds of objects not only enumerators. The huge number of allocated enumerators will increase the pressure on the GC.

    As you indicated (in context of the quote above) my comments are specific to this benchmark but of course I’m the one whose cautioning not to interpret the results of this analysis too broadly. It is interesting to note that the raw cost of the allocations isn’t really the problem, because if it was you would see it even in this benchmark.

    The true cost is just as you indicated, if there were other operations you would have side-effects on them (to some extent, you would have to measure to see just how much impact). You would affect the overall locality and age the objects faster than they otherwise would potentially increasing the overall cost of collection.

    While allocations are cheap, not allocating at all is still cheaper :)

    Locality issues can be, as you say, difficult to find because the tax can be spread about your code. However you would see the enumerators in this case in the CLRProfiler, even though they are short lived. I often go fishing for locality issues by looking at overall allocation volume using CLRProfiler. My LogDump tool (recently posted) is also helpful for this because sometimes the logs get very large.

    Good comments, thank you!