Performance Quiz #7 — Generics Improvements and Costs


Time for another quiz for you all, this is just a micro-benchmark so we want to be careful not to conclude too much but it’s useful for understanding the costs and their origins.    Consider the two snippets shown below:


            // choice #1, using Generics


            int total = 0;


            List<int> li = new List<int>();


            for (i = 0; i < 20; i++)
                li.Add(i);


            // start timer


            for (i = 0; i < 10000000; i++)
                foreach (int el in li)
                    total += el;


            // end timer



            // choice #2, using ArrayList


            ArrayList al = new ArrayList();


            for (i = 0; i < 20; i++)
                al.Add(i);


            // start timer


            for (i = 0; i < 10000000; i++)
                foreach (int el in al)
                    total += el;


            // end timer


Now try to answer these question (no peeking at the comments until you’ve done your work :))


Q1: Which is faster?


Q2: How much faster?


Q3: Why is it faster?


Q4: What price do we pay for this speed?  Quantify it if you can.


Answers and discussion soon 🙂

Comments (27)

  1. tzagotta says:

    Hi, I’m new to C# and .NET, but let me take a stab…

    The List<int> version is faster because there is no boxing of the ints as there is with ArrayList.

    Price we pay is that List<int> is generated at runtime and takes some memory for the code. But this also speeds up List<int> since the generated code is specialized for accessing an array of ints.

  2. Cory says:

    1) #1

    2) too lazy to test. will take a guess: an order of magnitude?

    3) not having to box is good

    4) memory usage will be higher and jit will take longer if you instanciate a generic class with many different types.

    right?

  3. Rico just posted a new&amp;nbsp;performance quiz where he compares performance of List&amp;lt;T&amp;gt; and ArrayList…

  4. Wesner Moise says:

    It appears that you are comparing a number of things:

    1) IEnumerable<int> struct implementation of List<int> versus IEnumerable class implementation of ArrayList. The first is a direct and fast pattern invocation while the second makes slower interface calls.

    2) Allocated objects versus unallocated objects. The generic version allocates no additional memory in the loop while the second allocates a new enumerator object one million times.

    3) Int copying of List<int> versus Int unboxing and typechecking of ArrayList. Int unboxing is not as costly as boxing, but there is still the null-check and the type-check that needs to be performed.

    There may also be additional virtual calls in the ArrayList version not present in the generic List version–such as calls to GetEnumerator.

    I thought that this may be a trick question… especially since object references are just as efficient for computers to manipulate as are integers, but the disadvantages that I pointed out earlier lead me to believe the List<int> implementation performs fast.

  5. Greg Merideth says:

    Well, at least on my machine the generics implementation performed a little over 105% faster than an array list implementation.

  6. ricom says:

    Wesner Moise seems to think that I have many things on my mind with this example. Could it be true? 🙂

    Kudos for spotting some of the hidden details!

  7. C# performance quiz time.

  8. Rico just posted a new&amp;nbsp;performance quiz where he compares performance of List&amp;lt;T&amp;gt; and ArrayList…

  9. The Quiz:

    &amp;nbsp;

    Recently Ricom posted a performance quiz on his blog asking about the performance…

  10. ianhu says:

    Hey Rico,

    As a dev on the F1 Profiler team, I figured that I would take a wack at using our tool to profile this performance quiz. I posted my findings to my blog in an entry here. I ended up finding out some of the issue that have already been mentioned above.

    http://blogs.msdn.com/ianhu/archive/2005/08/25/456531.aspx

    I also linked it in my comment URL.

  11. ricom says:

    <P>Ian’s chimed in with a very detailed post (thank you Ian). Lots of good numbers there but have we come to the right conclusion? Do we know why this is happening? </P>

    <P>Is it really all those allocations? And if so, why? Is foreach inherently bad? Will I ever make a quiz with a straighforward answer? 🙂 :)</P>

  12. Jeffrey Sax says:

    Q1) The generic version is faster.

    Q2) I’d say #2 takes about 50% longer.

    Q3) Unboxing imposes a small cost, but it is large compared to the work being done with the result. Boxing is not an issue.

    Q4) Larger start-up cost: instantiating the generic type, possibly JITting (vs. NGEN’ed base classes?) Some memory overhead.

    The extra start-up cost is large (by orders of magnitude) compared to the gain for a single inner loop, but still only of the order of milliseconds.

    On reading the other comments: creating a new IEnumerator *10* million times will have a significant impact as well… probably the biggest impact. Add another 50%. (Good find, Wes!)

    Hidden allocations inside loops are a major source of performance degradation. Some languages do a really good job of hiding them. :o)

    I don’t think there are virtual calls on a struct implementing an interface. All Current and MoveNext() calls on the ArrayList will be virtual, however. (Add another few %)

    Both versions would still benefit from using a simple for loop and indexing. For #1 and #2, this removes the bookkeeping overhead of the enumerator. For #2, this removes allocations.

  13. tzagotta says:

    Hidden allocations and memory leaks in loops in languagues like C++ are a real problem. I wouldn’t want my app to allocate and leak 280MB while it runs.

    But in .NET, is this really an issue? Is the 280MB really all allocated and "held," or was the same small memory block just reused a million times? Or was the memory allocated and set aside for the GC to come along later?

  14. Wesner Moise says:

    One other comment: List<int>.Enumerator may be instantiated at the beginning of the foreach loop in which case there would be additional start-up cost for the JIT compilation to occur at the beginning of the test.

    In this case, the performance of the generic list will probably be significantly improved if we have an additional foreach loop before performing the timing tests, thereby moving the instantiation costs out of the tests.

    I doubt that this occurs, because I believe that the mscorlib already contains an instantiation of List<int>.

  15. stefang says:

    I suspect that all the samples in CLRStubOrUnknownAddress really indicate time when execution is suspended because the garbage collector is running.

    If this is the case, the profiler is not doing a very good job of indicating the amount of time spent in GC

  16. Presumably all the code for ArrayList is ngened, so if you’re not the only process on the machine using ArrayList, there’s no impact on your private memory usage for using that code. (Although I think you’ll still be paying a per-process hit for having used ArrayList type at all – ngen just means you get to share the code. If I’ve understood correctly, you don’t get to share the data structures the CLR creates for managing the type.)

    But what about List<T>? I have to admit that I don’t actually know whether NGEN has some mechanism for allowing the code generated for specific instances of a generic type to be shared. As I understand it, while generics can share one set of generated methods across all reference types, the code has to be generated anew for each value type you instantiate the generic for.

    So taking a wild stab in the dark, I’d guess that you’re allocating private bytes to hold JITed code specific to the List<int> instantiation of List<T>, a cost you won’t be paying with ArrayList. (And of course there may well be startup time issues here too – if other processes are using ArrayList, the ngened code in question will already be resident.)

    Of course it’s possible that the calls into the collection classes might be inlined, in which case the code-sharing-thanks-to-NGEN argument doesn’t hold. If I had to guess I’d say that the call to GetCurrent in the generic case will be, but MoveNext looks big and hairy enough in ILDASM that it probably won’t be inlined.

    Given where you’ve placed the "start timer" / "end timer" comments, some or possibly all of this discussion is all completely irrelevant. Unless of course one of the points of this post is to illustrate how you can get misleading results if you start your stopwatch in the wrong place… You’re starting the stopwatch after creating your collection objects, so there are a whole load of costs you’re not even measuring!

  17. Michael says:

    Q1: List<>

    Q2: 3:1 advantage when I tested

    Q3: Boxing

    Q4: Not sure. I posted before reading the comments, so hopefully someone else answered this for me 😉

  18. ricom says:

    Stefang writes:

    >>I suspect that all the samples in CLRStubOrUnknownAddress really indicate time when execution is suspended because the garbage collector is running.

    >>If this is the case, the profiler is not doing a very good job of indicating the amount of time spent in GC

    That is not correct. I haven’t confirmed this yet but I’m 99% sure that the stubs in question in this case have to do with virtual dispatch over an interface. We use a sparse representation for these guys in Whidbey because of the space savings we get.

    Time in GC would be charged to the GC in mscorwks — even with no symbols at all it would not appear to be a stub as there is a module to charge the time to at the very least. With the publicly available symbols you’ll get a strong indication that it is the collector.

    But as we’ll see in the analysis it is not the collector.

  19. ricom says:

    Ian Griffiths writes:

    >>You’re starting the stopwatch after creating your collection objects, so there are a whole load of costs you’re not even measuring!

    I don’t think I could measure the difference even if I moved the stopwatch. We do twenty adds to the collection then we do two hundred million read operations. Those adds would have to be pretty hefty indeed to make a difference 🙂

  20. Jeffrey Sax says:

    Rico> it is not the collector.

    With only a few objects allocated, the GC wouldn’t have much work, even if it’s invoked 100’s of times per second.

    However, you’re running through a large chunk of memory a few hundred times. My money for the main culprit would be on loss of locality, with tons of cache misses and expensive RAM accesses as a result.

  21. Believe it or not I actually spend a good amount of time thinking about my little quizzes hoping to come…

  22. ricom says:

    >>My money for the main culprit would be on loss of locality, with tons of cache misses and expensive RAM accesses as a result

    That was a very good thought. When I started doing the quiz I expected the costs to be more evenly divided between the extra virtuals and the memory allocation. The measurement seems to indicate otherwise though.

  23. > We do twenty adds to the collection then we

    > do two hundred million read operations.

    > Those adds would have to be pretty hefty

    > indeed to make a difference 🙂

    First of all, it wasn’t the adds I was thinking of. I was thinking more about the price paid before you even get to start to run the method – all the stuff that’ll happen at JIT time.

    And to be fair you did ask "What price do we pay… ?" Where I was going with this is that there might be startup costs which are amortized away in this particular example, but which may still be relevant in a Windows Forms application where startup time is a big deal…

    There are very few Windows apps (managed or unmanaged) that I use on a regular basis where the load time is fast enough not to be annoying. So pleading that you’re iterating 20,000,000 times seems like cheating – the startup costs can still matter a great deal.

    So the kind of thing I had in mind was "How much extra stuff do I have to wait for the disk to load before the code executing here even starts." In the world I happen to occupy, 10-20 second application startup time is a problem that plagues me far more than burning too many CPU cycles once I’m up and running.

  24. Due to my upbringing in C/C++ somehow I feel uneasy whenever I see some likevar a = 5;

    I guess this…

  25. Due to my upbringing in C/C++ somehow I feel uneasy whenever I see some likevar a = 5;

    I guess this…

  26. Believe it or not I actually spend a good amount of time thinking about my little quizzes hoping to come