Performance Quiz #9 : IList<T> List and array speed — solution

Last week I posted Performance Quiz #9 for discussion. Well the comments have been just awesome and many people went right ahead and measured the results.  Some of you have discovered the key point in this particular test:

Arrays are magic.

In the test case as given there’s a great deal of repeated work extracting the length of the array and accessing the items.  This is because of the unusual property that arrays have — they implement IList<T> for potetially more than one T even due to inheritance.  In the interest of economy alone then it is worthwhile to consolidate the IList<T> implementations into some kind of secret helper but this has some consequences.

Luckily there are also alternatives — the example that I provided is perhaps the worst performing choice.  As one reader pointed out you could make it much better by extracting the count from the IList once and then using that in the loop.  Note that the compiler cannot do this because IList<T>.Count is a virtual method and so we cannot assume anything about the implementation (it could sent a letter to grandma every time it is called and grammy wants her email!)

Here are the results from my program (source code attached, see link at the very bottom)

Test Case Milliseconds
Test1: Array 54
Test2: List<> 8
Test3: ArrayWrapper 14
Test4: Array via foreach 9
Test5: List<> via foreach            11
Test6: Array via special 6
Test7: List<> via special 8

The first two rows are the problem as given and we can easily see that the array is performing much more slowly than the List.  Let’s see where the cost is:

Excl% Incl% Name
0.00 51.27 ArrayTest.Program.Test1(uint16[])
1.64 51.27   ArrayTest.Program.Test.Sum(System.Collections.Generic.IList`1)
2.38 25.64     System.SZArrayHelper.get_Item(int32)
5.47 23.26         JIT_IsInstanceOfArray(…)
4.72 7.26           ArrayIsInstanceOfNoGC(…)
2.63 10.02           SigTypeContext::InitTypeContext(…)
0.49 22.40     System.SZArrayHelper.get_Count()
5.36 21.91         JIT_IsInstanceOfArray(…)
4.30 6.99           ArrayIsInstanceOfNoGC(…)
2.40 9.07           SigTypeContext::InitTypeContext(…)

Well it pretty much leaps off the page, the cost is in those helpers and you can see they’re doing a good bit of internal work to verify that we are talking about a simple array of a non-collectable type (ushort) and then more work happens to get to the real data.  The tragedy here is that the work done to validate what we need to do far exceeds the actual job at hand — which in both cases is just extracting one integer from a well known location.  The price of abstraction….  Though I’m pretty sure there is room for improvement in there.

Let’s look at some of the other alternatives.  First how did the List<T> fare:

Excl% Incl% Name
0.00 8.62 ArrayTest.Program.Test2(System.Collections.Generic.List`1)
1.06 8.62     ArrayTest.Program.Test.Sum(System.Collections.Generic.IList`1)
5.82 5.82         CLRStubOrUnknownAddress
1.54 1.54         System.Collections.Generic.List`1.get_Item(int32)

In the list case there’s still a good bit of dispatch code but you can see it’s much cheaper.  We’re on the sweet path for regular interface dispatch.  The actual count function calls are stil there if we were to look at lower percentages but I pruned anything with inclusive time below 1% in this example so it’s not showing up.  We’re doing a lot better.

Now what about some of the other results?  Well, now we’re in the bonus marks zone and I don’t want to drown you in perf results you can get yourself so I’ll summarize a bit.

In Test 3 I made a generic wrapper class to hold the array.  This adds a level of indirection but it avoids all the magic array processing.  My wrapper is highly lame (it only implements count and get item) but you can see that it’s doing fairly well performance wise.  Much better than the raw array and it’s a cheap object.

But the fun continues:  What if we replace that for loop in the original test with a foreach?  We can’t get the ultra-special array foreach construct but we can get a nice standard foreach over an IList in both cases.  Both Test4 and Test5 do introduce one allocation for the enumerator (which could be an issue if there were gazillions of these lists) but for our test case that’s not even measurable (indeed I get no samples in the allocator).

And the result?  It’s pretty sweet:  The array problem goes away entirely because we only have to do the array testing one time.  After that we’ve got a nice bound up enumerator that knows all the details of the array — no more redundant computations.   The array works great.. but the list is a bit slower?  Why would that be?  Well there’s some extra safety checks in the enumerator access code for List<T> to make sure the enumerator isn’t being used on an List that is being modified for instance.  Those checks slow down the list access a bit.

And the last two tests:  What if we special case the array in the sum function; assuming that was a very common/important case.  You can see that test6 gives the best result of all — now there isn’t any enumerator and we’re on the most optimized code path of any — foreach over a plain array.

But why is the List case faster in test7?  Shouldn’t it be pretty much the same as test5?  After all there’s an extra test and otherwise it’s the same.  Now there’s a mystery…  I actually ran a variety of control experiments, doing the tests in different order and with different array members to make sure this wasn’t an anomolous result — you can imagine with all these iterations cache effects would be important.  But no matter what I did the code in test7 always went faster than the code in test5.  Adding more iterations actually magnifies the effect.

The profile isn’t especially helpful either — both Test5 and Test7 have exactly the same call shape but Test7 spends less time in CLR stubs apparently.  At the moment my best theory is that we happen to get better cache alignment in Test7 than Test5 — less instructions split across cache lines or fewer unfortunate cache evictions because of happenstance.

Such is the life of the performance engineer — we sometimes get bit by a secondary effect.

So shall we continue?  What is happening to make Test7 faster than Test5?

And this seems like a great place to thank my colleague Vance Morrison for chiming in with the foreach and special case approaches when I first was bouncing this analysis around.  Thanks a ton Vance 🙂


Comments (8)

  1. kaalikas says:

    i think the differnce is caused by the boxing of the List<T>.Enumerator (which is a struct) by IList<T>.GetEnumerator() which returns an interface

    foreach over an List<T> can be potentially inlined and otherwise optimized while foreach over IList<T> will have to use interface calls

  2. A short and sweet quiz with lots of juicy discussion possibilities:


  3. Jeff says:

    This is completely off topic but what do you use to format the code you post Rico?

  4. Ivan Stoev says:

    Well, obviously the difference is caused by different interface call stubs (as explained by Vance Morrison) in Test5 and Test7 since foreach loop calls (IEnumerator<ushort> MoveNext and Current) in SumSpecial are only on List<ushort> instance while same calls in SumForeach are on both ushort[] (in Test4) and List<ushort> (in Test5). If we comment Test4 call, they would have similar speed (Test5 would be a little bit faster as expected).

    Very interesting, thanks Rico and Vance!

  5. Well you’ll recall that in the Performance Quiz #9 solution&amp;nbsp;there was a surprising result where…

  6. Sherif says:

    I ran the test and I got similar results, but I made a small modification (see below), and I really can’t explain the results. I just ran the 7 tests twice. Test1 was dramatically slower in the second time.

    for (int i = 0; i < 2; i++)










    Below are the results on my Desktop PC (Windows Server 2003, P4 2.4GHz HT, 1GB RAM, 512KB Cash)

    Elapsed Milliseconds Array: 98

    Elapsed Milliseconds List<>: 19

    Elapsed Milliseconds ArrayWrapper: 18

    Elapsed Milliseconds Array via foreach: 16

    Elapsed Milliseconds List<> via foreach: 24

    Elapsed Milliseconds Array via special: 2

    Elapsed Milliseconds List<> via special: 19

    Elapsed Milliseconds Array: 4616

    Elapsed Milliseconds List<>: 19

    Elapsed Milliseconds ArrayWrapper: 15

    Elapsed Milliseconds Array via foreach: 21

    Elapsed Milliseconds List<> via foreach: 23

    Elapsed Milliseconds Array via special: 2

    Elapsed Milliseconds List<> via special: 16

    Below are the results on my Laptop (Windows Server 2003, Pentium M 1.6GHz, 1GB RAM, 2MB Cache)

    Elapsed Milliseconds Array: 119

    Elapsed Milliseconds List<>: 23

    Elapsed Milliseconds ArrayWrapper: 20

    Elapsed Milliseconds Array via foreach: 17

    Elapsed Milliseconds List<> via foreach: 31

    Elapsed Milliseconds Array via special: 5

    Elapsed Milliseconds List<> via special: 22

    Elapsed Milliseconds Array: 3046

    Elapsed Milliseconds List<>: 22

    Elapsed Milliseconds ArrayWrapper: 19

    Elapsed Milliseconds Array via foreach: 27

    Elapsed Milliseconds List<> via foreach: 30

    Elapsed Milliseconds Array via special: 4

    Elapsed Milliseconds List<> via special: 23

  7. Well you’ll recall that in the Performance Quiz #9 solution there was a surprising result where Test7