Performance Quiz #5: The performance cost of the garbage collector : Solution


Solution and Discussion

On Wednesday I posted this leading question about the cost of using the garbage collector for a certain tree management algorithm and today I offer you my own analysis of this problem.  But before I get into the details I’d like to start by reminding you that, as always, this analysis is hardly complete. I’m going to try to generalize a little bit from this example to get some educational value, but please don’t take that to mean that I think what I’m going to say here is always true.  If anything I think this example reinforces the difficulty of truly understanding where your performance costs are.  And also by way of answer to some concerned readers who thought perhaps I’d had a wee bit too much fruitcake this Christmas — don’t worry I’m feeling much better now 🙂

To gather the numbers below I built the sources with csc /o+ and I added a few calls to QueryPerformanceCounter in strategic places.  I did only very few repititions so I expect there is a certain amount of noise in my results.  As a result I only report the numbers to within 1ms.  I was using v1.1 of the CLR for this particular test, though I expect the results would be similar with all versions.

OK, let’s first answer the question that was directly asked: How much faster is Test2 than Test1?

Well, the answer is — it isn’t faster… it’s slower.  As written I get 4.356s for Test1 and 6.306s for Test2.  So on my machine Test2 takes about 45% longer.

Is that surprising?  Well, maybe not because I’m sure you all knew it was a loaded question.  But why?  Surely all that object tracing and memory compaction that Test1 has to do should be much slower than a little tree walking and linked list building in Test2.  What gives?

Well you can get your first clue by trying this example out again with different tree sizes.  In the table below I show interations and the run times for Test1 and Test2 (in seconds) in the first three columns.  Looking at that table you can see that the two are really pretty close until about 50000 iterations.  There are some reported differences but remember I didn’t repeat the results very many times nor did I run it on a quiet machine so I’d expect a good bit of noise.  The times for low iteration counts really are there just for completeness — they’re so fast it’s basically all noise.

So we can see for smaller tree sizes that linked list idea is actually doing not so bad.  At 25000 nodes for instance you could believe it really is faster by a few milliseconds.  But by 50000 the list has lost any edge that it had.  So where is the extra time going?

Well, there are few major differences between Test1 and Test2:   Test2 has to do an explicit walk of the tree nodes and build a free list, and then it uses the free list.  So to isolate these costs a little bit I made Test3 and Test4.  

In Test3 we visit the nodes and build the free list the same as Test2 but we don’t actually use the free list for anything.  To do this I merely added “freelist = null” to the end Tree2.Reset().  So we build it up but then don’t use it. 

Now when we measure this at 100000 nodes an amazing thing happens.  It’s faster to make the list and throw it away than it is to actually use the freelist that we carefully created.  Yowsa!  This is extraordinary because it’s telling us that even though we’re now forcing the garbage collector to do the same work as it did in Test1 it’s still better to do that than it is to try to use the list we made.

Trying to further isolate the costs, I made Test4 in which we visit the nodes but don’t make the freelist as we go along.  So basically I commented out the last three lines of Tree2.FreeTree().  This gives us a rough idea what the cost is of just visiting the nodes but not building the list.  No surprises here, it’s cheaper to just visit the nodes than it is to both visit the nodes and build the list.

With these numbers we can do a breakdown (an imperfect one but kinda in the zone) of the cost difference between Test2 and Test1.

The Cost of the Visit to the nodes is Test4 – Test1 — because in Test4 we visit the nodes but don’t build the list

The Cost of building the list is Test3 – Test4 — because in Test3 we built the list but didn’t use it

The Cost of actually using the list is Test2 – Test3 — sometimes this is a negative number because the freelist actually helps

These numbers are summarized below.

Iterations Test1 Test2 Test3
Make freelist
don’t use it
Test4
Visit nodes
don’t make list
Cost of Visit
Test4-Test1
Cost of Making List
Test3-Test4
Cost of Using List
Test3-Test2
10 0.000 0.000 0.000 0.000 0.000 0.000 0.000
25 0.000 0.000 0.000 0.000 0.000 0.000 0.000
50 0.001 0.001 0.001 0.001 0.000 0.000 0.000
100 0.001 0.001 0.002 0.001 0.000 0.000 0.000
250 0.003 0.003 0.003 0.003 0.000 0.000 0.000
500 0.007 0.006 0.007 0.007 0.001 0.000 -0.001
750 0.011 0.010 0.012 0.011 0.001 0.000 -0.002
1500 0.024 0.025 0.028 0.025 0.001 0.003 -0.003
3125 0.057 0.052 0.067 0.065 0.007 0.003 -0.015
6250 0.136 0.112 0.135 0.134 -0.002 0.001 -0.024
12500 0.267 0.244 0.307 0.303 0.036 0.004 -0.063
25000 0.604 0.599 0.725 0.650 0.046 0.076 -0.126
50000 1.510 2.152 2.004 1.804 0.294 0.199 0.148
100000 4.356 6.306 5.616 5.140 0.784 0.476 0.690
200000 12.292 16.255 14.461 13.694 1.402 0.767 1.794

Again, the most astonishing number is the last column.  For 100000 nodes, using the already-made freelist has a marginal cost of 0.69s.

Let me say that again:  Even if you could magically make that same freelist at no cost it would still be better to not use it!  And of course the situation is much worse than that because it is costing us time to make it (even ignoring the cost of visiting the nodes).

Why is that freelist so bad?  And why does it start getting bad around 50000 nodes?

The answer is that the nodes in the freelist are in a lousy order. 

The keys in the tree were random numbers — a good situation for tree keys — but when we walk the tree the list gets built in an order that is related to the shape of the tree.  This shape has almost nothing to do with the location in memory of the nodes because of course the tree was built up over time in a (hopefully) fairly well distributed fashion.  As a result there’s no expectation that the children of any tree node are anywhere near the parents in memory.  If we try to recycle objects from a freelist like that then we’ll find that each new object we use is nowhere near the previous object.  That’s not a problem until we have enough objects that they mostly don’t fit in the processors cache — when that happens (around 50000 nodes) we start taking cache misses on most allocations. Those cache misses slow the processor down so much that the technique is doomed.  Bad locality in the free list has destroyed its value.

About This Test Case

Of course you don’t ever want to read too much into the results of one test case.  And you can always point to the benchmark in question and say it’s bogus and not representative of anything.  But in this case I think there’s a lot of value in the benchmark. 

At first it may seem unfair to give the simple version (Test1) the advantage of bulk deletes but actually this isn’t so unreliastic as it first seems.  A more complete benchmark would put both Test1 and Test2 in a situation where they have to delete subtrees occassionally — this is no problem for Test1, it can delete a subtree with the same ease as it deletes everything.  And Test2 would again have to do work that’s kinda like what its already doing.  We’ve already shown that adding some tree visits to Test1 doesn’t make it an instant loser — the real problem with Test2 was the memory management.

A second point you could raise is that Test2 is trying to remember *all* the allocations and not just some.  That turns out to not be crazy in this benchmark because we always allocate the same number of objects but in general you would have to be a lot more clever about how much you choose to remember.  But even if you remembered less what you did remember would tend to have the same problems as shown in the existing benchmark unless you went to great pains to avoid those problems.

Finally, increasing the size of the tree objects involved in either cases really shouldn’t make a whole lot of difference to the locality situation other than magnify the same problem.  However if you do something like adding an array to the tree node then you must remember that the collector automatically gives you a nice initialized ready-to-use array.  For a fair comparison the recycling version Test2 also would have to reset the array contents (thereby touching the memory).

So overall this benchmark really isn’t totally crazy or anything.  Although different usages would show the issues in different degrees the same fundamental problems would appear in a wide variety of real tree implementations.  As usual, it’s just a question of how much the benchmark magnifies the underlying problems — and that’s the function of a benchmark after all, to magnify fundamental problems so we can see them better, not to provide real measurments we can use for actual applications.

So What Is The Point

Usually I write these quizzes with a specific point in mind.  This time I have two, we’ll do the least important first:

I often encounter people who just assume garbage collection as a memory management discipline is inherently bad and “can’t possibly lead to good results.”  I think that’s not at all accurate.  There are plenty of situations where a garbage collected discipline is going to give you great results.  In contrast it’s hard to think of any applications with significant memory usage that can actually just use malloc/free right out of the box.  Invariably you end up writing some front end to the underlying allocator that wraps malloc in some interesting kind of way so as to give you the performance you need.   These wrappers can be quite tricky to get right and they often suffer from tough issues like long term fragmentation, long term loss of locatity and so forth.  You can get it right but it’s not going to just work easily.  In contrast the garbage collector often doesn’t suffer from some of the issues mentioned above, and in many cases can be used directly with no wrapper but you still may get tricky object lifetime issues as I’ve discussed in previous blogs.  So is it perfect?  Hardly.  But is it junk?  Definately not.  Memory management is, after all, a very difficult problem.

That point is somewhat philosophical, so let me make a more concrete one.  Many developers choose to recycle their own objects in certain subsystems.  This can be a great technique to avoid mid-life-crisis but a great deal of care is required.  Even comparatively simple seeming systems can suffer massive, and nearly invisible, performance penalties by having bad locality.  Locality always needs to be in your mind when choosing a caching policy (or recycling policy, because after all, they’re kinda the same).  Keep factors like locality in mind but make your policy choices based on measurements.

Thank you all for the great comments!

Comments (14)

  1. G. Man says:

    Heheh, pretty good! I never would have taken it down to the processor cache level… very interesting!

  2. I would be very happy if you also tested the pooled struct approach I emailed. I think this is a more realistic pool approach for .net.

    To replicated the exact same tree you need to use the same random number sequence, by starting with the same seed.

  3. Rico Mariani says:

    Frank gave me an interesting tree-in-an-array implementation. It grows the array of nodes dynamically, doubling in size until it’s big enough. From there it never grows. Pointers are replaced by array indexes. The array is cleared out between runs but is otherwise recycled without pointer chasing.

    I instrumented it and ran it through the same battery on the same machine. Test1 is again shown for comparison.

    10 test1: 0.00010 array: 0.00010

    25 test1: 0.00026 array: 0.00057

    50 test1: 0.00053 array: 0.00094

    100 test1: 0.00111 array: 0.00178

    250 test1: 0.00314 array: 0.00462

    500 test1: 0.00663 array: 0.00953

    750 test1: 0.01050 array: 0.01565

    1500 test1: 0.02402 array: 0.03495

    3125 test1: 0.05950 array: 0.07540

    6250 test1: 0.12592 array: 0.16078

    12500 test1: 0.26757 array: 0.34816

    25000 test1: 0.58688 array: 0.76582

    50000 test1: 1.52845 array: 1.80631

    100000 test1: 4.42142 array: 4.79157

    200000 test1: 12.54707 array: 12.54535

    The results are pretty close. The GC has the edge for most of the run and they even up as the array grows in size.

    Lesson: The GC cost of compacting the memory is often not terribly different than zeroing your own. So simple memory-only objects are not likely to make especially good candidates for a pool.

  4. Rico Mariani says:

    Note the times for test1 were also slower on that run. Probably more network interrupts or something…

  5. Thanks! I used to physically disconnect the network cable for perf testing. Note that the pooled struct approach eliminates the memory overhead for an object allocation. Since the Node struct or Tree class contains only 3 32-bit fields, elimination of the heap overhead should reduce memory consumption by at least 25%. That should translate into better locality of reference and performance, unless I did not optimize well.

    Another interesting thing about the pooled struct is fragmentation. Your test does not deliberately fragment the heap by allocating lots of objects in between tree node allocations. In a real world app that might happen.

  6. Sergey Korytnik says:

    The idea that the simple free item list technique doesn’t always scale well is interesting. But it may be also worth to note that when the dictionary size is getting significantly larger than the processor cache size, it is more efficient using a B-tree rather then a binary search tree.

  7. forient says:

    The result is really amazing to me! For understanding this, I beat my head everywhere in my cubic.

    Greatly appreciate for this excellent sample and detailed analysis! Really obtain a lot.

  8. I am posting the code for a slightly faster pooled struct approach that does not maintain a free list for others to examine.

    /// <summary>

    /// Creates a binary tree from a linked list of pooled free nodes, using array

    /// indices as node references. Nodes cannot be removed; the tree can only be reset

    /// as a unit. Capacity can be initially set.

    /// </summary>

    class Tree6

    {

    private const int startSize = 16;

    private static Node[] pool = new Node[startSize];

    private static int count = 0;

    // NOTE: in this tree, 0 signals a null node reference

    public static void SetCapacity(int capacity)

    {

    pool = new Node[capacity];

    count = 0;

    }

    static public void Insert(int key)

    {

    if (count == 0)

    {

    NewNode(key);

    return;

    }

    int t = 0; // root node always 0

    for (;;)

    {

    Node n = pool[t];

    if (n.Key == key)

    return;

    if (key < n.Key)

    {

    if (n.Left == 0)

    {

    pool[t].Left = NewNode(key);

    return;

    }

    t = n.Left;

    }

    else

    {

    if (n.Right == 0)

    {

    pool[t].Right = NewNode(key);

    return;

    }

    t = n.Right;

    }

    }

    }

    private static int NewNode(int key)

    {

    if (count == pool.Length)

    IncreaseCapacity();

    int t = count++;

    pool[t] = new Node(key);

    return t;

    }

    private static void IncreaseCapacity()

    {

    int size = pool.Length;

    Node[] nodes = new Node[size * 2];

    Array.Copy(pool, nodes, size);

    pool = nodes;

    }

    public static void Reset()

    {

    count = 0;

    }

    private struct Node

    {

    public int Key;

    public int Left;

    public int Right;

    public Node(int key)

    {

    Key = key;

    Left = 0;

    Right = 0;

    }

    }

    }

  9. The "linked list of free nodes" comment in that code I posted is incorrect; there is no list of free nodes. Anything above count is free.

    That code is consistently faster than the GC version by a small margin in my tests, but I only tested large trees.

  10. Rico Mariani says:

    I haven’t tested Tree6 that Frank posted but it looks little faster than the one he mailed me. This one is cashing in on the batch delete semantics big time. Tree1 and Tree2 both had the interesting property that they handled batch deletes basically the same way that they would have handled incremental deletes.

    Still it’s interesting what can be done.