Solution and Discussion
On Wednesday I posted this leading question about the cost of using the garbage collector for a certain tree management algorithm and today I offer you my own analysis of this problem. But before I get into the details I'd like to start by reminding you that, as always, this analysis is hardly complete. I'm going to try to generalize a little bit from this example to get some educational value, but please don't take that to mean that I think what I'm going to say here is always true. If anything I think this example reinforces the difficulty of truly understanding where your performance costs are. And also by way of answer to some concerned readers who thought perhaps I'd had a wee bit too much fruitcake this Christmas -- don't worry I'm feeling much better now 🙂
To gather the numbers below I built the sources with csc /o+ and I added a few calls to QueryPerformanceCounter in strategic places. I did only very few repititions so I expect there is a certain amount of noise in my results. As a result I only report the numbers to within 1ms. I was using v1.1 of the CLR for this particular test, though I expect the results would be similar with all versions.
OK, let's first answer the question that was directly asked: How much faster is Test2 than Test1?
Well, the answer is -- it isn't faster... it's slower. As written I get 4.356s for Test1 and 6.306s for Test2. So on my machine Test2 takes about 45% longer.
Is that surprising? Well, maybe not because I'm sure you all knew it was a loaded question. But why? Surely all that object tracing and memory compaction that Test1 has to do should be much slower than a little tree walking and linked list building in Test2. What gives?
Well you can get your first clue by trying this example out again with different tree sizes. In the table below I show interations and the run times for Test1 and Test2 (in seconds) in the first three columns. Looking at that table you can see that the two are really pretty close until about 50000 iterations. There are some reported differences but remember I didn't repeat the results very many times nor did I run it on a quiet machine so I'd expect a good bit of noise. The times for low iteration counts really are there just for completeness -- they're so fast it's basically all noise.
So we can see for smaller tree sizes that linked list idea is actually doing not so bad. At 25000 nodes for instance you could believe it really is faster by a few milliseconds. But by 50000 the list has lost any edge that it had. So where is the extra time going?
Well, there are few major differences between Test1 and Test2: Test2 has to do an explicit walk of the tree nodes and build a free list, and then it uses the free list. So to isolate these costs a little bit I made Test3 and Test4.
In Test3 we visit the nodes and build the free list the same as Test2 but we don't actually use the free list for anything. To do this I merely added "freelist = null" to the end Tree2.Reset(). So we build it up but then don't use it.
Now when we measure this at 100000 nodes an amazing thing happens. It's faster to make the list and throw it away than it is to actually use the freelist that we carefully created. Yowsa! This is extraordinary because it's telling us that even though we're now forcing the garbage collector to do the same work as it did in Test1 it's still better to do that than it is to try to use the list we made.
Trying to further isolate the costs, I made Test4 in which we visit the nodes but don't make the freelist as we go along. So basically I commented out the last three lines of Tree2.FreeTree(). This gives us a rough idea what the cost is of just visiting the nodes but not building the list. No surprises here, it's cheaper to just visit the nodes than it is to both visit the nodes and build the list.
With these numbers we can do a breakdown (an imperfect one but kinda in the zone) of the cost difference between Test2 and Test1.
The Cost of the Visit to the nodes is Test4 - Test1 -- because in Test4 we visit the nodes but don't build the list
The Cost of building the list is Test3 - Test4 -- because in Test3 we built the list but didn't use it
The Cost of actually using the list is Test2 - Test3 -- sometimes this is a negative number because the freelist actually helps
These numbers are summarized below.
don't use it
don't make list
|Cost of Visit
|Cost of Making List
|Cost of Using List
Again, the most astonishing number is the last column. For 100000 nodes, using the already-made freelist has a marginal cost of 0.69s.
Let me say that again: Even if you could magically make that same freelist at no cost it would still be better to not use it! And of course the situation is much worse than that because it is costing us time to make it (even ignoring the cost of visiting the nodes).
Why is that freelist so bad? And why does it start getting bad around 50000 nodes?
The answer is that the nodes in the freelist are in a lousy order.
The keys in the tree were random numbers -- a good situation for tree keys -- but when we walk the tree the list gets built in an order that is related to the shape of the tree. This shape has almost nothing to do with the location in memory of the nodes because of course the tree was built up over time in a (hopefully) fairly well distributed fashion. As a result there's no expectation that the children of any tree node are anywhere near the parents in memory. If we try to recycle objects from a freelist like that then we'll find that each new object we use is nowhere near the previous object. That's not a problem until we have enough objects that they mostly don't fit in the processors cache -- when that happens (around 50000 nodes) we start taking cache misses on most allocations. Those cache misses slow the processor down so much that the technique is doomed. Bad locality in the free list has destroyed its value.
About This Test Case
Of course you don't ever want to read too much into the results of one test case. And you can always point to the benchmark in question and say it's bogus and not representative of anything. But in this case I think there's a lot of value in the benchmark.
At first it may seem unfair to give the simple version (Test1) the advantage of bulk deletes but actually this isn't so unreliastic as it first seems. A more complete benchmark would put both Test1 and Test2 in a situation where they have to delete subtrees occassionally -- this is no problem for Test1, it can delete a subtree with the same ease as it deletes everything. And Test2 would again have to do work that's kinda like what its already doing. We've already shown that adding some tree visits to Test1 doesn't make it an instant loser -- the real problem with Test2 was the memory management.
A second point you could raise is that Test2 is trying to remember *all* the allocations and not just some. That turns out to not be crazy in this benchmark because we always allocate the same number of objects but in general you would have to be a lot more clever about how much you choose to remember. But even if you remembered less what you did remember would tend to have the same problems as shown in the existing benchmark unless you went to great pains to avoid those problems.
Finally, increasing the size of the tree objects involved in either cases really shouldn't make a whole lot of difference to the locality situation other than magnify the same problem. However if you do something like adding an array to the tree node then you must remember that the collector automatically gives you a nice initialized ready-to-use array. For a fair comparison the recycling version Test2 also would have to reset the array contents (thereby touching the memory).
So overall this benchmark really isn't totally crazy or anything. Although different usages would show the issues in different degrees the same fundamental problems would appear in a wide variety of real tree implementations. As usual, it's just a question of how much the benchmark magnifies the underlying problems -- and that's the function of a benchmark after all, to magnify fundamental problems so we can see them better, not to provide real measurments we can use for actual applications.
So What Is The Point
Usually I write these quizzes with a specific point in mind. This time I have two, we'll do the least important first:
I often encounter people who just assume garbage collection as a memory management discipline is inherently bad and "can't possibly lead to good results." I think that's not at all accurate. There are plenty of situations where a garbage collected discipline is going to give you great results. In contrast it's hard to think of any applications with significant memory usage that can actually just use malloc/free right out of the box. Invariably you end up writing some front end to the underlying allocator that wraps malloc in some interesting kind of way so as to give you the performance you need. These wrappers can be quite tricky to get right and they often suffer from tough issues like long term fragmentation, long term loss of locatity and so forth. You can get it right but it's not going to just work easily. In contrast the garbage collector often doesn't suffer from some of the issues mentioned above, and in many cases can be used directly with no wrapper but you still may get tricky object lifetime issues as I've discussed in previous blogs. So is it perfect? Hardly. But is it junk? Definately not. Memory management is, after all, a very difficult problem.
That point is somewhat philosophical, so let me make a more concrete one. Many developers choose to recycle their own objects in certain subsystems. This can be a great technique to avoid mid-life-crisis but a great deal of care is required. Even comparatively simple seeming systems can suffer massive, and nearly invisible, performance penalties by having bad locality. Locality always needs to be in your mind when choosing a caching policy (or recycling policy, because after all, they're kinda the same). Keep factors like locality in mind but make your policy choices based on measurements.
Thank you all for the great comments!