Performance Quiz #5: The performance cost of the garbage collector

It's been a while, so it's time for another exciting Performance Quiz!

Here are two implementations of a management algorithm that creates and destroys trees.  The first implementation uses the garbage collector to do all of the memory management (even freeing is done via periodic collections, we just drop the nodes on the floor).  The second implementation is more like what you would write if you had to do a malloc/free style solution that managed your own memory.  However after the first iteration there will be no collections at all -- allocations come out a high-speed free list and the freelist is created via a fast recursive algorithm.

The Question:  How much faster is Test2() than Test1()?

Warning: There's already spoilers in the feedback

------------------------------------------

using System;

namespace Test
{
public class Quiz5
{
public static void Main()
{
// try running the tests in the other order too, or just one per run
Test1();
Test2();
}

public static void Test1()
{
Random r = new Random();

Tree1.Reset();

for (int i = 0; i < 50; i++)
{
for (int j = 0; j < 100000; j++)
{
int k = (int)r.Next();
Tree1.Insert(k);
}
Tree1.Reset();
}
}

public static void Test2()
{
Random r = new Random();

Tree2.Reset();

for (int i = 0; i < 50; i++)
{
for (int j = 0; j < 100000; j++)
{
int k = (int)r.Next();
Tree2.Insert(k);
}
Tree2.Reset();
}
}

class Tree1
{
private int key;
private Tree1 left;
private Tree1 right;

private static Tree1 root;

private Tree1(int _key)
{
key = _key;
}

static public void Reset()
{
root = null; // much garbage is created
}

static public void Insert(int key)
{
Tree1 t = root;
if (t == null)
{
root = new Tree1(key); // asking for new memory
return;
}

for (;;)
{
if (t.key == key)
return;

if (key < t.key)
{
if (t.left == null)
{
t.left = new Tree1(key);  // new memory again
return;
}
else
{
t = t.left;
}
}
else
{
if (t.right == null)
{
t.right = new Tree1(key); // new memory yet again
return;
}
else
{
t = t.right;
}
}
}
}
}

class Tree2
{
private int key;
private Tree2 left;
private Tree2 right;

private static Tree2 root;
private static Tree2 freelist;

static public Tree2 NewTree2(int _key)
{
Tree2 t = freelist;

if (t != null)
{
freelist = t.right;
t.right = null;
}
else
{
t = new Tree2();  // on the 2nd time through this never happens
}

t.key = _key;
return t;
}

static public void Reset()
{
FreeTree(root);
root = null;
}

static private void FreeTree(Tree2 t)
{

// put the tree on the free list... no collections

if (t == null)
return;

FreeTree(t.right);
FreeTree(t.left);

t.right = freelist;
t.left = null;
freelist = t;
}

static public void Insert(int key)
{
Tree2 t = root;
if (t == null)
{
root = NewTree2(key);
return;
}

for (;;)
{
if (t.key == key)
return;

if (key < t.key)
{
if (t.left == null)
{
t.left = NewTree2(key);
return;
}
else
{
t = t.left;
}
}
else
{
if (t.right == null)
{
t.right = NewTree2(key);
return;
}
else
{
t = t.right;
}
}
}
}
}
}
}

Tags

1. forient says:

The Test2 method will run double time than Test1. For Tree2 produce garbage one node by one node which spends bundle of time; Tree1 is very happy to kick the whole tree to GC which does the best to reclaim the memory.

At my test, the Test2 method uses 6.48 seconds and the Test 1 only 3 seconds.

Very impressive quiz! Thanks! :0)

2. Michael says:

> How much faster is Test2() than Test1()?

It isn’t faster, Test1 is faster since allocation is so fast on the compacted heap.

Test2 executes more code while it is running, so it is slower.

But I don’t think that this is a proper analogy for garbage collection to always be more performant than unmanaged malloc/free type mechanisms.

First of all, when you talk about the GC version’s performance, are you asking about overall system performance, or just the immediate function call?

It seems like deferring some work to the garbage collector will eventually cause the garbage collector to do more work at some point in the future, while a malloc/free type implementation pays for it right now.

But just because it defers it doesn’t mean that the deferred work doesn’t exist at all…

Also, it’s certainly possible to use a pool-based malloc/free type mechanism, where you just free the pool once instead of for each object. An approach like that would probably change the performance characteristics of your test pretty considerably.

3. If you use an array of structs instead of classes, and int indexes instead of pointers for the reference, the pool might actually serve some purpose (save memory and possibly improve locality of reference). As is it has no advantage.

4. Dean Harding says:

With the exact code you posted, I got about the same results but I assume that’s because test1 didn’t actually do any garbage collects. So I modified it a bit and added a member to each class:

private byte[] val = new byte[1000];

Which makes each node 1000 bytes big. In that case I get 22s for method 2 and 3 minutes 20 seconds for method 1.

At the same time, however, you probably shouldn’t just cache every node that’s ever created. What happens if the first request requires 1 million nodes, but then you only ever need at most 10 at a time after that? Of course, in a server-based application you’re probably getting a fixed number of requests per second and you can probably assume that it’s OK to just naively cache everything indefinately…

5. Rico Mariani says:

I’ll have some more comments when I post my "solution" but I’m very sure that I won’t be jumping to any conclusions about which is faster generally 🙂

I don’t think this freelist is a dead loser at all. Try cutting the number of entries to 50000 and see how it does. Then try 10000.

Anybody interested in coding up an array of structures approach for comparison?

I don’t want to give away too much of my analysis until I post the solution, I’d rather encourage you guys to think about it 🙂

As usual, it’s more slippery than it seems on the surface. 🙂 🙂

6. Rico Mariani says:

As written there are plenty of garbage collects required for Test1 — it’s not cheating there. The true cost is directly measurable. In contrast, Test2 will have no collections at all.

You don’t *have* to add additional members to force collections but such additions do of course make a difference in the results. But which way and for which sizes of members? 🙂

7. Luc Cluitmans says:

First of all, I didn’t play around with the code. But the phrasing of your question ‘How much faster is Test2() than Test1()?’ already set off alarm bells of the kind ‘is that a trick question’?

But one thought that came up in my mind, and that hasn’t been mention above, is about what would happen when some ‘other’ part of the application did induce garbage collects simultaneously with this code. Isn’t it so that in that sooner or later you may see differences between the trees’ behaviours related to which GC generation the nodes are placed? Obviously, Tree2 has a larger chance of getting its data in generation2, because it doesn’t free anything. But then again, the same may be true for Tree1, if it doesn’t free any memory during the build-up. Any thoughts about how GC generations affect this scenario?

8. Rico Mariani says:

What? Me ask a trick question? What makes you think I would ever do such a thing? *evil grin* 🙂 🙂

9. Mike Swaim says:

Don’t >both< versions garbage collect, with the GC in the second one just wasting cycles?

10. Rico Mariani says:

The GC collects in response to memory pressure — i.e. allocations. In the absence of allocations there are no collections. The second one does not collect.

11. G. Man says:

If you are trying to prove that managed memory is faster than unmanaged, I think you are just wasting time (pun intended). You need to run Test2 in an unmanaged environment. You cant just claim that no collections happen, therefore it’s just like an unmanaged environment.

No collections != Unmanaged code

12. Rico Mariani says:

I’d be foolish indeed to try to prove anything so profound as that with such a trivial benchmark. It’s merely an interesting problem with surprising and educational properties.

13. Joe Laughlin says:

A key part of GC is determining what is live and what is not. Updating the references to add the nodes back to the free list causes a lot of book keeping code associated with keeping track of the live/dead state of the heap to be executed (write barriers), making test 2 slower. For this reason reference/pointer rich data structures like trees and linked lists can be especially hard on GC, and which generation they are in can add additional cost. For example, modifying references contained within objects in gen 2 will likely be more costly then modifying those in gen 1, or gen 0.

14. RJ says:

The perf counter ‘% time in GC’ had a somewhat surprising result on this code. Test1 had a normal erratic graph with a single digit average. But Test2 held a constant 30%. Is that counter always accurate, or is is possible the value got ‘stuck’?

The result makes more sense if Joe is correct that assigning a GC pointer is not as simple as copying a number to a memory location.

15. Joe Laughlin says:

It’s more than just the cost of modifying the references. When GC goes to check which objects are alive it must trace through all of the modified references. Test2 has a lot more "live" modified references, and so much more of the heap needs to be traced. Modiying references in higher generations can result in a lot of the heap having to be traced.

17. Rico Mariani says:

>>The perf counter ‘% time in GC’ had a somewhat surprising result on this code. Test1 had a normal erratic graph with a single digit average. But Test2 held a constant 30%. Is that counter always accurate, or is is possible the value got ‘stuck’?

Those performance counters are updated at every collection (it would be too expensive to update them in real time). So if there are no collections then of course it shows the % as of the last collection and stays there. Test2 has no collections so it stays "stuck" as you say.

Maoni wrote an excellent treatment of this here: http://blogs.msdn.com/maoni/archive/2004/06/03.aspx

And btw her blog generally has great GC advice.

>>It’s more than just the cost of modifying the references. When GC goes to check which objects are alive it must trace through all of the modified references.

In this particular benchmark Test2 has no collections at all. Therefore there is no cost associated with tracing. If Test2 were being used side-by-side with other code it would incur that cost so you would likely not want to keep such a large freelist around. I talk about this some in the solution (link above).

To use Test2 in a real system you’d want to be a lot more clever about your recycling policy.

18. I emailed you the pooled struct code. You did not provide a correctness verification function and I did not write one, so it is unverified. But it does compile and run.

Note that the memory consumption could be cut by 33% by using 2 shorts instead of int for right and left references; the maximum capacity would then be limited to the max size of a short.

19. 菊池 Blog says:

Performance Quiz:Memory management