Mid-life crisis

This particular problem (I call it mid-life-crisis) seems to come up fairly often so I thought I'd write up some general advice on it.  The symptoms go something like this:  There is a server process (usually a web server) and that process has a high percentage time spend in the garbage collector, like say 30%, or even more.  Simple enough, but why would the time be so high?  Why isn’t it more like the 1% or so that we’d like it to be?

Often (not always) the answer is mid-life-crisis.  By this I mean that something has happened that is causing objects that are normally of middle lifetime length (which we would very much like to die in generation 1) are living longer and end up dying in generation 2.

This is very bad.

Generation 2 garbage collections are the full ones.  That means every object on the heap must be visited and the process is largely stopped while this is going on.  If you are getting a lot of objects promoted all the way from generation 0 to generation 2 and then having them die shortly thereafter you are paying a huge price to clean up those objects. 

Why does this happen?

Well in the server case there’s one very common reason.  Let’s say it’s a web server, a request comes in, a bunch of setup work is done to get the result for that request, and at some point the code then accesses a database or a web-service to get the necessary data.  At that point the thread accessing the data is blocked, but all the objects that were pending for that request are still live.  Meanwhile, other threads on the server are still running, still doing allocations, and those might end up requiring a garbage collection.  When that collection happens, all the temporary objects on the blocked threads are still live, probably in local variables or objects that represent the transaction in flight.  They survive the collection and are promoted. 

Now since transactions are often longish things and collections are going to happen at some point, it’s normal for some objects associated with the transaction in flight to survive the generation 0 collections that are hopefully happening every second or so.  Those objects are going to get promoted to generation 1 just like they should, in fact, the main purpose of the generation 1 group of objects is to live long enough for a transaction related objects to stick around and then die cheaply.

But here’s where things go wrong.  If there are fairly long delays waiting for say database results, and a fairly large number of objects representing the state of the transaction in flight, there will be enough buildup in generation 1 that it will become appropriate to try to collect those objects.  At that point the survivors will be promoted to generation 2.  If there are a lot of survivors we are now in trouble because in order to clean them up we will have to do a full collection.  If those are happening regularly, the percent time spent in the collector will shoot up from a healthy 1% to something very bad, like 30%, 50%, even more sometimes.

So what to do about this?

Well, the good news is there’s a fairly straightforward line of defense.  The trick is, that you must clean up (i.e. set references to null) as much of your state as possible before you block on something like a database, or really before you block on anything that might be long.  It’s often the case that a lot of the temporary data won’t be needed after the database results come back, or could be cheaply recreated.  Before you call your web-service or database backend, get rid of as much as you can so that the objects that will survive collections while you are blocked are minimized.  This will let more things die in generation 0, minimize additions to generation 1, and avoid the crisis your mid-life generation 1 objects will cause should they start surviving into generation 2.

Remember, the “age” of objects is a relative thing. Collections cause things to age, and allocations are what cause collections, so reducing the total number of allocations causes things to age more slowly.  Having your objects die as quickly as possible again reduces the pressure to grow the generations and hence keeps things younger.

To see if this sort of thing is happening to you, you can look at the Relocated Types view in CLR Profiler to see what’s getting moved around (remember things are normally moved when they are promoted so moving objects are a good proxy for promoted objects).  To get overall promotion rates, use the GC Performance counters, there are counters that will tell you how much stuff is getting promoted into generation 2.  You want that number to be as small as possible – zero is ideal and even achievable in steady state, but as long as the rate of generation 2 collects is staying low, you’ll be fine.

Summary:  Don’t have a mid-life-crisis.  When there’s are many threads be sure to release as many of your objects as possible before you block any thread.