I thought I’d share a support story with you from a very interesting case I have. My customer is running a pretty busy ASP.NET application on an 8-way box. Actually they are running on several 8-way boxes but that doesn’t really make a difference for the story…
Intermittently requests are running so slow that the ASP.NET application has to be restarted.
The customer has been extremely dilligent in taking memory dumps of the process when it is slow, and this turned out to be a key to solving this situation because what they were running into was really messy and hard to find.
In pretty much all the memory dumps they were in the middle of a garbage collection, so the problem looked like a classic high CPU in GC… but before we continue the story I have to give some background information about the garbage collector so that you can follow along.
I have written tonnes about the GC so in order to not bore you too much here are the points that are interesting for our story
1. Garbage Collections occur either when we make an allocation that makes the GC exceed the limit for one of its generations, alternatively when someone calls GC.Collect
2. Collecting higher generations is more expensive than collecting younger ones. When you have high CPU in GC you typically see a high ratio of Gen 2 collections.
3. On a multiproc box running ASP.NET you will use the server GC which means that you have one GC thread per CPU and one GC heap for each GC thread. These threads work on their respective GC heaps at the same time and join betweeen the different phases (eg. plan, mark, sweep, compact etc.). The join is typically really quick since the GC works hard on balancing the heaps making sure that each GC thread gets approximately the same amount of work.
4. If the application uses the large object heap extensively, the application will have a higher Gen 2 ratio and spend more time in GC.
For some related stories that will give you more background info, see http://blogs.msdn.com/tess/archive/2006/11/24/asp-net-case-study-bad-perf-high-memory-usage-and-high-cpu-in-gc-death-by-viewstate.aspx and http://blogs.msdn.com/tess/archive/2006/06/22/643309.aspx.
Debugging the problem:
It doesn’t surprise me when a process is spending a lot of time in garbage collection, we get those cases all the time and it usually turns out to be a bad allocation pattern causing the problem. Naturally I looked for all the common causes such as large allocation rates, lots of large objects, calls to GC.Collect, unneccessary finalizers etc. and we found a little of each but didn’t find that one thing that makes you go hmmm… yeah that’s it, anywhere.
I racked it up to a lot of small things, like some GC.Collects, some large object usage, fairly high allocation rates etc. all working together to make the process go bananas with the garbage collection, but it bugged me because i like black and white, i don’t like a lot of maybes, especially since I know that any one of these little changes that are needed will take a long time to implement and I can’t promise that they will make a lot of difference.
To add insult to injury each GC was triggered by something different, some by small object allocations, some by large object allocations, some by GC.Collect etc. so the proverbial murder weapon was missing.
Something else that bugged the heck out of me was that in pretty much all the dumps we had 7, sometimes all 8 GC threads stuck in a join. In other words, we hardly every caught it doing any real work in the GC, we were just waiting for all the GC threads to hook up all the time.
ChildEBP RetAddr Args to Child
0218f9ac 7c822124 77e6bad8 000002e4 00000000 ntdll!KiFastSystemCallRet
0218f9b0 77e6bad8 000002e4 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0218fa20 77e6ba42 000002e4 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
0218fa34 791f3462 000002e4 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
0218fa48 7924eaca 000de140 00000000 000de140 mscorsvr!t_join::join+0x54
0218fad0 791f3d71 00000000 000de140 00000008 mscorsvr!gc_heap::plan_phase+0x986
0218fb24 791f3ce4 00000000 000de140 77e670da mscorsvr!gc_heap::gc1+0x78
0218fb88 791f3219 00000000 00000000 000de140 mscorsvr!gc_heap::garbage_collect+0x22f
0218fbac 79224ac2 79172270 79172277 ca1386b6 mscorsvr!gc_heap::gc_thread_function+0x42
0218ffb8 77e6608b 000de140 00000000 00000000 mscorsvr!gc_heap::gc_thread_stub+0x1e
0218ffec 00000000 79224aa4 000de140 00000000 kernel32!BaseThreadStart+0x34
I gave them the fix for the timers bug, and the fix for a situation where we do a lot of Gen 2 collections if you have more than 2 GB of RAM but nothing seemed to help, and this issue just kept haunting me, finally I had more dumps than I could muster:)
We got a break in troubleshooting when I got some dumps where they had the loader lock-GC deadlock caused by a 3rd party component they were using. I asked them to separate out the part of the application using this component and suddenly all their problems went away. Now, one could argue that separating this piece of the application out separated out a lot of code that could have caused the high allocation rates, large object usage etc. too, and I agree. Not only that, but if the deadlock was the issue, we would see the deadlock in all the dumps that were hanging and we didn’t. So something else in that piece of the code caused the massive amount of time spent in GC.
What was even more troublesome was that this problem would occur sometimes after a few hours, but sometimes after such a short time as a few minutes. Taking multiple dumps after eachother showed that we were not actually stuck in join, the process progressed and dumps taken a few minutes apart would both be stuck in GC, but in different garbage collections, however knowing that the problem occurred when this part of the app was enabled allowed them to repro more readily in a test environment. Finally one of the dumps that they sent in contained the smoking gun.
ChildEBP RetAddr Args to Child
08f0ed00 7c821e64 77e952ac ffffffff 00000015 ntdll!KiFastSystemCallRet
08f0ed04 77e952ac ffffffff 00000015 08f0ed28 ntdll!ZwSetInformationProcess+0xc
08f0ed1c 02e8ba0d ffffffff 00000008 00000008 kernel32!SetProcessAffinityMask+0x16
08f0ee84 77e60eb5 08f0ee9c 08f0eec4 00000000 0x2e8ba0d
08f0eec4 06886841 05dbc030 08f0f9f4 792e6ab9 kernel32!GetSystemInfo+0x3f
08f0eed8 09ffd76e 0bd652b4 08f0ef9c 00000000 0x6886841
08f0f1f0 7c34218f 08f0f2e4 0a41af6a 0bd226d0 TheSmokingGun+0xd76e
791c9410 791cfd8b 79243b66 792a1d1a 79205613 msvcr71!free+0xc8
791c9410 00000000 79243b66 792a1d1a 79205613 mscorsvr!Thread::StackWalkFramesEx+0x3e4
And low and behold, it was their 3rd party component, the one that had caused the loader lock issue… Now, of course, the dll isn’t called TheSmokingGun.dll:) that would be too easy, but I’m replacing the real name of the component since the actual name is unimportant for the story.
So what is so smoking about this? Well, this component is calling into the API SetProcessAffinityMask. And guess what this does?
It limits the process to run on only one CPU… Niiiiiiice….
Here is a little note from Raymond Chen on the topic, and if that didn’t convince you that calling this from a library component is a bad thing, check out the API docs http://msdn2.microsoft.com/en-us/library/ms686223.aspx where it even states
“Do not call SetProcessAffinityMask in a DLL that may be called by processes other than your own”.
So what does this mean for the GC? Well, first off, it’s majorly bad for the whole process, we think we are running on 8 CPUs and we are only running on one, but it’s even worse for the garbage collector. The garbage collector as I mentioned before relies heavily on running concurrently on all processors, that is the advantage of the server GC. The joins are just overhead but in a normal situation the overhead is supersmall compared to the gain of being able to run on all processors at the same time.
In this case, where the GC thinks that it has 8 processors and can only run on 1, we have 8 threads doing work, but instead of doing it all at the same time each of them do it one after the other, and the other ones are stomping their feet asking, “are you done yet”, stealing CPU time by stomping their feet.
Just for comparison purposes, I created a little sample that allocated a fair amount of memory/sec and called SetProcessAffinityMask to limit it to one processor. My CPU in GC went from 20%-30% avg to 70% avg, and it took me almost 3 times as long to execute my requests.
Here is another support story on the topic if mine isn’t enough to give you a headache:)
The moral of the story
If you think that SetProcessAffinityMask is the solution to all your multi threading problems, think again:) If you don’t own the process you shouldn’t be messin with it.
Until next time,