My application seems to hang. What do I do? – Part 2


Last time I talked about the hang scenario where your process is taking 0 CPU and the CPU is taking by other process(es) on the same machine.


 


The next scenario is your process is taking 0 CPU and the CPU is barely used by other processes.


 


As one of the readers correctly pointed out, this is very likely because you have a deadlock. Usually debugging deadlocks is relatively straightforward – you look at what the threads are waiting on and figure out which other threads are holding the lock(s). And there are plenty of online resources that talk about debugging deadlocks. If you use the Windows Debugger package there are built in debugger extension dlls that help you with this like !locks and etc. If you are debugging a managed app the SoS debugger extension has commands that will aid you – !SyncBlk shows you managed locks (for CLR 2.0 there’s also !Dumpheap –thinlock for objects locked with ThinLocks instead of SyncBlk’s).


 


Another possibility is your process is not doing any CPU related activities. A common activity is IO – for example if the process is heavily paging you will see almost 0 CPU usage but it appears hang because the memory it needs is getting loaded from the disk which is really slow. A very useful tool that shows you what processes are doing is Process Monitor. Yesterday a program on my machine paused periodically – very annoying. So I used process monitor which showed me that this program periodically checks if I am logged onto my account in another program and since I am not, it would log me on, does a little bit stuff then log me off. And the hang was due to waiting on network IO. So to make it happy I logged myself on then the annoying periodic hang disappeared.


 


Now if your process is indeed taking CPU it can also appear to hang – as I mentioned last time this means different things for different people. If you have a UI app this can mean the UI is not getting drawn; if you have a server app this can mean your app is not processing requests. So you’ll have to define what hang means to you. I will use server app not processing requests as an example. Usually server applications run on dedicated machines. So let’s assume that’s the case here – you run a server on a machine and the server could consist of multiple processes. You measure the server performance by throughput. One scenario is the CPU usage is high (perhaps even higher than usual) but the throughput is lower than usual.


 


The easiest case, as one of the readers pointed out, is an infinite loop – very easy to debug. You break into the debugger a few times and see a thread is taking all the CPU and that thread can not exit some function – so there goes your infinite loop. And if your process is pretty much the only process that uses CPU at the time this is super obvious. It gets a bit more complicated if you have multiple CPUs and other processes are also using CPUs. But still since it’s an infinite loop the nice thing is it will always be executing if you don’t interfere so it’s always available to you to investigate as soon as it happens.


 


It becomes hard if the hang only reproes sporadically and when it reproes it only lasts for a little while. Time to whip out a CPU profiler. As another reader pointed out, Process Explorer is a useful tool to get you started. It shows you which processes are “active” – meaning it’s using CPUs. Personally I start with collecting appropriate performance counters, partially because pretty much all test teams in the product groups at Microsoft have some sort of automated testing procedure that collects perf counters so requesting them is easy. And because of the low overhead you can collect them for a long period of time so you have a histogram.


 


These are the counters I usually request (comments in []’s):


 


Processor\% Processor Time for _Total and all processors


[This is so I have an idea what kind of CPU usage I am looking at and if there are paticular processors that get used more than the rest]


 


Process\% Processor Time for all processes or less the ones that you already know can not be the problem



 


Thread\% Processor Time for all processes or less the ones that you already know can not be the problem



 


[The above counters will tell you which threads are using the CPU so you know which threads to look at]


 


[Since I usually look at GC related issues I request all counters under .NET CLR Memory]


.NET CLR Memory counters for all managed processes or less the ones that you already know can not be the problem


[If you are looking at other things you should add appropriate counters – for example,  ASP.NET counters for apps that use ASP.NET]


 


[If you know the kind of activities your processes do you can add appropriate counters for them. For me I often request memory related counters like:]


Memory\% Committed Bytes In Use


Memory\Available Bytes


Memory\Pages/sec


Process\Private Bytes for processes I am interested in



 


At this point I can look at the results and concentrate on the interesting parts – for example when the CPU is usually high. I will have an idea which threads in what processes are consuming the CPU and the aspects of them that interest me (usually GC and other memory activities). Then I can request more detailed data on those processes/threads. For example I can ask the person to use a sampling profiler so I can see what functions are executing in the part I am interested in (along with other info – this depends on what the profiler you are using is capable of).


 


Some people prefer to take memory dumps when the process hangs, sometimes this doesn’t necessarily work (when it works it’s great) because if the hang is related to timing/how threads are scheduled the threads can easily behave differently when you interrupt it in order to take memory dumps so the hang may not repro anymore. If you do have consecutive dumps from one hang then you can use the !runaway command to see which threads have been consuming CPU. One dump is hardly useful for debugging hangs because it only gives you info at one point in time how the process behaves.

Comments (8)

  1. Rudi says:

    Hi,

    I have been reading a lot on your blog (and others),  but still there is something I cannot explain:

    We have an ASP.NET application that can generate reports. However, some users will request soo much data that the server will throw an OutOfMemoryException.

    We will fix the application so that users can not attempt to get that much data back.

    But I still have the question why ASP.NET does not recover from the OutOfMemoryException. Initially we get the exception and after some more failed requests we will get the Server Application Unavailable error. My understanding is that since the request has been aborted, the memory used by the request should become ‘unreachable’ and the Garbage Collector should be capable of reclaiming the memory. But that does not seem to happen.

    I have reviewed the code and none of the objects are set in Session/Application/Cache.

    Would you have an explanation of why the memory is not being reclaimed?

  2. Sam Gentile says:

    I thought Thanksgiving morning is for adults to be sleeping in. Instead, not only are the kids up at

  3. maoni says:

    Rudi, GC will attemp a collection before it returns OOM. When the server gives you unavailable error was it still getting OOM exceptions or due to something else?

  4. Nick Batell says:

    Great post. Here is another article about the top 3 reasons <a href="http://www.correlsense.com/cto-blog/top-3-reasons-an-application-hangs/">applications hang</a> for no reason

  5. Ron Inbar says:

    Hi Maoni,

    My application seems to hang in the middle of GC.

    The stack looks like this:

    ntdll!NtWaitForSingleObject+0xa

    kernel32!WaitForSingleObjectEx+0x130

    mscorwks!Thread::LeaveRuntimeNoThrow+0xba

    mscorwks!CLREvent::WaitEx+0x57

    mscorwks!WKS::gc_heap::start_c_gc+0x1b

    mscorwks!`string’+0x56f3a

    mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x17d

    mscorwks!WKS::gc_heap::try_allocate_more_space+0x2f1

    mscorwks!WKS::GCHeap::Alloc+0x6d

    mscorwks!Alloc+0x7a

    mscorwks!FastAllocatePrimitiveArray+0x71

    mscorwks!JIT_NewArr1+0x43d

    Base_ni!Philips.PmsCT.Base.MathLib.Vector3..ctor()+0x1c

    The other threads all look more or less like this:

    ntdll!NtWaitForSingleObject+0xa

    kernel32!WaitForSingleObjectEx+0x130

    mscorwks!Thread::LeaveRuntimeNoThrow+0xba

    mscorwks!CLREvent::WaitEx+0x57

    mscorwks!WKS::gc_heap::wait_for_gc_done+0x5b

    mscorwks!WKS::GCHeap::WaitUntilGCComplete+0x4b

    So the question is what is the GC waiting for?

    Thanks,

    Ron

  6. Ron Inbar says:

    After submitting the previous post, I found out that there is one thread that doesn’t have WKS::GCHeap::WaitUntilGCComplete on its stack.  Its stack looks as follows:

    ntdll!ZwWaitForMultipleObjects+0xa

    kernel32!WaitForMultipleObjectsEx+0x1cf

    mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1

    mscorwks!Thread::DoAppropriateAptStateWait+0x41

    mscorwks!Thread::DoAppropriateWaitWorker+0x195

    mscorwks!Thread::DoAppropriateWait+0x5c

    mscorwks!WaitHandleNative::CorWaitOneNative+0x19f

    mscorlib_ni!System.Threading.WaitHandle.WaitOne(Int64, Boolean)+0x1f

    Could it be that the GC is waiting for this thread to suspend, thereby creating a deadlock?

  7. maoni says:

    Ron, both callstacks you showed are doing normal operations. The first one is waiting for the concurrent GC thread to wake it up and the second one is detecting that a GC is in progress and the threadpool suspended the thread.

  8. I thought Thanksgiving morning is for adults to be sleeping in. Instead, not only are the kids up at the crack at dawn but one (the eight year old) is playing games on Vista and asking for Starbucks, while the nearly 4 year old is clamoring for princess