We get a lot of calls from customers about how to debug a misbehaving process and we end up asking for the same information most times. I hope this post will help a few out there maybe solve their problem, or cut the time it takes to resolve the issue by having the right information ready for MS Support to help. I am thinking of making this into a series, but lets see what I can cover. The information is scattered out there, but this is what we normally do.
A crashing scenario is when the application terminates unexpectedly. This can happen a variety of ways, but usually a result of an unhandled exception. Most often if the error reporting system of Windows has not been changed, an error dialog similar to "Send Error Report to Microsoft" will be displayed and there will be an application event log telling the module that failed, but actual cause of the crash is usually in a different module. I have seen a few times application just disappearing, no dialog, or anything mostly in managed code. This is usually a result of a manged heap corruption causing the crash. If the app is handling the exception and gracefully closing the application out or logging it, this is not a crash. The tools listed below will help in taking crash dumps for postmortem debugging.
A hang is when the application is not responsive due to a dead hang, 0% CPU usage, or a busy hang, 100% CPU usage, where it needs to be killed. The dead hang can happen if the application is blocked from execution due resource contention, blocked or orphaned critical sections, or an indefinite wait for a resource or a piece of data to be processed. The busy hang mostly from unintended infinite loops/recursions. The question to ask here is how long the hang happens, if it's an indefinite hang there is an obvious problem, but usually the hang can happen for 30 seconds, 5 -10 minutes, or any random amount of time. With a hang scenario, a few hang dumps needs to be captured in succession to get a good picture of what's going on. Depending on how long the application hangs, divide that time up by 3,4,5 and at each interval of the process take a hang dump using adplus -hang or use debugdiag and it will give a good snapshot of the process over time. This can help determine which thread is hung when doing postmortem debugging.
Other special scenario, to be covered later.
Tools to tackle the little bugger
If the problem is happening during development then I am sure most will try and debug it as normal and resolve the problem. The following information might help those seeking help when stuck. Say the application is crashing on a production environment or on a customer's customer site and there is limited control/access to collect the right debug information to resolve the issue or the issue happens once in a blue moon.
The first thing that all developers need to be aware of when building their application is to always have symbol files, *.PDB, generated. These files should be saved for each build so that it can be used for debugging in the future if there are problems. A symbol store can be setup or it can be saved anywhere and pulled in later using a debugging tool like Visual Studio or windbg. By default VS 2002 and later build symbols in release unlike VS6.0 which was not the default. Symbols for native/mix-mode modules are required and managed module is not really needed when debugging. It's always painful asking a customer to rebuild the application with symbols, re-deploy, and repro the issue again as not having symbols makes it very hard to debug the issue. Old symbols used for newer builds will not work. There's always the option to force load the symbol, but the call stack maybe incorrect and unreliable, so always make sure the symbols are from the same build as the application having the problem.
Once the symbol issue is resolved, now to start collecting the data to debug the problem with some tools used to live debug or collect memory dumps for postmortem debugging.
DebugDiag.exe: This tool is mainly targeting IIS process, but can be used for all processes. This tool is very user friendly and flexible. Rules can be setup to monitor a process and when it crashes a memory dump can be generated or manual dumps can be taken of the process while it's hung. It has added features to inject leaktracker.dll to monitor for memory leaks. The best feature is Advanced Analysis. This gives a pretty good overview of what is going on with the process or what is causing the problem, many times it will not point exactly to the cause of the problem. Can be used for both managed and unmanaged process, needs to be installed on the client machine.
adplus.vbs: This is a script to attach the CDB debugger to the process to monitor it. This is a robust tool, but complex as config files need to be created to handle custom events. This tool comes with the installation Debugging Tools For Windows and can be used for simple crash scenarios to to take quick hang dumps. The documentation for the tool after install is very helpful, this works for both managed and unmanaged processes and does not need an install. Can xcopy the Debugging Tools for Windows directory to client, no install: adplus -crash -pn <ProcessName.EXE> -quiet or adplus -hang -pn <ProcessName.EXE> -quiet
windbg.exe: Many articles on using this debugger. It's pretty powerful and can be used as a live debugger or a postmortem debugger. Works for both managed and native processes and can be xcopied over without having to install/register any components on client machine.
drwtsn32.exe: Comes standard on all Windows OS, maybe embedded Windows is optional. So no install is needed. Only constraint is that it cannot collect dumps for managed processes. Use this tool in most restrictive environment, tool defaults to taking minidumps, need to configure it to take full dumps.
If stuck, dumps and symbols are good data to have ready to pass onto support to get the help needed. I listed some links below those willing to read on. I'll continue this discussion on a later post talking more about debugdiag, adplus, windbg and how to use it to collect information to resolve the issue. There are lots of articles out there already on how to resolve, but collecting the correct information is the most important.