It may appear as a contradiction after my previous post, but the first thing to do to start analyzing a memory dump is ask yourself: do I really need a dump?!?
Let me explain: when you need to troubleshoot an error there are a number of things to do before really going down the dump path, simply because not all problems can be resolved in that way… keep in mind that a dump is nothing more than a snapshot of a process at a certain point in time, we can try to understand what happened in the past but with some limitations (as long as the details we are looking for are still in memory), and of course we can’t know what happened to the process after the dump has been taken; exactly like a picture you can take with your digital camera. For example if you are having problems to remotely debug your application, I hardly think a dump can add any value to your troubleshooting… while in case of a memory leak a dump is one of the first things I ask the customer to provide me (but again there are some things before this step).
What’s first, then? Well, as you can guess, the first step is to understand the problem and know the scenario where it reproduces. If you are troubleshooting your own application you should already know most of the details, but if you are a consultant and are helping one of your customers with a weird exception thrown one in a while, you must have an open an ongoing discussion with the people whom developed the application, and maybe with the IT pros whom are maintaining the application and the environment day by day.
Let’s assume we have the information we need to start, and we decided we need to capture a dump. But which kind? How? When? Moreover, are you sure you and your customer are talking the same language and using the same terms to name things? I tell you because I learnt this lesson on my own in the hard way… the customer was describing a crash in his application so we configured adplus to run in crash mode, but some some reason we were unable to get a dump when the crash reproduced; we kept trying, but after 3-4 runs we gave up. Finally it turned out that the crash the customer was reporting was "just" an exception not handled in a try…catch block shown to the final user (do you know the yellow/orange ASP.NET error page?) but the worker process was still happily serving requests for other users…
So, here is some basic terminology: if your customer is not expert in this area, assure he understands those terms and stick to them to avoid confusion. This still apply if you’ll ever need to raise a support call with Microsoft CSS, this is the terminology you can expect to be used
- crash: this refers to a process which for some reason (usually an unhandled exception) is terminated by the operating system. How to be sure? Check the TaskManger when the problem occurs, and if the process gets a new PID, it has been recycled. And check your event log: usually you’ll have a message like "process xyz terminated unexpectedly"
- hang: the application reach a status where it’s unable to continue serve incoming requests (and maybe the users are getting a "server too busy" error) but the process does not crash. In such a situation the target process could simply sit there in memory doing nothing, and you have to restart it manually to restore normal application activities. Note that IIS has a mechanism to automatically detect the status of its worker processes, and if one of them for some reason does not respond to regular pings, after a certain timeout elapses IIS assumes the process is hanging and recycles it. In this latter case the symptom may looks like a crash, but it really isn’t and you can tell because you’ll not have the "process xyz terminated unexpectedly" message, but rather you’ll have something like "a process serving application pool xyz has failed to respond to a ping"
- deadlock: imagine thread 1 in your application has acquired a lock on resource A (a handle, a socket etc…) but to complete its work must also access at the same time resource B; now imagine this resource B is locked by thread 2 which in turn is waiting for resource A (remember it’s locked by threads 1?)… we have a deadlock (is the same concept as the circular reference in a Excel sheet) because this situation does not have a solution, unless one of the two threads finally times out and release the resource it was locking
- leak: we have a memory leak when a process keeps growing over time and never releases back the memory to the operating system, until it eventually throws an OutOfMemoryException and it finally crashes. In this case capturing a dump when the process is being terminated is almost certainly too late, so it’s better to capture a manual dump when the process is approaching it’s size limit, but before the actual crash. By the way, a leak can take only 5 minutes to cause the process to crash, or it might take some days; but the pattern is always the same, as the OOM exception and the crash at the end. The smaller (and slower) the leak, the more difficult will be to find the culprit(s) of the problem…
Crash or Hang dump?
So, now that we gathered all those details about the problem, which is the right approach to capture the dump we need? It depends on the problem, of course. There may be some variations depending on the circumstances, but basically we can capture either a crash or hang dump. What’s the difference?
We’ll need a crash dump when we can’t determine when the problem (typically a crash like the name implies, but that’s not the only case) will happen, so we can configure the debugger in advance to monitor our target process and capture a dump when the process will be terminated, or when we need to capture a dump on a specific exception (as I’ll discuss in another post). On the other hand, we can capture a hang dump after the problem has occurred but the process is still in memory, for example in a memory leak scenario but also when a process is burning our your CPU.
First or second chance?
As the name suggests, exceptions should be the exception rather than the rule; so for example it’s always a good idea to check if an object is valid before trying to use it, rather than let the runtime throw an exception and trap it in a try…catch block. Anyway what happens when you have a debugger attached to a process? The debugger gets the first chance to handle the exception; If it allows the execution to continue and does not handle the exception, the application will see the exception as usual. If the application does not handle the exception, the debugger gets a second chance to see the exception; in this case the application would normally crash if the debugger was not present.
This is more clear when you use adplus in crash mode: by default the debugger attaches to the target process and logs every exception thrown; if you try you’ll very likely end up with quite a few minidumps (a few megabytes each) corresponding to every exception thrown and trapped in try…catch blocks in your code, and a second chance full dump (the same size as the process, the private bytes value you can see in TaskManager) when the process will crash.
How much does this cost?
I mean in terms of performance for your server which could potentially be a highly stressed production environment? Of course there is a cost, especially for a crash dump because you’ll have a debugger attached to your worker process for the time needed to reproduce the problem, but it’s hard to exactly tell how much; in my experience I just had one server where the debugger was really affecting the site and forced us to stop it. But it worth mentioning that the server was already beyond its capacity limit and was already performing badly, the debugger was just the last straw…
Having a repro in a test environment is the ideal situation, since we’ll be able to capture dumps, run tests and do whatever needed to resolve the problem without causing additional pain to your poor users.
Quote of the Day:
keep it real/keep it clean/keep it simple