Anatomy of an Application Pool Crash

Question:

Ok, I read through David Wang's Troubleshooting crashes thing and got the DebugDiag and I am able to reproduce the problem.

If I select Hang, and type in one of the website addresses that we host on this server, the moment the Hang Test starts it brings the entire Application Pool to a stand still. All other sites in other application pools respond perfectly, just not in this application pool.

The moment I terminate the w3wp.exe process for this application pool, IIS respawns a new process and all of a sudden all the sites that were hanging respond instantly.

Any idea of what may be causing this?

Answer:

Actually, what you describe as "causing this" is by-design. Things are working exactly the way it should when you catch a crash with a debugger like DebugDiag. Hmm... this reminds me that I should hurry up and complete that blog entry describing native code debugging of IIS6... but let's get back to the question at hand.

Handling the Unhandled Exception

What is happening is that when you select Hang, DebugDiag attaches a debugger onto the w3wp.exe process and waits for an unhandled exception to happen. An unhandled exception is an exception that is not expected, usually caused by human logical error (i.e. "bug"), and it triggers an immediate execution halt of the process. The raising of the exception basically indicates a crash is ABOUT to happen, which is why a debugger waits on it. You want to debug the process and its state right as the exception happens so that you can figure out the human logical error and fix it.

So, as soon as the unhandled exception happens, the attached debugger seizes control of the process, and NOTHING runs in the process because the debugger preserves the failing state of the process for investigation.

Debugging Application Pools

Depending on the configuration of the Application Pool, when a debugger seizes control of its w3wp.exe, ALL sites/applications using this Application Pool may simply grind to a standstill and not run.

However, options exist for having the Application Pool function while a worker process is debugged, and they include:

  • Web Garden - other worker processes take up the slack while this worker process gets debugged.
  • Orphaning - WAS will orphan and "forget" about this w3wp.exe, so new w3wp.exe will be spawned to handle future requests

Of course, these options come with their own unique set of caveats. You will have to evaluate them to determine if your situation benefits. They are not defaults for good reasons.

As soon as you terminate the w3wp.exe (or exit the debugger), the attached debugger also terminates. At this point, WAS detects this as an unexpected crash because WAS waits on all worker processes' handles, and this one just went away unexpectedly (i.e. not due to a triggered process recycle), so it logs it in the Event Log.

Then, depending on health-monitoring metrics (i.e. have not unexpectedly crashed too many times recently), WAS will keep the Application Pool active for HTTP.SYS, meaning that requests continue to queue and spawn new w3wp.exe process as appropriate to handle them.

Conclusion

This is why as soon as you terminate the w3wp.exe, the other sites become responsive. The requests were all queued and placed on-hold because a debugger halted the w3wp.exe that will handle them and the Application Pool is not configured to allow another w3wp.exe to handle them. As soon as the old w3wp.exe and associated debugger are out of the picture, a new w3wp.exe gets spawned to immediately handle the queued requests.

Thus, the "hang" you observe only happens when a debugger is attached to the crashing worker process and the Application Pool is configured in particular ways, and that is by-design. In the "normal" case where a debugger is not attached, the unhandled exception simply bubbles up to either the configured JIT Debugger or Windows itself, who usually just identify it as a crash and immediately terminates the process... and the subsequent requests simply spawn up a new w3wp.exe and continue onwards. You never see a "hang".

So, the IIS6 sequence is pretty darn optimal when it comes to handling code crashing at runtime and then gracefully recovering. It just may not appear that way at first glance.

//David