Anatomy of an Application Pool Crash


Question:


Ok, I read through David Wang’s Troubleshooting crashes thing and got the DebugDiag and I am able to reproduce the problem.


If I select Hang, and type in one of the website addresses that we host on this server, the moment the Hang Test starts it brings the entire Application Pool to a stand still. All other sites in other application pools respond perfectly, just not in this application pool.


The moment I terminate the w3wp.exe process for this application pool, IIS respawns a new process and all of a sudden all the sites that were hanging respond instantly.


Any idea of what may be causing this?


Answer:


Actually, what you describe as “causing this” is by-design. Things are working exactly the way it should when you catch a crash with a debugger like DebugDiag. Hmm… this reminds me that I should hurry up and complete that blog entry describing native code debugging of IIS6… but let’s get back to the question at hand.


Handling the Unhandled Exception


What is happening is that when you select Hang, DebugDiag attaches a debugger onto the w3wp.exe process and waits for an unhandled exception to happen. An unhandled exception is an exception that is not expected, usually caused by human logical error (i.e. “bug”), and it triggers an immediate execution halt of the process. The raising of the exception basically indicates a crash is ABOUT to happen, which is why a debugger waits on it. You want to debug the process and its state right as the exception happens so that you can figure out the human logical error and fix it.


So, as soon as the unhandled exception happens, the attached debugger seizes control of the process, and NOTHING runs in the process because the debugger preserves the failing state of the process for investigation.


Debugging Application Pools


Depending on the configuration of the Application Pool, when a debugger seizes control of its w3wp.exe, ALL sites/applications using this Application Pool may simply grind to a standstill and not run.


However, options exist for having the Application Pool function while a worker process is debugged, and they include:



  • Web Garden – other worker processes take up the slack while this worker process gets debugged.
  • Orphaning – WAS will orphan and “forget” about this w3wp.exe, so new w3wp.exe will be spawned to handle future requests

Of course, these options come with their own unique set of caveats. You will have to evaluate them to determine if your situation benefits. They are not defaults for good reasons.


As soon as you terminate the w3wp.exe (or exit the debugger), the attached debugger also terminates. At this point, WAS detects this as an unexpected crash because WAS waits on all worker processes’ handles, and this one just went away unexpectedly (i.e. not due to a triggered process recycle), so it logs it in the Event Log.


Then, depending on health-monitoring metrics (i.e. have not unexpectedly crashed too many times recently), WAS will keep the Application Pool active for HTTP.SYS, meaning that requests continue to queue and spawn new w3wp.exe process as appropriate to handle them.


Conclusion


This is why as soon as you terminate the w3wp.exe, the other sites become responsive. The requests were all queued and placed on-hold because a debugger halted the w3wp.exe that will handle them and the Application Pool is not configured to allow another w3wp.exe to handle them. As soon as the old w3wp.exe and associated debugger are out of the picture, a new w3wp.exe gets spawned to immediately handle the queued requests.


Thus, the “hang” you observe only happens when a debugger is attached to the crashing worker process and the Application Pool is configured in particular ways, and that is by-design. In the “normal” case where a debugger is not attached, the unhandled exception simply bubbles up to either the configured JIT Debugger or Windows itself, who usually just identify it as a crash and immediately terminates the process… and the subsequent requests simply spawn up a new w3wp.exe and continue onwards. You never see a “hang”.


So, the IIS6 sequence is pretty darn optimal when it comes to handling code crashing at runtime and then gracefully recovering. It just may not appear that way at first glance.


//David

Comments (7)

  1. Vlad says:

    Wait a min – aren’t you describign behaviour of a Crash rule above?

    "when you select Hang, DebugDiag attaches a debugger onto the w3wp.exe process and waits for an unhandled exception to happen"

    I thought when you used a CRASH rule, DD attached a debugger; HANG rules only attach and dump processes when the hang condition fails? Or did I misunderstand it?

    If thats the case the guy is asking a different question to the one you answered which is why the process isn’t being dumped and resuming with a HANG rule, I think

  2. David.Wang says:

    Vlad – I wouldn’t get “hung” up with the names of the rules. The names are simply how we describe actions to users.

    Behind the scenes, a debugger has to attach to the process(s) no matter what and wait for specified conditions to trigger and then take action.

    When the condition triggers, whether it is an unhandled exception causing a crash, or a hang test, or memory leak, the debugger has to halt the process to get an accurate guage and take action.

    It is this interaction with the debugger that the user observes, and that interaction stays consistent by-design.

    //David

  3. Vlad says:

    Thanks for correcting me!

    So Hang and Crash rules both attach the debug host to the process as soon as the rule is activated, and then it’s just the timing of when the dbghost dumps the process that varies.

    My mistaken belief was that the Hang rule only triggered an attach at the point at which the Hang test failed.

    Tks!

  4. Michael says:

    I have been dealing with an application pool that has been causing the iis to crash.. back on 6/5/07 at 9:22 a virus entered the network aimed specifically at the iis6.0, i run a medical imaging center network and because of this the iis went down and i have 3 companys trying to figure out what is going on.. i got the virus/grayware/malware out but the problem is still there.. i have been working for 4 days straight with the main application vendor, and another company trying to get this back up and it has all of us looking stupid.. we have it down to a process that pings the app pool and the app pool doesn’t respond and iis crashes i need some help it gives memory read errors when it crashes. we have removed the iis, and reinstalled the iis and the problem is still there.. do you have any idea’s you could throw my way?? and so far with the ssl cert in it still ccrashed one app pool not its crashing the whole iis. any help would be greatly appreciated.

  5. Aryan Nava says:

    Our company SharePoint site was keeps on crashing. When I looked at IIS, it was application pool keep on crashing whenever anyone opens the site.  If we restart the site and try again going to the page, it crashes again.

    This is what I did to solve the problem…

    – I went to IIS

    – Application Pool

    – SharePoint Site Properties | Identity | Configurable

    – I removed the user name and password

    – Entered the user name and password again

    I did this to all the SharePoint sites are in the application pool and it seem to solve the problem.

  6. Beau says:

    @Aryan Nava

    Well that solution should keep you busy.

    We have frequent IIS app pool crashes at our hosting company. Couple thousand sites on about 5 servers. We reset the app pool and problem solved. However, this becomes a real issue when you multiply by 2,000. Anyone know what the most common cause for these crashes are?

    I'm just a web developer, not a sys admin but its really getting to me that IIS seems to have a serious fault. I don't tend to see these kind of issues with apache and php, rather IIS and ASP.

  7. Beau says:

    @Aryan Nava

    Well that solution should keep you busy.

    We have frequent IIS app pool crashes at our hosting company. Couple thousand sites on about 5 servers. We reset the app pool and problem solved. However, this becomes a real issue when you multiply by 2,000. Anyone know what the most common cause for these crashes are?

    I'm just a web developer, not a sys admin but its really getting to me that IIS seems to have a serious fault. I don't tend to see these kind of issues with apache and php, rather IIS and ASP.