While I was working on this issue today, I was thinking of this old adage... "When the going gets tough, the tough get going!".
It looked pretty simple initially, and the issue was that the IIS worker process, W3WP.exe was crashing when the users tried browsing specific pages. I have seen a lot of crash issues so far, and we know that the Post-Production debugging can be quite tricky at times.
Basically, the w3wp.exe was always doing the same stuff before it crashed. I was happy since we had a pattern to this crash and the solution wasn't very far away since we were able to reproduce the error on demand. The w3wp.exe was showing High CPU for about a min, and then just crashed. Immediately it flashed that is it a StackOverFlow? The high CPU issue could be because of quite a few reasons, and one of them could be loops that never end. But normally they just make the CPU usage high and stay there. They don't crash the process. Although, StackOverFlow can!
We had a theory, and we needed to prove/disprove it. So, we decided to take a crash dump using Debug Diagnostic Tool. Normally, it is pretty easy, you set up a crash rule and once the crash happens, the tool generates a dump itself. Here is where the problem kicked in... Crash Rule was setup and w3wp.exe was crashing, but dumps were not generated. We checked the Event Logs and found that it was indeed a crash and an event log entry like the following was being logged each time the process crashed.
Event Type: Error
Event Source: .NET Runtime 2.0 Error Reporting
Event Category: None
Event ID: 1000 -> Signifies Crash!
Faulting application w3wp.exe, version 6.0.3790.1830, stamp 42435be1, faulting module kernel32.dll, version 5.2.3790.2756, stamp 44c60f39, debug? 0, fault address 0x00015e02.
Looking into the Task Manager we found an interesting stuff. Whenever there was a Crash we could see the process called DW20.exe kick in and it would spike the CPU, and then it would die. We wanted to get this out of the picture but boy oh boy... I will remember this day :o) And hence the song which I was referring to! I felt as if DW stands for Doctor Watson and I started changing the stuff which were of no importance. Silly me!
In fact, this guy DW20.exe is called Windows Error Reporting tool and was trying to do something when the error was happening on the server. We tried hard to remove and thought of renaming it, but somehow I don't seem like workarounds much. So the search continued and I landed up on a KB http://support.microsoft.com/kb/841477. We tried creating all the registry entries but nothing changed. Now what?
KBs after KB but unfortunately, I couldn't get the answer I was looking for. We tried atleast 6 different things and nothing worked. I don't want to bore you by telling you what didn't work, so let me take you to the solution now!
1. Go to Control Panel.
2. Click System.
3. Go to the Advanced tab.
4. Click Error Reporting.
Once we were done with that, DW20.exe stopped kicking in and we were able to collect the Crash Dumps. And after that, it was pretty easy. We checked the dumps and found that the issue is happening because of a Stack Overflow as we guessed!! Wow :o) Another issue resolved!!
Honestly, after fixing it, I felt as if I was a tough guy and persisted with this issue. But I also know that if I was aware of this button, I would have fixed it much earlier. Sometimes, small things like this, cause big problems and today I realize it once again... that no matter how much you learn, you just can't get enough.
Hope this helps!