My name is Brad and I’ve been on the PFE team here at Microsoft for many years. Suffice to say, I’m overdue for contributing to the team blog. I’ve seen lots of interesting (and not-so-interesting) issues with customers all over the world in my time at PFE. What follows is an issue I worked earlier this year. For me, the most interesting part of this issue was not so much in finding root cause as it was the process of discovering who was behind root cause.
It all started with a customer who reported their ASP.NET application had an OutOfMemory issue. These kinds of issues are not at all uncommon in the .NET world, and the trick usually comes down to finding what object(s) are rooted so that the .NET Garbage Collector can’t reclaim the memory associated with said object(s).
Getting data from the problem
They sent me a dump of the problematic application pool, mentioning that they dumped it two hours after they received notice about the OutOfMemoryException (OOM). My initial thought on this was that they had recycled the process and then obtained a dump of the fresh new process. This would obviously be no good since the w3wp instance exhibiting the OOM was gone, and the dump instead represented a new process instance with no memory pressure. Unfortunately, this wouldn’t be the first time I’d had this problem.
However, when I received the dump, I was pleasantly surprised to see that they had dumped the correct process instance. It was over 1GB in size, representing a 32-bit process with significant memory pressure. And when I looked at the length of time the process had been alive, that verified this was the same instance that threw the OOM.
Windows Server 2003 Version 3790 (Service Pack 2) MP (4 procs) Free x86 compatible
Product: Server, suite: TerminalServer SingleUserTS
Debug session time: Thu Feb 25 15:07:20.000 2010 (UTC - 5:00)
System Uptime: 194 days 18:11:06.875 Process Uptime: 0 days 15:03:28.000
Analyzing the data, Part 1
After searching through the dump I found the problematic object that led to the OOM. And after talking with the right folks at the customer, we pieced back together how the problem arose in addition to a resolution to this problem.
So one question was answered, but another one remained: How did this process manage to stay alive for a full two hours after getting the OOM? I’ve been debugging issues like this since .NET 1.0 was released, and I can tell you that this wasn’t a normal set of circumstances.
The thread that threw the OOM turned out to be the same thread that answered the question.
# ChildEBP RetAddr
00 0c3aedb8 7c82775b ntdll!KiFastSystemCallRet
01 0c3aedbc 773d7a4b ntdll!NtRaiseHardError+0xc
02 0c3aee18 773b8377 user32!ServiceMessageBox+0x145
03 0c3aef74 7739eec9 user32!MessageBoxWorker+0x13e
04 0c3aefcc 773d7d0d user32!MessageBoxTimeoutW+0x7a
05 0c3af000 773c42c8 user32!MessageBoxTimeoutA+0x9c
06 0c3af020 773c42a4 user32!MessageBoxExA+0x1b
07 0c3af03c 7c34c224 user32!MessageBoxA+0x45
08 0c3af070 7c348e6c msvcr71!__crtMessageBoxA+0xf4
09 0c3af294 7c34cf83 msvcr71!_NMSG_WRITE+0x12e
0a 0c3af2cc 7a09bea7 msvcr71!abort+0x7
0b 0c3af2d8 77e761b7 mscorwks!InternalUnhandledExceptionFilter+0x16
0c 0c3af530 77e792a3 kernel32!UnhandledExceptionFilter+0x12a
0d 0c3af538 77e61ac1 kernel32!BaseThreadStart+0x4a
0e 0c3af560 7c828752 kernel32!_except_handler3+0x61
0f 0c3af584 7c828723 ntdll!ExecuteHandler2+0x26
10 0c3af62c 7c82863c ntdll!ExecuteHandler+0x24
11 0c3af90c 77e4bee7 ntdll!RtlRaiseException+0x3d
12 0c3af96c 78158e89 kernel32!RaiseException+0x53
13 0c3af9a4 7a14fa70 msvcr80!_CxxThrowException+0x46
14 0c3af9b8 7a108013 mscorwks!ThrowOutOfMemory+0x24
15 0c3afae4 7a109f7d mscorwks!Thread::RaiseCrossContextException+0x408
16 0c3afb98 79fd878b mscorwks!Thread::DoADCallBack+0x2a2
17 0c3afbb4 79e983fb mscorwks!ManagedThreadBase_DispatchInner+0x35
18 0c3afc48 79e98321 mscorwks!ManagedThreadBase_DispatchMiddle+0xb1
19 0c3afc84 79e984ad mscorwks!ManagedThreadBase_DispatchOuter+0x6d
<clipped for brevity>
As seen in the stack above, after the OOM was thrown in frame 0x14, a message box was thrown in frame 0x8. A message box will keep the process alive until someone clicks OK, Cancel, etc. and the message box goes away. In short, message boxes in server-side processes are never a good thing since they will hang your application!
Once again, we have answered one question, but another question remains: the process stayed alive for two hours after it threw the OOM because a message box was thrown, but who threw the message box, and why?
Analyzing the data, Part 2
As is common when finding a message box that’s popped up in a process, I wanted to see what the message box said. From MSDN, we learn that the second parameter to user32!MessageBoxExA is the text of the message, and the third parameter is the text of the caption in the message box. Using the kb command, we can retrieve these parameters, and then dump them to find the values:
# ChildEBP RetAddr Args to Child
00 0c3aedb8 7c82775b 773d7a4b 50000018 00000004 ntdll!KiFastSystemCallRet
01 0c3aedbc 773d7a4b 50000018 00000004 00000003 ntdll!NtRaiseHardError+0xc
02 0c3aee18 773b8377 0dbec7a8 0bcda228 00012010 user32!ServiceMessageBox+0x145
03 0c3aef74 7739eec9 0c3aef80 00000028 00000000 user32!MessageBoxWorker+0x13e
04 0c3aefcc 773d7d0d 00000000 0dbec7a8 0bcda228 user32!MessageBoxTimeoutW+0x7a
05 0c3af000 773c42c8 00000000 0c3af0a4 7c37f480 user32!MessageBoxTimeoutA+0x9c
06 0c3af020 773c42a4 00000000 0c3af0a4 7c37f480 user32!MessageBoxExA+0x1b
0:023> da 7c37f480
7c37f480 "Microsoft Visual C++ Runtime Lib"
0:023> da 0c3af0a4
0c3af0a4 "Runtime Error!..Program: c:\wind"
0c3af0e4 ".This application has requested "
0c3af104 "the Runtime to terminate it in a"
0c3af124 "n unusual way..Please contact th"
0c3af144 "e application's support team for"
0c3af164 " more information..."
I’ve seen this message before, and it can have different underlying causes. The bottom line – at this point in the troubleshooting phase – is that it isn’t a custom message box thrown carelessly by application code. The trick now is to determine how the message box was thrown.
I made a few unusual observations about the call stack in question.
- First, there are two versions of the C++ runtime on the stack (7.1 and 8.0). This isn’t common.
- Secondly, the sequence of events as told by the stack seems very unusual. It appears the .NET Framework throws an OOM (frame 0x14). Then eventually when the underlying OS handles the exception in frame 0xe, it somehow goes back to BaseThreadStart in frame 0xd where the default unhandled exception filter (UEF) is called in frame 0xc. From there we wind up back in the .NET Framework’s UEF in frame 0xb that appears to call the 7.1 CRunTime’s abort() function in frame 0xa. Finally, a message box results.
What a wild ride!!
- Finally… wait, the .NET Framework appears to throw a message box (if you observe frames 0xb through 0x8 of the stack)??? This is something I never would have predicted!
Fortunately, I’d been fooled by such an appearance before, so I knew to take a look at the raw stack first, before I went reading through the .NET source to see if mscorwks!InternalUnhandledExceptionFilter really does call for a message box to be thrown. When I looked at the raw stack around the frames 0xb through 0x8, this is what I found in between all the frames displayed from a kb command:
0:023> dds 0c3af070 L90
0c3af074 7c348e6c msvcr71!_NMSG_WRITE+0x12e
0c3af07c 7c37f480 msvcr71!`string'
0c3af26c 77e66ebd kernel32!VirtualQueryEx+0x1d
0c3af284 013af29c*** WARNING: Unable to verify checksum for ThirdParty.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for ThirdParty.dll -
0c3af28c 7c349600 msvcr71!_beginthreadex+0x7
0c3af298 7c34cf83 msvcr71!abort+0x7 <—frame 0xa of the kn callstack above
0c3af2a8 7a3b2000 mscorwks!COMUnhandledExceptionFilter
0c3af2b8 7c34240d msvcr71!_except_handler3
0c3af2bc 7c380ea0 msvcr71!type_info::`vftable'+0x94
0c3af2c8 7c35f0a8 msvcr71!__CxxUnhandledExceptionFilter+0x2b
0c3af2cc 77ecb7c0 kernel32!gcsUEFInfo
0c3af2d0 7a09bea7 mscorwks!InternalUnhandledExceptionFilter+0x16 <—frame 0xb of the kn callstack above
Aha! So there is, in fact, someone else in between the .NET Framework and the call to throw the message box.
Now, let’s say you still want some kind of proof that the .NET Framework doesn’t make the call to msvcr71!abort(), which results in a call to show the message box. I had my doubts that mscorwks.dll had the 7.1 CRT as a dependency. On a tip from fellow PFE Zach Kramer, you can prove this by running dumpbin /imports on the binary (Dumpbin is an old – but very handy - utility that still ships with Visual Studio). I could always obtain the binary by asking my customer for their mscorwks.dll, but it’s much easier to just use psscor2!savemodule.
start end module name
79e70000 7a400000 mscorwks
Loaded symbol image file: mscorwks.dll
0:023> !savemodule 79e70000 F:\modules\mscorwks.dll
5 sections in file at 79e70000
Successfully saved file: F:\Modules\mscorwks.dll
Then from a command window:
f:\Modules>dumpbin /imports mscorwks.dll >Imports.txt
Searching Imports.txt for the string ‘msvcr7’ comes up empty. In fact, the build of mscorwks used by my customer’s application was msvcr80.dll. This makes sense when you look at frames 0x14-0x13 in our call stack.
What if we employ the same technique on this ThirdParty.dll – will it show the assembly depends on the 7.1 CRT? Scanning the output for dumpbin /imports on ThirdParty dll shows the following:
Microsoft (R) COFF/PE Dumper Version 9.00.30729.01
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file ThirdParty.dll
File Type: DLL
Section contains the following imports:
1000E154 Import Address Table
10010554 Import Name Table
0 time date stamp
0 Index of first forwarder reference
So we know this ThirdParty.dll has the 7.1 CRT as a dependency, but that alone doesn’t prove this component throws the message box. And according to my customer, the vendor of ThirdParty.dll had been approached many times in the past regarding these “phantom message boxes” hanging production applications. But the vendor had denied any involvement without explicit proof. So the presence of their dll on the stack next to a call to msvcr71!abort() might not suffice when I confronted them with this issue. I felt I still had some work ahead of me.
Before continuing, let’s do a quick review of what we know and don’t know:
1. The process hung because a message box was thrown. Ironically, this kept the process alive so that a dump could be taken, and from this dump we learned root cause of the OOM.
2. The message box text and caption indicate it was thrown as a result of some systematic process – not due to some “rogue code” in the customer’s application.
3. Contrary to the appearance of the kb command, the message box was not thrown by the .NET Framework.
4. Based on the placement of ThirdParty.dll in the raw stack, the fact that ThirdParty.dll has the 7.1 CRT as a dependency, and the fact that the stack shows the message box was thrown directly by the CRT, and we’re likely to make the most progress by trying to rule out ThirdParty.dll as the culprit (or alternatively, prove that it is the culprit).
5. From the call stack, it appears the .NET Framework had already instructed the OS to handle the exception. Why does it appear that the .NET Framework got a second go-around at handling the exception, and how do we connect the dots from the .NET Framework, to this ThirdParty.dll, to the 7.1 CRT’s call to abort()?
6. Why does msvcr71!abort() throw a message box – was it explicitly instructed to do this by someone?
First, let’s tackle the question about connecting the dots seen in the call stack frames. There’s a pretty clear and concise explanation for this in an MSDN Magazine article from 2008.
When an exception goes unhandled and the OS invokes the topmost [Unhandled Exception Filter], it will end up invoking the CLR's UEF callback. When this happens, the CLR will behave like a good citizen and will first chain back to the UEF callback that was registered prior to it. Again, if the original UEF callback returns indicating that it has handled the exception, then the CLR won't trigger its unhandled exception processing
In other words, ThirdParty.dll had registered its UEF before the .NET Framework. And its UEF took the default road of calling abort() and throwing a message box. After ThirdParty.dll registered its UEF, the .NET Framework then registered its UEF callback. But it didn’t want to be rude and step over the UEF that ThirdParty.dll had registered first, so it chains back to it. Therefore, the result of msvcr71!abort() being called is due to the UEF registered by ThirdParty.dll.
Next, let’s tackle the question about msvcr71!abort() throwing a message box. Since this ThirdParty.dll was using the 7.1 build of the CRT, let’s look in MSDN for the information on msvcr71!abort():
abort determines the destination of the message based on the type of application that called the routine. Console applications always receive the message through stderr. In a single or multithreaded Windows application, abort calls the Windows MessageBox API to create a message box to display the message with an OK button. When the user clicks OK, the program aborts immediately.
To influence the behavior of abort(), simply call _set_error_mode in the DllMain so that it doesn’t exercise the default behavior of throwing a message box. MSDN’s documentation on _set_error_mode states you can use _OUT_TO_STDERR for the lone parameter, and this will avoid the message box when abort() is called.
A colleague of mine, Senior Escalation Engineer Bret Bentzinger, offered to write some sample code that would load ThirdParty.dll and test this proposed resolution of passing _OUT_TO_STDERR to _set_error_mode. Doing this confirmed that no message box was thrown, and the thread exits without hanging the process.
In the end, this problem of throwing a message box from the CRT’s call to abort() isn’t a new one. This issue has been around for ages. But it was my first opportunity to drive such an issue and see it through to a resolution. My customer got a fix from the vendor, and we were able to help the vendor write better, more stable code. It was a win-win situation!