Hello again. It has been more than one month since the last post. I guess the Christmas vacation and the local christmas sweet played a role in this delay .
In the last few days I was faced with an interesting issue: a Windows Forms application abruptly terminates despite the fact that there are exception handlers wrapping the code. Needless to say , the problem occurs in a production environment, rather seldom and at random times. Nonetheless, its impact is huge and so the problem needs to be sorted out.
When a process having exception handlers terminates, the most obvious reason is that exceptions occur within exception handlers themselves. It's not the only possibility, but certainly the first to check.
The first step we take is to attach the debugger to the process (through Adplus) and look at the log file it produces; this confirms the first theory: from some point, an access violation occurs every time the process calls into the windowing subsystem (user32.dll and then w32k.sys in the kernel): the call to the kernel does not crash (otherwise the entire system would crash, not only the process), but the kernel makes a callback to the user-mode code with an invalid address, which causes a user-mode access violation. And since exception handlers display a message box in order to notify the user of the exception, they also generate the exception because this is a call to the windowing subsystem.
A user-mode dump, by itself, is not enough to identify the problem, because the bad address originates from the kernel. In order to understand where this invalid address comes from, we need a kernel dump as well, taken approximately at the same time when the problem happens.
A kernel dump crashes the system, so we need to make sure we are taking the dump only when we run into the issue. Clear enough, but how can this be done? The idea is that we attach a debugger to the user-mode process and, when the "right condition" occurs, we cause the system to crash somehow, so as to take the dump. As usual when something seems complex, let's break it down into pieces.
Analyzing the user-mode exception
By looking at the adplus log of a previous run of the faulty application, we know that there are benign access violations that occur from time to time in the application. They have nothing to do with the problem and are successfully handled by the process.
You may wonder how an Access Violation exception can be benign and therefore successfully handled by the code. Typically, benign access violations are those that dereference a NULL pointer. There are pieces of code that, instead of checking whether a pointer is NULL, try to dereference it and handle the exception. The Just-In-Time compiler in .NET does that as well: it sets up a handler for a native Access Violation exception, the handler creates a NullReferenceException managed object and throws it.
Obviously, we should not crash the system and take a kernel dump in case of a benign access violation. But this poses the problem of understanding which ones are benign and can be ignored by the debugger, allowing the application to handle them, from those that indicate we ran into the problem and should trigger the system crash.
One easy way to achieve this could be to rely on the debugging architecture, whereby debuggers are notified of exceptions at 2 stages:
- first-chance, before the exception handlers in the process are given a chance to handle the exception
- second-chance, only if the exception handlers in the debuggee did not handle the exception
The idea here is pretty simple: if we assume the application handles benign exceptions, we can simply configure the debugger to handle the exception second-chance: at that time we know that the application did not handle the exception, therefore the program is bound to terminate.
In our case, however, there is a complication: the code that does the exception handling is outside our and our customer's control, being it third-party, so we cannot change it. This means, unfortunately, that we cannot configure the debugger to catch the exception second-chance, because the debugger won't get notified.
So we need to handle the exception first-chance and, by analyzing the exception, understand whether it is one for which we should crash the system or if it is one which we should ignore and allow the application code to handle. The exception is the last exception event that waked up the debugger, which can be displayed with the .exr -1 command:
0:000> .exr -1 ExceptionAddress: 773d7dfe (ntdll!DbgBreakPoint) ExceptionCode: 80000003 (Break instruction exception) ExceptionFlags: 00000000 NumberParameters: 3 Parameter: 00000000 Parameter: 918d63c0 Parameter: 7741d094
The output of this command is essentially the content of the EXCEPTION_RECORD struct found in the WinNT.h header file:
We will distinguish whether an access violation is benign or otherwise based on the address of the data that is being dereferenced. If this address is 0, we consider it a benign AV. In order to accomodate the dereferencing of a member of a structure, we also consider benign an exception dereferencing data at addresses up to 4096. The choice of this particular value is certainly arbitrary, but probably reasonable (structs larger than 4KB are unusual).
In an access violation, the dereferenced address is the second parameter:
0:027> .exr -1 ExceptionAddress: 74a5849f (msdart!XxMpHeapFree+0x00000037) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter: 00000000 Parameter: a0068421 Attempt to read from address a0068421
So we need to dynamically check the value of the second parameter. The ".exr -1" command is not suitable for that, since its output is failry complex and its format is not documented: we need a debugger command that returns just the arguments of the exception.
When there isn't a debugger command that does what we need, it si time to write a debugger extension.
The Debugger Extension
For generality, we will write a debugger extension, let's call it ExParams, that outputs the parameters of the exception event that caused the debugger to wake up. If the last event is not an exception (for example, if it is a breakpoint), the extension writes out an error message.
If the debugging tools are installed with the optional SDK, it will create an sdk subfolder with folders for libraries (lib subfolder), include files (inc sufolder) and samples (samples subfolder). If you need to quickly start writing a simple extension, I suggest to start from the "extcpp" sample. It has a sample command, ummods, that you can customize to fit your needs.
The logical steps we need to take in our extension are as follows:
- Get the last event, the event that caused the debugger to wake up, by calling IDebugControl::GetLastEventInformation()
- Check the type of the event: if it is not DEBUG_EVENT_EXCEPTION, return an error message
- Cast the ExtensionInformation output argument to an EXCEPTION_RECORD64 data structure and write out the parameters
A simplified version with fixed-size arguments and with no error handling is reported here:
Since the extension should execute quickly and does not use symbols, I did not implement command cancellation.
We compile and build the extension with either the WDK tools or with Visual Studio. If you use the latter, please pay attention to the dependency on the C++ runtime library: better to do a static link (/MT compiler option) so as to avoid having to deploy the C++ redistributable on the target machine.
We are now ready to use our brand new extension.
Creating the ADPlus Script
The debugger extension, when run, will output 2 parameters. We need to check the second one. To achieve that, the control flow tokens of the debugging tools come to the rescue. The .foreach and .if tokens can be used to parse the output of ExParams and get the 2nd argument.
Also, we need a way to cause a crash of the system. The utility NotMyFault.exe, available from download here, can do the trick. It dynamically loads the driver MyFault.sys, which then causes the crash. NotMyFault can be downloaded at http://download.sysinternals.com/Files/Notmyfault.zip.
Furthermore, we use the facilities of AdPlus in order to have a log file where the events are reported and where we can declaratively specify what we need the debug session to do. Without further comments (see the documentation), here is the config file:
- Since the kernel dump has to be the last thing we do on the machine (at that point, the machine crashes), we obviously need to take the user dump before crashing the machine. That's automatic in the script because the <CustomActions1> block is executed after the <Actions1> block.
- Only administrators, by default, have the privilege of loading device drivers. This means that the debugger must be run with administrative privileges, otherwise NotMyFault.exe will not be able to crash the system.
- You may have noticed the > entity in the CustomActions1 element. This is the ">" character. The direct use of ">" in the XML would cause Adplus to interpret it as an element bracket and therefore to display an error message. By using > we make sure that the ">" sign will not be interpreted by AdPlus and, instead, will be delivered to the debugger.
- The command .expr /s MASM at the beginning of the debug session is meant to set the default expresssion evaluator for the debugger. The debugging tools documentation has more details.
Sorting out the details
The very last thing we need to care about is to make sure that the machine is properly configured to take a full memory dump when the system crashes. I think you'll agree with me that it would really be a shame to do all this work and then not have a dump just because of a machine configuration detail .
The details are reported in the article Overview of memory dump file options for Windows Server 2003, Windows XP, and Windows 2000. For our case, being the target system a Windows XP machine, the things to check are the following:
- Make surethat the paging file is on the boot/system disk (usually c:)
- Check that the paging file is at least 10MB larger than the physical memory of the machine
- Ensure that the option to save the required type of memory dump is selected into Computer's properties, tab "advanced", "startup and recovery": "Complete Memory Dump"
- Make sure you have enough space for the dump to be saved in the specified location (equal to the size of physical memory plus 10MB).
In today's topic we have touched on several areas in order to achieve an advanced and not-so-common goal: automatically analyze, in the debugger, the details of an event (an exception, in our case) and, based on that, take a specific action (crash the system and take a kernel dump, in our case). This requirement is not so uncommon in a production environment or test environment where the problem happens very seldom. In these circumstances the debuggging needs to be automated as well.
We leveraged several tools and technologies in order to achieve that:
- A debugger extension which dumps out the data we need
- The control flow tokens in the debuggers of the " Debugging Tools for Windows" which parse and filter the output of a debugger extension
- The NotMyFault.exe tool (or equivalent) which can be run to crash the system at a given point in time
- The AdPlus tool to simplify the set of actions that we want the debugger to take in reaction to specific events
- The configuration of the dump option on the machine
While the set of steps may seem complex at first sight, this is a one-time effort: reapplying the same principles and techniques, or adapting them to a similar need, is straightforward. And the ability to track down difficult and complex problems which seldom occur and have an impact on the production environment is a huge added value.