Understanding the Crashdump Test and debugging failures

The crashdump test is part of all of the Storage > Adapter or Controller program and verifies that the storage adapter driver is able to produce a crashdump.  The purpose of a crashdump file is to analyze the state of the system in the event of a non-recoverable error, also known as a crash, stop error, kernel error, or bugcheck.  In order to diagnose the crash the operating system creates a dump file.  Microsoft Windows supports three different levels of dumps: a complete dump, a kernel dump, and a minidump.  All dumps contain header information indicating the type of dump, the nature of the error, and other data added by drivers in the system (Bugcheck callback data).  The complete dump contains the entire contents of physical memory on the system, the kernel dump contains only the contents of physical memory corresponding to the operating system and omits user mode applications and unallocated memory, and the minidump contains only a list of loaded modules and information about the active process and thread that caused the crash. 

In the case of a crash, there is no certainty of what parts of the operating system will still be functional.  The network or file system drivers may have caused the crash, for example, preventing access to file system structures to create a dump file, or network to store the file remotely.  The OS handles this by using a file that it already knows exists (the pagefile) and writes directly to that file's logical block extents on disk.  The dump process writes the contents of physical memory into the pagefile on the system disk (usually c:\pagefile.sys).  The pagefile must be large enough to contain the dump.  The largest dump is a complete dump which requires the size of physical memory (ex: 4096) plus one extra megabyte to contain the header information.  The test requires that the user configure the page file to an appropriate size before execution (see https://support.microsoft.com/kb/314482).  If the page file size is insufficient, the test will log the following error during the initialization phase:

(i) Verifying paging file size.
(x) Paging file size is too small for full dump purposes.
(i) Paging file size: 330989568
(i) Physical memory size: 1073094656
(i) Please configure minimum paging file size to physical RAM size + 1MB.

After some basic settings validation the test will install a driver used to crash the system and reboot the system.  After the reboot the test changes the crash control settings (for full memory dump), deletes any old dump files, and crashes the system.  At the point of the crash the system will display a bugcheck screen (blue screen) with details of the nature of the crash.  The type of bugcheck should be MANUALLY_INITIATED_CRASH (e2).  If anything else appears here it means a second bugcheck occured during the process of writing the dump file.  This should be investigated by connecting a kernel debugger to the test client and debugging the storage adapter driver.  After the dump file has been written the test machine should automatically reboot.

Upon boot after a crash the operating system will detect the presense of dump inforamtion in the page file and begin the process of writing a dump.  This process occurs asyncrhonously while the machine is booting and even after the user has logged in (see https://support.microsoft.com/kb/886429 for details).  During this process you can view it's progress by checking the size of the dump file (C:\windows\memory.dmp) or viewing the process in task manager (werfault.exe).  The test will often be running at the same time as this process, trying to access the same dump file.  If this occurs the following messages will appear in the log:

(i) Connecting to DumpFile: C:\Windows\MEMORY.DMP
(i) Dump file is being used by another process. HRESULT: 0x80070020
(i) Usually memory.dmp is still being written due to large RAM, retry after 5 minutes.

The test should then retry accessing the file.  If the error message code is different, or if it changes (ex: 0x80070002 : ERROR_FILE_NOT_FOUND) then it means the file could not be written to disk.  The first place to check for valuable debug information is the system event log.  To view the event log click Start > Run > Compmgmt.msc.  In the computer management window, select Computer Management\System Tools\Event Viewer\System.  Browse through the list of events for an event with the source BugCheck.  The most common cause for a missing dump file is insufficient free space on the disk.  As noted in the KB article referenced above, the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl\MachineCrash contains information about the last crash (before a reboot), including the partial dump file if one was created.  This can be useful when trying to debug other missing dump issues.

Upon reboot the test examines the dump file for correctness.  The test does this, just as a developer would, using the kernel debugger kd ("Debugging Tools for Windows" package).  In order to analyze a dump file the debugger needs access to symbols.  I won't go into details on why these are needed (more information is available here), but think of these as a dictionary for the dump.  They allow the debugger to analyze the contents of memory (or a crashdump file) into individual modules (executables, libraries, drivers, etc), functions within those modules, and data structures.

For the test to work properly, it needs to provide the debugger with symbols.  When it does not have proper symbols it will log warnings during the log failures during the first analysis phase of testing.  The current mechanism for doing so is to have the user download public symbol packages (here) and install them on the test machine before running the test.  When symbols are not installed or the symbols do not match the operating system under test, the following message may appear in the test log:

(x) Failed to load the correct symbols.
(i) Please refer to WDK documentation on how to install OS symbols.
(i) Your symbols might also be out of date, please update symbols via the symbol server: https://support.microsoft.com/kb/311503

This won't actually cause the test to fail because in some cases the dump can still be analyzed with partially matching symbols.  If the test continues and more test cases fail with messages like "Error retrieving addresses of ..." or "Unable to get..." it means that the debugger cannot analyze the dump due to the missing symbols.  We understand the difficulties associated with downloading the public symbol packages and ensuring they are up to date with all of the latest hotfixes on the system.  One way we have found to work around a symbol is to supplement the locally installed symbol packaged with symbols cached from an internet symbol server.  The test could use the symbol server directly, but it is not advised to have the test machines connected to the internet when under test.  The steps for caching the symbols locally are as follows:

  1. Ensure that you have already created a crashdump.  The easiest way to do this is to run the test once and let it fail.
  2. Ensure that you have the debugging tools installed.  Again the easiest way to do this is to run the test once.  It will install the tools to C:\Debuggers.
  3. Open a command prompt (in Vista right-click and choose run as administrator).
  4. Type in the following command, without the quotes: "c:\Debuggers\kd -z c:\Windows\MEMORY.DMP -y SRV*C:\Symbols*https://msdl.microsoft.com/download/symbols"
  5. This will load the dump in the kernel debugger using the remote symbol store at microsoft and the local directory C:\Symbols as the downstream store to cache the symbols.
  6. Make sure that symbols can be found for OS files like NTOSKRNL and NTDLL.  These are necessary to analyze the dump.  It is ok if errors appear loading symbols for other modules like 3rd party drivers.
  7. You should now have a prompt "0: kd>".  At this prompt type the command ".reload /f" without the quotes.  This command forces the debugger to load and cache all symbols for modules loaded in the dump.
  8. You can now exit the debugger with CONTROL+B then enter.