How a Bluescreen Button (NMI) can Save Your Bacon

I know, another title that seems ridiculous.  Why in the world would anyone want a button that intentionally bluescreens your system?!  When you’re confronted with a hard hang though, (no mouse or keyboard) you’re in for a heck of a time trying to figure out what’s wrong without one.  That’s where the NMI button can come in handy.

Many people are already familiar with the mechanism introduced in Windows 2000 for these kinds of issues.  The gist is that by setting a registry key, you can enable a key sequence (at the local keyboard only) that will bluescreen the machine.  Thus if you’re having problems with hangs, you can get a memory.dmp and send it to your OEM or Microsoft for analysis.

However, this mechanism can’t cover every scenario that will result in a hang.  The keyboard interrupt is typically a fairly low priority on the system in relation to the rest of the devices.  If your hang isn’t the result of a deadlock in the kernel itself, the key sequence will never get through and initiate the crash.  It’s simply too easy for other devices and drivers to turn off that interrupt while doing their own I/O.

This is where the Non-Maskable Interrupt (NMI) comes in to save the day.  As the name implies, this is an interrupt that cannot be hidden by software.  When the interrupt is generated, the CPU will always get it, and the interrupt handler (which you also must explicitly enable in the registry) will start the process of bluescreening the box.  It will then break into the kernel debugger if attached, or generate a STOP 0x00000080 blue screen if not.

Now if the NMI doesn’t work, you can be confident that something is seriously wrong with your system, and it’s probably hardware.  The CPU typically has to move into an unknown state for this feature to fail.  It’s time to contact your hardware vendor, and quick.  If you’re wondering why no one uses this feature, you’d be surprised.  A number of major server vendors do in fact ship systems with this button, but they keep it hidden (for good reason) and don’t really use it as a feature to sell the box.  They consider it purely diagnostic. 

Personally, I’d want every system in my server room to have this mechanism.  I don’t want 2 or 3 hangs before I can even begin to troubleshoot.  I want it done the first time, every time.

Comments (3)

  1. I seem to remember a ZX Spectrum magazine having an article on how to build one of these things back in the 80s. The motivation was much the same – to be able to "debug" (or generally mess with) the current memory contents "on demand".

  2. Dr Pizza says:

    "This is where the Non-Maskable Interrupt (NMI) comes in to save the day. As the name implies, this is an interrupt that cannot be hidden by software. "

    But the software can tell the firmware to tell the hardware to hide it (the CMOS NMI AND gate), so really, what’s the difference?

    in al,70h

    or al,80h

    out 70h,al

    in al,71h // necessary to read 71h after writing 70h

  3. That’s true, and in fact it is the default behavior of Windows to ignore an NMI, even if they’re not hidden. (Hence the required registry key.)

    You can’t stop a driver or other kernel mode code from intentionally hiding it, but you better have a good reason to do it. As a rule, this almost never happens. Usually if NMI break-in fails, the problem is that the machine was designed to not pass NMIs from devices. (Which is maddening for someone in my line of work.)