What does an NMI error mean? (The infamous "Hardware Malfunction")


I promised to talk more about NMI, so here it is.

What generates an NMI? What does it mean?

The first question is easy to answer but doesn't actually shed much light: Any device can pull the NMI line, and that will generate a non-maskable interrupt. Back in the Windows 95 days, a few really cool people had taken the ball-point pen trick one step further: They had a special expansion card in their computer with a cord coming out the back. At the end of the cord was a momentary switch like the one you might see on a quiz show. If you pressed it, the card generated an NMI. No fumbling around with ball-point pens for these folks, no-ho! (To be honest, I had two of these. One of them was a simple NMI card, triggered by a foot pedal! The other was really a card with a high-resolution real-time clock that could be used for performance analysis. I used the NMI button far more often than the timer...)

In practice, the only device that generates an NMI (on purpose) is the memory controller, which raises it when a parity error is detected. The non-geek explanation of a parity error: Your memory chips are acting flakey.

Here's what a parity error looks like. It shows up as a mysterious "Hardware Malfunction" error.

Now, it's possible that a device may be generating an NMI by mistake. For example, in Wendy's case, it may have been due to damaged caused by overheating.

If you suspect your memory chips, you can run a memory diagnostic tool to see if it can find the bad memory.

My colleague Keith Moore reminded me that paradoxically, on the IBM PC-AT, you could mask the non-maskable interrupt! This definitely falls into the category of "Unclear on the concept." The masking was done in hardware that could be configured via some magic port I/O. It prevented the NMI from reaching the CPU in the first place. (NMI is still not maskable in the CPU.)

Comments (28)
  1. vince says:

    At least on Linux, watchdog timers and performance counters also trigger NMIs.

  2. Tomer Chachamu says:

    And if you prefer, memtest86 does pretty much the same thing and can probably boot off the network. It is also found on almost every linux boot disc, Live CD or install CD – just type "memtest" or "memtest86" at the prompt.

    You should also be careful – both memtest and windiag can repeat their tests forever if you just leave them to do whatever they want.

  3. Matt Pietrek says:

    Yup. Back in my NuMega days the company sold boards with an NMI switch like Raymond describes.

    I also recall Purart (who did Turbo Debugger for Borland) had an NMI board.

    Of course, my memory may be bad. :-)

  4. David says:

    The PC Jr. used NMI for the keyboard.  All current chipsets allow NMI to be blocked.  The PC will not start if you cannot block it.  The BIOS must write to all of memory several times before the memory becomes stable.  I guess those chipsets that don’t support parity or ECC memory don’t need the capability.  Periscope had several cards that provided NMI switches from a simple one to their more complete ICE cards.

  5. bramster says:

    @David

    "The BIOS must write to all of memory several times before the memory becomes stable."

    Could you expand on that? Intriguing!

  6. DriverDude says:

    In the really old days, when a parity error occured, the BIOS would print the address (or something resembling an address) on the screen before halting. You could decode the address and figure out which RAM chip to replace.

    Of course there were 36+ RAM chips in a PC back then. Nowadays it is just one out of four or eight DIMMs.

  7. Jeremy Croy says:

    Back in the day, I had this bluescreen, took me the longest time to figure out what caused it. The motherboard had blown a capacitor. It was a dual proc PIII 1Ghz rig. When I’d run memory tests, they’d all pass, since they ran on CPU0. I’d boot windows and in about 2 minutes I’d have an NMI bluescreen. Took me about 2 days to notice a little oozing capacitor in the case, that was on the power line going to CPU1.

  8. J. Edward Sanchez says:

    I have an somewhat related question (not necessarily directed at Raymond, but to anyone reading who might know the answer).

    I have a machine that is configured with ECC memory, and has ECC enabled via the BIOS’s "ECC Scrub" setting. I’m running Windows Vista. What happens if the ECC encounters an uncorrectable error (i.e., two or more flipped bits)? Does an NMI still get generated, like in the old days? Do I get a system-modal error message? An entry in the Windows System log? A bugcheck (blue screen)? Or — shudder — nothing at all?

    An ECC failure has never occurred with this machine, to my knowledge. But I’ve always wondered what would happen if one did.

  9. Richard says:

    I just found a very dusty book on my shelf titled "Professional Debug Facility" from IBM.  In this book I found a small ISA card that provides the NMI function described by Raymond, complete with a little black button to force the NMI.  This brings back fond memories of debugging assembly code on the original PC.

    Now if only I could find a computer with an available ISA slot.

  10. random@undergrad.math.uwaterloo.ca says:

    At UW (Waterloo of course; accept no substitutes) there was an NMI button on the on the CS 452 (realtime) course machines.  The course objective was to write an OS to control a Marklin train system; it was a lot of fun, but sleep was hard to come by; at least there was plenty of free food from employer presentations.  Naturally these fledgling OSes would get wedged hard fairly frequently.  Pushing the NMI would boot back up to the loader program, which could run submitted OS images using TFTP.

  11. Tom says:

    "The BIOS must write to all of memory several >times before the memory becomes stable."

    >

    Could you expand on that? Intriguing!

    I once saw and embedded system that did the assembly equivalent of a memset(sdram_base, 0, sdram_size) to clear parity DRAM early in initialization. The comment was that one some DRAM, the chips power up with random data. Since it’s random and reading checks the parity which has a 50% chance of being right, it’s possible that if you read before writing, you’ll get a fatal ECC error.

    Not sure why ‘several times’ though, maybe it’s something to do with refreshing.

  12. Norman Diamond says:

    In practice, the only device that generates

    an NMI (on purpose) is the memory controller,

    which raises it when a parity error is

    detected.

    The only device that SHOULD generate an NMI (on purpose) is the power failure detector.  A while earlier there was a link either in this blog or one of your famous colleagues, to someone else’s article about unbelievable abuses of NMI.  Anything other than power failure can be handled normally in accordance with an OS’s priorities and thread management.

    Tuesday, February 27, 2007 10:24 AM by vince

    At least on Linux, watchdog timers and

    performance counters also trigger NMIs.

    OK, I guess there’s no limit to this unbelievability.  Though they do have competition — where’s that MSDN page about Windows giving performance counters a higher priority than power failure (but that’s software priorities not NMI).

  13. Chris Nahr says:

    There was a commercial debugger for IBM PCs that came with an expansion card and an NMI trigger button.  You’d load the debugger as a resident program, and when you pressed the button the debugger would start up, showing you the exact location and memory state of the program you just interrupted.

    That was a really cool program… sadly I can’t recall the name at the moment.

  14. Chris Nahr says:

    Hey, David mentioned the name in a comment above — it was the Periscope debugger!

  15. BryanK says:

    Norman — I’m not sure about the performance counters, but there’s a very good reason the watchdog uses the NMI.  (It’s a watchdog that’s implemented in the I/O APIC; IIRC the APIC can schedule interrupt delivery for a later time, and the NMI watchdog driver just schedules an NMI for some time in the future.  Then it wakes up before that time and reschedules the NMI.)

    Anyway, the reason it’s an NMI is so it will still act as a watchdog even if your kernel is locked up in a state where interrupts are disabled (via cli or the SMP equivalent).  As long as the interrupt handler for the NMI interrupt is still there, the machine will at least print out a stack trace, and you can see where in the kernel it got locked up.

    (True, it’s not a real watchdog: it won’t reset the machine.  But it’s useful for debugging many types of in-kernel lockups, where real watchdogs aren’t.)

  16. vince says:

    OK, I guess there’s no limit to this unbelievability.  

    Though they do have competition — where’s that MSDN page about Windows

    giving performance counters a higher priority than power failure (but

    that’s software priorities not NMI).

    What good are your performance counters if they lose counts if you happen to trigger while the processor is servicing an interrupt?

    In any case all the performance counter NMI does is update an OS counter, so it is unlikely it’s going to interfere with the system that much.  And of course, performance counting is turned off by default and only enabled if you decide you want to profile something.

  17. Mark Hampton says:

    I found another way to generate NMI’s by accident… In college, a roommate and I built a plugin card (etched the card ourselves), and one day we plugged it in backwards. After that, the machine wouldn’t boot, just gave NMI errors. After looking at the schematics, we realized we blew up one NOR gate on a 7402. Luckily, I had one, and we piggy backed it on the bad one and replaced the blown one. Computer ran fine. Of course, this was back in the 8088 days when mere mortals could understand timings and such.

  18. Xepol says:

    The old IBM ATs were strange beasts.

    I once had one clear the screen and inform me that the system bus had failed.

    Ultimately, it turned to be a rogue procedural pointer that drove the machine down the path to insanity rather than the next logical operation, but to this day, I wonder who it was that decided that they could actually get a message out across the bus to the video card if the bus had ACTUALLY failed…

  19. Nick Lamb says:

    BryanK / Norman

    Linux provides support for watchdog hardware, which can autonomously power cycle a machine when it stops doing whatever it was supposed to do, no matter the reason. It also provides a trivial software implementation of the same API. Neither of these use an NMI.

    The NMI watchdog, though it has a similar name, is merely a debugging tool, specifically it converts hangs (which are notoriously hard to diagnose) into crashes (which provide tracebacks and other useful diagnostics) in exchange for reduced performance. It’s a software replacement for Raymond’s tried and tested method with the ballpoint pen.

    So in terms of production use you might have a hardware watchdog because you intend to bury the system in a glacier for two years and you can’t have someone dig it out every time a software error crashes it, whereas you might use the NMI watchdog because your customer’s database server mysteriously freezes every 2-3 days for no apparent reason and you can’t reproduce it on your test hardware.

  20. Raymond – oh what memories. In the Windows 95 days I had to acquire and install NMI cards in all of my game testing machines so that (if my memory is not failing me) we could break into the debugger so that you guys (I think it was usually Ralph Lipe who drew the short straw) could debug Doom or other less important DOS-based game.

    Yes everyone – getting Doom to run on Windows 95 was a fairly high priority. ;-)

  21. Joe Old-timer says:

    The old Gravis Ultrasound card used NMI to implement software-assisted Soundbaster emulation. See for example:

    http://www.tldp.org/HOWTO/PCI-HOWTO-3.html

  22. Norman Diamond says:

    Wednesday, February 28, 2007 8:08 AM by BryanK

    there’s a very good reason the watchdog uses

    the NMI.

    OK, I won’t complain too loudly about watchdogs.  But I’ll still complain a little bit, because I sure don’t want a watchdog to interrupt a power failure handler.

    Wednesday, February 28, 2007 10:30 AM by vince

    What good are your performance counters if

    they lose counts if you happen to trigger

    while the processor is servicing an interrupt?

    What good are your performance counters if they interrupt an ISR that really knew it couldn’t be interrupted at that point?  Will you really be able to count that tick after recovering from a BSOD?  For comparison, if a power failure handler executes while some other ISR is interrupted, causing a BSOD when the power failure handler returns, well so what, the power’s going away in about 10 milliseconds anyway.

  23. David Moisan says:

    One peripheral card that has caused NMI’s is a USB port card.  I put an Adaptec USB/Firewire card in a server to hook up an external HD that was bus powered (off the USB).

    The first time I spun up the drive, Windows disabled the USB port due to overcurrent.  The 2nd time gave me thea bluescreen just as described.

    The real WTF is why the USB card didn’t have external power connectors so that it would not have to draw all its power off the PCI bus.  Honorable mention to the disk vendor who thought there was nothing wrong in having a USB powered device suck up all the power on a port.

  24. vince says:

    What good are your performance counters if they

    interrupt an ISR that really knew it couldn’t be

    interrupted at that point?

    Any ISR can be interrupted by a NMI at any time.  That’s the definition of a NMI, and your OS better handle it.

    Sure, you probably don’t want to enable NMI interrupts if you are trying to run some sort of real-time operating system, but we are talking general purpose OS right now.

    In any case, on x86 NMI interrupts themselves can be disabled by SMI interrupts, which have higher priority than NMI.  So if you really are worried about your ISRs you better hope your BIOS isn’t ever using SMM routines.

  25. ::Wendy:: says:

    slightly off topic:  once I said "blue screen of death" while in a call to IT support,  the support person got me to describe it,  then giggled and said "Blue Screen of Death?!  cute!"  I had to laugh when tech support knows less tech jargon than me I know I’m in for a rough ride….

  26. BryanK says:

    because I sure don’t want a watchdog to interrupt a power failure handler.

    The NMI only triggers when the watchdog code fails to reset the APIC’s timer.  You don’t get periodic NMIs when using the NMI watchdog; you only get one when the kernel’s locked up.  And at that point, the power’s-going-to-die NMI may not matter much.

    (But I’m not sure whether it matters or not: when power fails and the NMI is generated, what should the OS do?  I can see hardware having to deal with imminent power loss, but not software.  I think pretty much anything that the OS should do to prepare for power failure (cancel disk activity, suspend itself to disk, etc.) will likely fail if the NMI watchdog is about to trigger.)

  27. Norman Diamond says:

    Wednesday, February 28, 2007 11:38 PM by vince

    [Norman Diamond:]

    > What good are your performance counters if

    > they interrupt an ISR that really knew it

    > couldn’t be interrupted at that point?

    >

    Any ISR can be interrupted by a NMI at any

    time.

    No shi*.  Guess why I expressed outrage at the use of NMI for performance counters?  Guess why NMI should be used only for power failures?

  28. Rich says:

    I got a "Hardware Malfunction" on a Vista 6000 platform with NMI: Parity check / Memnory Parity Error. Call stack as below:

    nt!RtlpBreakWithStatusInstruction

    nt!KiBugCheckDebugBreak+0x1c

    nt!KeEnterKernelDebugger+0x45

    hal!HalpNMIHalt+0xe2

    hal!HalBugCheckSystem+0x3d

    nt!WheaReportHwError+0x10c

    hal!HalHandleNMI+0x93

    nt!KiTrap02+0x136

    nt!READ_REGISTER_ULONG+0x6

    Any good suggestion or idea?

Comments are closed.