Self-Monitoring and Diagnosing Hardware

This is something that most people in the mainframe business have taken fom granted for decades now. To the PC world, it’s relatively new…and to the PC OS world, even newer.

Starting with the Pentium and Pentium Pro, Intel introduced the Machine Check Architecture (MCA), which was a way for the CPU and other components of the system to report internal inconsistencies to software, so that the operating system can make decisions about how best to protect the user and data and/or report the problem. For full information on how this works, see the IA-32 Architecture Software Developer’s Manual Volume 3: System Programming Guide, Chapter 14.

Now, that’s all well and good, but unfortunately Windows didn’t support anything but the most basic level of reporting until Server 2003. Before that release, we would stop the system is a fatal error occurred, but not much else. With Server 2003 however, the reporting mechanism became more sophisticated.

If your processor and platform support it, we can read and log events into the event log to tell you more clearly what happened. This might seem redundant, but not all Machine Check Exceptions (MCE) are fatal. Some are just informative. For example, you could have one particular region of memory that keeps returning corrected parity errors. Corrected is great, no problems with your data. The fact that they keep happening? Usually bad news. Go get it replaced!

The worst-case scenario, of course, is an unrecoverable error. Those are reported with a STOP 0x0000009C. If you encounter one of these, it’s best to contact your OEM instead of Microsoft. There’s really nothing we can do. This is a hardware problem, always. We might be able to help interpret, but it’s not likely. If the system is critical, get to hardware swapping.