Bad RAM is Not Good

I know that title must seem silly, but it’s the title of a mock KB article that floated around inside of product support here for a number of years. You’d be amazed by the number of blue screens we see where the culprit is bad or failing RAM.

 

It can be very hard to convince people that their RAM is failing, and understandably so. The fact that the system was working fine until recently, or only fails intermittently makes the diagnosis very hard to swallow.

 

When you consider the dramatically improving speed and density of DRAM nowadays, it’s amazing to me that we have any reliability at all. It’s not uncommon to have a mid-range desktop system to ship with 1GB of RAM, which translates to over 8.5 BILLION discreet bits. Having all of them work at the same time, ever, almost seems like a miracle. As it is, it fails all too often.

 

A single bit error can cause the whole house of cards crashing down. Either your application will fail with an access violation of some sort, or your system will blue screen with any one of a half dozen error codes or so, depending on what the kernel was doing at that time.

 

Finding and proving these problems in the past has been a nightmare, since you can’t really test RAM in a protected mode operating system without compromising it’s general purpose nature and risking stability. Thankfully, our Online Crash Analysis group has a Memory Diagnostic tool designed to help ferret out these kinds of RAM problems.

 

The tool has a few key features that make it invaluable and perfect for this kind of testing. First is that it doesn’t run in Windows or any other general purpose OS. It has it’s own loader and OS services, which allows it to take up the tiniest memory footprint possible. Second, is that it has full control of all memory and processors on the system, and can run exhaustive tests on the RAM, looking for scenarios that can cause iffy RAM to return bad results.

 

The next time you have intermittent errors that just won’t seem to go away, give the memory diagnostic a whirl. One thing to note though is that you MUST run the extended test suite (hit ‘T’ after the diagnostic loads) to have confidence in the stability of you RAM. The extended tests turn off memory caching in the CPU, which can hide problems in RAM. If you write a value into cache and read it back immediately, you have no idea if the RAM really stored and retrieved the right info. I can’t emphasize enough how often bad RAM is hidden by a well-meaning cache...use the extended tests!

 

In a later entry I’ll discuss how you can trace a possible bit flip in the kernel debugger, with some basic disassembly and examination of the registers.