If you can’t stand the heat…

You should get out of the PC kitchen.  This is another silent system killer that most people don’t want to acknowledge.  (Though I will admit it’s gotten easier the last 2-3 years, as Intel, AMD, nVidia, and ATI have cranked up the wattage to the point where even the most stubborn have to recognize heat as a design issue.)  While not often a problem for a brand-name computers (which are built with high tolerances and an eye towards good heat dissipation characteristics), it can kill a homemade desktop or server.  I won’t go into solutions, just explain why it’s hard to work on these when you’re on the other end of a phone line or e-mail thread.

This is another one that can be a nightmare to work on, at least from the perspective of someone troubleshooting the operating system.  The way it manifests itself is very similar to random memory problems: Blue screens and access violations with no discernable pattern.  The way I usually go at it?  Open the case up and stick a big ol’ box fan pointing into the case.  Low tech?  Sure.  Effective?  Heck yeah. 

One of the axioms we live by is that a software problem should be consistently reproducible.  Sometimes figuring out the parameters for reproducing a problem can be tricky, but if we see a closely related set of behaviors around multiple failures, you can feel good that it is something you can fix in software.  Bad hardware on the other hand, plays by no ones rules.

Someone taking a 30,000 view of the problem might say: “It is consistent.  I run for this long, and it always blue screens!”  When we dig into the details though, a different picture emerges.  What the CPU was doing at one time, in terms of software, could be drastically different.  Running notepad, SQL, minesweeper, core OS functions, it doesn’t matter.  You have to look at the state of the system itself, and see if the CPU is doing exactly what it should be, or if RAM has conspicuous patterns that don’t match anything software would likely create, or a device is returning noise instead of data.   Getting to the root cause of a problem like this can be terribly difficult without the right tools, especially when you only have a snapshot of the system provided by a memory.dmp file, instead of the live (or more appropriately, freshly dead) system sitting in front of you.

Comments (11)

  1. Simon says:

    At my ex workplace we had a compaq server.

    When the aircon failed, on two occasions – it stopped responding to SQL queries.

    SMB was fine, so was everything else.

    I spent ages trying to work out why it would stop working since my manager at the time wouldn’t believe it could be "just because the temperature had gone up.". They expected it to crash completely, or turn off.

    Nothing else in the room failed, but as soon as we got the aircon fixed each time, the problem went away.

    It was 100% consistent, from a hardware perspective and from a user perspective.

    From a software perspective, it was like the memory had just… Spontaneously currupted on SQL Server.

    I told them to keep it cool, they’d never have replaced the memory, and I doubt it would have made much difference since it was a 1u server and there was no cooling around the memory itself.

  2. asdf says:

    My CPU temperature rises if I take my case off and blow a huge fan inside, but then again my couple of case fans are loud as hell. You may want to try CPU idle (www.cpuidle.de), it consistently lowers my temperature 10 degrees Celsius.

  3. JD on MX says:

    Intermittent failure: Carmen Crincoli of Microsoft points out a frequent cause of intermittent failures which cannot be produced on demand: heat buildup. Things can change if your cooling vents or fan get too much dust. It’s also a good essay…