A bunch of us were going through some Windows crashes that people sent in by clicking the “Send Error Report” button in the crash dialog. And there were huge numbers of them that made no sense whatsoever. For example, there would be code sequences like this:
mov ecx, dword ptr [someValue] mov eax, dword ptr [otherValue] cmp ecx, eax jnz generateErrorReport
Yet when we looked at the error report, the
eax registers were equal!
There were other crashes of a similar nature,
where the CPU simply lots its marbles and
did something “impossible”.
We had to mark these crashes as “possibly hardware failure”. Since the crash reports are sent anonymously, we have no way of contacting the submitter to ask them follow-up questions. (The ones that the group I was in was investigating were failures that were hit only once or twice, but were of the type that were deemed worthy of close investigation because the types of errors they uncovered—if valid—were serious.)
One of my colleagues had a large collection of failures where the program crashed at the instruction
xor eax, eax
How can you crash on an instruction that simply sets a register to zero? And yet there were hundreds of people crashing in precisely this way.
He went through all the published errata to see whether any of them would affect an “xor eax, eax” instruction. Nothing.
He sent email to some Intel people he knew to see if they could think of anything. [Aside from overclocking, of course. – Added because people apparently take my stories hyperliterally and require me to spell out the tiniest detail, even the stuff that is so obvious that it should go without saying. I didn’t want to give away the story’s punch line too soon!] They said that the only [other] thing they could think of was that perhaps somebody had mis-paired RAM on their motherboard, but their description of what sorts of things go wrong when you mis-pair didn’t match this scenario.
Since the failure rate for this particular error was comparatively high (certainly higher than the one or two I was getting for the failures I was looking at), he requested that the next ten people to encounter this error be given the opportunity to leave their email address and telephone number so that he could call them and ask follow-up questions. Some time later, he got word that ten people took him up on this offer, and he sent each of them e-mail asking them various questions about their hardware configurations, including whether they were overclocking. [- Continuing from above aside: See? Obviously overclocking was considered as a possibility.]
Five people responded saying, “Oh, yes, I’m overclocking. Is that a problem?”
The other half said, “What’s overclocking?” He called them and walked them through some configuration information and was able to conclude that they were indeed all overclocked. But these people were not overclocking on purpose. The computer was already overclocked when they bought it. These “stealth overclocked” computers came from small, independent “Bob’s Computer Store”-type shops, not from one of the major computer manufacturers or retailers.
For both groups, he suggested that they stop overclocking or at least not overclock as aggressively. And in all cases, the people reported that their computer that used to crash regularly now runs smoothly.
Moral of the story: There’s a lot of overclocking out there, and it makes Windows look bad.
I wonder if it’d be possible to detect overclocking from software and put up a warning in the crash dialog, “It appears that your computer is overclocked. This may cause random crashes. Try running the CPU at its rated speed to improve stability.” But it takes only one false positive to get people saying, “Oh, there goes Microsoft blaming other people for its buggy software again.”