So why didn't the Windows guys just remove the AARD code from the system?

In yesterday’s post I talked about the AARD code.  One of the questions that perennially comes up is “Why on earth didn’t the Windows guys just remove that code”?

Well, the answer is that it would have likely broken far more code to remove the AARD code from the product than to jump around it.  The Windows guys could render the code ineffective by simply modifying a single JMP instruction to avoid the AARD code, while removing the code would have meant that the windows executable would change – every routine would be at a different offset, because the AARD code wasn’t there any more.

What’s the big deal with the Windows executable changing?  Well Chris Pratley explained it really well in his “More on Quality” post:

A last anecdote to leave you with. Even re-linking your code (not even recompile) can introduce a crashing bug. A few years ago, we were working on the release candidate for an Asian-language version of Word97. We thought we were done, and ran our last optimization on the build. We have some technology at Microsoft that profiles code usage and arranges the code modules so that they are in the executable in the optimal order to produce the best possible boot speed. After running this process which involves mainly the linker and is considered very safe, we put the code in escrow while the testers tried to break it. And they did - they found that it would crash on some of the machines they had when a certain feature was used. But the unoptimized build did not crash with the same steps and machine.

So we ran the "debug" build (a version of the build that has the same code as the "ship" build you all use, but includes extra information to allow debugging) and it also did not crash. We then tried to debug the ship build - but just running the ship build in the debugger caused the crash to go away. Some developers love this sort of mystery. One of the future inventors of Watson stopped by to get involved. No matter what they did, as soon as they tried to find out what was causing the crash, it went away. We went to get an ICE - a hardware debugger that might give us some clues.

Then we noticed that there was a pattern. The only machines that showed this crash had Pentium processors of 150MHz or less, and some of the 150MHz machines did not show the problem. We had a hunch, and searched the Intel web site for "errata" (their word for bugs - we prefer "issues"). Sure enough, there was a flaw in the Pentium chip which under certain obscure circumstances could cause a fault - there needed to be a JMP instruction followed exactly 33 bytes later by a branch, and the JMP instruction had to be aligned with a "page" (4096 byte block) boundary in memory. Talk about specific. The flaw had been fixed in later versions of the 150MHz generation, and all chips produced later.

Now, there are actually flaws in chips quite often (one or two are even famous). Those chips include software after all. But as they are discovered, chip manufacturers tell the people who write compilers about the bugs, and the compiler people modify their compilers to produce code that cannot hit the bug. So the flaws seem to just disappear (thank goodness for software!) It turned out we were using a slightly older version of the compiler which did not know about this flaw. An Intel rep at Microsoft confirmed the details of the problem, and rather than relinking or taking any chances whatsoever, we ran a check for this byte sequence, and manually moved three bytes in our 5MB executable to make sure those instructions were 34 bytes apart. Problem solved. So now when someone tells me a fix is "safe", I can tell them that no fix is truly safe - you really can never know.

I can personally attest to this problem. We hit exactly this bug in Exchange 5.5 – we had a build which consistently crashed on one test machine (and only on that test machine). We spent days trying to figure out the problem, no amount of instrumentation would allow us to find the cause of the problem. Every thing we changed made the problem go away, but it was absolutely reproducible. Eventually we stumbled on the errata Chris mentioned, and it turns out that our code was hitting that errata. In our case, we were far enough from shipping that we fixed the problem by globally enabling the compiler flag that fixed this problem.

Anyway, given this history, it’s not at all surprising that the Windows team chose to disable the AARD check by simply jumping around it – they KNEW that Windows worked with the AARD check code in place, they’d been testing it for weeks (or months). When you’re close to shipping a product, you don’t want to invalidate your entire test pass (which can literally take months to complete) by taking a fix that potentially could invalidate it. Instead you find the lowest-impact fix that works around the problem.