AARDvarks in your code.

If there was ever a question that I’m a glutton for punishment, this post should prove it.

We were having an email discussion the other day, and someone asked:

Isn't there a similar story about how DOS would crash when used with [some non-MS thing] and only worked with [some MS thing]? I don't remember what the "thing" was though =)

Well, the only case I could think of where that was the case was the old AARD code in Windows.  Andrew Schulman wrote a great article on it back in the early 1990’s, which dissected the code pretty thoroughly.

The AARD code in Windows was code to detect when Windows was running on a cloned version of MS-DOS, and to disable Windows on that cloned operating system.  By the time that Windows 3.1 shipped, it had been pulled from Windows, but the vestiges of the code were left behind.  As Andrew points out, the code was obfuscated, and had debugger-hiding logic, but it could be reverse engineered, and Andrew did a great job of doing it.

I can’t speak as to why the AARD code was obfuscated, I have no explanation for that, it seems totally stupid to me.  But I’ve got to say that I totally agree with the basic concept of Windows checking for an alternative version of MS-DOS and refusing to run on it.

The thing is that the Windows team had a problem to solve, and they didn’t care how they solved it.  Windows decided that it owned every part of the system, including the internal data structures of the operating system.  It knew where those structures were located, it knew what the size of those data structures was, and it had no compunction against replacing those internal structures with its own version.  Needless to say, from a DOS developer’s standpoint, keeping Windows working was an absolute nightmare.

As a simple example, when Windows started up, it increased the size of MS-DOS’s internal file table (the SFT, that’s the table that was created by the FILES= line in config.sys).  It did that to allow more than 20 files to be opened on the windows system (a highly desirable goal for a multi-tasking operating system).  But it did that by using an undocumented API call, which returned a pointer to a set of “interesting” pointers in MS-DOS. It then indexed a known offset relative to that pointer, and replaced the value of the master SFT table with its own version of the SFT.  When I was working on MS-DOS 4.0, we needed to support Windows.  Well, it was relatively easy to guarantee that our SFT was at the location that Windows was expecting.  But the problem was that the MS-DOS 4.0 SFT was 2 bytes larger than the MS-DOS 3.1 SFT.   In order to get Windows to work, I had to change the DOS loader to detect when win.com was being loaded, and if it was being loaded, I looked at the code at an offset relative to the base code segment, and if it was a “MOV” instruction, and the amount being moved was the old size of the SFT, I patched the instruction in memory to reflect the new size of the SFT!  Yup, MS-DOS 4.0 patched the running windows binary to make sure Windows would still continue to work.

Now then, considering how sleazy Windows was about MS-DOS, think about what would happen if Windows ran on a clone of MS-DOS.  It’s already groveling internal MS-DOS data structures.  It’s making assumptions about how our internal functions work, when it’s safe to call them (and which ones are reentrant and which are not).  It’s assuming all SORTS of things about the way that MS-DOS’s code works.

And now we’re going to run it on a clone operating system.  Which is different code.  It’s a totally unrelated code base.

If the clone operating system isn’t a PERFECT clone of MS-DOS (not a good clone, a perfect clone), then Windows is going to fail in mysterious and magical ways.  Your app might lose data.  Windows might corrupt the hard disk.   

Given the degree with which Windows performed extreme brain surgery on the innards of MS-DOS, it’s not unreasonable for Windows to check that it was operating on the correct patient.

 

Edit: Given that most people aren't going to click on the link to the Schulman article, it makes sense to describe what the AARD check was :)

Edit: Fixed typo, thanks KC