Share via


AARDvarks in your code.

If there was ever a question that I’m a glutton for punishment, this post should prove it.

We were having an email discussion the other day, and someone asked:

Isn't there a similar story about how DOS would crash when used with [some non-MS thing] and only worked with [some MS thing]? I don't remember what the "thing" was though =)

Well, the only case I could think of where that was the case was the old AARD code in Windows.  Andrew Schulman wrote a great article on it back in the early 1990’s, which dissected the code pretty thoroughly.

The AARD code in Windows was code to detect when Windows was running on a cloned version of MS-DOS, and to disable Windows on that cloned operating system.  By the time that Windows 3.1 shipped, it had been pulled from Windows, but the vestiges of the code were left behind.  As Andrew points out, the code was obfuscated, and had debugger-hiding logic, but it could be reverse engineered, and Andrew did a great job of doing it.

I can’t speak as to why the AARD code was obfuscated, I have no explanation for that, it seems totally stupid to me.  But I’ve got to say that I totally agree with the basic concept of Windows checking for an alternative version of MS-DOS and refusing to run on it.

The thing is that the Windows team had a problem to solve, and they didn’t care how they solved it.  Windows decided that it owned every part of the system, including the internal data structures of the operating system.  It knew where those structures were located, it knew what the size of those data structures was, and it had no compunction against replacing those internal structures with its own version.  Needless to say, from a DOS developer’s standpoint, keeping Windows working was an absolute nightmare.

As a simple example, when Windows started up, it increased the size of MS-DOS’s internal file table (the SFT, that’s the table that was created by the FILES= line in config.sys).  It did that to allow more than 20 files to be opened on the windows system (a highly desirable goal for a multi-tasking operating system).  But it did that by using an undocumented API call, which returned a pointer to a set of “interesting” pointers in MS-DOS. It then indexed a known offset relative to that pointer, and replaced the value of the master SFT table with its own version of the SFT.  When I was working on MS-DOS 4.0, we needed to support Windows.  Well, it was relatively easy to guarantee that our SFT was at the location that Windows was expecting.  But the problem was that the MS-DOS 4.0 SFT was 2 bytes larger than the MS-DOS 3.1 SFT.   In order to get Windows to work, I had to change the DOS loader to detect when win.com was being loaded, and if it was being loaded, I looked at the code at an offset relative to the base code segment, and if it was a “MOV” instruction, and the amount being moved was the old size of the SFT, I patched the instruction in memory to reflect the new size of the SFT!  Yup, MS-DOS 4.0 patched the running windows binary to make sure Windows would still continue to work.

Now then, considering how sleazy Windows was about MS-DOS, think about what would happen if Windows ran on a clone of MS-DOS.  It’s already groveling internal MS-DOS data structures.  It’s making assumptions about how our internal functions work, when it’s safe to call them (and which ones are reentrant and which are not).  It’s assuming all SORTS of things about the way that MS-DOS’s code works.

And now we’re going to run it on a clone operating system.  Which is different code.  It’s a totally unrelated code base.

If the clone operating system isn’t a PERFECT clone of MS-DOS (not a good clone, a perfect clone), then Windows is going to fail in mysterious and magical ways.  Your app might lose data.  Windows might corrupt the hard disk.   

Given the degree with which Windows performed extreme brain surgery on the innards of MS-DOS, it’s not unreasonable for Windows to check that it was operating on the correct patient.

 

Edit: Given that most people aren't going to click on the link to the Schulman article, it makes sense to describe what the AARD check was :)

Edit: Fixed typo, thanks KC

Comments

  • Anonymous
    August 12, 2004
    Windows worked on my copy of DRDOS for starters.

    But anyway, you have this all wrong. It's not your job to ensure that you are running on the right version of DOS, it's the responsibility of the clone OS (DR DOS) to make sure that they behave the same way that you do.

    Same way that the onus is on AMD to keep compatibility with Intel, and if you want to code something that's highly specific to Intel's architecture, that's wholly up to you, but please don't then police the architecture you run on.

    The most you have a right to do is say 'this program might not work correctly'.
  • Anonymous
    August 12, 2004
    matthew, you're right, I forgot to mention that it was disabled in the final release.

    And in principal, you're right, it would be dr-dos's responsibility.

    But there's a philosophical shift that has to occur.

    The original philosophy: "If we trash the user's machine because we're running on a non-MSDOS system, it's our fault, because we're sleazy."

    The new philosophy: "If we trash the user's system because we're running on a non-MSDOS system, it's the non-MSDOS system's fault, because they're not compatible enough".

    Both are valid, one puts the onus on Windows, the other on the clone manufacturer. Eventually they decided to put the onus on the clone instead of checking for the clone. The right decision IMHO. But I still understand (and agree with) the reasons behind the original decision.
  • Anonymous
    August 12, 2004
    Are there known cases where a customer had a problem (say corrupted data) because of running with DR-DOS? If MS didn't make it clear that only MS-DOS was supported, and a customer got burned because of it... well, then the original idea might have been better.
  • Anonymous
    August 12, 2004
    The comment has been removed
  • Anonymous
    August 13, 2004
    My how times have changed. Ten years ago (or whenever it was) this was the central core of one of the Microsoft anti-trust investigations. Page after page was written about how this proved that Microsoft was anti-competitive.

    Today? Someone defending what Microsoft did at the time barely gets any comments at all.

    Can I apply this same principle to the current set of anti-trust investigations? Will it come out in 10 years that what Microsoft is accused of doing actually made sense?

    Seems to me that fame (or infamy in this case) is transient. And thats a good thing...



  • Anonymous
    August 13, 2004
    If it's reasonable for Windows to refuse to run on DR-DOS because it isn't MS-DOS, then it's also reasonable for Windows to:
    i) Say so, clearly. "Non-fatal error detected: error #2726" is hardly a model of straightforwardness.
    ii) Check for DR-DOS the documented way, using INT 21h AX=4452h.

    If, on the other hand, the AARD code is actually checking for behaviour that is known to cause problems and/or data corruption (rather than just a non-MS DOS which might or might not work) then presumably someone at Microsoft knows what those problems are, and could divulge them.
  • Anonymous
    August 13, 2004
    John,
    Checking INT 21/4452h works for DR-DOS. Does it work for other MS-DOS clones (there were a couple at the time, DR-DOS was the most popular)? It's easier to check for things that you know are true about MS-DOS than to look for a specific clone operating system.


  • Anonymous
    August 13, 2004
    The comment has been removed
  • Anonymous
    August 13, 2004
    mschaef, my explanation wasn't made with 15 years of hindsight, I knew why the AARD code was a good idea at the time.

    I do NOT understand why it was encrypted. That makes absolutely no sense whatsoever, IMHO and only feeds the suspicion that the it was anti-competitive (if it wasn't anti-competitive, there would be no reason to hide the check).

    And Microsoft DID say that it was to protect the customer at the time. The public just didn't buy it. It didn't explain that the AARD code existed because Windows was a sleazy application that trampled on MS-DOS's data structures, all it said was that Windows was tightly coupled with MS-DOS.

    It was pulled because of the firestorm of criticism about it (not the first time that's happened, remember Intel's CPU ID?). But that doesn't mean that the idea wasn't a good one.

    I actually wrote this article because AARD is (in my opinion) a shining example of Microsoft going out of its way to protect the customer.

    Think about it: The customer has a system that worked perfectly running DR-DOS. Then they installed Windows. And all of a sudden, their hard disk got corrupted. Whos fault is that? It's WINDOWS fault - it was the last application running on the machine.

    The Windows team put the AARD check in to PROTECT customers. Not to hurt them. They pulled it out because people said it was anti-competitive (and a strong argument could be made to say that it was), but pulling it out that left customers at risk of data corruption.

    The Windows 3.1 team COULD have worked to ensure that Windows ran on all MS-DOS clones, you're right. But we're talking about an OS designed to run on machines with significantly less than 1M of RAM, it was far easier to just test with MS-DOS and just say that MS-DOS was required to run Windows. The test effort to get the OS working on cloned platforms would be significant.

    Let's give a modern example. Is it the responsibility of the Microsoft Word team to ensure that their application works under WINE?

    After all, WINE is a clone of the Win32 environment, so if Word doesn't work in that cloned environment, it's Words fault, not the fault of the cloned environment. Right?

    Nope. It's the WINE teams responsibility to ensure that their platform is compatible enough with the Windows platform to ensure that Word works.

    Similarly, it was DR-DOS's responsibility to ensure that DR-DOS was compatible enough to guarantee that Windows (don't forget: Windows was "just" another DOS application) ran on DR-DOS. The AARD check was a safetybelt to attempt to ensure that Windows only ran on operating systems that were tested with it, when the Windows team pulled it out, they moved the onus of supporting cloned versions of MS-DOS from Windows to the clone OS vendor.
  • Anonymous
    August 13, 2004
    The comment has been removed
  • Anonymous
    August 13, 2004
    I completely agree that the XOR business and debugger-jimmying (good turn of phrase btw) was unconsionable. Btw, you're quoting an allegation of DRI in the comment "Microsoft had excluded DRI" - of course DRI claimed that Microsoft was being malicious, that's a court document where DRI is the plantiff and Microsoft's the defendant. Microsoft DID prevent ANY 3rd party OS vendors from seeing prereleases of Win 3.1, the judge indicated that if Microsoft had a version of MS-DOS that was designed to run with Windows it would be ok.

    And the conclusion of the discussion was that Microsoft DID have technical issues with DR-DOS. From the same document you cite:

    "I tracked down a serious incompatibility with DR-DOS 6 - They don't use the 'normal' devise driver interface for >32M partitions. Instead of setting the regular START SECTOR field to Offffh an then using a brand new 32-bit field the way MS-DOS has always done, they simply extended the start sector field by 16 bits.

    This seems like a foolish oversight on their part and will likely result in extensive incompatibilities when they try to run with 3rd part device drivers"

    Microsoft's request for a summary judgement was denied not because Microsoft had technical reasons for detecting DR-DOS, but because of other non technical issues (which were significant, don't get me wrong).

    The other thing to keep in mind is that the ruling sited was a motion denying a request for summary judgement - in other words, the judge was denying Microsoft's requests to throw the claims out, not that Microsoft's defenses were without merit.
  • Anonymous
    August 13, 2004
    "Btw, you're quoting an allegation of DRI"

    The statement that Microsoft didn't give betas to competing OS vendors isn't preceded by "it is alleged" or "Caldera asserts" like the rest of the paragraph, and it gets repeated at the start of the next section; so I think we can believe it.

    "And the conclusion of the discussion was that Microsoft DID have technical issues with DR-DOS"

    - and had come up with a workaround.

    'It is possible to make Bambi work, assuming we can come up with a reasonably safe method for detecting DRD6.'
  • Anonymous
    August 13, 2004
    The comment has been removed
  • Anonymous
    August 13, 2004
    "It's possible to make Microsoft's MS-DOS drivers work on a non MS-DOS system. But is it necessary to make them work?"

    That depends if it could have been fixed sufficiently reliably that it didn't fall over if confronted with a subsequent version of DR DOS without this feature. Unlike, say, those weird workarounds that MS-DOS 6 does if it doesn't like the OEM ID on a hard drive partition, which go funny if the last-but-one character of the OEM ID isn't a dot.
    This is digressing rather; but I suspect that the reason DR DOS did the >32M partition support differently from MS-DOS is that it was copying Compaq DOS 3.31, which does the same thing and (of course) predates MS-DOS 4.

    "The good news is that this was 15 years ago, the world and this company are very different places now. "

    The sheer number of jokey replies I could make to that...

    I think I'll settle for my 4th choice - "So AARD is the next CPLed project to go on Sourceforge then?"
  • Anonymous
    August 14, 2004
    The comment has been removed
  • Anonymous
    August 15, 2004
    In the base note:

    > If the clone operating system isn’t a
    > PERFECT clone of MS-DOS [...]
    > Windows might corrupt the hard disk.

    Interesting observation. Now can you say why Windows 95 and Windows 2000 corrupt the hard disk when there's no worry about MS-DOS clones?

    For IDE disks for Windows 95A there was the patch REMIDEUP. Windows 95B didn't need this patch.

    For SCSI disks for Windows 95A and Windows 95B there was no patch. It took months and 11,000 yen just for train tickets to vendors in order to track this down, but now it takes just 10 minutes to reproduce. Why do all the fdisk commands of various Windows 95 versions create overlapping partitions on SCSI disks?
    (Exception: if the SCSI adapter has a BIOS, i.e. if it's a desktop machine and the SCSI adapter has a BIOS, then fdisk doesn't always corrupt the drive.)

    With Windows 2000, it's external IDE disks connected through PCMCIA IDE adapters. If the PCMCIA card and the drive are connected before Windows 2000 boots, then Windows 2000 detects phantom corruption during booting, runs CHKDSK, and proceeds to create real corruption. Why?
    (This doesn't happen if booting is completed before connecting the devices.)

    In one case there's some interaction with 16-bit code but it's not a DOS clone, it's an integral part of Windows 95. In the other case there's no 16-bit code. Why weren't measures taken to prevent disk corruption?
  • Anonymous
    August 15, 2004
    Norman,
    Neither Win95 or Win2000 mess with the internals of the underlying operating system.

    If there's a corruption issue in the operating system that's caused by a bug in the OS, then that happens. It's not good, but it happens.

    If there's a corruption issue in the operating system that's caused because a user application messed up an OS internal data structure, that's significantly different.

    In the Win95/Win2000 corruption, was it an end-user application that was corrupting the system, or was it a hardware combination that wasn't tested? It sounds like the latter, not the former.
  • Anonymous
    August 16, 2004
    "The Windows 3.1 team COULD have worked to ensure that Windows ran on all MS-DOS clones, you're right. But we're talking about an OS designed to run on machines with significantly less than 1M of RAM, it was far easier to just test with MS-DOS and just say that MS-DOS was required to run Windows."

    Don't get me wrong, I do understand what you're saying, about why it might technically make sense.

    What seems inconsistent to me (aside from the encryption, etc.) is that this doesn't match Microsoft's long established practices with respect to supporting 3rd party software. Despite targeting relatively small machines, even Windows 3.1 had compatibility flags to turn on buggy behavior on a per module basis. ( http://support.microsoft.com/default.aspx?scid=kb;en-us;82860&Product=win ) Despite the fact that it's lower level, Adding/Testing special case code for DR-DOS doesn't seem that much more significant than the compatibility flags, particularly given that the problems with DR-DOS were understood. Thus, we're left with the fact that DR-DOS, unlike all the software listed on the web page above, is unique in that it had a Microsoft engineer spend time specifically developing tricky, hidden code to display a strange error, rather spending time on actually working around the problem.

    If the distinction is that DR-DOS is an "Operating System" and Lotus Notes was an "Application" then it's just another example of using market leverage to control competition in particular ways.
  • Anonymous
    August 16, 2004
    8/16/2004 7:53 AM Larry Osterman

    > Neither Win95 or Win2000 mess with the
    > internals of the underlying operating
    > system.

    Huh?

    In the W95 case, maybe you mean that the protected mode part doesn't mess with the real mode part, but that is still wrong. Microsoft released patches for IDE disks for W95A because the protected mode part DOES mess with the real mode part. Microsoft didn't release patches for SCSI disks for any W95 version because, well, we'll get to that.

    In the W2000 case, what can you possibly mean?

    > In the Win95/Win2000 corruption, was it an
    > end-user application that was corrupting the
    > system,

    No end-user application is involved at all. A fresh installation of W95 on any vendor's PC, attach a PCMCIA SCSI adapter made by any vendor, and attach a SCSI hard disk made by any vendor. If the PCMCIA SCSI adapter is made by anyone other than Adaptec then the vendor's SCSI driver has to be installed; Adaptec's driver was built into W95. The fdisk command is broken. The solution is to use the vendor's partitioner utility that the vendor intended for use under Windows 3.1. Too many vendors didn't know that W95's fdisk was broken so they told customers to use fdisk instead of the vendors' own utilities under W95.

    With a fresh installation of W2000, it isn't even necessary to use a vendor's driver. Windows 2000's built-in IDE driver handles PCMCIA IDE adapters. But something in W2000 is badly broken during boot time. Interesting that for Windows XP Microsoft provided a downloadable patch containing the characters W2K in its name, but for Windows 2000 Microsoft did not provide one. But I didn't test the thing under Windows XP because fortunately I was able to return the device during its warranty period, which was before XP came out.

    > or was it a hardware combination that wasn't
    > tested?

    No kidding. Microsoft built some drivers (and fdisk command) into its OSes, didn't test them, and produced disastrous results.

    Here in your blog entry you point out the possible disastrous effects of mixing earlier Windows systems with other vendors' DOS clones, so I ask why Microsoft didn't test Microsoft's own OSes in order to avoid the exact same disastrous effects?
  • Anonymous
    August 17, 2004
    Norman, there IS no underlying operating system when running Win2K and Win95. They are complete solutions, from the disk drivers to the user interface.

    Windows 3.1 was not a complete solution. It relied on the OS for file services and other stuff (memory management, etc).

  • Anonymous
    August 17, 2004
    8/16/2004 7:53 AM Larry Osterman

    > Neither Win95 or Win2000 mess with the
    > internals of the underlying operating
    > system.

    8/17/2004 8:07 AM Larry Osterman

    > Norman, there IS no underlying operating
    > system when running Win2K and Win95.

    Right. Now that this is out of the way, can you say why Windows 95 and Windows 2000 were not tested to prevent corruption of hard disks? Your base note explained how bad things could be if Windows 3.1 were combined with another vendor's DOS instead of yours. Things could have been exactly as bad as things actually were (and are) with your company's Windows 95 or Windows 2000 all by themselves. You KNOW how bad disk corruption is. One question is why weren't these tested, but a bigger question is why weren't patches released after customers got to do your testing for you?
  • Anonymous
    August 17, 2004
    Norman, that's a truely silly question, IMHO.

    Of course the OS was tested for disk corruption. Not every possible scenario was tested, and obviously a bug was missed. But the testing was done.



  • Anonymous
    August 17, 2004
    The comment has been removed
  • Anonymous
    August 18, 2004
    That's true only because Win95 booted with DOS 7.0 - once it came up, all the DOS code was thrown away (unless there were 3rd party drivers like the CD-ROM driver mentioned) present.
  • Anonymous
    August 18, 2004
    8/17/2004 5:26 PM Larry Osterman

    > Norman, that's a truely silly question,
    > IMHO.

    Compare it to the silliness of this. A retrospective essay on the AARD code asserted that there were valid technical reasons for the AARD code. One of the asserted reasons is that a mixture of code from Microsoft and other vendors might have been equally disastrous as code from Microsoft actually was all by Microsoft's self.

    I agree that the possibility was there, the effects would have been disastrous, and Microsoft's code was (and likely still is in Windows 2000) disastrous.

    But I think I'm not going to believe that this was one of the reasons for the AARD code.

    > Of course the OS was tested for disk
    > corruption. Not every possible scenario was
    > tested, and obviously a bug was missed.

    In Windows 95 obviously several bugs were missed, because there were more than one downloadable patch for W95A for bugs causing IDE disk corruption (in fact I probably named the wrong one a few days ago). Why was there no downloadable patch for the bug causing SCSI disk corruption? Microsoft did not remain permanently ignorant about this bug.

    In Windows 2000 maybe there were only one or two missed bugs, maybe. They are more subtle than the Windows 95 case.

    > But the testing was done.

    Indeed there are additional clues besides your statement, demonstrating that testing was done. One SCSI card vendor had performed testing before I got hit by it, but unfortunately the first four SCSI cards I experimented with were from vendors who didn't know about it. The fact became more widely known later, other SCSI card vendors tested, AND Microsoft almost surely tested. Why didn't Microsoft provide a downloadable patch?

    (Actually I have seen clues about an answer to this too, your company knowingly and willfully cared not a fig for the amount of damage your company did to end users outside of the US. But if you have any more useful answers, please let's hear them.)

    For Windows 2000 and PCMCIA IDE hard disks, observe that testing was done for Windows XP while Windows 2000 was still on the market. Why was a fix released for XP and not for 2000?

    There is one case that might or might not be a Windows 2000 bug. If the user attaches a SCSI disk that had been fdisk'ed by Windows 95, and opens Windows 2000 disk manager, Windows 2000 accurately determines that one of the logical partitions is corrupt but fails to determine that second one is corrupt. But if the user tells Windows 2000 disk manager to delete the corrupt logical partition, it actually deletes both, without warning. It is understandable how this might have been missed in testing. I don't know if this testing was actually done or not. I almost want to ask if you know, but then you'll answer this instead of the much higher priority questions that I asked above. Please, the others are far more important.
  • Anonymous
    August 18, 2004
    Sorry for two in a row. This appeared while I was editing my previous posting.

    8/18/2004 9:29 AM Larry Osterman replied to mschaef

    > That's true only because Win95 booted with
    > DOS 7.0 - once it came up, all the DOS code
    > was thrown away (unless there were 3rd party

    Some of the DOS code was not thrown away. The protected mode and real mode code interacted with each other, resulting in at least two bugs for which Microsoft provided downloadable fixes and at least one for which Microsoft didn't. In some cases 3rd party drivers interacted too but they were not at fault. The same bugs were manifest when the user tried devices whose drivers were all built into Windows 95.
  • Anonymous
    February 17, 2005
     Good articles about .NET 2.0 and VS.NET 2005 - ASP.NET Whidbey- Storing User Information with ASP.NET 2.0 Profiles- ASP.NET Whidbey-...
  • Anonymous
    January 11, 2008
    I read Raymond Chen's blog from time to time, somewhat because he's a really conversational writer, and somewhat because he's got lots of interesting things to say about the history of Windows. I was amused by this post about MS-DOS...