My, what strange NOPs you have!


While cleaning up my office, I ran across some old documents which reminded me that there are a lot of weird NOP instructions in Windows 95.

Certain early versions of the 80386 processor (manufactured prior to 1987) are known as B1 stepping chips. These early versions of the 80386 had some obscure bugs that affected Windows. For example, if the instruction following a string operation (such as movs) uses opposite-sized addresses from that in the string instruction (for example, if you performed a movs es:[edi], ds:[esi] followed by a mov ax, [bx]) or if the following instruction accessed an opposite-sized stack (for example, if you performed a movs es:[edi], ds:[esi] on a 16-bit stack, and the next instruction was a push), then the movs instruction would not operate correctly. There were quite a few of these tiny little "if all the stars line up exactly right" chip bugs.

Most of the chip bugs only affected mixed 32-bit and 16-bit operations, so if you were running pure 16-bit code or pure 32-bit code, you were unlikely to encounter any of them. And since Windows 3.1 did very little mixed-bitness programming (user-mode code was all-16-bit and kernel-mode code was all-32-bit), these defects didn't really affect Windows 3.1.

Windows 95, on the other hand, contained a lot of mixed-bitness code since it was the transitional operating system that brought people using Windows out of the 16-bit world into the 32-bit world. As a result, code sequences that tripped over these little chip bugs turned up not infrequently.

An executive decision had to be made whether to continue supporting these old chips or whether to abandon them. A preliminary market analysis of potential customers showed that there were enough computers running old 80386 chips to be worth making the extra effort to support them.

Everybody who wrote assembly language code was alerted to the various code sequences that would cause problems on a B1 stepping, so that they wouldn't generate those code sequences themselves, and so they could be on the lookout for existing code that might have problems. To supplement the manual scan, I wrote a program that studied all the Windows 95 binaries trying to find these troublesome code sequences. When it brought one to my attention, I studied the offending code, and if I agreed with the program's assessment, I notified the developer who was responsible for the component in question.

In nearly all cases, the troublesome code sequences could be fixed by judicious insertion of NOP statements. If the problem was caused by "instruction of type X followed by instruction of type Y", then you can just insert a NOP between the two instructions to "break up the party" and sidestep the problem. Sometimes, the standard NOP would end up classified as an instruction of type Y, so you had to insert a special kind of NOP, one that was not of type Y.

For example, here's one code sequence from a function which does color format conversion:

        push    si          ; borrow si temporarily

        ; build second 4 pixels
        movzx   si, bl
        mov     ax, redTable[si]
        movzx   si, cl
        or      ax, blueTable[si]
        movzx   si, dl
        or      ax, greenTable[si]

        shl     eax, 16     ; move pixels to high word

        ; build first 4 pixels
        movzx   si, bh
        mov     ax, redTable[si]
        movzx   si, ch
        or      ax, blueTable[si]
        movzx   si, dh
        or      ax, greenTable[si]

        pop     si

        stosd   es:[edi]    ; store 8 pixels
        db      67h, 90h    ; 32-bit NOP fixes stos (B1 stepping)

        dec     wXE

Note that we couldn't use just any old NOP; we had to use a NOP with a 32-bit address override prefix. That's right, this isn't just a regular NOP; this is a 32-bit NOP.

From a B1 stepping-readiness standpoint, the folks who wrote in C had a little of the good news/bad news thing going. The good news is that the compiler did the code generation and you didn't need to worry about it. The bad news is that you also were dependent on the compiler writers to have taught their code generator how to avoid these B1 stepping pitfalls, and some of them were quite subtle. (For example, there was one bug that manifested itself in incorrect instruction decoding if a conditional branch instruction had just the right sequence of taken/not-taken history, and the branch instruction was followed immediately by a selector load, and one of the first two instructions at the destination of the branch was itself a jump, call, or return. The easy workaround: Insert a NOP between the branch and the selector load.)

On the other hand, some quirks of the B1 stepping were easy to sidestep. For example, the B1 stepping did not support virtual memory in the first 64KB of memory. Fine, don't use virtual memory there. If virtual memory was enabled, if a certain race condition was encountered inside the hardware prefetch, and if you executed a floating point coprocessor instruction that accessed memory at an address in the range 0x800000F8 through 0x800000FF, then the CPU would end up reading from addresses 0x000000F8 through 0x0000000FF instead. This one was easy to work around: Never allocate valid memory at 0x80000xxx. Another reason for the no man's land in the address space near the 2GB boundary.

I happened to have an old computer with a B1 stepping in my office. It ran slowly, but it did run. I think the test team "re-appropriated" the computer for their labs so they could verify that Windows 95 still ran correctly on a computer with a B1 stepping CPU.

Late in the product cycle (after Final Beta), upper management reversed their earlier decision and decide not to support the B1 chip after all. Maybe the testers were finding too many bugs where other subtle B1 stepping bugs were being triggered. Maybe the cost of having to keep an eye on all the source code (and training/retraining all the developers to be aware of B1 issues) exceeded the benefit of supporting a shrinking customer base. For whatever reason, B1 stepping support was pulled, and customers with one of these older chips got an error message when they tried to install Windows 95. And just to make it easier for the product support people to recognize this failure, the error code for the error message was Error B1.

Comments (34)
  1. Falcon says:

    Interesting fact: the common x86 NOP, 90h, is actually an alias for XCHG AX, AX or XCHG EAX, EAX (depending on the CS default operand and address size).

  2. Sunil Joshi says:

    I live in perpetual fear (paranoia) of processor erata. The whole idea just makes ice run down my spine. I'm glad to see that MS was on the case even in the early 1990s.

  3. Mark Jonson says:

    So after B1 support was pulled, why weren't the binaries recompiled to remove the now-unnecessary NOPs?

    [They're so cute when they're young. -Raymond]
  4. AIDS says:

    Maybe because the cost of re-testing everything was too high, and the delay with the Windows 95 release caused by this?

  5. Yuhong Bao says:

    "Maybe the testers were finding too many bugs where other subtle B1 stepping bugs were being triggered. "

    Well, for one thing, some of these CPUs also had a 32-bit multiply bug, with no way to workaround it. There was a double sigma stamped on CPUs that did not have it, and a "For 16-bit operations only" stamped on CPUs that did.

  6. Joel Franklin says:

    Raymond – if you ever compile all your DOS / Win 3.1 / Win 95 anecdotes into a book, I'll buy it.

  7. Ken Hagan says:

    Loved the idea of a 32-bit NOP, especially the idea that a 16-bit NOP wouldn't have not done enough.

    Whilst we're collecting, Intel have played this trick too. The PAUSE instruction is encoded as "REP: NOP".

  8. lefty says:

    I just want to say that I love this kind of article.  Thanks.

  9. NB says:

    This kind of posts makes me realize I was probably too hard on Windows 95 back in the day.

  10. Evan says:

    @Joel Franklin: Just in case you didn't know, he has compiled (and occasionally expanded) many of the posts from here into a book, and it's awesome. It's got a bunch of anecdotes, but it's not just that; substantial portions of the book are spent on the more "how-to" coding entries. There's a link to the right (or search in-page for "holy cow"). I like the "anecdotes" parts more, and I think they are awesome enough to get the book. Especially if you haven't read back into the archives; there are some real gems. (See the bonus chapters and sample chapters at http://www.informit.com/…/product.aspx)

    @lefty: And no kidding, best article in a while.

  11. Anonymous Coward says:

    This article is awesome in the same way as urban exploration.

  12. Gabe says:

    How would you even know if you had a B1 stepping? Could you read the ceramic package (assuming there's no heatsink), or did you have to run a bunch of regular instructions to see which ones failed?

  13. GrumpyYoungMan says:

    @Gabe

    Could you read the ceramic package?

    Yes, the stepping is marked on the package for all processors.

    did you have to run a bunch of regular instructions to see which ones failed?

    No, there is a CPUID instruction that provides that information.

  14. Ben Voigt [Visual C++ MVP] says:

    Bonus topic: errata concerning the CPUID instruction.

  15. vince says:

    No, there is a CPUID instruction that provides that information.

    Not on 386s.  cpuid didn't happen until some later 486s if I recall correctly.

    A quick google search will turn up code that will let you differentiate B1 386s from others.  I'm guessing that code depends on the very errata that make you want to be able to tell.

  16. Joshua says:

    The funny form of these NOPs reminds me of ES:RET. It's prime, and it executes.

  17. booie says:

    Wow, you should clean up your office more often, that's 1987! :3

    Good post though, very informative!

  18. Yuhong Bao says:

    "A quick google search will turn up code that will let you differentiate B1 386s from others.  I'm guessing that code depends on the very errata that make you want to be able to tell."

    Yep, the 386 and later did put a family-model-stepping value in EDX on reset but not all BIOSes even stored the values. One trick BTW was to determine the existence of the IBTS/XBTS instructions. Unfortunately, Intel reused the opcodes for these instructions for the CMPXCHG instruction in the early steppings of the 486, which broke this trick. Ultimately, Intel was forced to change it to a different opcode.

  19. Myria says:

    I once got a crash report from our tech support department that hundreds of customers were running into.  The crash report made no sense to most of our developers, so they sent it to me, since I'm known for assembly language prowess.  The thing that struck me as odd was that the crash was an illegal opcode exception, yet the log had the bytes at EIP and the opcodes there were definitely legal (a "mov edi, edi" used as a "nop" in fact).

    The crash reports all named the same processor model: Pentium 4.  My spidey-sense was tingling, and I brought up the errata sheet.  Sure enough, there was an errata where under certain conditions, opcodes whose memory address is divisible by 128 would get decoded incorrectly, potentially causing an illegal opcode exception.  The address of the crash was 0 mod 128 all right.

    The next build of our product, the problem "mysteriously" disappeared.  I ran it through IDA again, and that opcode was no longer at an address divisible by 128.

  20. Cheong says:

    Wow, this article is very educational.

    I remember that one magazine apparently noticed these NOP instructions in system code, and went on explaining it's written like this to support some advanced form of CPU cooling machanism.

    The "no man's land" of 0x80000xxx address has puzzled me for a long time too. Books describing memory layout often leaves the hole there, and I could see the hole myself using MSD.exe too, but no books I read actually said anything about it.

  21. Yuhong Bao says:

    In fact, I remember reading that the 32-bit multiply bug was impossible to even reliably *detect* from software due to it only triggering under certain conditions.

  22. Some of these must have been fun bugs to track down if you found your code triggering it … "OK, I stored the values 16 and 256 in the structure at 0x800000F8, then branched to my division code … so why on earth am I getting division by zero? It's not zero, it's 16! @$%!" I lost enough hair debugging a BugCheck (BSOD) which happened every time I called ZwClose before eventually deducing the anti-virus software's filter driver was breaking because it hadn't done something important during that particular ZwCreateFile path; when you have the processor itself *sometimes* decoding individual instructions wrongly or loading the wrong address … scary.

    With at least one of the B1 bugs being triggered across a branch boundary, scanning all the generated code for instances must have been virtually impossible, even without the linker inlining Win95 introduced (for example, if you had a call to a function which just returned a value from a known location, in Win95 the call would actually be replaced with an equivalent load instruction).

    I do love these anecdotes. Mark Jonson: Just because these NOPs *were* needed on B1 stepping cores does not mean they were *only* needed there … no doubt some of these errata, or related ones close enough to be fixed by the same workaround, persisted in C1 steppings as well. Trying to strip out all these workarounds after the final beta would have been a huge risk, all to shave off literally individual clock cycles. I imagine the surplus NOPs disappeared over time as later releases (OSR1 and OSR2 in particular) re-built and re-did QA fully, without considering B1 stepping errata.

  23. Bonus Chapter says:

    (See the bonus chapters and sample chapters at http://www.informit.com/…/product.aspx)

    Where are the bonus chapters? I registered (the book is great, but for my money more war stories/less Windows programming advice would be better) and see nothing.

  24. Worf says:

    I've always wondered how Intel a) gets these errata, b) finds the instruction sequence that triggers it, and c) fixes them.

    Are they bugs that occur due to the chip layout? Actual transistor-level bugs? Microcode issues?

    Larry Osterman also detailed an odd one blogs.msdn.com/…/214338.aspx

  25. Falcon says:

    Here's a good one:

    groups.google.com/…/6c8ed874d60beb83

    I sincerely hope that link works. If it doesn't, do a Google search for "6 megabyte lisp", including the quotation marks.

  26. icabod says:

    I seem to recall when Win95 came out that it seemed to take up "a lot of space" (not much really, just more than we were used to with Win3.x). It all makes sense now I know that the code included bonus NOPs :)

    @ Bonus Chapter: The bonus chapters are at the bottom of the "Sample Content" tab on that page.

    Incidentally, regarding the book, will there be a second edition with the bonus chapters/errata, and maybe more?  A lot has happened in the last four years.

    [There can't be a second edition until the first edition sells out. Odds of that are awfully slim. -Raymond]
  27. Bob says:

    @Worf:  I'm not an Intel person, but I do chip design[1]. We pull out tens-of-thousands of errata during our design process. Most are prior to fabrication. Chip functional design is superficially similar to software design. (but with single-assignment & very hard real-time constraints on every code path)  The major benefit we have is a golden model that perfectly predicts the (range of) correct behavior for every case. So, we do lots of targeted random code generation & comparison.

    On fabricated chips, the same is possible. Run code designed to punish specific parts of the architecture which checks the results. Once a mistake is found, you can re-run in your software model to find the exact problem.

    Bugs are primarily logic design errors. Mask layout & transistor level design can be formally proved to match the logic design. This leaves you with hard-to-model physics effects, which generally fix themselves by slowing the clock speed. In the lab, slowly increasing clock speed can help locate the problem by looking for the first fail.

    [1] My credits include for what is probably the non-x86 CPU that gets the most SW support from Microsoft. It has its own custom OS and large teams of Microsoft programmers (and Microsoft subsidiaries) who release dozens of end-user SW titles yearly.

  28. ERock says:

    The KB article indicates intel actually sold 386s with a sticker that read "for 16-bit use only". Anyone have a picture of those 386's or it's label? I think it would make a great picture to put up on my wall. :)

  29. Evan says:

    "Where are the bonus chapters? I registered (the book is great, but for my money more war stories/less Windows programming advice would be better) and see nothing."

    They're not the easiest things to find, admittedly. And no need to register.

    Just click on the "sample content" tab and scroll to the very bottom. There are two sample chapters from the book and the two bonus chapters linked as PDFs.

  30. Worf says:

    @Bob: Ah, cool. Too bad Windows CE doesn't get as much love. IIRC, your would be working in one of the Studio buildings… B or C was it…

    My IC design was limited to VLSI courses in university 10 years ago… ah yes, layout and all that FPGA goodness. Good to know how it's done – I've read many an errata sheet in my work (I work with embedded hardware. Errata sheets are the first thing I look at when things behave oddly). But reading through some errata always made me wonder since the oddest ones were so utterly specific I couldn't help but wonder how they were discovered and the real causes.

  31. DWalker says:

    I'll bet those 80386's aren't as large or heavy as the 80521 Pentium Pro (200 MHz) that I have sitting on my desk.  It seems to weigh a pound (actually, it's just over 3 oounces).

    According to CPU-world.com, the 80386 marked with "double sigma" is a "bug-free version".  

  32. peterchen says:

    "They're so cute when they're young" – My mind's now stuck on the image of a puppy developer with huge brown eyes making chirpy noises meaning "let's play 'recompile'".

    (You can give yourself a star for that)

Comments are closed.

Skip to main content