What’s wrong with this code, part 12 – Retro Bad Code Answers

In the last article, I looked at a prototype code snippet to enter a system call.

But the code had a bug in it (no, really?  Why would I be asking what was wrong with it otherwise)?

Not surprisingly, it wasn’t that hard to find, Peter Ibbotson found it in the first comment – if you set SP before you set SS, then you introduce a window where a hardware interrupt could occur which would pre-empt your code and trash random pieces of user memory.

Several people quite correctly pointed out that writing to the SS segment would lock out interrupts for the next instruction, which would inherently protect the MOV SP instruction.

But in reality, the answer is a bit subtler than that.

You see, you can prevent hardware interrupts from occurring simply by turning off the “allow interrupts” flag by issuing a CLI instruction – that will disable all hardware interrupts (software interrupts don’t matter, since you own the code)..

And the x86 architecture mandates that after a software or hardware interrupt occurs the interrupt flag is disabled.  So the code in question is ALREADY called with interrupts disabled.

So why is all this important?

Because there’s one interrupt that is NOT disabled by the CLI instruction, that’s the NMI (or Non Maskable Interrupt).  You can’t disable NMI’s, under any circumstances.

So how can you switch stacks if an NMI could come along and interrupt your code?  Well, that’s where the MOV SS behavior comes into play.  While the NMI interrupt can’t be disabled, it CAN be deferred – and the MOV SS sequence defers the NMI interrupt until after the NEXT instruction has finished executing.

Btw, Universalis mentioned in the comments of the last post that this behavior wasn’t present on the 8088, my version of the 8088 hardware reference manual states differently:

 “A MOV (move) to segment register instruction and a POP segment register instruction are treated similarly: No interrupt is recognized until after the following instruction.  This mechanism protects a program that is changing to a new stack (by updating SS and SP).  If an interrupt were recognized after SS had been changed, but before SP had been altered, the processor would push the flags, CS and IP onto the wrogn area of memory.  It follows from this that whenever a segment register and another value must be updated together, the segment register should be changed first followed immediately by the instruction that changes the other value.

So that’s why the MOV SS needs to come first.  But why did people care, given that NMI’s weren’t’ that common anyway?

Well, it turns out that one very well known OEM produced a product with a wireless keyboard (and little square keys) that tied the keyboard interrupt to the NMI line on the processor.  So every time the user hit a key on a keyboard they would be generating an NMI.

Another clever issue with interrupts had to do with a bug in (I believe) the first steppings of the 286 processor (it might have been the 8088 though).  As I’d mentioned before, when you executed an interrupt, the interrupt handler was called with interrupts disabled.  But this processor had a bug in it – if an interrupt (software, of course) occurred with interrupts disabled, then the processor would enable interrupts briefly during the interrupt translation.

So you had a situation where you could get a hardware interrupt executed even though you’d disabled interrupts.  Not pretty at all.  And before people ask, no, I don’t remember how one worked around it 🙁

Kudos: Peter Ibbotson for being the first, but everyone else commenting pretty much agreed with him. 

Comments (8)

  1. Anonymous says:

    Don’t I feel old that I figured out the OEM product to which you refer — and that the only person that I knew who ever owned one was one of your father’s partners — and that the only reason I remember this is from playing Kings Quest (I) on it with his kids.

  2. Let me guess, John Hanna? I wasn’t trying to be TOO obscure on which OEM it was, but it was a rather "silly" architecture.

    I hadn’t realized you’d played KQ1 🙂

  3. POP SS is actually a little buggy in this respect. I believe there’s an errata on it for current Intel processors although I think the problem is primarily theoretical.

  4. Anonymous says:

    > But why did people care, given that NMI’s

    > weren’t’ that common anyway?

    Some less famous makers would have cared because of this reason: One in a million was next Tuesday. (Of course that was ancient history. In modern times one in a million is the next minute, or somewhere around there.)

  5. Anonymous says:

    Actually, it was Michael Whiteman. And it was a very long time ago. And yes, it was a silly architecture which I suppose is why that is the only one I ever saw outside a computer store. Although as I recall, the Radio Shack Color Computer was still on sale at that time.

  6. Norman, why did I know you’d mention that 🙂

    The thing is that on those machines (and on most current machines, I believe) the NMI line is tied to ground – it’s never generated.

    So coding to handle it was unusual.

  7. Anonymous says:

    You’re both right about the 8088. The manual says how it was supposed to work, but the first rev of the 8088 did indeed have a nasty bug where MOV SS failed to defer any interrupts (including NMI). This was fixed very early on, but my original IBM PC had an 8088 with this bug. It would really bite you if you figured you could just do MOV SS/MOV SP without wrapping it inside a CLI/STI. The CLI/STI fixed it for interrupts other than NMI, and a later rev of the 8088 fixed it for good. I remember replacing the 8088 in my PC after the fix came out.