What's wrong with this code, part 12 - Retro Bad Code Answers

In the last article, I looked at a prototype code snippet to enter a system call.

But the code had a bug in it (no, really?  Why would I be asking what was wrong with it otherwise)?

Not surprisingly, it wasn't that hard to find, Peter Ibbotson found it in the first comment - if you set SP before you set SS, then you introduce a window where a hardware interrupt could occur which would pre-empt your code and trash random pieces of user memory.

Several people quite correctly pointed out that writing to the SS segment would lock out interrupts for the next instruction, which would inherently protect the MOV SP instruction.

But in reality, the answer is a bit subtler than that.

You see, you can prevent hardware interrupts from occurring simply by turning off the "allow interrupts" flag by issuing a CLI instruction - that will disable all hardware interrupts (software interrupts don't matter, since you own the code)..

And the x86 architecture mandates that after a software or hardware interrupt occurs the interrupt flag is disabled.  So the code in question is ALREADY called with interrupts disabled.

So why is all this important?

Because there's one interrupt that is NOT disabled by the CLI instruction, that's the NMI (or Non Maskable Interrupt).  You can't disable NMI's, under any circumstances.

So how can you switch stacks if an NMI could come along and interrupt your code?  Well, that's where the MOV SS behavior comes into play.  While the NMI interrupt can't be disabled, it CAN be deferred - and the MOV SS sequence defers the NMI interrupt until after the NEXT instruction has finished executing.

Btw, Universalis mentioned in the comments of the last post that this behavior wasn't present on the 8088, my version of the 8088 hardware reference manual states differently:

 "A MOV (move) to segment register instruction and a POP segment register instruction are treated similarly: No interrupt is recognized until after the following instruction.  This mechanism protects a program that is changing to a new stack (by updating SS and SP).  If an interrupt were recognized after SS had been changed, but before SP had been altered, the processor would push the flags, CS and IP onto the wrogn area of memory.  It follows from this that whenever a segment register and another value must be updated together, the segment register should be changed first followed immediately by the instruction that changes the other value.

So that's why the MOV SS needs to come first.  But why did people care, given that NMI's weren't' that common anyway?

Well, it turns out that one very well known OEM produced a product with a wireless keyboard (and little square keys) that tied the keyboard interrupt to the NMI line on the processor.  So every time the user hit a key on a keyboard they would be generating an NMI.

Another clever issue with interrupts had to do with a bug in (I believe) the first steppings of the 286 processor (it might have been the 8088 though).  As I'd mentioned before, when you executed an interrupt, the interrupt handler was called with interrupts disabled.  But this processor had a bug in it - if an interrupt (software, of course) occurred with interrupts disabled, then the processor would enable interrupts briefly during the interrupt translation.

So you had a situation where you could get a hardware interrupt executed even though you'd disabled interrupts.  Not pretty at all.  And before people ask, no, I don't remember how one worked around it :(

Kudos: Peter Ibbotson for being the first, but everyone else commenting pretty much agreed with him.