Why did InterlockedIncrement/Decrement only return the sign of the result?


If you read the fine print of the InterlockedIncrement and InterlockedDecrement functions, you'll see that on Windows NT 3.51 and earlier and on Windows 95, the return value only matches the sign of the result of the increment or decrement. Why is that?

The 80386 instruction set supports interlocked increment and decrement, but the result of the increment/decrement operation is not returned. Only the flags are updated by the operation. As a result, the only information you get back from the CPU about the result of the operation is whether it was zero, positive, or negative. (Okay, you also get some obscure information like whether there were an even or odd number of 1 bits in the result, but that's hardly useful nowadays.)

Since those operating systems supported the 80386 processor, their implementations of the InterlockedIncrement and InterlockedDecrement functions were limited by the capabilities of the processor.

The 80486 introduced the XADD instruction which returns the original value of the operand. With this additional information, it now becomes possible to return the result of the operation exactly.

Windows NT 4 dropped support for the 80386 processor, requiring a minimum of an 80486, so it could take advantage of this instruction. Windows 98 still had to support the 80386, so it couldn't.

So how did Windows 98 manage to implement an operation that was not supported by the CPU?

Windows 98 detected whether you had a CPU that supported the new XADD instruction. If not, then it used an alternate mechanism which was mind-bogglingly slow: It called a driver whenever you wanted to increment or decrement a variable. The driver would then emulate the XADD instruction by disabling interrupts and performing the operation in locked memory. Since Windows 98 was a uniprocessor operating system, it didn't have to worry about a second processor changing the memory at the same time; all it needed to ensure was that the single processor didn't get interrupted while it was performing the "atomic" operation.

Comments (16)
  1. Serge Wautier says:

    Raymond, XADD link broken !

  2. Raymond Chen says:

    Hm, works for me. It’s not important; you can go to any old XADD page.

  3. Peter Montgomery says:

    Raymond,

    The link is broken for me as well. For anyone who cares, here’s a link that does work:

    http://courses.ece.uiuc.edu/ece291/archive/fall2001/books/labmanual/inst-ref-xadd.html

  4. Marc Wallace says:

    Eeeew. I really did not want to know that.

    Sounds exactly like what was done back in the x86/x87 days, when you weren’t sure you had an FPU. I still have memories of tracing way down into compilers, only to find "#ifdef IS_FPU" blocks where one side was one line, the other was inline ASM.

  5. James Day says:

    Hmmm… I replaced most of Microsoft’s Z-80 FORTRAN library, to use an AM9511 math coprocessor. Even converting between Microsoft and IEEE floating point format for each operation it was faster overall.

    Does this qualify for an eeeew?:)

    (This wasn’t a Microsoft project – Microsoft didn’t even know about it as far as I know.)

  6. Keith Moore [exmsft] says:

    Geesz, calling into kernel-mode and disabling interrupts just to manipulate a LONG seems a bit extreme.

    Hmm… I was just about to ask why the legacy 80386 support couldn’t just protect all InterlockedXxx() functions with a single global critical section. Then I realized that, under NT at least, critical sections are implemented using the InterlockedXxx() functions.

    GRRRRR…

  7. Raymond Chen says:

    Consider:

    initial value: L = 9

    Thread A: InterlockedIncrement(&L);

    Thread B: L = 3

    If A executes before B, then the result is L = 3; if B executes before A then the result is L = 4. Since both operations are atomic, these are the only possible results.

    But what if you used a critical section instead of a kernel trap? Then this would be possible:

    A: EnterCriticalSection

    A: load eax = L (loads 9)

    A: eax++ (eax = 10 now)

    B: L = 3

    A: L = eax (L = 10, overwriting 3 [not 9])

    A: LeaveCriticalSection

    A: return eax (returns 10)

    L now has the impossible value 10. Atomicity has been violated.

  8. Catatonic says:

    It must be painful to run Windows 98 on a 386, but I did used to make do with a 386sx running Windows 95.

  9. Antonio Tejada says:

    As Keith Moore suggests, for the Win9x case they could have implemented InterlockedXXXXXX using a simple xchg semaphore common to all InterlockedXXXXXX functions:

    InterlockedIncrement(DWORD* addend)

    mov ebx, [addend]

    loop:

    mov eax, 1

    ; lock prefix makes it also work in multiproc (yeah, not very useful in 9x)

    ; ilocksem is a variable which guarantees serialization to Interlocked funcs

    lock xchg [ilocksem], eax

    cmp eax, 0

    je loop

    ; Now it’s safe to increment addend

    mov eax, [ebx]

    inc eax

    mov [ebx], eax

    ; Free the semaphore

    mov [ilocksem], 0

    ; result already in eax, so we are ok

    }

    That would be so much faster than a ring switch.

    The problem with this approach, is this sentence of the MSDN:

    "The threads of different processes can use this mechanism if the variable is in shared memory."

    This would mean that [ilocksem] needs to be in memory area shared among all processes (in case two processes use InterlockedXXXXX over a shared "addend"), and also means that a process could manually set it to 1 and starve all processes trying to use InterlockedXXXXX functions, which is bad.

    Other approaches would be possible, like only share [ilocksem] among processes which already share some memory; for the rest of the cases [ilocksem] would be a per process variable.

    This would mean that a process can only hang (by setting [ilocksem] to 1) another process it shares memory with (which is not that bad).

  10. Antonio Tejada says:

    Oh, yes, and as Raymond says, serializing InterlockedXXXX functions won’t cover the case where a process accesses addend directly.

  11. Keith Moore [exmsft] says:

    Interesting, Raymond. I was focused on just the interlocked case.

    So I must ask: Does the system make *any* atomicity guarantees when mixing interlocked non-interlocked operations?

  12. Keith Moore [exmsft] says:

    Interesting, Raymond. I was focused on just the interlocked case.

    So I must ask: Does the system make *any* atomicity guarantees when mixing interlocked non-interlocked operations?

  13. Jordan Russell says:

    That page doesn’t seem to be fully accurate. It discusses the atomicity of 32-bit and 64-bit read/writes, but then states: "Reads and writes to variables of other sizes are not guaranteed to be atomic on any platform." On x86, reads/writes to 8-bit and properly-aligned 16-bit variables are also atomic, at least according to the Intel manual. Quote here: http://lists.freebsd.org/pipermail/freebsd-smp/2003-September/000334.html

  14. Raymond Chen says:

    A particular CPU may guarantee it but Win32 doesn’t.

  15. Jordan Russell says:

    I took "not guaranteed to be atomic on any platform" to mean "not guaranteed to be atomic on any CPU". I guess by "any platform" they really mean "32-bit or 64-bit Windows in general".

Comments are closed.