How does InterlockedIncrement work internally?


The Interlocked family of functions perform atomic operations on memory. How do they do it?

It depends on the underlying CPU architecture. For some CPUs, it's easy: The x86, for example, has direct support for many interlocked operations by means of the LOCK prefix (with the bonus feature that LOCK is implied for the XCHG opcode.) The ia64 and x64 also have direct support for atomic load-modify-store operations.

Most other architectures break the operation into two parts, known as Load-link/store-conditional. The first part (load-link) reads a value from memory and instructions the processor to monitor the memory address to see if any other processors modify that same memory. The second part (store-conditional) stores a value to memory provided that no other processors have written to the memory in the meantime. An atomic load-modify-store operation is therefore performed by reading the value via load-link, performing the desired computation, then attempting a store-conditional. If the store-conditional fails, then start all over again.

LONG InterlockedIncrement(LONG volatile *value)
{
  LONG lOriginal, lNewValue;
  do {
    // Read the current value via load-link so we will know if
    // somebody has modified it while we weren't looking.
    lOriginal = load_link(value);

    // Calculate the new value
    lNewValue = lOriginal + 1;

    // Store the value conditionally. This will fail if somebody
    // has updated the value in the meantime.
  } while (!store_conditional(value, lNewValue));
  return lNewValue;
}

(If this looks familiar, it should. You've seen this pattern before.)

Now, asking the CPU to monitor a memory address comes with its own gotchas. For one thing, the CPU can monitor only one memory address at a time, and its memory is very short-term. If your code gets pre-empted or if a hardware interrupt comes in after your load_link, then your store_conditional will fail because the CPU got distracted by the shiny object known as hardware interrupt and totally forgot about that memory address it was supposed to be monitoring. (Even if it managed to remember it, it won't remember it for long, because the hardware interrupt will almost certainly execute its own load_link instruction, thereby replacing the monitored address with its own.)

Furthermore, the CPU might be a little sloppy in its monitoring and monitor not the address itself but the cache line. If somebody modifies a different memory location which happens to reside in the same cache line, the store_conditional might fail even though you would expect it to succeed. The ARM architecture allows a processor to be so sloppy that any write in the same block of 2048 bytes can cause a store_conditional to fail.

What this means for you, the assembly-language coder who is implementing an interlocked operation, is that you need to minimize the number of instructions between the load_link and store_conditional. For example, InterlockedIncrement merely adds 1 to the value. The more instructions you insert between the load_link and store_conditional, the greater the chance that your store_conditional will fail and you will have to retry. And if you put too much code in between, your store_conditional will never succeed. As an extreme example, if you put code that takes five seconds to calculate the new value, you will certainly receive several hardware interrupts during those five seconds, and your store_conditional will always fail.

Bonus reading: Why did InterlockedIncrement/Decrement only return the sign of the result?

Comments (23)
  1. Axel says:

    ARM is always hyped as being so much better than x86 but the relaxed memory semantics of it just make everything so much more difficult to code for in the age of multicores.

    Intel made the right choice of being very strict about that even if it means you lose a bit of performance.

  2. Joshua says:

    What's really annoying is the old Dekker's algorithm and friends don't work on ARM.

  3. laonianren says:

    @Joshua.  Dekker's algorithm doesn't work on Intel either, but it fails less frequently than on ARM so you might not notice:

    jakob.engbloms.se/…/65

  4. j b says:

    My favorite hardware solution in terms of low conceptual complexity (and I would think implementation was fairly simple, too): The Norsk Data ND-5000 superminis ("VAX class" machines) had a SOLO instruction that turned off all interrupts and started an 8-bit counter incremented every clock cycle. A TUTTI instruction reenabled interrupts and stopped the counter.  A counter overflow caused a fatal interrupt to the current process, so it had at most 255 clock cycles to complete its protected operations. The overhead was one clock cycle each for the SOLO and TUTTI instructions; all instructions executed at full speed. If the protected operations were to complex to complete in 255 cycles, you would have to build some sort of critical region or monitor, protected by a semaphore read/update in SOLO-TUTTI sequences, and the overhead would double to 4 clock cycles, 2 for entering the region, 2 for leaving. Still quite moderate.

    Interrupts signaled between SOLO and TUTTI were queued up. In theory the worst case extra latency was 256 cycles, but in pracice, duration of SOLO-TUTTI sequences were more like a dozen clock cycles or less (and usually, external interrupts were processed by a 16-bit front end with super-low interrupt latency, down to 900 ns for a complete context switch).

    The best thing was that even though SOLO disabled interrupts (for a few clock cycles), these instructions were unprivileged and could be useed by any process in user space, without incurring any context switch – not even a function call, if you didn't want to. SOLO and TUTTI were available as keywords in the systems programming language to any application programmer (other languages usually had a macro facility and could use inline assembly, looking very much the same), making the cost of protecting updates of shared data structures very low.

    Even the Intel LOCK prefix is unprivileged, but you need help from your compiler to generate it, you can use it only with specific instructions and you can't extend the lock to an instruction sequence e.g. relinking a linked list, which requires several instructions. And it complicates instruction decoding by making the instruction format even more complex in structure.

    In principle, SOLO-TUTTI could be abused by "terrorists" creting a loop of a SOLO instruction, do dummy work for 230-240 cycles, before a TUTTI. Then then could keep interrupts disabled for a significant fraction of their time slice. However, the SOLO would allow handling of all interrupts arriving during this loop iteration, before the next loop iteration, so the effect would probably be far less than expected by the students trying to show their cleverness. (Who else would care to do antyning of that sort?)

  5. The IA32 implements interlocked operations for multiple nodes/CPU, using the cache coherency protocol. The data being modified is in the local cache only. The cache coherency protocol makes sure no other processor contains a copy of the stale cache line.

    This makes interlocked operations no more expensive than a simple non-interlocked load-store-modify operation. There is a bit of cost, though, because interlocked operations are serializing.

  6. Myria says:

    InterlockedIncrement on x86-32 is lock xadd, then inc to give the caller the result it expects.

    Somewhat relevant: On x88-32, InterlockedIncrement64 is implemented using a loop that reads the variable into two registers, increments the 64-bit value, then attempts to do a lock cmpxchg8b (i.e., InterlockedCompareExchange64) to write the new value if the old value has not changed since it was read.  If the cmpxchg8b fails, it retries the loop.  This works because livelock is essentially impossible (probability tends toward zero) in modern environments.  The ABA problem is there, but is irrelevant because you're merely incrementing an integer.  You generally aren't going to mess up if another thread manages to wrap the counter between you reading the value and writing the incremented version (and with a 64-bit counter, good luck with that anyway).

    Intel has new processors coming (or out?) with transactional memory features that look a lot like load-linked/store-conditional but for multiple memory locations.  It's hard to say what's going to happen with that.  It's certainly not going to be widely used any time soon.

  7. Myria says:

    @Axel: Yes, strict memory ordering is nice, but it is really only an x86 thing these days.  I'd much rather have my code able to run on many different systems.

    @j b: "TUTTI"?  That's just as bad as PowerPC's "EIEIO" instruction =^_^=

  8. SimonRev says:

    Wow, this article is particularly relevant to me today, as we are reviewing customer code that is misusing InterlockedIncrement on an ARM Windows CE platform.

  9. @Myria:

    I don't see ABA problem in that implementation of II64. Where it is?

  10. j b says:

    @Myria,

    You never played in an orchestra, I presume.

  11. >Then then could keep interrupts disabled for a significant fraction of their time slice.

    I suppose TUTTI will handle all pending interrupt requests, so the only effect will be 240 clocks of interrupt latency (and timeslice occasionally extended by up to 240 clocks).

  12. WndSks says:

    @Myria On Win95 it is LOCK INC and not XADD, this is also why the COM rules for the AddRef/Release return value are the way they are…

  13. j b says:

    @alegr,

    I meant to write exactly what you point out, but wrote SOLO instead of TUTTI, which messed the thing up. Sorry. Your are right; the only "damage" would be as you say. But when this machine came out, me and my fellow students were hoping for much more…

    One point that came to mind: Each CPU had a private cache and usually private RAM. "Multiport" RAM could also be shared between several CPUs, managed by dedicated arbiting hardware. I take for granted that during a SOLO-TUTTI sequence, an access to the shared memory would lock it for access through other ports until TUTTI; that could actually slow down other CPUs competing for access to the same shared memory. I never heard this being raised as an issue, so programmers probably avoided it by not accessing the shared RAM in SOLO-TUTTI sequnces. (I think all customers running mulitport RAM modules were highly specialized single-application environments with expert programners who knew what they were doing :-).)

  14. @skSdnW:

    Because as we both know, XADD was added with the 486, COM predates that, and Windows 95 could run on a 386.

  15. voo says:

    @Axel If you care about the underlying platform memory model you're most likely doing it wrong anyhow. Most modern languages have their own memory model by now that dictates what you can and can't expect. By the same definition no, Dekker's algorithm is not broken the author of the linked article just doesn't understand what volatile does and does not do in C/C++. If your language tells you that volatile reads/writes also have a uni-directional memory barrier the algorithm will work just fine. C++ doesn't, Java5+ and .NET do.

    Really the only people that have to care about memory models these days are compiler writer and now let me go back to reading the Aarch64 specification, because some people actually have to write those backends for the rest ;)

  16. DWalker says:

    I always thought that the silicon itself could be engineered to be able to increment a value at a storage location.  But I suppose that involves actually changing some bits in memory (RAM) with the appropriate semantics (carries, etc).  Is it true that the only things that RAM chips can do is allow read and write?

    Still, if this could be implemented directly in RAM hardware, wouldn't that be useful?

  17. @DWalker: If you mean the actual RAM attached to the processor (whether internal RAM, external RAM) and not processor registers, then No, the RAM cannot do anything but store and retrieve the bit value.

    Now, theoretically, could a memory chip be designed to increment and decrement a value in place? Sort of. The chip could internally do a read/modify/write without exposing the value on the bus to run it through some add/subtract block. But there are a couple of issues:

     1) how do you instruct the memory to do this? There would need to be some standard that specified how to make this work. That might be ok for DRAM-based memory systems such as DDRx but is practically impossible if your memory is SRAM (as some embedded systems use)

     2) how does the memory know the correct width of the element you want to increment?

    I am sure there are more issues that I am too lazy to think of at the moment.

    [The biggest issue is that the increment doesn't happen in RAM. It happens on the CPU itself inside the L1 cache. So make your RAM as fancy as you like. It doesn't matter. The value will be incremented inside the CPU's L1, and then it will get flushed out some unspecified time later. I guess you could design a CPU that, when it sees an increment instruction, flushes out the cache line, instructs other CPUs to flush out the cache line, then issues the "increment directly in RAM" instruction. But I think it'll be much slower than what they do now. -Raymond]
  18. DWalker says:

    @Brian: Right, I realize that the RAM would have to know the expected width of the element, and there would need to be a new RAM or memory bus "instruction" defined in addition to "read" and "write", and so on.  The increment (or decrement) instruction would have to specify the width.

    I was just speculating that it might be an interesting thing to do.

  19. DWalker says:

    @Raymond.  Yes, there's that too.  :-)  Oh, the good old days of uniprocessors with no shared memory.

  20. DWalker says:

    Of course, the compare-exchange is also comparing what's in the CPU's L1 cache with … what's in every other CPU's L1 cache?  Isn't a flush or a memory barrier required anyway, for a compare-and-swap (or for any compare with memory)?

    [A cache invalidation occurs on all CPUs other than the one performing the operation. How this is accomplished varies from processor to processor. -Raymond]
  21. @Dwalker:

    >Of course, the compare-exchange is also comparing what's in the CPU's L1 cache with … what's in every other CPU's L1 cache?

    As I mentioned above, interlocked operations in IA32 processors are based on the cache coherency protocol. The cache coherency protocol guarantees that for a dirty cache line, there is no other copy in other nodes. The processor only needs to read and modify its local cache line atomically.

  22. >but … couldn't the CPU doing an "Atomic increment" use the cache coherency protocol to make sure that there is no other copy in other nodes, and then increment the value in its own cache?

    This is exactly what happens. If the cache line is marked as "shared", any modification on it will cause invalidation.

    Flushing it to RAM every time would be slow. Because RAM is slow.

  23. DWalker says:

    @Alegr1: I am not a chip engineer, but … couldn't the CPU doing an "Atomic increment" use the cache coherency protocol to make sure that there is no other copy in other nodes, and then increment the value in its own cache?  Then, when that cache line gets flushed to RAM, the value will be incremented.  

    Raymond said that would likely be slower than what's done now, but I don't see how.  Does the Compare-exchange function actually compare a register with what's in its own cache line, rather than what's actually in RAM?

    [No, what I said would probably be slower is for *all* CPUs to invalidate their cache lines (even the one doing the incrementing) so that the RAM could do a hardware increment. -Raymond]

Comments are closed.