What’s up with compare_exchange_weak anyway?


Last time, I left you with a homework assignment: Watch this video on std::atomic.

At time code 33:03, the presenter notes the weak version of compare-exchange (which is permitted to fail even if the value matches the expected value) and tries to reverse-engineer what kind of hardware would require this operation, eventually settling on a NUMA architecture where cross-node memory accesses can time out.

But there's no need to speculate about something that exotic, because the answer is all around us. In fact, it's probably happening right now on a computer in the presenter's pocket.

Most RISC processors do not have a compare-exchange instruction. Instead, they use a load locked/store conditional pattern. This pattern is employed by the ARM architecture, and we also saw it for Alpha AXP, and we'll see it later for MIPS and PowerPC.

The load locked/store conditional pattern goes like this:

  • Issue a load locked instruction which reads a value from memory and instructs the processor to monitor that location for writes from other processors.
  • Perform some computations.
  • Issue a store conditional instruction which writes a value to the same memory location that was locked, provided the processor can prove that the memory has not been written to in the interim.

The conditional store can fail if another processor has written to the memory, or memory on the same cache line or other unit of monitoring granularity, or if the processor took an interrupt.

On an ARM, a strong compare-exchange contains a loop because the only way that compare_exchange_strong is permitted to fail is when the current value of the atomic variable does not match the expected value. If the failure reason was because of contention, then the strong version must perform an internal retry loop until the operation succeeds, or until the failure condition is met.

    ; r0 is the proposed new value
    ; r1 is the expected old value
    ; r2 is the address of the atomic variable

retry:
    DMB                     ; data memory barrier
    LDREX   r3, [r2]        ; load current value and lock it
    CMP     r3, r1          ; is it what we expected?
    BNE     fail            ; N: operation failed
                            ; actual current value is in r3

    STREX   r4, r0, [r2]    ; try to store new value
    CBNZ    r4, retry       ; lost the lock, try again
    DMB                     ; data memory barrier

Consider the compare-exchange loop in the code sample in the presentation:

    do { new_n->next = old_h; }
    while (!head.compare_exchange_strong(old_h, new_n));

The compare_exchange_strong has an embedded loop, and it's part of another loop. So we have to generate two loops:

    ; r0 is new_n
    ; r1 is old_h
    ; r2 is the address of the atomic variable "head"

outer_loop:
    STR     r1, [r0]        ; new_n->next = old_h

retry:
    DMB                     ; data memory barrier
    LDREX   r3, [r2]        ; locked load of head
    CMP     r3, r1          ; is it what we expected?
    BNE     fail            ; N: operation failed

    STREX   r4, r0, [r2]    ; try to store new value
    CBNZ    r4, retry       ; lost the lock, try again

    DMB                     ; data memory barrier

    ; succeeded - continue with code that comes after

    ...

    ; This code goes at the end of the function because ARM
    ; statically predicts forward-jumps as not-taken.
fail:
    DMB                     ; data memory barrier
    MOV     r1, r3          ; old_h = current value of head
    B       outer_loop      ; restart the outer loop

The outer loop drives the loop written by the C++ programmer. The inner loop is the one required by compare_exchange_strong.

The weak version avoids this nested loop:

    do { new_n->next = old_h; }
    while (!head.compare_exchange_weak(old_h, new_n));

With this version, the compiler can simply bail out at the first sign of trouble. It avoids having to create a separate fail label and reduces register pressure because it doesn't need to carry the expected and actual values through the (no-longer present) inner loop.

    ; r0 is new_n
    ; r1 is old_h
    ; r2 is the address of the atomic variable "head"

outer_loop:
    STR     r1, [r0]        ; new_n->next = old_h

    MOV     r3, r1          ; save old_h before we overwrite it
    DMB                     ; data memory barrier
    LDREX   r1, [r2]        ; locked load of head into old_h
    CMP     r3, r1          ; is it what we expected?
    BNE     outer_loop      ; N: retry with revised old_h

    STREX   r3, r0, [r2]    ; try to store new value
    CBNZ    r3, retry       ; lost the lock, try again

    DMB                     ; data memory barrier

    ; succeeded - continue with code that comes after

When should you prefer the strong version of compare-exchange as opposed to the weak version? We'll take up that question next time.

Comments (15)
  1. ZLB says:

    My guess for the use of the weak version is building locks with TryAcquireXXX type calls. Slim locks at a guess.

    The performance should be pretty fast.

    1. Peter says:

      That’s it, more-or-less. Basically, you use the weak versions (in a loop) whenever you must update the shared state and the strong version when the update is optional. The most common case I’ve come across is “if I win the race, do X; otherwise, bail.”

  2. Grant R says:

    Please be careful using compare_exchange for lock-free linked lists. It suffers from the A-B-A problem, and cane lead to some horribly corrupted lists. the ‘push’ algorithm you show here is safe, but there is no corresponding safe ‘pop’ without changing the ‘push’.

    1. Interestingly, the RISC load locked/store conditional pattern successfully detects A-B-A issues because the conditional store will fail because the memory was indeed written to (twice).

      1. Grant R says:

        While that is true for many RISC processors, it is not true for x86, and thus the C++ code snippet is not very portable. It will work fine on ARM, but fail in strange and bizarre ways, but only under stress on x86. Hence my warning to be careful.

        1. Right. I was talking about the machine language, not C++. If you use the LL/SC pattern on a RISC processor, it will detect A-B-A. But x86 doesn’t have LL/SC, so no help there.

  3. DWalker07 says:

    I have often wondered why a simple increment seems to require a compare and retry by the calling assembler code. Isn’t it possible to create a hardware “increment” instruction, at the chip level, which doesn’t require retries by the calling assembler code? The same for decrement.

    Maybe there are such hardware instructions and I’m just ignorant of them.

    1. You could do it at the chip level, but it requires two memory accesses in a single instruction, which is very non-RISCy. (No other RISC instruction performs two memory accesses in a single instruction, so you’re increasing chip space, power consumption, and instruction time, all just for one instruction.)

      1. pc says:

        One could imagine putting such logic in the RAM chips, where the CPU could issue an “increment” instruction to the RAM much like it currently issues “read” and “write” commands. I assume it only hasn’t been done because the benefits wouldn’t be worth the costs. Probably it’d increase the cost and decrease the speed of the RAM (as the RAM would have to make sure to do the increment all within one “cycle”), and make caching a lot harder, but it’s surely possible if one really wanted to build hardware that did it.

        1. DWalker07 says:

          Well, it SEEMS like it’s a frequent enough thing that it would be worth building into RAM or at the CPU chip level, since there are well-known patterns for interlocked increment, and spin locks, and so on. However, I’m not sure if anyone knows how often an increment actually has to retry, or not. If I were designing a chip, I would build an increment and a decrement at the hardware level….

          1. “Sorry, you bought the wrong kind of RAM. You got big-endian 32-bit RAM, but you need little-endian 64-bit RAM with support for 32-bit aligned operations.”

          2. Yuhong Bao says:

            AFAIK modern DDR memory does read/write in bursts and assume that you have a cache anyway.

  4. Joshua says:

    When the loop body is big.

  5. Jim says:

    Reminds me of Redis’ WATCH command and how it works with transactions for optimistic locking: https://redis.io/topics/transactions#optimistic-locking-using-check-and-set

  6. Yuhong Bao says:

    I wonder if the ARM64 version use LL/SC or the new CAS instructions introduced in ARMv8.1.

Comments are closed.

Skip to main content