How do I choose between the strong and weak versions of compare-exchange?

Last time, we left with the question of when you should prefer the strong version of compare-exchange as opposed to the weak version.

It comes down to whether spurious failures are acceptable and how expensive they are.

In the example given in the presentation, the cost of a spurious failure is very low:

    do { new_n->next = old_h; }
    while (!head.compare_exchange_strong(old_h, new_n));

Recovering from a spurious failure is just updating a single variable and retrying the operation. Removing the nested loop embedded in the strong compare-exchange simplifies the outer loop.

On the other hand, if recovering from the failure requires a lot of work, such as throwing away an object and constructing a new one, then you probably want to pay for the extra retries inside the strong compare-exchange operation in order to avoid an expensive recovery iteration.

And of course if there is no iteration at all, then a spurious failure could be fatal. Consider the lock-free singleton construction pattern:

std::atomic<Widget*> cachedWidget;

Widget* GetSingletonWidget()
 Widget* widget = cachedWidget;
 if (!widget) {
  widget = new(std::nothrow) Widget();
  if (widget) {
   Widget* previousWidget = nullptr;
   if (!cachedWidget.compare_exchange_strong(previousWidget, widget)) {
    // lost the race - destroy the redundant widget
    delete widget;
    widget = previousWidget;
 return widget;

If we were to switch to compare_exchange_weak, then a spurious failure would mean that the value of cachedWidget was nullptr, but we failed to exchange anyway. This means that we would think that we lost the race against another thread and return the previousWidget as the singleton. But in the case of a spurious failure, the previousWidget will still be nullptr, causing the code to create a Widget, think it was redundant, throw away the created Widget, and then return nullptr. This is bad news for the Get­Singleton­Widget function.

Choosing between the strong and weak versions of compare-exchange requires you to understand what your algorithm does in the case of a spurious failure.

Comments (10)
  1. Giedrius says:

    I fail to understand what would the problem be if you used weak version in a loop ? You would still create and set widget.

  2. Peter Doubleday says:

    Stupid question: what’s wrong with just implementing the strong (ie, easy to reason about) version in hardware, and hang the weak version?

    I genuinely don’t see what utility the weak version has. (I’ve read both posts.) No doubt, at the JVM/.NET/virtual machine level, this is all transparent to the programmer, but it still smells wrong to me.

    1. The strong version requires a read-modify-write (two memory accesses in one instruction) which a load-store architecture cannot do. So basically you’re saying “hang RISC-style processors. All processors should be VAX or Intel.”

      1. Joshua says:

        I see no trouble designing a RISC instruction for solving the root problem. We take a trick from the ancient days; the instruction locks the bus for a few more instructions. Assuming I remember my college classes well, this sequence would be typical and says that 5 is enough (4 if the RISC processor has conditional jmp hints).

        LDA r2, [global_variable] ; LDA assembles to the correct number of MOV instructions …
        LOD r4, r2
        ; … compute new value in r5 preserving r2 and r4
        LOD r1, r2
        CMP r1, r4
        JNE again
        STO r5, r2 ;STO’s backwards from everybody else

        Of course, issuing LOCK while you have a pending LOCK needs to fault.

        1. Wenbin says:

          There is a reason why it is not implemented this way nowadays.
          On modern multi-core CPUs, “locks the bus” may mean stop all other CPU cores (and other hardware) from access memory, which is inefficient.

          1. Joshua says:

            Well that’s how x86 and x64 are actually doing it. lock inc [memory], lock dec [memory], lock cmxchg register, [memory] …

        2. Imagine if an interrupt occurs while you hold a lock. You would still need a conditional store instruction, and you’re back where you started.

          1. Peter Doubleday says:

            Again, I’m probably being dense. Or at least I haven’t looked back at my CS notes (which, being Cambridge in the 1980s, were heavily biased towards RISC. You wouldn’t believe how anti-CISC they were, and of course Intel hybrids didn’t exist at the time).
            It’s (evidentially) clear to me that the PARISC architecture cannot mandate a “strong compare-and-swap.” But my understanding of, say, SPARC v9 (a poster child for RISC) is that it has what I, naively, would call a strong “cas” instruction, involving two registers and one storage location.

            Now, precisely how SPARC v9 manages to co-ordinate the two registers involved, which obviously requires some form of lock at whatever level, I don’t really know. And precisely how much credence I should place in assertions that “this is necessary, but not sufficient, for lock-free algorithms,” I don’t know either. I worry about these things sometimes. I worry about my ignorance, in fact.

            But I still wonder whether PARISC avoids a (possibly expensive) hardware implementation of a strong compare-and-swap in favour of just passing the buck to the programmer. Or at the present day the compiler writer.

            Because it seems to me that 99.9999% of software doesn’t actually need a compare-and-swap at all. (Easy thought experiment: single threading! Outside the kernel, of course.) The thing is, for that vanishingly small percentage … it’s quite important to reason about these things. And whilst you can’t really make mistakes in hardware that aren’t already there … you can guarantee that somebody somewhere is going to make mistakes in software.

          2. Joshua says:

            They actually built this thing on one ancient processor so I know the answer. The interrupt isn’t handled until the lock expires.

        3. alegr1 says:

          >Well that’s how x86 and x64 are actually doing it

          Except that it’s not how x86/x64 are doing it. Inter/locked instructions use cache coherency protocol to perform atomic modifications. They only need to atomically modify an L1 cache line which doesn’t require to lock the whole QPI.

Comments are closed.

Skip to main content