# Interlocked operations don’t solve everything

Interlocked operations are a high-performance way of updating DWORD-sized or pointer-sized values in an atomic manner. Note, however, that this doesn't mean that you can avoid the critical section.

For example, suppose you have a critical section that protects a variable, and in some other part of the code, you want to update the variable atomically. "Well," you say, "this is a simple imcrement, so I can skip the critical section and just do a direct InterlockedIncrement. Woo-hoo, I avoided the critical section bottleneck."

Well, except that the purpose of that critical section was to ensure that nobody changed the value of the variable while the protected section of code was running. You just ran in and changed the value behind that code's back.

Conversely, some people suggested emulating complex interlocked operations by having a critical section whose job it was to protect the variable. For example, you might have an InterlockedMultiply that goes like this:

```// Wrong!
LONG InterlockedMultiply(volatile LONG *plMultiplicand, LONG lMultiplier)
{
EnterCriticalSection(&SomeCriticalSection);
LONG lResult = *plMultiplicand *= lMultiplier;
LeaveCriticalSection(&SomeCriticalSection);
return lResult;
}
```

While this code does protect against two threads performing an InterlockedMultiply against the same variable simultaneously, it fails to protect against other code performing a simple atomic write to the variable. Consider the following:

```int x = 2;
{
InterlockedIncrement(&x);
}

{
InterlockedMultiply(&x, 5);
}
```

If the InterlockedMultiply were truly interlocked, the only valid results would be x=15 (if the interlocked increment beat the interlocked multiply) or x=11 (if the interlocked multiply beat the interlocked increment). But since it isn't truly interlocked, you can get other weird values:

x = 2 at start
InterlockedMultiply(&x, 5)
EnterCriticalSection
InterlockedIncrement(&x);
x is now 3
multiply by 5 (result: 10)
store x (stores 10)
LeaveCriticalSection
x = 10 at end

Oh no, our interlocked multiply isn't very interlocked after all! How can we fix it?

If the operation you want to perform is a function solely of the starting numerical value and the other function parameters (with no dependencies on any other memory locations), you can write your own interlocked-style operation with the help of InterlockedCompareExchange.

```LONG InterlockedMultiply(volatile LONG *plMultiplicand, LONG lMultiplier)
{
LONG lOriginal, lResult;
do {
lOriginal = *plMultiplicand;
lResult = lOriginal * lMultiplier;
} while (InterlockedCompareExchange(plMultiplicand,
lResult, lOriginal) != lOriginal);
return lResult;
}
```

[Typo in algorithm fixed 9:00am.]

To perform a complicated function on the multiplicand, we perform three steps.

First, capture the value from memory: `lOriginal = *plMultiplicand;`

Second, compute the desired result from the captured value: `lResult = lOriginal * lMultiplier;`

Third, store the result provided the value in memory has not changed: `InterlockedCompareExchange(plMultiplicand, lResult, lOriginal)`

If the value did change, then this means that the interlocked operation was unsucessful because somebody else changed the value while we were busy doing our computation. In that case, loop back and try again.

If you walk through the scenario above with this new InterlockedMultiply function, you will see that after the interloping InterlockedIncrement, the loop will detect that the value of "x" has changed and restart. Since the final update of "x" is performed by an InterlockedCompareExchange operation, the result of the computation is trusted only if "x" did not change value.

Note that this technique works only if the operation being performed is a pure function of the memory value and the function parameters. If you have to access other memory as part of the computation, then this technique will not work! That's because those other memory locations might have changed during the computation and you would have no way of knowing, since InterlockedCompareExchange checks only the memory value being updated.

Failure to heed the above note results in problems such as the so-called "ABA Problem". I'll leave you to google on that term and read about it. Fortunately, everybody who talks about it also talks about how to solve the ABA Problem, so I'll leave you to read that, too.

Once you've read about the ABA Problem and its solution, you should be aware that the solution has already been implemented for you, via the Interlocked SList functions.

Tags

1. Chris says:

Shouldn’t that while condition be "!= lOriginal"? I thought InterlockedCompareExchange returned the initial value of the destination, and we want to keep looping if the result wasn’t what we expected.

2. Raymond Chen says:

D’oh! You’re right.

3. anon says:

Spelled Interlocked wrong in title:

>Inerlocked< operations don’t solve everything

4. Seth McCarus says:

Okay, two simple and probably stupid questions:

Why is there not InterlockedAssign function? (I guess InterlockedExchange could be used there, but still…)

How can one iterate a list created with InitializeSListHead?

5. Raymond Chen says:

2. You can’t walk a linked list atomically.
6. Seth McCarus says:

What about on a multiprocessor system? If two CPUs try to write to the same memory location "simultaneously," would there not be problems?

Maybe synchronization is performed on a hardware level??

7. Jordan Russell says:

An "InterlockedAssign" function would be handy because while writes to aligned 32-bit variables are always atomic, they’re not always ordered on non-x86 platforms as explained at <http://msdn.microsoft.com/library/en-us/dllproc/base/synchronization_and_multiprocessor_issues.asp&gt;. That article suggests using InterlockedExchange instead of a direct assignment, but that seems like overkill on platforms where writes are always ordered, e.g. x86.

8. Pavel Lebedinsky says:

The MSDN article that Jordan referred to is not quite correct, by the way:

Win32 should have provided a memory barrier macro/intrinsic that is guaranteed to be respected by both the compiler and hardware. As far as I know, there’s still no such thing (there are various bits and pieces like the MemoryBarrier macro but nothing that prevents both CPU and compiler reordering).

Interestingly, one of the best explanations of ABA google came up with was in this blog.

http://weblogs.asp.net/oldnewthing/archive/2004/08/17/215682.aspx#217600

10. jeffdav says:

I came across someone who really did not understand InterlockedIncrement() once. I had to remove the following:

/* Synchronization */

#define InterlockedIncrement(plong) (++(*(plong)))

#define InterlockedDecrement(plong) (–(*(plong)))

11. Johan says:

It’s not clear to me what the intended memory access/ordering semantics of the plain (non-*Acquire/*Release) Interlocked* operations are. The documentation says nothing about it, but there’s an IA64-specific note indicating an associated *full* memory barrier. Is that the intended semantics? Or was that fence just added to avoid breaking all the existing code which assumes an x86-like memory model?

12. lowercase josh says:

If I’m not mistaken, what you have to lock for the critical section to work is the data, not the code. If all access to that data goes through the same critical section, that first InterlockedMultiply would work correctly. To me, "critical section" seems to imply a "section" of code, but you nearly always want to lock the data. :/

13. Mike says:

Why are they called the interlocked functions rather than the atomic functions? ie AtomicIncrement etc. That would be a slightly more intuitive name (albiet, doesn’t sound as cool :)

Seth: yes the sync is done with hardware.

InterlockedCompareExchange() is some assembly like this:

lock cmpxchg a, b

The lock prefix causes the CPU to perform a "bus assert" which basically locks the bus. It prevents other processors from accessing memory while the bus is held. At least this is my understanding.

After a bus assert I suppose the CPU clears the dcache but now I’m just guessing :)

14. MikeF says:

Possibly silly question: What is the "critical section bottleneck", and why is looping over a cmpxchg better?

My understanding is that a pure critical section approach (a) lets the OS sleep the blocked thread, and (b) wastes less time than the cmpxchg loop anyway.

I’m almost certainly missing something here but I can’t see what. Is the overhead on critical sections particularly big?

15. Seth McCarus says:

16. Ben Hutchings says:

Mike wrote: "After a bus assert I suppose the CPU clears the dcache but now I’m just guessing :)"

That would take far too long. Actually processors in SMP systems obey a cache coherency protocol which ensures that memory that’s been modified in cache but not yet written back is forwarded from processor to processor (possibly by writing it back to memory first) and memory that’s about to be modified by one processor is invalidated in the caches of the others.

The problems of memory coherence are not due to caching but to reordering of instructions in the processor core and of memory operations issued by the processor core to its cache(s).

17. Trent Glascock says:

MikeF: Waiting on a critical section is an expensive operation. So expensive that on multi-processor systems you can elect to "spin" before waiting on the critical section. The code in the original post could loop through a hundred iterations and still come out ahead of waiting on a busy critical section.

See InitializeCriticalSectionAndSpinCount

http://msdn.microsoft.com/library/en-us/dllproc/base/initializecriticalsectionandspincount.asp