The case of amazingly morphing intrinsic function

I was in a code review earlier this week and someone pointed out the InterlockedOr, unlike the other InterlockedXxx operations, does not return the previous value. I found that hard to believe. I pulled up the MSDN docs and pointed this out

Return Value

InterlockedOr returns the original value stored in the variable pointed to by Destination.

So, we set out to find out which one of us was right. It turns out that both of us are right, it just depends on how you call InterlockedOr! We created the following test application to see what was happening

    #include <windows.h>
#include <stdio.h>

extern “C”
_InterlockedOr (
IN OUT LONG volatile *Target,
#pragma intrinsic(_InterlockedOr)

int _cdecl main(int argc, char *argv[])
LONG old, cur = 0xF;

old = _InterlockedOr(&cur, 0xF0);
_InterlockedOr(&cur, 0xF0);

return 0;

And then looked at the assembly that was generated

    >   16:     old = _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x10 [d:\work\size\main.cpp @ 16]:
00e611d0 b9f0000000 mov ecx,0F0h
00e611d5 8d55f8 lea edx,[ebp-8]
00e611d8 8b02 mov eax,dword ptr [edx]
00e611da 8bf0 mov esi,eax
00e611dc 0bf1 or esi,ecx
00e611de f00fb132 lock cmpxchg dword ptr [edx],esi
00e611e2 75f6 jne test!main+0x1a (00e611da)
00e611e4 8945fc mov dword ptr [ebp-4],eax
0:000> p

> 17: _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x27 [d:\work\size\main.cpp @ 17]:
00e611e7 b9f0000000 mov ecx,0F0h
00e611ec 8d55f8 lea edx,[ebp-8]
00e611ef f0090a lock or dword ptr [edx],ecx

Completely different code is generated if capture the return value vs ignoring it! Wow, I didn’t expect the compiler to do that.

When you capture the return value, the InterlockedOr intrinsic generates the standard pattern used to generate an Interlocked operation that is not supported by hardware (capture the old value, interlock compare and exchange for the new value, repeat if the old value from compare and exchange is not the captured old value). When you ignore the return value, a lock or (which does not set eax to the previous value) generated without a loop. If you compare the two implementations, the first implementation (the loop) can be a lot slower than the second both in terms of actually looping and in terms of prediction misses since there is a branch which could be mispredicted. If this API in your hot path, this might be very good information to know and change your usage of it to squeeze a little more cycles out of it. After testing this out, InterlockedAnd also exhibits the same behavior and, going out on a limb, I would assume the other Interlocked intrinsic also generate code the same way depending on if the return value is captured or not.

Comments (6)

  1. aionescu says:

    They do! I often have to use GCC and I’m trying to make the code as compatible/portable between compilers as possible, so we’ve had to build our own header with the intrinsics implemented in gcc’s own atomic language (which is based of Intel’s official specs). Unfortunately there’s no way to make them morphing like this, so the slower loop path must always be implemented.

  2. dispensa says:

    This was actually the subject of a cl 13.x compiler bug:

    Checking the return value of these functions was broken. Seems to be fixed in 14.x, but a lot of people still build with 3790.1830 or previous.

  3. aionescu says:

    There’s also this one 😉

    *subtle nudge to d*

  4. alegr says:

    Did you notice that the assembly code was incorrect? The WDK 6001 compiler still generates wrong code. And the error matches erroneous InterlockedOr_Inline, InterlockedAnd_Inline, InterlockedXor_Inline functions. The loop should return _before_ mov eax,[edx], otherwise, if [edx] changes before cmpxchg (causing the branch), the loop will never end. Such mistakes (as in *_Inline functions in WINBASE.H) are easily caught by a code review, which obviously was not done for such an important function.

    Note that InterlockedAnd64_Inline function was obviously written by a more experienced programmer. It’s cleaner and it’s _correct_!

  5. alegr says:

    Even though InterlockedAnd64_Inline is correct, the code generated by x64 compiler for _InterlockedAnd64 is incorrect (same error).

  6. forrestf says:

    alegr, when you get a chance, you should take a look at the documentation for cmpxchg, which states (from the AMD manual):

    "If the two values are equal the instruction copies the value in the second operand to the first operand and sets the ZF flag in the rFLAGS register to 1.  Otherwise, it copies the value in the first operand to the AL, AX, EAX, or RAX register and clears the ZF flag to 0."

    In particular, consider how the second sentence in the preceding paragraph applies to the code in question.