The case of amazingly morphing intrinsic function

I was in a code review earlier this week and someone pointed out the InterlockedOr, unlike the other InterlockedXxx operations, does not return the previous value. I found that hard to believe. I pulled up the MSDN docs and pointed this out

Return Value

InterlockedOr returns the original value stored in the variable pointed to by Destination.

So, we set out to find out which one of us was right. It turns out that both of us are right, it just depends on how you call InterlockedOr! We created the following test application to see what was happening

     #include <windows.h>
    #include <stdio.h>
    
    extern "C"
    LONG
    _InterlockedOr (
        IN OUT LONG volatile *Target,
        IN LONG Set
        );
    #pragma intrinsic(_InterlockedOr)
    
    int _cdecl main(int argc, char *argv[])
    {
        LONG old, cur = 0xF;
    
        old = _InterlockedOr(&cur, 0xF0);
        _InterlockedOr(&cur, 0xF0);
    
        return 0;
    }

And then looked at the assembly that was generated

     >   16:     old = _InterlockedOr(&cur, 0xF0);
    0:000> u
    test!main+0x10 [d:\work\size\main.cpp @ 16]:
    00e611d0 b9f0000000      mov     ecx,0F0h
    00e611d5 8d55f8          lea     edx,[ebp-8]
    00e611d8 8b02            mov     eax,dword ptr [edx]
    00e611da 8bf0            mov     esi,eax
    00e611dc 0bf1            or      esi,ecx
    00e611de f00fb132        lock cmpxchg dword ptr [edx],esi
    00e611e2 75f6            jne     test!main+0x1a (00e611da)
    00e611e4 8945fc          mov     dword ptr [ebp-4],eax
    0:000> p

    >   17:     _InterlockedOr(&cur, 0xF0);
    0:000> u
    test!main+0x27 [d:\work\size\main.cpp @ 17]:
    00e611e7 b9f0000000      mov     ecx,0F0h
    00e611ec 8d55f8          lea     edx,[ebp-8]
    00e611ef f0090a          lock or dword ptr [edx],ecx  

Completely different code is generated if capture the return value vs ignoring it! Wow, I didn't expect the compiler to do that.

When you capture the return value, the InterlockedOr intrinsic generates the standard pattern used to generate an Interlocked operation that is not supported by hardware (capture the old value, interlock compare and exchange for the new value, repeat if the old value from compare and exchange is not the captured old value). When you ignore the return value, a lock or (which does not set eax to the previous value) generated without a loop. If you compare the two implementations, the first implementation (the loop) can be a lot slower than the second both in terms of actually looping and in terms of prediction misses since there is a branch which could be mispredicted. If this API in your hot path, this might be very good information to know and change your usage of it to squeeze a little more cycles out of it. After testing this out, InterlockedAnd also exhibits the same behavior and, going out on a limb, I would assume the other Interlocked intrinsic also generate code the same way depending on if the return value is captured or not.