The case of amazingly morphing intrinsic function
I was in a code review earlier this week and someone pointed out the InterlockedOr, unlike the other InterlockedXxx operations, does not return the previous value. I found that hard to believe. I pulled up the MSDN docs and pointed this out
Return Value
InterlockedOr returns the original value stored in the variable pointed to by Destination.
So, we set out to find out which one of us was right. It turns out that both of us are right, it just depends on how you call InterlockedOr! We created the following test application to see what was happening
#include <windows.h>
#include <stdio.h>
extern "C"
LONG
_InterlockedOr (
IN OUT LONG volatile *Target,
IN LONG Set
);
#pragma intrinsic(_InterlockedOr)
int _cdecl main(int argc, char *argv[])
{
LONG old, cur = 0xF;
old = _InterlockedOr(&cur, 0xF0);
_InterlockedOr(&cur, 0xF0);
return 0;
}
And then looked at the assembly that was generated
> 16: old = _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x10 [d:\work\size\main.cpp @ 16]:
00e611d0 b9f0000000 mov ecx,0F0h
00e611d5 8d55f8 lea edx,[ebp-8]
00e611d8 8b02 mov eax,dword ptr [edx]
00e611da 8bf0 mov esi,eax
00e611dc 0bf1 or esi,ecx
00e611de f00fb132 lock cmpxchg dword ptr [edx],esi
00e611e2 75f6 jne test!main+0x1a (00e611da)
00e611e4 8945fc mov dword ptr [ebp-4],eax
0:000> p
> 17: _InterlockedOr(&cur, 0xF0);
0:000> u
test!main+0x27 [d:\work\size\main.cpp @ 17]:
00e611e7 b9f0000000 mov ecx,0F0h
00e611ec 8d55f8 lea edx,[ebp-8]
00e611ef f0090a lock or dword ptr [edx],ecx
Completely different code is generated if capture the return value vs ignoring it! Wow, I didn't expect the compiler to do that.
When you capture the return value, the InterlockedOr intrinsic generates the standard pattern used to generate an Interlocked operation that is not supported by hardware (capture the old value, interlock compare and exchange for the new value, repeat if the old value from compare and exchange is not the captured old value). When you ignore the return value, a lock or (which does not set eax to the previous value) generated without a loop. If you compare the two implementations, the first implementation (the loop) can be a lot slower than the second both in terms of actually looping and in terms of prediction misses since there is a branch which could be mispredicted. If this API in your hot path, this might be very good information to know and change your usage of it to squeeze a little more cycles out of it. After testing this out, InterlockedAnd also exhibits the same behavior and, going out on a limb, I would assume the other Interlocked intrinsic also generate code the same way depending on if the return value is captured or not.