Lock-free algorithms: Choosing a unique value (warm-up)


Here's a snippet of code whose job is to generate a unique number within the process. Here's some reference reading to get yourself in the mood. Caution: It may or may not be useful.

dwUniqueId = InterlockedCompareExchange(&g_dwUniqueId, 
                                        g_dwUniqueId+1, 
                                        g_dwUniqueId);

Criticize this code fragment.

Comments (34)
  1. laonianren says:

    The apparent invariant "the second parameter equals the third parameter plus one" doesn't necessarily hold.  The second parameter may be less than, equal to, or greater than the third parameter, so the call might actually reduce the value of g_dwUniqueId.  In other words, this code is useless.

  2. Wizou says:

    It may not proceed to the incrementation if another thread updated the value of g_dwUniqueId in the mean time.

    Moreover, dwUniqueId receive the initial value of g_dwUniqueId, so we have no idea if the incrementation occurred.

    The compiler might also decide to read the value of g_dwUniqueId twice while building the stack of parameters so g_dwUniqueId & g_dwUniqueId+1 might have unexpected value if changed from other threads.

  3. Damien says:

    Hmmm. So it's "let's ignore InterlockedIncrement, let's build a broken version using a more complex function"

  4. Tom says:

    It's far more complicated than using InterlockedIncrement()?

  5. Jim Lyon says:

    That line of code requires the reader to think too hard (because the writer didn't think enough). Replace it with the obviously bug-free:

       dwUniqueId = InterlockedIncrement(&g_dwUniqueId);

  6. Dan Bugglin says:

    The call would seem to be useless since you don't make a copy of g_dwUniqueId like in the linked example… so it would succeed even if the value changes due to another usage of this piece of code.  Example, two threads:

    g_dwUniqueId = 0

    Thread 1: Compute g_dwUniqueId+1, result: 1

    Thread 2: Compute g_dwUniqueId+1, result: 1

    Thread 1: Call InterlockedCompareExchange(Pointer to 0, 1, 0).  First argument dereferenced is == third argument, so set to second argument.

    Thread 2: Call InterlockedCompareExchange(Pointer to 1, 1, 1).  First argument dereferenced is == third argument, so set to second argument.

    And of course even if it did catch the problem in thread 2 (maybe it does?  I've never used this function before) the return value is never checked, and there is no loop to try again.

  7. Frederik Slijkerman says:

    You need to call InterlockedCompareExchange in a loop until it returns the old unique ID that you passed as the last argument: only then is the new value successfully stored.

  8. Austin Donnelly says:

    InterlockedIncrement() is what's needed here, as others have commented.

    But also, this is a global, so will quickly become a scalability bottleneck.

    Ideally, there'd be a separate counter for each CPU, each allocating a sequence of non-overlapping integers.  And those counters should be on separate cache lines, to avoid cache conflicts between the CPUs.

  9. Leo Davidson says:

    In addition to the other valid criticisms, I'd usually (not always, but usually) criticise the act of trying to write such code in the first place. :)

    It's one of those things, like implementing your own containers, that's extremely difficult to get completely right, even for top programmers. It's better, if possible, to use building blocks provided by a library/framework/platform that was written by experts (who will have got things wrong, too), subjected to serious review & testing (and maybe formal proof) and then used by thousands of people (and thus less likely to have any lingering bugs, or at least have buggy behaviour well documented). Rolling your own, for one project, is often a recipe for "impossible" bugs and phantom, unreproducible crashes.

    I mean, a lot of code that uses basic threads and locks is only one unexpectedly long page-fault away from crashing; doing away with the locks makes things even trickier. And how many threading/synchronisation libraries actually got it right first time?

    I'm not criticising what Raymond is doing, which is helping people understand this kind of stuff and writing about an interesting topic. I'm just saying that writing a lock-free algorithm is not to be done on a whim. Don't catch yourself doing "premature optimisation" or optimisation of the wrong thing.

    Sometimes it is the right thing to do, of course.

  10. Joshua says:

    I write lock free algorithms.

    I write proofs of correctness for lock free algorithms.

  11. Joshua says:

    The code only works on certain optimization settings.

    Try this fragment:

    volatile scv;

    do {

      scv = g_dwUniqueId;

    } while (InterlockedCompareExchange(&g_dwUniqueId,

                                           scv+1,

                                           scv) != scv);

    dwUniqueId = scv;

  12. Michael Mol says:

    @Leo — Be careful of placing too much trust in 'Experts'. Appeal to authority is, er, appealing, but experts are fallible, too. Most times we've tried to do something necessarily complicated, we've slammed into bug after bug in libraries and third-party code, and wound up having to do it ourselves.

    Of course, some things we've tried to do ourselves, we eventually found it more effective to use a library such as Boost. If it's a reasonable library with a reasonable license, we've sometimes been able to clean it up to our needs. (I think the Boost code wound up ultimately being stripped out, though, with our own code used instead)

    So, I guess, the moral is that the mirror view is valid, too. In some domains, rolling your own will be safer or more effective than using a third-party library.

    Sometimes it is the right thing to do, of course. ;)

  13. Barbie says:

    Ah, "lock-free" algorithms. The very ones that use the LOCK x86 prefix on x86. There's no free lunch. Something, somewhere, has to play the atomic game, and that will always require some form of non-scalable feature. If you want a cross-thread unique id, use the thread-id as part of your id. No lock, no sync, no atomic… (well, as long as your threads are long living).

  14. Michael Mol says:

    @David Walker

    There is a suite of instructions, and they've been available since the 486, but there's no 'one instruction for all register widths', AFAIK. Also, I presume InterlockedIncrement is guaranteed to work on each of the platforms the NT kernel line has ever run on. (Meaning Alpha, x86, x86-64, Itanium…)

  15. Alex Grigoriev says:

    Besides from obvious race condition with multiple access to the original variable in RC's example, keep in mind that InterlockedX functions operate on LONG type, not DWORD, not 'int', not anything else. If your g_dwIniqueId is a DWORD, it's wrong. LONG. Get used to using proper types. It will save you in the long run. We've seen enough function pointer casts that are a bomb waiting to explode. The fewer casts, the less bugs.

  16. KJK::Hyperion says:

    Historically, InterlockedIncrement is allowed to return just the sign of the result, not the actual result, at least on x86. Before lock xadd, InterlockedIncrement used to be implemented with lock inc, which can only atomically return the sign of the result

    [Yuhong, is that you? -Raymond]
  17. Joshua says:

    InterlockedIncrement first appeared on NT4. NT4 required a 486. I rest my case.

  18. Jeff says:

    Joshua: Not so. The history of InterlockedIncrement/Decrement: blogs.msdn.com/…/127141.aspx

  19. David Walker says:

    Back when I used to write code in assembly language, on IBM mainframes, on the VM platform, there were macros that performed interlocked increment and decrement using the hardware Compare and Swap instruction (HCPCOUNT and HCPMINUS with an optional parm to ask for the interlocked version).  That is, the macro would expand to a few instructions that would load a value from storage into a register, increment the value in the register, perform a hardware Compare and Swap instruction between the register and the original storage location, test the return code from the Compare and Swap, and optionally try again in a loop.

    It seemed to me that a lot of headache and performance issues could have been solved by getting the IBM 370/390/ESA hardware designers to implement a hardware "Interlocked Increment" instruction that would increment the value at a storage address and return its new value in the specified register that would go back to the caller.  (I would probably treat the value as a signed fullword, and maybe have another instruction to increment or decrement a doubleword.)

    I don't do assembly programming any more, but it still seems that, no matter what the platform, *in the hardware* is the place to implement things like this.  That's even better than having experts write the macros for you.  Hacking around with global spin-locks, or even worse, trying to roll your own code to do this kind of thing, and worrying about other processors or threads affecting what you do, seems like a mess.  

    In the x86/x64 world, I see that the InterlockedIncrement function (in C?) is documented with "this function generates a full memory barrier (or fence) to ensure that memory operations are completed in order".  Is there no atomic, single x86 or x64 instruction to increment the value at a real or virtual storage address?  

  20. Michael Mol says:

    InterlockedIncrement is obviously more desirable than InterlockedCompareExchange, as already noted.

    If you want a unique number within a process, a 32-bit value may be insufficient in long-running scenarios. That depends on the application.

    As noted, Interlocked methods risk scalability problems. That depends on the application.

    As noted, writing your own code is probably an error. That depends on the needs of the application.

    Someone noted per-CPU globals would be useful, too.

    Another approach might be to use the 64-bit variant of InterlockedIncrement available in Vista kernels and later, if you're running on a 64-bit OS. That increases the pool of available identifiers, but also increases the scalability factor by forcing the coherency enforcement and copying of 64-bit values instead of 32-bit values.

    A better approach might be to use a per-use-case counter for the unique value. This increases the scalability concerns as far as cache coherency, but confines the scope (and thus rate) of usage, mitigating that somewhat, and also mitigating the limited number of available identifier values. Combine with thread-local storage to reduce the cache coherency problems (the kernel's scheduler will try to keep threads in localized places as much as possible).

    Another approach may be to use CoCreateUUID instead of the approach shown here. That meets the requirement of a unique number within the process as a subset of producing a *globally* unique number. The drawback, of course, are that 128 bits instead of 32 bits will result in additional overhead. You're already making a system call with Interlocked*, so CoCreateUUID isn't necessarily inherently a slower operation on that particular boundary. (Though the implementation will probably be slower than that of InterlockedIncrement)

  21. Adam Rosenfield says:

    @Michael Mol: The Interlocked* functions are most definitely *not* system calls.  The whole point of the Interlocked* functions is to be able to do certain operations very quickly, even when they're contention among threads.  System calls are sloooooow.

    On x86, my system implements Interlocked/Increment/Decrement/ExchangeAdd as "lock xadd" and InterlockedExchange/CompareExchange as "lock cmpxchg" (Exchange loops, CompareExchange does not).

    More information from the Intel® 64 and IA-32 Architectures Software Developer's Manuals (http://www.intel.com/…/manuals):

    lock prefix: "LOCK—Assert LOCK# Signal Prefix", Vol 2A, p. 3-606

    cmpxchg instruction: Vol 2A, p. 3-192

    xadd instruction: Vol 2B, p. 4-489

  22. Maurits says:

    That code is perfectly valid…

    … as long as it's protected by a critical section.  But if you're already protected you might as well go with dwUniqueId = g_dwNextUniqueId++;

  23. Leo Davidson says:

    @Michael Mol: It's not that I'm placing lots of trust in experts to get this stuff right. As I said, they make mistakes with this kind of code as well. Lots of mistakes, as the errata for any (maintained) book or library that touches on threading will show.

    My point is that I place *even less* trust in the non-experts. An expert should have a really good reason to write this kind of code (and if they don't have a good reason, I'd question whether they are really an expert). A non-expert shouldn't attempt to write it at all. (Unless they are just experimenting with concepts for their own enjoyment/curiosity and not inflicting the results on others. Nothing wrong with that, and it's one way to become more of an expert.)

  24. bcs says:

    On the topic of atomic ASM ops not being "lock free" (in that, IIRC, they can halt all memory accesses across all cores): that's only relevant if you do more than one of them because any mutex will include an atomic op. You might as well make that op do the work you need rather than make it part of your overhead.

  25. Ferose Khan J says:

    If 2 threads call this statement at the same time.

    g_dwUniqueId = 0

    Thread 1=> increments g_dwUniqueId to 1 and returns 0

    Thread 2=> tries to increment but fails since g_dwUniqueId is already incremented to 1 and returns 0

    both call will get the same value which is not unique. I think it may not work.

  26. sarat says:

    the mathematical operations enclosed in the parenthesis are not safe and it's not protected. g_dwUniqueId is modified outside, the value could have been in an undefined one.

  27. Phil says:

    I'm surprised no one has mentioned yet that g_dwUniqueId must be volatile if its address is passed as the first parameter. I think this means that the reads for the second and third arguments must happen separately. So g_dwUniqueId might be different for the second argument than for the third. This function could succeed in doing the exchange, but place the wrong value because the compiler read the variable for the second argument back when the value was smaller than when it read the variable for the third argument.

  28. Maurits says:

    That said, neither a critical section nor Interlocked* are what I would call "lock-free".  Perhaps a better way to meet the requirement of "generate a unique number" is to combine a thread identifier with an incrementing variable in thread-local storage.

    [Interlocked* is "lock-free" in the sense that a thread that is using Interlocked* can be suspended without affecting other clients. (Wikipedia calls this "obstruction-free".) -Raymond]
  29. Michael Mol says:

    @Phil The definition of g_dwUniqueId was never shown, so we don't technically know its type from the code snippet. g_dw leads us to infer that it's a DWORD, but doesn't tell us about any additional type modifiers. It could as easily have been volatile as not. I assumed it was, because I assumed the code would compile in the context in which it was intended.

  30. Brad Bellomo says:

    Raymond is not a college Freshman trying to write a lock free unique number generator.  He wrote this code on purpose to teach us something.

    This compiles to 2 instructions in assembly:

    Store g_dwUniqueId + 1 in a register

    Compare g_dwUniqueId to itself and assign it to the register value if it is equal.

    g_dwUniqueId will always be equal to itself, so the compare will always assign the register value.

    g_dwUniqueId might change between these instructions, in which case you'd repeat a number 1 2 3 4 4 5 6 etc, making the result not unique.

  31. KJK::Hyperion says:

    Raymond, offense taken! I was spilling the beans on Windows internals long before Yuhong Bao! Just be grateful that I'm just discussing a documented (albeit obsolete) API, instead of, say, telling everyone how the Windows CE guys implemented InterlockedXxx functions on CPUs without atomic instructions

  32. Worf says:

    Most architectures implement an interlocked instruction for the pure purpose of atomic operation implementation in modern OSes.

    ARM had to add a register-memory exchange instruction strictly for cases like this, even though it violates RISC principles (registers may only be loaded and unloaded by load-store, and a instruction can only do one thing). The exchange is the only instruction needed – from there you can implement test-and-set (foundation for ye olde spinlock) and the like.

    Otherwise one had to resort to the ultimate lock – disabling interrupts.

  33. Jules says:

    @Worf: "ARM had to add a register-memory exchange instruction strictly for cases like this, even though it violates RISC principles"

    Which is an interesting choice as LL/SC provides equal capability and is compliant with the RISC worldview.  But then, ARM has never been a pure RISC system anyway: the LDM/STM multiple-register load and store instructions were included from the very beginning.

  34. Anonymous Coward says:

    Wouldn't it be easier to build up your numbers from two parts? The first part you'd get from a global counter with a proper lock, and the second would be a counter local to the thread. That way you have to pay the locking bill only once at thread start-up, and from then on you can generate unique numbers till the cows come home, without having to worry about whether you implemented your lock-free increment correctly.

    [That would be helpful if the unique number generator was high-frequency. It so happens that in this example, the function was called pretty rarely! -Raymond]

Comments are closed.