Lock-free algorithms: The singleton constructor


The first half may be familiar to many (most?) readers, but there’s an interesting exercise at the bottom.

A very useful pattern for the Interlocked* functions is lock-free lazy initialization via Interlocked­Compare­Exchange­Pointer­Release. Yes, that’s a really long function name, but it turns out every part of it important.

Widget *g_pwidCached;

Widget *GetSingletonWidget()
{
 Widget *pwid = g_pwidCached;
 if (!pwid) {
  pwid = new(nothrow) Widget();
  if (pwid) {
   Widget *pwidOld = reinterpret_cast<Widget*>
       (InterlockedCompareExchangePointerRelease(
          &reinterpret_cast<PVOID&>(g_pwidCached),
          pwid, NULL));
   if (pwidOld) {
    delete pwid; // lost the race - destroy the redundant copy
    pwid = pwidOld; // use the old one
   }
  }
 }
 return pwid;
}

This is a double-check lock, but without the locking. Instead of taking lock when doing the initial construction, we just let it be a free-for-all over who gets to create the object. If five threads all reach this code at the same time, sure, let’s create five objects. After everybody creates what they think is the winning object, they called Interlocked­Compare­Exchange­Pointer­Release to attempt to update the global pointer.

The parts of the name of the Interlocked­Compare­Exchange­Pointer­Release function work like this:

  • Interlocked: The operation is atomic. This is important to avoid two threads successfully updating the value of g_pwidCached.
  • Compare: The value in g_pwidCached is compared against NULL.
  • Exchange: If the values are equal, then g_pwidCached is set to pwid. This, combined with the comparison, ensures that only one thread gets to set the value of g_pwidCached.
  • Pointer: The operations are on pointer-sized data.
  • Release: The operation takes place with release semantics. This is important to ensure that the pwid we created is fully-constructed before we publish its pointer to other processors.

This technique is suitable when it’s okay to let multiple threads try to create the singleton (and have all the losers destroy their copy). If creating the singleton is expensive or has unwanted side-effects, then you don’t want to use the free-for-all algorithm.

Bonus reading:

  • One-Time Initialization helper functions save you from having to write all this code yourself. They deal with all the synchronization and memory barrier issues, and support both the one-person-gets-to-initialize and the free-for-all-initialization models.
  • A lazy initialization primitive for .NET provides a C# version of the same.

Okay, now here’s the interesting exercise. This is an actual problem I helped out with, although details have been changed for expository purposes.

We have a data structure which manages a bunch of singleton objects, let’s say that they are instances of a structure called ITEMCONTROLLER and they are keyed by a 32-bit ID. We’re looking for design suggestions on making it thread-safe. The existing code goes like this (pseudocode):

struct ITEMCONTROLLER;
struct SINGLETONINFO {
 DWORD dwId;
 ITEMCONTROLLER *(*pfnCreateController)();
};

class SingletonManager {
public:
 // rgsi is an array that describes how to create the objects.
 // It's a static array, with csi in the range 20 to 50.
 SingletonManager(const SINGLETONINFO *rgsi, UINT csi)
               : m_rgsi(rgsi), m_csi(csi),
                 m_rgcs(NULL), m_ccs(0), m_ccsAlloc(0) { }
 ~SingletonManager() { ... }
 ITEMCONTROLLER *Lookup(DWORD dwId);

private:
 struct CREATEDSINGLETON {
  DWORD dwId;
  ITEMCONTROLLER *pic;
 };

private:
 const SINGLETONINFO *m_rgsi;
 int m_csi;

 // Array that describes objects we've created
 CREATEDSINGLETON *m_rgcs;
 int m_ccs;
};

ITEMCONTROLLER *SingletonManager::Lookup(DWORD dwId)
{
 int i;

 // See if we already created one
 for (i = 0; i < m_ccs; i++) {
  if (m_rgcs[i].dwId == dwId)
   return m_rgcs[i].pic;
 }

 // Not yet created - time to create one
 ITEMCONTROLLER *pic;
 for (i = 0; i < m_rgsi; i++) {
  if (m_rgsi[i].dwId == dwId) {
   pic = m_rgsi[i].pfnCreateController();
   break;
  }
 }
 if (pic == NULL) return;

 ... if m_rgcs == NULL then allocate it and update m_ccsAlloc
 ... else realloc it bigger and update m_ccsAlloc

 // append to our array so we can find it next time
 m_rgcs[m_ccs].dwId = dwId;
 m_rgcs[m_ccs].pic  = pic;
 m_ccs++;

 return pic;
}

In words, the SingletonManager takes an array of SINGLETONINFO structures, each of which contains an ID and a function to call to create the object with that ID. To look up an entry, we first check if we already created one; if so, then we just return the existing one. Otherwise, we create the object (using pfnCreateController) and add it to our array of created objects.

Our initial inclination is to put a critical section around the entire Lookup function, but maybe there's something more clever we can do here. Maybe a slim reader-writer lock?

Bonus chatter: Although it's the case on Windows that the plain versions of the interlocked functions impose both acquire and release semantics, other platforms may not follow Windows' lead. In particular, on the XBOX360 platform, the plain versions of the interlocked functions impose neither acquire nor release semantics. I don't know what the rules are for Windows CE.

Erratum: I once knew but subsequently forgot that the singleton pattern described in this article (with the InterlockedCompareExchangePointer) is not safe on some CPU architectures. An additional MemoryBarrier() needs to be inserted after the fetch of the single pointer to ensure that indirections through it will retrieve the new values and not any cached old values:

Widget *GetSingletonWidget()
{
 Widget *pwid = g_pwidCached;
 if (!pwid) {
  ...
 } else {
  // Ensure that dereferences of pwid access new values and not old
  // cached values.
  MemoryBarrier();
 }
 return pwid;
}

The discussion of lock-free algorithms continues (with probably more errors!) next time.

Comments (25)
  1. David Walker says:

    Re the name of Interlocked­Compare­Exchange­Pointer­Release: I have nothing at all against long names, if every part is descriptive.  Some people who don't like long names tell me that the names take too long to type.  As if their coding speed is limited by how fast they can type, which is a disturbing thought.  These are some of the same people who love cryptic table aliases in T-SQL, and can't write a JOIN statement without them.  Sorry, I am digressing…

  2. Adam Rosenfield says:

    Whoa, good call on noting that the plain Interlocked* functions have neither acquire nor release semantics on the Xbox 360, I was not aware of that.

  3. Jim Lyon says:

    The erratum, the note about XBox360 semantics, and the general chatter should be enough to scare people away from this style of programming, unless absolutely necessary.

    When asked whether one could do better than a critical section around the above code, my first question would be "How many times per second is this function being called?" If fewer than hundreds, a critical section is certainly fine. It may be fine up into the thousands of calls per second. Above that, I would completely refactor the function in question. (Any bets on whether absolutely every caller passes a constant dwId?).

  4. tobi says:

    I would be interested in what is the point of nothrow. Surely we would not want a NULL widget returned and instead fail the entire algorithm?

    [I find it interesting how everybody is quibbling about unimportant implementation details and not the point of the article. This usually means that the topic is too advanced. -Raymond]
  5. Joshua says:

    I wrote this once, without the interlock. The object was a singleton only to reduce memory and load time (object takes ~100ms to load) and there was a garbage collector, so the worst case was a few were loaded when only one load was necessary.

  6. Gabe says:

    The use of realloc pretty much requires that you need to protect the whole thing with a lock, and in this case I think a read/write lock would work fine. Acquire a shared lock for the "See if we already created one" part, then release it. If you make it to "Not yet created – time to create one", you need to acquire an exclusive lock.

    Note that since the lock is released between the two parts of the algorithm, the second part will have to make sure that the item it's trying to create wasn't created already. Ideally you would have an upgradable R/W lock for this, but Slim Reader/Writer (SRW) Locks don't support that feature.

    Also, no mention of .NET lazy initialization would be complete without a mention of the [new for 4.0] Lazy<T>: msdn.microsoft.com/…/dd642331.aspx

  7. Ray Trent says:

    Oh, and I forgot to mention: your example code has a serious bug — you forgot to initialize g_pwidCached to 0. It may contain random trash, and therefore pass the if (!pwid) test. Worse, it might only do that in release builds.

    [The C++ language specification disagrees with you. "3.6.2: Objects with static storage duration shall be zero-initialized." -Raymond]
  8. Joshua says:

    Sadly, I used to use one compiler that didn't guarantee implicit static initialization to 0.

  9. Alex Grigoriev says:

    You seem to love reinterpret_cast too much. A hint for you: it's a cast of last resort, for cases where static_cast would not work. It's the most unsafest cast, and you don't give the compiler any chance to detect your mistakes. Even C-style cast is safer sometimes.

    [Was not aware that you could static_cast a void* to a T*. I use reinterpret_cast to emphasize "no bits changing – I'm just reinterpreting them." -Raymond]
  10. scott says:

    My assumption is that it is not too expensive to create the singleton.

    int ccs = m_ccs;

    ITEMCONTROLLER *oldpic = InterlockedCompareExchangeRelease(

                                     &reinterpret_cast<PVOID &>(m_rgcs[ccs].pic),

                                     pic, NULL);

    if (oldpic) {

    delete pic;

    pic = m_rgcs[ccs].pic;

    } else {

    m_rgcs[ccs].dwId = dwId;

    InterlockedIncrementRelease(&m_ccs);

    }

    return pic;

    [What if thread A asks for item dwId=1 and thread B asks for item dwId=2. Thread A's exchange succeeds. Thread B tries to exchange and fails, so it grabs the item from the array (which is for dwId=1) and returns it. Oops, we gave the wrong object to thread B. -Raymond]
  11. Mihailik says:

    Surely in .NET one wouldn't want a luxury of built-in Lazy<T> class, neither indeed should one lazily rely on the guarantees of the runtime in 'static readonly' construct.

    One should look for something vintage instead. At least 4 years old I mean.

  12. Alex Grigoriev says:

    This exercise contains two tasks:

    1. Finding/allocating a slot for the dwId. Iterate over the array until you find dwId or zero (or a reserved no-id value, if zero is a valid ID). If found dwId, go to step 2. If zero, try to do InterlockedCompareExchange to set it to dwId. If succeeded, step 2. Else, if the slot is taken, continue iteration.

    2. Fetching or creating the object, now that we found/claimed the slot. See the solution in the first half of this article.

  13. nksingh says:

    It seems like if the number of possible singleton ID's is small (in the 20s), the m_rgcs array should be statically sized in the constructor and pre-mapped to the singleton IDs. That is, you can pre-fill the m_rgcs[i].dwId field.  You can then use the m_rgcs[i].pic field pointer to decide whether to create the object or not using the Windows InitOnce API or Raymond's code snippet above (which I believe is correct).

  14. Billy O'Neal says:

    Alex,

    You need to use reinterpret_cast in order to do the reference conversions Raymond is doing here. You can replace the one *around* the call to InterlockedCompareExchangePointerRelease with a static cast, but it really doesn't matter. The largest reason for reinterpret_cast being dangerous is that it allows platform specific conversions; but InterlockedCompareExchangePointerRelease is only going to compile on one platform (Windows) anyway, so there's really no large reason to avoid using it here. Considering both static_cast *and* reinterpret_cast will allow conversions from that void * into any other pointer type without difficulty.

    Billy3

  15. Ray Trent says:

    I am opposed to this style of programming in general, largely because it leads to maintainability headaches. Developers of singleton classes tend to expect that only one instance of the class will ever exist.

    What's wrong with this much simpler variation?:

    Widget *GetSingletonWidget()

    {

     static Widget *pwid = new Widget();

     return pwid;

    }

    Any compiler worth its salt will a) ensure that this is atomic, and b) that it only incurs the overhead of a lock once when the object is first created (modulo any concurrent calls to the function during that first initialization).

    [Prior to C++0x, the compiler is not obligated to ensure atomicity. (In fact, earlier versions of the C++ language required the code to *not* be threadsafe.) Also, the technique does not generalize: It doesn't help you with the exercise, for example. -Raymond]
  16. Maurits says:

    &reinterpret_cast<PVOID&>(g_pwidCached),

    Um, ew.  Can that be right?

    (1) can you cast a foo* to a void*& ?

    (2) can you take the address of the result of cast?

    (3) surely &g_pwidCached suffices (since the cast to void* is implicit?)

  17. Maurits says:

    > surely &g_pwidCached suffices (since the cast to void* is implicit?)

    No… casting to void* is implicit but casting to void** is not. &reinterpret_cast<PVOID&>(g_pwidCached) is by no means more ugly than (void**)&g_pwidCached.

    [I prefer to cast through void*& because the cast is more obviously valid: You're taking a pointer to a widget and reinterpreting it as a generic pointer. Whereas it's not as obviously valid (to me) that you can take a pointer to a pointer to a widget and cast it to a pointer to a generic pointer. (Especially once you start throwing consts into the mix.) But that's probably just me. Then again, it's my blog. On your blog, you can use (void**). -Raymond]
  18. Thorsten says:

    Sorry if this turns out to be a double post, but I thought I had posted it a few hours ago and it still hasn't shown up:

    The singleton manager can easily be implemented in a thread-safe, lock-free way by turning the array into a single linked list. Because items are only added and never removed during the lifetime of the singleton manager we don't even need to bother about defending against the ABA problem.

    In fact, I just recently implemented something similar (excuse the pascal, but I guess most developers should be able to read it):

    type

     TnxObjectExtender = class(TObject)

     private

       oeExtendedObject : TObject;

       oeNextExtender   : TnxObjectExtender;

       constructor Create(aExtendedObject: TObject; aNextExtender: TnxObjectExtender);

     end;

     TnxObjectExtenderFactory = class(TObject)

     private

     public

       class function Get<T: TnxObjectExtender>(aObject: TObject; out aExtender: T): Boolean; inline; static;

     end;

    class function TnxObjectExtenderFactory.InternalGetOrCreate(aObject: TObject; aClassInfo: Pointer): TnxObjectExtender;

    var

     HiddenField : PPointer;

     Head        : TnxObjectExtender;

     PrevHead    : TnxObjectExtender;

     Current     : TnxObjectExtender;

     NewExtender : TnxObjectExtender;

    begin

     If not Assigned(aObject) then

       Exit(nil);

     HiddenField := GetHiddenField(aObject);

     NewExtender := nil;

     try

       Head := nil;

       repeat

         {remember the the previous head, to shortcut the loop below}

         PrevHead := Head;

         Head := HiddenField^;

         {catch if the object is already in the process of being destroyed}

         if Head = Pointer($FFFFFFFF) then

           Error(reInvalidPtr);

         {check if we already got an extender of that type}  

         Current := Head;

         while Assigned(Current) do begin

           if Current.ClassInfo = aClassInfo then

             {found it, we are done}

             Exit(Current);

           Current := Current.oeNextExtender;

           if Current = PriorHead then

             {we already checked these the last time through, stop here…}

             Current := nil;

         end;

         {didn't find it}

         if not Assigned(NewExtender) then

           {first time, create the new extender}

           NewExtender := TnxObjectExtenderClass(ClassInfoToClass(aClassInfo)).Create(aObject, Head)

         else

           {we already created a new extender but weren't successful in adding it,

            just fix up the next pointer and try again}

           NewExtender.oeNextExtender := Head;

         Current := NewExtender;

         if LockedCompareExchange(HiddenField^, Pointer(Current), Pointer(Head)) <> Head then

           {somone else was faster, loop}

           Current := nil

         else

           {success, make sure the new extender that has been inserted doesn't get freed}

           NewExtender := nil;    

       until Assigned(Current);

       Result := Current;

     finally

       if Assigned(NewExtender) then begin

         NewExtender.oeNextExtender := nil;

         NewExtender.oeExtendedObject := nil;

         NewExtender.Free;

       end;

     end;

    end;

    Given that this blog software is bound to mangle the code above, here is a more readable version: http://pastebin.com/ymNWZu2Z

  19. tobi says:

    "I find it interesting how everybody is quibbling about unimportant implementation details and not the point of the article." Yet you are discussing the correct way to cast a pointer! I think there is nothing wrong with going a bit off topic and following what is interesting. However I'm guest at your place and will respect your rules. If the blog software supports making comments invisible you could punish off topic messages by hiding them and inserting a message for the author. I would like that.

  20. Neil says:

    It's possible to avoid reinterpret_cast by making g_pwidCached a void pointer, and then you only have to deal with a few static casts. But that's just me quibbling about unimportant implementation details.

    Widget *pwid = static_cast<Widget*>(g_pwidCached);

    if (!pwid) {

     pwid = new(nothrow) Widget();

     if (pwid) {

      Widget *pwidOld = static_cast<Widget*>

          (InterlockedCompareExchangePointerRelease(

             &g_pwidCached, pwid, NULL));

    [Also delete static_cast<Widget*>(g_pwidCached); Things get messier if g_pwidCached is a smart pointer with an overloaded operator&. It boils down to whether it's better to cast each time you use it or to cast each time you initialize it. I prefer the latter because you know you're doing something funky when you're initializing it, whereas you're not accustomed to having to take special action when just using it. Perhaps union { void *pv; Widget *pwid; } would be cleaner. -Raymond]
  21. Adam Rosenfield says:

    [I prefer to cast through void*& because the cast is more obviously valid: You're taking a pointer to a widget and reinterpreting it as a generic pointer. Whereas it's not as obviously valid (to me) that you can take a pointer to a pointer to a widget and cast it to a pointer to a generic pointer. (Especially once you start throwing consts into the mix.) But that's probably just me. Then again, it's my blog. On your blog, you can use (void**). -Raymond]

    Yep, it's your blog, and you can use whatever style you wish.  Personally, I think the cast to (void**) is clearer, because on all of the platforms that I write code for, all pointers are the same size.  It's guaranteed by the C/C++ standards that sizeof(void*) >= sizeof(T*) for all T, but there are some less common systems where equality does not hold for all types T.

    For example, if you were writing code for Win16, then near pointers are 16 bits whereas void* is a far pointer and is 32 bits.  So don't try to use this code on Win16, since the Interlocked functions require DWORDs, and sizeof(near pointer) != sizeof(DWORD).

  22. Ray Trent says:

    Hmm… learn something new every day about the zero-initialization thing…

    Anyway, the reason I said "any compiler worth its salt" is that if static initialization is not guaranteed to be atomic by every compiler you use (when compiled with multithreaded options and linked with multithreaded libraries) then the *correct* solution is to throw away the compiler and get a reasonable one. Massively complicating your code to solve problems with inferior tools is a huge waste of time and a giant barrier to maintainability.

    VC++ does this right (any reasonable modern version).

    And it does generalize, you just have to put the static inside the controller function for creating the singleton class. Please tell me that creation functions for singleton classes can reasonably assume only one object of the class needs to be created. It does mean you need a different factory function for each class… but we're talking about C++ here, right?

    [It doesn't generalize to the controller, because the objects are singletons *with respect to a specific SingletonManager*. There can be more than one SingletonManager. After all, if this were for process-global singletons, then you wouldn't need a manager at all. Just make each object implement its own static singleton. (I guess I will have to elaborate on this in a separate article since one-sentence explanations aren't cutting it.) -Raymond]
  23. KT says:

    "An additional MemoryBarrier() needs to be inserted after the fetch of the single pointer to ensure that indirections through it will retrieve the new values and not any cached old values"

    With the singleton pattern shown here, how can there be "old values" and "new values"? g_pwidCached is NULL initially, and once assigned a non-NULL value, it never changes.

    AFAICS, the barrier should only be required if the pointer can change from one non-NULL value to another non-NULL value, as in read-copy-update (RCU) for example.

    [Did you read the linked article?

    Widget::Widget() : m_Burgle(2011) { }
    int *pi = new int(42); // pi = 0x10000, say
    delete pi; // memory freed
    Widget *pwid = GetSingletonWidget(); // pwid = 0x10000 (memory reused)
    print(pwid->m_Burgle);

    This might print 42 because the CPU had cached that the value 42 was stored at 0x10000. -Raymond]

  24. KT says:

    Ah, thanks for the clarification. (On first read I thought you meant old values obtained through an old g_pwidCached pointer value.)

  25. js says:

    The Alpha example is a fascinating one. It's amazing that they sold a processor that can reorder dependent reads. Unlike every other memory reordering I can think of, this is one where you can't really understand it as a reordering of the instruction stream, since doing so violates causality! You really just have to think about it in terms of how a cache is implemented, and then it doesn't seem so strange, just annoying.

    Thankfully, modern explicitly don't reorder dependent reads. The bizarro Alpha days are over.

Comments are closed.