Lock-free algorithms: The one-time initialization


A special case of the singleton constructor is simply lazy-initializing a bunch of variables. In a single-threaded application you can do something like this:

// suppose that any valid values for a and b stipulate that
// a ≥ 0 and b ≥ a. Therefore, -1 is never a valid value,
// and we use it to mean "not yet initialized".
int a = -1, b = -1;

void LazyInitialize()
{
 if (a != -1) return; // initialized already

 a = calculate_nominal_a();
 b = calculate_nominal_b();

 // Adjust the values to conform to our constraint.
 a = max(0, a);
 b = max(a, b);
}

This works fine in a single-threaded program, but if the program is multi-threaded, then two threads might end up trying to lazy-initialize the variables, and there are race conditions which can result in one thread using values before they have been initialized:

Thread 1 Thread 2
if (a != -1) [not taken]
a = calculate_nominal_a(); // returns 2
if (a != -1) return; // premature return!

Observe that if the first thread is pre-empted after the value for a is initially set, the second thread will think that everything is initialized and may end up using an uninitialized b.

"Aha," you say, "that's easy to fix. Instead of a, I'll just use b to tell if initialization is complete."

void LazyInitialize()
{
 if (b != -1) return; // initialized already (test b, not a)

 a = calculate_nominal_a();
 b = calculate_nominal_b();

 // Adjust the values to conform to our constraint.
 a = max(0, a);
 b = max(a, b);
}

This still suffers from a race condition:

Thread 1 Thread 2
if (b != -1) [not taken]
a = calculate_nominal_a(); // returns 2
b = calculate_nominal_b(); // returns 1
if (b != -1) return; // premature return!

"I can fix that too. I'll just compute the values of a and b in local variables, and update the globals only after the final values have been computed. That way, the second thread won't see partially-calculated values."

void LazyInitialize()
{
 if (b != -1) return; // initialized already

 // perform all calculations in temporary variables first
 int temp_a = calculate_nominal_a();
 int temp_b = calculate_nominal_b();

 // Adjust the values to conform to our constraint.
 temp_a = max(0, temp_a);
 temp_b = max(temp_a, temp_b);

 // make the temporary values permanent
 a = temp_a;
 b = temp_b;
}

Nearly there, but there is still a race condition:

Thread 1 Thread 2
if (b != -1) [not taken]
temp_a = calculate_nominal_a(); // returns 2
temp_b = calculate_nominal_b(); // returns 1
temp_a = max(0, temp_a); // temp_a = 2
temp_b = max(temp_a, temp_b); // temp_b = 2
a = temp_a; // store issued to memory
b = temp_b; // store issued to memory
store of b completes to memory
if (b != -1) return; // premature return!
store of a completes to memory

There is no guarantee that the assignment b = 2 will become visible to other processors after the assignment a = 2. Even though the store operations are issued in that order, the memory management circuitry might complete the memory operations in the opposite order, resulting in Thread 2 seeing a = -1 and b = 2.

To prevent this from happening, the store to b must be performed with Release semantics, indicating that all prior memory stores must complete before the store to b can be made visible to other processors.

void LazyInitialize()
{
 if (b != -1) return; // initialized already

 // perform all calculations in temporary variables first
 int temp_a = calculate_nominal_a();
 int temp_b = calculate_nominal_b();

 // Adjust the values to conform to our constraint.
 temp_a = max(0, temp_a);
 temp_b = max(temp_a, temp_b);

 // make the temporary values permanent
 a = temp_a;
 // since we use "b" as our indication that
 // initialization is complete, we must store it last,
 // and we must use release semantics.
 InterlockedCompareExchangeRelease(
    reinterpret_cast<LONG*>&b, temp_b, -1);
}

If you look at the final result, you see that this is pretty much a re-derivation of the singleton initialization pattern: Do a bunch of calculations off to the side, then publish the result with a single Interlocked­Compare­Exchange­Release operation.

The general pattern is therefore

void LazyInitializePattern()
{
 if (global_signal_variable != sentinel_value) return;

 ... calculate values into local variables ...

 globalvariable1 = temp_variable1;
 globalvariable2 = temp_variable2;
 ...
 globalvariableN = temp_variableN;

 // publish the signal variable last, and with release
 // semantics to ensure earlier values are visible as well
 InterlockedCompareExchangeRelease(
    reinterpret_cast<LONG*>&global_signal_variable,
    temp_signal_variable, sentinel_value);
}

If this all is too much for you (and given some of the subtlety of double-check-locking that I messed up the first time through, it's clearly too much for me), you can let the Windows kernel team do the thinking and use the one-time initialization functions, which encapsulate all of this logic. (My pal Doron called out the one-time initialization functions a while back.) Version 4 of the .NET Framework has corresponding functionality in the Lazy<T> class.

Exercise: What hidden assumptions are being made about the functions calculate_nominal_a and calculate_nominal_b?

Exercise: What are the consequences if we use Interlocked­Exchange instead of Interlocked­Compare­Exchange­Release?

Exercise: In the final version of Lazy­Initialize, are the variables temp_a and temp_b really necessary, or are they just leftovers from previous attempts at fixing the race condition?

Exercise: What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

Update: See discussion below between Niall and Anon regarding the need for acquire semantics on the initial read.

Comments (21)
  1. Mordachai says:

    Seems like you still have a logic error: If two threads both see b == -1, begin initializing their temp copies of a & b, one of them must get to the Interlocked… first, update b, but the other thread may still overwrite a (having already begun the process of computing a & b).  There's no guarantee that a & b will both be set by the same thread…

  2. Mordachai says:

    Unless you're making the assumption (in which case it should be well documented) that calculate_nominal_?() always produce the exact same values, so it doesn't matter that a and b are actually assigned by different threads.

    [Congratulations, you solved Exercise 1. -Raymond]
  3. Eric Wilson says:

    Exercise #1: We are also assuming that all callers to these functions get the same values (i.e., they don't depend on any global state that could change between calls to LazyInitialize)

    Exercise #2: Reduced performance?  InterlockedExchange uses both acquire and release semantics, which we don't really need here.

    Exercise #3: temp_b is obviously needed since we are using b as the sentinal value.  temp_a can be removed PROVIDED that calculate_nominal_a() always returns the same value.

    Exercise #4: For pointers, we need InterlockCompareExchangeReleasePtr.  For floating point values, you may need a full lock because on some architectures a double is larger than the pointer size and cannot be updated atomically.

  4. Tom says:

    I can answer a few of your exercises.

    1) What hidden assumptions are being made about the functions calculate_nominal_a and calculate_nominal_b?

    There are no side-effects, e.g. manipulation of the a and b variables.

    2) What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

    If they are pointers, you'll need to use InterlockedCompareExchangePointerRelease() since pointers can be 32- or 64-bits, depending on the platform.  While the Interlocked functions work on DWORD and QWORD values (with suitable alignment) they typically don't work on floats, at least using strict type checking.  You can interpret a float as a DWORD and a double as a QWORD (at least on x86 hardware) and use the appropriate Interlocked function as long as you're only concerned with comparing them in bit-wise fashion.  Note that doing so will do the exchange without using the FPU registers, so you could possibly jam NaNs into the target float/double.  I don't think there is a 128-bit Interlocked function, so long doubles (80-bits on x86 hardware) are right out.

    3) In the final version of Lazy­Initialize, are the variables temp_a and temp_b really necessary, or are they just leftovers from previous attempts at fixing the race condition?

    I think temp_a could be removed (even by compiler optimization), but temp_b cannot because of the need for the Interlocked function call.

  5. Gabe says:

    Is it safe to assume that you'd need to use InterlockedCompare64Exchange128 for extended-precision floats?

  6. NB says:

    I hope you get back to posting about non-headache inducing subjects soon!

    [You might want to go read a book. It's going to get a lot worse before it gets better. -Raymond]
  7. Mashmagar says:

    Keep up the headache inducing subjects. Once I figure out what's going on, I learn so much more.

  8. +1 on keeping up the geek stuff; this is what I come here for.

  9. anonymous says:

    > What are the consequences if we use Interlocked­Exchange instead of Interlocked­Compare­Exchange­Release?

    On x86 / x64 nothing at all because #define InterlockedCompareExchangeRelease InterlockedCompareExchange

    On other platforms maybe a few more cycles, but since this is a one-time initialization (maybe a few times if there is a race) this would not be measurable.

    > What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

    If you use a dedicated signal variable then nothing changes.  

  10. Derek says:

    Exercise: What hidden assumptions are being made about the functions calculate_nominal_a and calculate_nominal_b?

    As pointed out, these must be idempotent.  a and b themselves must also not be changed outside LazyInitialize().

    Exercise: What are the consequences if we use Interlocked­Exchange instead of Interlocked­Compare­Exchange­Release?

    Since we're assuming that calculate_nominal_a() and calculate_nominal_b() are idempotent, there are no concerns with respect to correctness.  It doesn't really matter whether we get the first thread's or the last thread's values.  The performance difference for this lazy-instantiation case is probably not measurable, either.

    Exercise: What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

    If the globals are pointers or floating-point variables, we'll need to use the appropriate Compare/Exchange function.  Atomicity of setting 'a' should probably not be a concern, since we're setting the same value, so even if it's not atomic, there shouldn't be an inconsistent view.  If we're using pointers, we also need to make sure that calculate_nominal_a() and calculate_nominal_b() aren't allocating memory (if they are, they aren't really idempotent anyway).

  11. pinwing says:

    When using floating points with InterlockedCompareExchangeRelease, be sure to use the correct bit-pattern to compare against (and not -1.0f converted to a LONG, if that is chosen as the sentinel value).

    Also, care should be taken when choosing a value for the global_signal_variable and the sentinel value: +0.0 compares equal to -0.0, but bit-patterns differ, and NaN != NaN.

    Also, I believe you have the Exchange (2nd parameter) and Comparand (3rd) mixed up in both uses of InterlockedCompareExchangeRelease.

  12. more boosts says:

    > you can let the Windows kernel team do the thinking and use the one-time

    > initialization functions

    Unless you have to support such archaic operating systems as XP or Server 2003:

    Windows Server 2003 and Windows XP/2000:  Applications must provide their own synchronization for one-time initialization by using the interlocked functions or other synchronization mechanism. The one-time initialization functions are available starting with Windows Vista and Windows Server 2008.

    Personally, I have been using boost::call_once for this.

    http://www.boost.org/…/synchronization.html

    Works on many more platforms than just versions of Windows starting with Vista and Server 2008…

    [On the other hand, call_once is not lock-free. -Raymond]
  13. Stephen Oberholtzer says:

    > What hidden assumptions are being made about the functions calculate_nominal_a and calculate_nominal_b?

    calculate_nominal_a and calculate_nominal_b should be idempotent with regard to side effects (calling them multiple times is the same as calling them once) and reentrant (since multiple threads may be calling them at the same time).

    Depending on the meaning of 'a', interesting things may happen if calculate_nominal_a() can return a different value each time it is called, but I don't believe that will necessarily cause incorrect behavior.

    There's also an implicit assumption that calculate_nominal_b() does not depend on a; if it does, the original code would have worked most of the time. The new code, however, always executes calculate_nominal_b() with a=-1.

    > What are the consequences if we use Interlocked­Exchange instead of Interlocked­Compare­Exchange­Release?

    MSDN says that InterlockedExchange does a full memory barrier, which take to mean "both acquire and release semantics".

    Which as far as I understand it, means that memory-accessing operations found in the stream *after* IE will not be reordered before the IE (acquire), and memory-accessing operations found in the stream *before* IE will not be reordered after the IE (release).

    This could causes a slight drop in performance, which doesn't really matter too much because this code path won't be executed again after the first successful initialization ;)

    Possibly more significant is that this changes the semantics of our function from "first to succeed wins" to "last to succeed wins". Depending upon what calculate_nominal_[ab]() do, this might be acceptable (if they return the same a/b each time) or even preferable (if they return some sort of statistical information, where newer information is better, but the calculation requires the same values be used throughout the execution of the program).

    > In the final version of Lazy­Initialize, are the variables temp_a and temp_b really necessary, or are they just leftovers from previous attempts at fixing the race condition?

    They're necessary. Otherwise, thread A could be about to execute

    temp_b = max(temp_a, temp_b)

    when thread B executes

    temp_a = calculate_nominal_a()

    If temp_a and temp_b are both negative, and temp_b is less than temp_a, this race could cause the real 'b' to be equal to the

    result of calculate_nominal_a(), instead of 0 as it should be.

    > What changes (if any) are necessary to the above pattern if the global variables are pointers? Floating point variables?

    Pointers: On a 32-bit system, no worries. On a 64-bit system, you need a cmpxchg8 (I expect that's what InterlockedCompareExchangePointer is for).

    Floats: Floats are 32-bits, so you can use the same code as for 32-bit longs.  

  14. acq says:

    I also like keeping up. I haven't had to develop for Itanium where Release sematics matters, up to now my target was still x86 and the XP, but I'm still interested in the topics.

    Now that I mentioned, anybody knows some other platform, except for Itanium, where Release semantics matters? In the example of today, where given Interlocked* happens only at most once per thread, I'd believe there's no need to prefer them compared to the plain ones, except to be more explicit. I'd also not like writing something I can't test? If I'm never going to see Itanium, should I use *Release variants at all in my own code?

  15. ryouj says:

    #define InterlockedCompareExchangeRelease InterlockedCompareExchange

    lol

  16. nksingh says:

    @acq:

    You should probably *not* use the Acquire/Release variants of the operations unless it's performance-critical code actually running on an architecture where it matters (Itanium and potential future architectures). If you can't test it, chances are it will be wrong.  Even if you can test it, there's a good chance it will be wrong in rare circumstances that only hit on customer machines under load.

  17. Medinoc says:

    @ryouj: On x86.

    Wow, I didn't know such functions existed. I used to do it with InterlockedCompareExchange and two LONG variables with boolean semantics, one for "started initializing", the other for "initialization done"

  18. John says:

    isn't inverted the temp_signal_variable and the sentinel_value?

    InterlockedCompareExchangeRelease(

       reinterpret_cast<LONG*>&global_signal_variable,

       sentinel_value, temp_signal_variable);

    }

    it should be:

    InterlockedCompareExchangeRelease(

       reinterpret_cast<LONG*>&global_signal_variable,

       temp_signal_variable, sentinel_value);

    }

    ?

    [Indeed it should. Fixed, thanks. -Raymond]
  19. Niall says:

    I believe there is still one potential bug in the final version of LazyInitializePattern, at least in theory. The initial read of global_signal_variable should have acquire semantics. Consider someone using globalvariable1:

    LazyInitializePattern();

    DoSomethingWith(globalvariable1);

    If LazyInitializePattern is then inlined, the resulting code is equivalent to:

    if (global_signal_variable != sentinel_value)

    {

      DoSomethingWith(globalvariable1);

    }

    else

    {

      globalvariable1 = temp_variable1;

      InterlockedCompareRelease(&global_signal_variable, sentinel_value);

      DoSomethingWith(globalvariable1);

    }

    There is then nothing preventing the compiler from reordering the read of global_signal_variable and globalvariable1, to the following:

    register reg1 = globalvariable1;

    if (global_signal_variable != sentinel_value)

    {

      DoSomethingWith(reg1);

    }

    else

    {

      globalvariable1 = temp_variable1;

      InterlockedCompareRelease(&global_signal_variable, sentinel_value);

      DoSomethingWith(globalvariable1);

    }

    To fix it a compiler barrier, _ReadBarrier(), would have to be inserted after the read of global_signal_variable. Only a compiler barrier is needed, not a full memory barrier, as the memory barrier included in InterlockedCompareRelease is sufficient to prevent memory reordering issues.

    [Agreed. -Raymond]
  20. Anon says:

    Niall wrote:

    "Only a compiler barrier is needed, not a full memory barrier, as the memory barrier included in InterlockedCompareRelease is sufficient to prevent memory reordering issues."

    Doesn't that assume that the CPU can only reorder writes?

    I see nothing stopping a CPU that allows visible out-of-order reads (Itanium, Alpha) from speculatively reading/caching globalvariable1 before global_signal_variable.

    Based on the information at msdn.microsoft.com/…/ms686355%28v=vs.85%29.aspx , it appears that the code should either use InterlockedCompareExchange when reading global_signal_variable, or global_signal_variable should be marked "volatile" (if building on VC2005+). Both of these enforce acquire barriers, and also prevent reordering by the compiler.

    Another option might be to insert a MemoryBarrier() before the "return".

  21. Niall says:

    @Anon: Yes you're quite right, it does need a real acquire barrier when reading global_signal_variable, I had x86/x64 in the back of my mind when I wrote that. Using InterlockedCompareExchange or MemoryBarrier is overkill though, the full fence they use would be a bit expensive to have on what should be the fast code path (already initialized). So volatile is probably the best, most portable choice available.

    This is all just a great reason to use the one-time initialization functions instead really.

Comments are closed.