Lock-free algorithms: The opportunistic cache


Suppose profiling reveals that a specific calculation is responsible for a significant portion of your CPU time, and instrumentation says that most of the time, it's just being asked to calculate the same thing over and over. A simple one-level cache would do the trick here.

BOOL IsPrime(int n)
{
 static int nLast = 1;
 static BOOL fLastIsPrime = FALSE;

 // if it's the same value as last time, then
 // use the answer we cached from last time
 if (n == nLast) return fLastIsPrime;

 // calculate and cache the new result
 nLast = n;
 fLastIsPrime = slow_IsPrime(n);
 return fLastIsPrime;
}

Of course, this isn't thread-safe, because if one thread is pre-empted inside the call to slow_IsPrime, then another thread will see values for nLast and fLast­Is­Prime that do not correspond to each other.

One solution would be to put a critical section around this code, but this introduces an artificial bottleneck: If the most recent cached result is nLast = 5, fLast­Is­Prime = TRUE, and if two threads both try to see whether 5 is prime, you don't want those two threads to serialize against each other.

Another solution is to use slim reader-writer locks and acquire in shared mode when checking the cache and in exclusive mode when updating the cache. But let's try a lock-free solution.

We're going to combine two different techniques. First, we use the change counter technique we saw last time when we investigated try/commit/(try again), but we also combine it with a lock that is manipulated with a try/commit/abandon pattern.

#define IsLocked(l) ((l) & 1)

BOOL IsPrime(int n)
{
 static int nLast = 1;
 static BOOL fLastIsPrime = FALSE;
 static LONG lCounter = 0;

 // see if we can get it from the cache
 LONG lCounterStart = InterlockedReadAcquire(&lCounter, -1);
 if (!IsLocked(lCounterStart) && n == nLast) {
  BOOL fResult = fLastIsPrime;
  if (InterlockedReadRelease(&lCounter, -1) == lCounterStart)
   return fResult;
 }

 // calculate it the slow way
 BOOL fIsPrime = slow_IsPrime(n);

 // update the cache if we can
 lCounterStart = lCounter;
 if (!IsLocked(lCounterStart) &&
     InterlockedCompareExchangeAcquire(&lCounter,
              lCounterStart+1, lCounterStart) == lCounterStart) {
  nLast = n;
  fLastIsPrime = fIsPrime;
  InterlockedCompareExchangeRelease(&lCounter,
              lCounterStart+2, lCounterStart+1);
 }
 return fIsPrime;
}

The lCounter consists of a LOCK bit as the bottom bit and a change counter in the remaining bits. (Choosing the bottom bit as the LOCK bit makes the operations of clearing the lock and incrementing the counter very simple.)

There are two parts to this function, the part that reads the cache and the part that updates the cache.

To read the cache, we first read the counter with acquire semantics, so that the reads of nLast and fLast­Is­Prime will not take place until after we get the counter. If the counter says that the cache is not locked, then we go ahead and fetch the last value and the last result. If the last value in the cache matches the value we're calculating, then we go ahead and use the last result. But as a final check, we make sure that the counter hasn't changed while we were busy looking at the protected variables. If it has, then it means that we may have read inconsistent values and cannot trust the cached result.

If we have a cache miss or we couldn't access the cache, we go ahead and calculate the result the slow way.

Next, we try to update the cache. This time, instead of just looking to see whether the cache is locked, we try to lock it ourselves by setting the bottom bit. (If the lock fails, then we skip the cache update and just return the value we calculated.) Once the lock is taken, we update the protected variables, then atomically release the lock and increment the counter. (This is where putting the lock in the bottom bit comes in handy: You can increment the counter by adding 2 and not worry about a carry out of the counter bits turning into an accidental lock bit.) We use Release semantics so that the values of the protected values are committed to memory before lock bit clears.

Note that in both halves of the function, if the cache is locked, we just proceed as if there were no cache at all. The theory here is that it's better just to say "Oh, the heck with it, I'll just do it myself" than to line up and wait to access the cache. Continuing instead of waiting avoids problems like priority inversion, but it also means that you get some spurious cache misses. Fortunately, since it's just a cache, an occasional spurious miss is not the end of the world.

You could do something similar with the Try­Enter­Critical­Section function provided you're running Windows NT 4.0 or higher:

BOOL IsPrime(int n)
{
 static int nLast = 1;
 static BOOL fLastIsPrime = FALSE;
 BOOL fHaveAnswer = FALSE;
 BOOL fIsPrime;

 // see if we can get it from the cache
 if (TryEnterCriticalSection(&g_cs)) {
  if (n == nLast) {
   fHaveAnswer = TRUE;
   fIsPrime = fLastIsPrime;
  }
  LeaveCriticalSection(&g_cs);
 }
 if (fHaveAnswer) return fIsPrime;

 // calculate it the slow way
 fIsPrime = slow_IsPrime(n);

 // update the cache if we can
 if (TryEnterCriticalSection(&g_cs)) {
  nLast = n;
  fLastIsPrime = fIsPrime;
  LeaveCriticalSection(&g_cs);
 }
 return fIsPrime;
}

This does have the disadvantage that multiple readers will lock each other out, so we can switch to a slim reader/writer lock provided we're running on Window  7 or higher:

BOOL IsPrime(int n)
{
 static int nLast = 1;
 static BOOL fLastIsPrime = FALSE;
 BOOL fHaveAnswer = FALSE;
 BOOL fIsPrime;

 // see if we can get it from the cache
 if (TryAcquireSRWLockShared(&g_lock)) {
  if (n == nLast) {
   fHaveAnswer = TRUE;
   fIsPrime = fLastIsPrime;
  }
  ReleaseSRWLockShared(&g_lock);
 }
 if (fHaveAnswer) return fIsPrime;

 // calculate it the slow way
 fIsPrime = slow_IsPrime(n);

 // update the cache if we can
 if (TryAcquireSRWLockExclusive(&g_lock)) {
  nLast = n;
  fLastIsPrime = fIsPrime;
  LeaveSRWLockExclusive(&g_lock);
 }
 return fIsPrime;
}

This still has the problem that readers can lock out a cache update. If the function is hot (and if it weren't, why would you switch to a lock-free algorithm?), and the usage pattern shifts (say, instead of checking whether 13 is prime over and over, it starts checking whether 17 is prime over and over), everybody will be so busy reading the cache to see if the cached value is 17 that nobody will get a chance to update the cache to actually be 17!

Exercise: What constraints must be imposed on the protected variables for this technique to be successful?

Comments (22)
  1. Joshua says:

    I believe the constraint is no pointers to memory objects that you cannot afford to leak.

  2. Maurits says:

    Pet peeve: please don't use "l" as an identifier.

    ((l) & 1)

  3. Arseny Kapoulkine says:

    In this exact case, I'd ensure that the cache access is atomic (we can use a sign bit for 'prime' result, or use a int64 with suitable atomic read/write functions); the code then becomes something like

    uint32_t v = g_cache;

    if ((v & 0x7fffffff) == n) return v >> 31;

    uint32_t r = n | (slow_prime_check(n) << 31);

    g_cache = r;

    This is both easier and faster.

    P.S. The last assignment can be changed to CAS(&g_cache, v, r) to improve perf. in rare cases (lots of concurrent checks with different values – should decrease cache eviction rate) and to make it worse in all other cases.

    [True, that works in this specific case, but let's generalize to the where the information you need to cache doesn't fit into an atomic data size. -Raymond]
  4. Michael J says:

    bool IsPrime(int n)

    {

       static std::map<int, int> cache;

       if (!cache[n])

           cache[n] = 1 + SlowIsPrime(n);

       return cache[n] – 1;

    }

    // Note we store "+1" to distinguish "false" from "not set".

    // if the range of 'n' is small, can use an array instead of a map.

    bool IsPrime(int n)

    {

       const int MAX(1024);

       static int cache[MAX] = { 0 };

       assert(n >= 0 && n < MAX);

       if (!cache[n])

           cache[n] = 1 + SlowIsPrime(n);

       return cache[n] – 1;

    }

  5. Adrian says:

    Fascinating series.  Shouldn't all those statics that are used with the Interlocked… functions be declared volatile?

  6. Pierre B. says:

    Ahahahah!

    A multiple-parts series covering the in and out of locking, lock-free access and memory barriers, and we get two back-to-back posters who write code ignoring everything that was taught. Priceless.

  7. Arseny Kapoulkine says:

    Pierre, care to back up your words and tell me why memory barriers/locking/lock free access is required in my example?

  8. sukru says:

    Wouldn't it be easier if you changed the architecture a little bit:

    static int lastPrime = 2;

    static int lastNonPrime = 1;

    if(n == lastPrime)

     return true;

    if(n == lastNonPrime)

     return false;

    bool retval = is_prime_slow(n);

    if(retval)

     lastPrime = n;

    else

     lastNonPrime = n;

    return retval;

    No synchornization, and since int writes are atomic (as long as they are aligned), no worries.

    [Yes, that works for this specific example, but let's consider the more general case. (In the more general case that inspired this example, there were about four inputs to a complex calculation which resulted in an integer output.) -Raymond]
  9. bcs says:

    @sukru: I've always been a fan of that class of solutions: Build stuff so that the state is only modified by atomic writes, and that after any write you may make, the sate will be correct regardless of any other intervening accesses. I was going to suggest storing LastIsPrime as the low order bit of Last and then special casing 2 and all other even numbers.

  10. Thorsten says:

    For this particular case where Last and LastIsPrime together are smaller than 2 pointers, and assuming that for now you are happy to just target x86/x64, I would simply go with an "lock cmpxchg8b" instruction for both reading and writing which makes the implementation trivial and much more straight forward.

  11. John says:

    Yikes.  That's a lot of code.  Wouldn't you be better off just using a thread local variable for the cached value?

  12. vox says:

    What is the advantage of using Aquire/Release functions since standard ones simply "impose both acquire and release semantics." (msdn.microsoft.com/…/ff540496.aspx).

  13. vlg says:

    What is the advantage of using Aquire/Release functions since standard ones simply "impose both acquire and release semantics." (msdn.microsoft.com/…/ff540496.aspx).

  14. acq says:

    The responses appear to be a good proof that too simplified examples lead people in the wrong direction (I always had problems with books making them as I always found at least somewhere that somebody actually retypes the example and uses in production because "that's how it should be done" — I guess such people shouldn't stay on that positions, but it's hard to influence that). Anyway, if we imagine the examples that start instead of:

    BOOL IsPrime(int n)

    {

       static int nLast = 1;

       static BOOL fLastIsPrime = FALSE;

    with

    struct Something { int a, b, c, d; };

    BOOL IsGood( Something x )

    {

       static Something lastX;

       static BOOL lastXIsGood;

    there would be much less chance for misunderstanding and also less to explain in text.

    On another side, I can imagine that some lawyers somewhere could then claim some "too similar to our" code? Yes I know it's hard to find the right balance.

  15. Gabe says:

    acq: The "IsPrime" function is something that everybody can relate to. We an all understand how it would take some time to compute and that you might want to cache it. The "IsGood(Something)" function is much more abstract, which would make the example harder to understand.

  16. acq says:

    @Gabe thanks, incorporating your insight, the "clear" example can be, for example:

    struct MyBigNum { __int64 a, b, c, d, e; };

    BOOL IsPrime( MyBigNum& x )

    {

       static MyBigNum lastX;

       static BOOL lastXIsPrime;

  17. JT says:

    What are InterlockedReadRelease and InterlockedReadAcquire?  MSDN seems to have failed to document them.

    [These functions were introduced earlier in the week. -Raymond]
  18. Daniel D says:

    I know this is a contrived example to demonstrate various thread techniques, but personally I would go with __declspec(thread)/TLS on that state.  It can be much better for individual threads to use their own state rather than to share state.

    This might also be useful if analysis reveals that individual threads are likely to make the same request, but distinct threads are likely to make different requests (that is, thread 1 tends to ask for the results of one value and thread 2 tends to ask for the results of a different value).  You'd get more in the way of thrashing your one value cache if that were true.  Again, __declspec(thread) your one-value cache would help in this instance.

  19. Worf says:

    Forgot, the plain functions impose both barriers so all memory accesses are done at that stage. On x86, I believe this is the case as it does not support read or write barriers individually.

  20. Worf says:

    To answer a few quwstions – the *Release and *Acquire functions are used to impose a memory barrier because a modern processor does not always post or complete memory transactions in program order. If you do a read of write two memory addresses, the reads can occur in any order, which mess up the interlock functions if you're using them to signal completion of an update.

    So you can impose read memory barriers (at this point, all previous reads have completed), aka Acquire, and write memory barriers (at this point, all writes have been posted), aka Release.

    This is a modern processor thing, it doesn't matter if it's single core or multi core (depends on architecture).

  21. SL says:

    x86 (in sufficiently modern versions) does provide: mfence=serialize both loads and stores, lfence=serialize loads, sfence=serialize stores.

  22. Kayru says:

    Agree with John's comment: TLS should be enough for this.

    [TLS works if your cache works with thread scope. But if you want each object to have its own cache, then TLS doesn't help. (Giving each object its own cache is not unreasonable. For example, "Give each transaction its own SID cache since a transaction rarely changes its identity, but different transactions will probably have different SIDs.") -Raymond]

Comments are closed.