Spurious wakes, race conditions, and bogus FIFO claims: A peek behind the curtain of WaitOnAddress


We spent the past few days looking at the Wait­On­Address function. Today we're going to peek behind the curtain.

Reminder: These are undocumented implementation details and are subject to change at any time. The information is presented here for educational purposes. Do not take dependencies on the implementation.¹

Okay, let's go.

The general idea behind the implementation of the Wait­On­Address function is that all of the threads that have an outstanding Wait­On­Address call are kept on a lock-free hash table, keyed by the address. Once the thread adds itself to the hash table, the thread calls an internal kernel function to go to sleep.

When one of the Wake­By­Address functions is called, it searches the hash table looking for any threads that are waiting for the address, and either wakes one or all of them by removing the corresponding hash table entry and calling an internal kernel function to snap the thread out of its stupor.

From this rough description, you can see a few race conditions. For example, what if the monitored address contains the undesirable value at the entry to the Wait­On­Address function, while the thread was adding itself to the hash table, another thread changed the value and called Wake­By­Address? Since the waking thread didn't see the hash table entry, it didn't see anybody to wake up. Meanwhile, the first thread finishes adding the entry to the hash table and goes to sleep, but it's too late. The thread that performed the wake operation didn't see anybody to wake, and the waiting thread waits forever.

To solve this problem, the Wait­On­Address function adds the thread to the hash table, and then re-checks whether the value at the address is equal to the undesirable value. If not, then the Wait­On­Address skips the wait entirely, thereby avoiding the problem of entering a wait that will never be woken.

Even with this fix, there's still a race condition: What if the re-check says that the value at the address is equal to the undesirable value, but just before the thread is about to call the internal kernel function, it gets pre-empted, and another thread issues the corresponding wake. The waiting thread is in the hash table, so the waking thread will ask the kernel to wake the waiting thread, but the waiting thread isn't waiting yet, so the wake request has no effect. The waiting thread then enters the kernel and goes to sleep, never to be woken.

To solve this second race condition, the kernel steps in to assist. When the waiting thread goes to sleep, it also gives the kernel the address that it's waiting for. Similarly, when the waking thread wants to wake up the waiting thread, it gives the kernel the identity of the thread to wake up as well as the address that is being woken. If the wake call arrives but the thread is not waiting on an address, then the kernel remembers the address for that thread. When the thread finally gets around to waiting on the address, the kernel says, "Oh, hey, there's a wake waiting for you."

This wake buffer is only one entry deep. If you try to wake a non-waiting thread twice, only the most recent wake request will be remembered. But that's okay, because the way that Wait­On­Address works, it never needs anything but the most recent wait request to be buffered.

This last race condition does explain one of the cases of a spurious wake: Suppose a thread calls Wait­On­Address, and after it adds the node to the hash table, another thread wakes the thread by address. Since the thread hasn't entered the kernel yet, the wake request is buffered. But now the thread re-checks the address and sees that the value is not the undesirable value, so the wait is skipped entirely, and control returns to the caller.

But the wake is still buffered.

That same thread then calls Wait­On­Address for the same address. This time, it adds the node to the hash table, no race condition occurs, and the thread enters the kernel to wait. The kernel thinks that this second wait is the race condition from the first call to Wait­On­Address, so it completes the wait immediately. "Hey, while you were busy setting up the wait, somebody already woke you."

Result: Spurious wake.

Another note about spurious wakes: There's nothing preventing an unrelated component from calling Wake­By­Address on the same address. And as we noted earlier, there's also the possiblity that the value changed back to the undesirable value after the thread was woken. Therefore, spurious wakes are unavoidably just the way things are, and your code needs to be able to handle them. Even if there were a way to clear the buffered wake from the kernel, applications still have to deal with spurious wakes for other reasons.

My final note for today is a response to the documentation for Wake­By­Address­Single which says

If multiple threads are waiting for this address, the system wakes the first thread to wait.

This is one of those times where the documentation starts documenting the implementation rather than the contract. My guess is that the current implementation wakes waiters FIFO, and somehow that got into the documentation without a clear indication that this is a descriptive statement rather than a normative one.

What's more, it's not even accurate. The presence of spurious wakes means that the order in which code calls Wait­On­Address may not match the order in which they remain in the queue, because a spurious wake takes a thread out of the queue, and then it re-enters the queue at the end. (Sound familiar?) And who knows, a future version of Wait­On­Address may choose to dispense with FIFO waking in order to add convoy resistance.

Okay, that's enough discussion of Wait­On­Address for now.

¹ Mind you, this warning didn't stop people from snooping in on the internals of the CRITICAL_SECTION structure. As a result, the CRITICAL_SECTION structure must continue to use an automatic-reset event, use -1 to indicate an unowned critical section. This prevented the kernel team from switching to a keyed event for critical sections. And even though the internal bookkeeping has changed, and the Lock­Count is no longer the lock count, the implementation must nevertheless go through contortions to ensure that the value of Lock­Count is -1 precisely when the critical section is unowned.

Comments (15)
  1. Pierre B. says:

    I suppose the kernel could create a new critical section with a new implementation and new macros to declare them?

    (Plus, with the advent of IoT, eventually, the kernel folks will be able to send zaps to the offending coders.)

  2. jnm236 says:

    Just curious- what’s the main reason Microsoft tries to shield users of software from the intentional bad decisions of the software writers like depending on undocumented features with no fallback?

    1. Joshua says:

      Many times on this blog: because you tend to depend on third party libraries that do it wrong.

    2. Kemp says:

      Because when MS change the implementation in a non-backward-compatible way and the software no longer works the users (at big companies with expensive licenses) aren’t going to think “oh, the software must have been breaking the rules, we’ll find a new vendor now”, they’ll think “Windows sucks, the upgrade broke all our software, we’d better tell everyone not to upgrade”. Even if you could convince them it’s not Windows’ fault they *still* can’t upgrade because their software is still broken. The truth will not set them free.

      1. DWalker says:

        “The truth will not set them free.” That’s one of the best explanations of this phenomenon I have ever heard!

        Even if (or when) the users fully understand that they can’t upgrade to a newer version of Windows *because* of bad decisions made by the ISVs, if their line-of-business software can’t handle an upgrade, then the users STILL cannot upgrade.

    3. Yukkuri says:

      Not to mention the flood lawsuits that would be generated accusing Microsoft of deliberately breaking compatibility for anticompetitive reasons.

      1. Yukkuri says:

        And before you say “But it is the ISVs that screwed up!” – yes, we know that. But will a jury understand that? Is it worth the costs of litigating and bad publicity?

        Smarter for Microsoft and better for end users to just deal with the compat hacks.

    4. Voo says:

      Even ignoring large corporations or assigning blame, if everything works fine in Windows n and half your programs break with Windows N+1 are you going to upgrade to the newest windows release?

      I certainly wouldn’t without good reason and as vista demonstrated there’s many people out there who are fine with an older OS version.

    5. Jonathan Gilbert says:

      One possible solution: When introducing a new feature, implement its underlying data storage two different ways from the very beginning. Each time e.g. InitializeCriticalSection is called, it picks one at random. With each Windows upgrade, replace one of the two possible data storage layouts. Document that you will do this. Do not document the layouts themselves. If this were done across the board, it would become common knowledge that you can’t depend on the underlying storage of system structures documented as opaque. The only question is, is that extra work (and extra bit/byte of storage in the data structures) worth the gain? I personally have to lean toward, “Probably, yes” :-) Having developers that don’t respect boundaries clearly makes a *big* difference to what you can and can’t do as the OS evolves.

      1. Back in the early days, developers were keenly sensitive to the number of cycles it took to perform an EnterCriticalSection. Articles were written about how few cycles it took to enter an uncontended lock. Adding an additional test and branch would have caused people to freak out.

  3. Richard Barrell says:

    Thank you for the explanation. The POSIX threads standard has a similar caveat that implementations are permitted spurious wakeups when sleeping on condition variables. I previously had no idea why. Your write up here illuminates why the pthreads standard’s authors too would have felt the need to allow spurious wakeups from condvars.

  4. Ariel Ben-Yehuda says:

    So WaitOnAddress is implemented half in userspace and half in kernelspace, unlike Linux’s similar futex(2) API?

    The futex(2) implementation inserts itself to the hash table and checks for waiters within the kernel, so this kind of spurious wakeup should not happen.

    However:
    1) typically, you have to check your condition again under a lock, so spurious wakeups do not matter.
    2) signals still “spuriously” wake you up.
    3) there is the self-synchronized destruction problem.

    The self-synchronized destruction problem has several variants, most classically appears when a refcount is protected by a lock. Suppose that `obj` is owned by 2 threads, with a refcount of 2, and both execute the code

    “`
    lock(&obj->lock);
    if(–obj->refcount) {
    unlock(&obj->lock);
    } else {
    unlock(&obj->lock);
    free(obj);
    }
    “`

    If `unlock` is inlined, this becomes

    “`
    lock(&obj->lock);
    if(–obj->refcount) {
    obj->lock = LOCK_UNLOCKED;
    WakeByAddressSingle(&obj->lock);
    } else {
    obj->lock = LOCK_UNLOCKED;
    WakeByAddressSingle(&obj->lock);
    free(obj);
    }
    “`

    And execution can progress as
    “`
    thread 0:
    lock(&obj->lock);
    if(–obj->refcount) {
    obj->lock = LOCK_UNLOCKED;
    thread 1:
    lock(&obj->lock); // finishes because obj->lock is unlocked.
    if(–obj->refcount) {
    obj->lock = LOCK_UNLOCKED;
    WakeByAddressSingle(&obj->lock);
    free(obj);
    thread 0:
    WakeByAddressSingle(&obj->lock); // use after free!
    “`

    The problem is that a thread must wake a lock *after* it allows for the lock to be taken, which allows another thread to free it. The “use-after-free” occurs in the *kernel*, so it can’t crash the user program. However, if the lock is reused, the next lock’s owner will see a spurious wakeup.

    1. Um, futex is also implemented half in userspace and half in kernelspace. The “u” stands for “userspace” after all. (Userspace manipulates the integer, and then calls into kernelspace if waiting or waking is necessary.) Confusingly, the syscall is also called “futex” even though it really should be called “futexk”. Another solution to the problem you describe above is to call WakeByAddress before releasing the lock. It is explicitly permitted to call Wake while holding the lock.

    2. Joshua says:

      I tried to figure out what was going on here; unfortunately I discovered the documentation is somehow wrong; WaitOnAddressSingle is somehow not in kernel32.dl in Windows 10l (in desperation tried GetProcAddress and got NULL back). Maybe it’s only in synchronization.lib; in which case the docs need some real improvement.

      Either way this code is wrong and the solution is as follows:
      lock(&obj->lock);
      unlock(&obj->lock);
      if (InterlockedAdd(&obj->refcount, -1) == 0) free(obj);

      1. Joshua says:

        On readback I typed it wrong here. My code does GetProcAddress(hKernel32, “WakeByAddressSingle”);

Comments are closed.

Skip to main content