Concurrency, Part 9 - APIs that enable scalable programming

Sorry, I just noticed that this didn't get posted yesterday - I don't know what happened, but...

As I mentioned the other day, Win32 provides a number of tools for writing scalable applications.

Today I want to catalog some of them.  Remember, that the #1 goal you have when attempting to design MP scalable applications is to avoid blocking your thread.  You absolutely never, ever want to let your thread go through the CPU scheduler, because that means that your processor is spending time switching threads when it could be doing work.  So what are some of the tools that Win32 provides for this?

First off, the two I mentioned yesterday, Fibers and Interlocked operations.  Fibers allow an application to switch between tasks without exhausting the currently running threads' quantum, but their cost can be quite high in terms of complexity - you essentially have to write your own scheduler to effectively use them.

The Interlocked functions, however are really, really neat.  You can't really appreciate the power of the APIs from their description.

Interlocked Increment and Decrement are straightforward - they increment and decrement a long, returning the new value of the variable.

InterlockedExchangeAdd is again straightforward - it adds (or subtracts) from a 32bit value.  Again, straightforward.  So is InterlockedExchangePointer - it swaps two pointer values.

The magic of the interlocked functions starts with InterlockedCompareExchangePointer - It turns out that if you're clever, you can use InterlockedCompareExchangePointer to implement a LIFO queue that doesn't require any external synchronization structures.  I'd write down how to make this work, but the good news is that I don't have to, because the NT base team formalized it with the Interlocked Singly Linked List primitives.  These allow you to add and remove entries to a list, again without requiring a critical section.

The next tools that are available for scalability are, somewhat paradoxically, the Win32 critical section structures.  It turns out that there are two big issues with critical sections.  One is that critical sections can block on contention.  The other is that if you use a critical section to protect a commonly accessed variable (like the head of your work queue), a phenomenon known as a lock convoy.  Lock convoys are horribly bad when you're designing an application to be scalable, because all your threads eventually wind up being serialized on a single critical section.  So a great deal of effort has been put into reducing the likelihood of lock convoys in critical sections.  The first of the changes made (for NT4 SP3 (and Windows 98+)) was to add the InitializeCriticalSectionAndSpinCount API.  This API allows the user to specify a "spin count" for critical sections - essentially, when you attempt to enter the critical section, if the critical section is owned by another thread, instead of simply blocking (and throwing away your quantum), the EnterCriticalSection API will "spin" trying repeatedly to acquire the critical section.  So if your critical section is "hot" (meaning it's a high contention critical section, and the critical section is held for a very short period of time), then the odds are very good that the critical section will be released while the other thread is spinning.  And that means that the thread that's trying to grab the critical section won't have to give up its quantum.

Now ICSASC isn't a panacea - if you own the critical section for any length of time, then spinning on contention just wastes time.  But for some critical sections, it can be a huge win.

In addition, for Windows 2003, SP1, the EnterCriticalSection API has a subtle change that's intended tor resolve many of the lock convoy issues.  Before Win2003 SP1, if 10 threads were blocked on EnterCriticalSection and all 10 threads had the same priority, then EnterCriticalSection would service those threads in a FIFO (first -in, first-out) basis.  Starting in Windows 2003 SP1, the EnterCriticalSection will wake up a random thread from the waiting threads.  If all the threads are doing the same thing (like a thread pool) this won't make much of a difference, but if the different threads are doing different work (like the critical section protecting a widely accessed object), this will go a long way towards removing lock convoy semantics.

As I mentioned in my "So you want a worker thread pool" article, another incredibly useful way of achieving scalability is to use an I/O completion port for a work queue - because the completion port queue is protected by the same kernel structure that protects the scheduler in NT, if there are any work items on the queue, your thread won't block when it calls GetQueuedCompletionStatus.

And, in addition to all these, there's one other facility - Process Affinity.  The SetProcessAffinityMask, SetThreadAffinityMask and SetThreadIdealProcessor APIs can be used to "pin" a thread to a CPU.  Otherwise the thread will be run on whichever processor is available.  Switching a thread from one processor to another can have serious negative consequences, since there is internal per-CPU state that gets tossed when a thread is moved.

There may be other APIs and facilities for improving the scalability of your application in Win32, but (honestly) they're not coming to mind immediately :)

Tomorrow, I want to talk about a couple of the real-world problems that you start encountering when you build highly scalable applications.

Edit: Added a paragraph about the Win2003 SP1 EnterCriticalSection behavior changes (I just got approval to post that behavior change).

Edit2: Fixed InterlockedIncrement/Decrement description, thanks Dave