Concurrency, Part 8 - Concurrency for scalability

Last time, I talked a bit about why you'd use concurrency in your application.  Today I'd like to talk a smidge about using concurrency to achieve scalability.
The first and most important thing that you have to understand when you decide to start improving the MP scalability of your application is that you are about to start on a long and arduous journey.  MP scalability is not a task for the faint of heart - as I mentioned in the
first of these articles, the reality is that high end scalability is very similar to quantum physics.  Once you start trying to introduce MP scalability into your application, most the rules you thought you understood about programming  change.  As I mentioned earlier, there are only a relatively few people at Microsoft who really understand this problem space (I'm not one of them), so I can only talk in general principles about this (yay, I used the right principle! :)).

Another thing to realize is that the vast majority of applications don't need to worry about scalability.  If your application is interacting with the user, for example, then worrying about scalability is almost always a total waste of time - the reality is that your application spends most of its time interacting with a user - and users are SLOW relative to computers.  Your scalability is generally limited to the speed of the user, and they are really slow - even the fastest users can only type 9-10 characters per second, and mice only generate hundreds of inputs per second.  This is not to say that concurrency isn't an issue for your application - you might very well chose to use multiple threads, but those threads are typically used for offloading work from your main user interface processing thread.

When would you want to worry about scalability?  Well, if you're writing a server of some kind, it's going to be really important.  The number of users that is supported by your server is limited by the bottlenecks of your server.  And you need to understand where those bottlenecks exist before you can start worrying about scalability.  For example, an email server is typically disk bound (most people don't realize this, but email is inherently a database processing problem, and as such it's usually disk bound).  But as a disk bound application, you can solve that problem by throwing more disks at the problem.

I'm only going to talk about CPU bound scalability issues, since the other types of scalability bottlenecks are offtopic.  But in general, depending on the design of your application, you can have bottlenecks in almost every type of resource you consume - you can have bottlenecks in disk bandwidth, in virtual address space, in cpu resources, etc.

But if your application is CPU bound (for example, a database server that spends most of its time traversing indexes in database), then the obvious solution is to throw more CPUs at the problem.

And that's when the problem gets strange.

The key to MP scalability the realization that a single processor core can only do one thing at a time.  And if your CPU is spending time doing anything that's not directly related to your application, it's hurting your scalability.

Why is this important?  Well, for two reasons.  The first is that it means that optimally, your application would only have one runnable thread per CPU core at anytime.  Any more threads than that means that the operating system is going to context switch between the threads.  And if the operating system's context switching, then that means that it's spending CPU cycles saving your state, finding the next runnable thread, restoring that threads state.  This is time that could be spent in your application performing its calculations, and instead, it's wasted switching to another thread in your application.

The other reason has to do with your thread quanta.  The NT scheduler schedules threads based on a concept called a quanta - the minimal amount of time that a thread will run.  Typically a quantum ranges from 20ms to 120ms in length (it varies depending on the OS configuration).  The NT scheduler is fairly deterministic, the OS won't context switch away from a thread unless one of two events occurs - first, if a higher priority thread becomes runnable, and second, if the thread relinquishes the CPU (by calling a system call that would block).  So, in general, if your thread doesn't block, your thread is somewhat immune to context switches for the length of its quantum (this is a rough approximation, but works for this discussion).

So it's critical that if you're trying to manage your scalability, you need to ensure that you get the most of your CPU.  And that means that you don't ever want to let the NT scheduler context switch away from your thread for any other reason than your quantum expiring.

Now it turns out that that's a heck of a lot harder than it sounds.  For example, it means that you can never do any synchronous file I/O (since that blocks your thread). It means that you can never ever use Win32 critical sections, or call WaitForSingleObject, since any of them might block.

It means that you need to take all the things I talked about in the last 7 posts on the series and throw most of them away, except for the first principle: If your data is never accessed on more than one thread, then you don't have to worry about concurrency.  The challenge is to get that to happen.   Because in most non trivial servers, you WILL have shared data structures - for instance, the index in your table in the database is a shared structure. 

Also, there are times that you ARE going to block before your quantum has expired (for example, when you need to bring a page from the database into cache) - but you don't want to let the OS pick the next thread to run, YOU want to pick the next thread to run.

I talked about one of the solutions in an earlier post: Fibers.  Fibers provide a lightweight mechanism for an application to write their own context switching API - that allows you to keep your thread "hot" at all times, avoiding the OS context switch mechanism.

One other major trick in the "don't block" toolkit are the Interlocked Win32 primitives.  The interlocked primitives use the CPUs locking mechanisms to guarantee cross-CPU core synchronization of data, which allows you to avoid acquiring a critical section.

Tomorrow, I'll spend a bit of time cataloging the different mechanisms for achieving scalability that Win32 provides.

Edit: Corrected typo in subject

Edit2: quanta->quantum, where appropriate.  Thanks Drew