Concurrency, Part 8 – Concurrency for scalability

Last time, I talked a bit about why you’d use concurrency in your application.  Today I’d like to talk a smidge about using concurrency to achieve scalability.

The first and most important thing that you have to understand when you decide to start improving the MP scalability of your application is that you are about to start on a long and arduous journey.  MP scalability is not a task for the faint of heart – as I mentioned in the first of these articles, the reality is that high end scalability is very similar to quantum physics.  Once you start trying to introduce MP scalability into your application, most the rules you thought you understood about programming  change.  As I mentioned earlier, there are only a relatively few people at Microsoft who really understand this problem space (I’m not one of them), so I can only talk in general principles about this (yay, I used the right principle! :)).

Another thing to realize is that the vast majority of applications don’t need to worry about scalability.  If your application is interacting with the user, for example, then worrying about scalability is almost always a total waste of time – the reality is that your application spends most of its time interacting with a user – and users are SLOW relative to computers.  Your scalability is generally limited to the speed of the user, and they are really slow – even the fastest users can only type 9-10 characters per second, and mice only generate hundreds of inputs per second.  This is not to say that concurrency isn’t an issue for your application – you might very well chose to use multiple threads, but those threads are typically used for offloading work from your main user interface processing thread.

When would you want to worry about scalability?  Well, if you’re writing a server of some kind, it’s going to be really important.  The number of users that is supported by your server is limited by the bottlenecks of your server.  And you need to understand where those bottlenecks exist before you can start worrying about scalability.  For example, an email server is typically disk bound (most people don’t realize this, but email is inherently a database processing problem, and as such it’s usually disk bound).  But as a disk bound application, you can solve that problem by throwing more disks at the problem.

I’m only going to talk about CPU bound scalability issues, since the other types of scalability bottlenecks are offtopic.  But in general, depending on the design of your application, you can have bottlenecks in almost every type of resource you consume – you can have bottlenecks in disk bandwidth, in virtual address space, in cpu resources, etc.

But if your application is CPU bound (for example, a database server that spends most of its time traversing indexes in database), then the obvious solution is to throw more CPUs at the problem.

And that’s when the problem gets strange.

The key to MP scalability the realization that a single processor core can only do one thing at a time.  And if your CPU is spending time doing anything that’s not directly related to your application, it’s hurting your scalability.

Why is this important?  Well, for two reasons.  The first is that it means that optimally, your application would only have one runnable thread per CPU core at anytime.  Any more threads than that means that the operating system is going to context switch between the threads.  And if the operating system’s context switching, then that means that it’s spending CPU cycles saving your state, finding the next runnable thread, restoring that threads state.  This is time that could be spent in your application performing its calculations, and instead, it’s wasted switching to another thread in your application.

The other reason has to do with your thread quanta.  The NT scheduler schedules threads based on a concept called a quanta – the minimal amount of time that a thread will run.  Typically a quantum ranges from 20ms to 120ms in length (it varies depending on the OS configuration).  The NT scheduler is fairly deterministic, the OS won’t context switch away from a thread unless one of two events occurs – first, if a higher priority thread becomes runnable, and second, if the thread relinquishes the CPU (by calling a system call that would block).  So, in general, if your thread doesn’t block, your thread is somewhat immune to context switches for the length of its quantum (this is a rough approximation, but works for this discussion).

So it’s critical that if you’re trying to manage your scalability, you need to ensure that you get the most of your CPU.  And that means that you don’t ever want to let the NT scheduler context switch away from your thread for any other reason than your quantum expiring.

Now it turns out that that’s a heck of a lot harder than it sounds.  For example, it means that you can never do any synchronous file I/O (since that blocks your thread). It means that you can never ever use Win32 critical sections, or call WaitForSingleObject, since any of them might block.

It means that you need to take all the things I talked about in the last 7 posts on the series and throw most of them away, except for the first principle: If your data is never accessed on more than one thread, then you don’t have to worry about concurrency.  The challenge is to get that to happen.   Because in most non trivial servers, you WILL have shared data structures – for instance, the index in your table in the database is a shared structure. 

Also, there are times that you ARE going to block before your quantum has expired (for example, when you need to bring a page from the database into cache) – but you don’t want to let the OS pick the next thread to run, YOU want to pick the next thread to run.

I talked about one of the solutions in an earlier post: Fibers.  Fibers provide a lightweight mechanism for an application to write their own context switching API – that allows you to keep your thread “hot” at all times, avoiding the OS context switch mechanism.

One other major trick in the “don’t block” toolkit are the Interlocked Win32 primitives.  The interlocked primitives use the CPUs locking mechanisms to guarantee cross-CPU core synchronization of data, which allows you to avoid acquiring a critical section.

Tomorrow, I’ll spend a bit of time cataloging the different mechanisms for achieving scalability that Win32 provides.

Edit: Corrected typo in subject

Edit2: quanta->quantum, where appropriate.  Thanks Drew

Comments (16)

  1. Ricky Datta says:

    If you give us an example of IO competion usage (File, sockets) using C# managed code – that will be great.


  2. Ricky, are the MSDN overlapped I/O and I/O completion port examples not sufficient? I could write something up but I’d think that there was enough sample code for that already on the web.

    Or do you have something more specific in mind?

  3. Ricky Datta says:

    IO Completion ports are kind mystery..exotic to ordinary Win32 programmer. On the other hand, developer education of completion ports are essential for scalable windows server development.

    CLR FX 2.0 implements IO completion ports, which is a high productivity devlopment environment. There is very little doc/samples/established examples of how to use the following :

    AcceptEx (CLR version)

    ConnectEx (CLR version)

    TransmitFile (CLR version)

    TransmitPackets (CLR version)

    Asynchronus File/Socket operations

    In my opinion asynchronous server techniques using IO completion are usually found in MS products, never in 3rd party apps.

    If you could write about these, that would be great.


  4. Jerry Pisk says:

    You don’t need to worry about synchronization if the data you share is read only. But it has to be read only for the lifetime of your threads. An example would be a B-tree index of your static data (such as resource strings) that is only built once, at your application startup, before you fire off your worker threads.

  5. Jerry, that’s true. But that relies on the fact that the data is static. While this is likely to be true for some data sets (census data, for example) for others (Amazon.Com’s book list) it’s not.

  6. Mike Dimmick says:

    Ricky, I seem to recall there’s a pretty good treatment of I/O Completion Ports – and sample code – in "Programming Server-Side Applications for Microsoft Windows" by Jeffrey Richter and Jason Clark. Unfortunately it’s out of print – why, MS Press? This is an essential book! I note that Jeff’s other Win32 book "Programming Applications for Microsoft Windows" is also out of print.

    As for the .NET Framework 2.0, it’s still in beta and a lot of the documentation is unfinished. Hopefully Beta 2 will be a lot better in that regard.

  7. Eric Lippert says:

    Good series Larry!

    > But if your application is CPU bound then the obvious solution is to throw more CPUs at the problem.

    That’s one obvious "solution" that turns out to not always work. Another obvious "solution" is to performance tune the application so that each operation is faster, AND throw more CPUs at the problem.

    Counter-intuitively, making the processor do less work on a given thread can actually DECREASE your performance, or at least wreck scalability. Scalability is really, really hard to wrap your mind around.

    I wrote up an analogy that shows how speeding can slow you down here:

  8. Drew says:


    The singular is quantum. Quanta is plural. Because a slice of CPU time is neuter although I have no idea why.


    I’m also really enjoying this series. And not just because of the irony inherent in a series on parallelism. I’m eagerly awaiting tomorrow’s installment.

  9. SeanH says:

    Jerry, Larry mentioned writing scalable concurrent applications need a paradigm-shift in terms of how software developers think about writing applications, and I think one of these shifts is the concept of shared data and how we access it.

    I think writing scalable concurrent applications will mean having minimal access to the shared data. My guess is, you will have one huge container with a well-staged way of modify contents of that container through some sort of message passing mechanism where data to be modified is deep copied and wrapped around a message that is sent to a worker thread and the modifications sent back through another message. The container will then be in charge of determining which modifications will be committed.

  10. Drew, you’re 100% right. But every discussion I’ve ever had with the kernel group refers to the unit of scheduling as a quanta, not quantum. I don’t get it, but…

  11. Mike Flasko says:

    Larry, I agree with Ricky, a series on IO Completion ports would be great.

    As for multithreading I typically use the following train of thought when developing:

    1. identify tasks suitable to a secondary thread

    2. identify shared data (among threads)

    3. of that shared data which of the data is writable (aka. find the race conditions)

    4. provide mutual exclusion for the variables in step 3.

    ** 5. If necessary, ensure my critical sections do not lead to thread starvation.

  12. MartinK says:

    This series gets better and better. Two things that I’d welcome a bit of expansion of:

    1. "Avoid synchronous I/O because it blocks". Out of interest, is this often true of WriteFile? Reading Inside Windows 2000, it looks as if WriteFile practically never writes a file, just schedules a lazy write and returns at once… in which case safe to use, surely?

    2. Does the deep scheduling stuff inside Windows work? I’m not being offensive, it’s just that we all know that concurrent code is virtually impossible to debug, and I don’t want to spend 24 hours chasing an evanescent bug in my own code and then discover that it is one of those "edge cases in ntdll.dll that were never thought through properly" [Rob Earhart]. This may be why it’s mostly MS that use their own completion ports and so on… the internal company "lore" will tell them which documented features have been well used and which ones haven’t been used enough to be safe to rely on.

    3. [ok, so I can’t count!] A server app can balance the work it does for its clients in all sorts of ways, intelligently allocating CPU, memory, disk I/O… but the one thing it seems impossible to balance is bandwidth back to the clients. Example: my protocol splits big messages into a stream of smaller packets, the idea being that a small message can insert itself into the stream between those packets – helps responsiveness at the client end. Also there is then some hope of bandwidth allocation so that a client who’s just requested 10MB won’t lock out lots of other clients who only want 1KB each to get on with their work.

    But – Windows provides massive buffers for socket output and happily hoovers up huge quantities of data that it then queues. Good for raw performance, but it makes for hellish latency [the insert-a-quick-little-packet solution won’t work when there are already megabytes in the queue "owned" by Windows] and for no hope of balancing the bandwidth requirements of multiple clients.

    I’m probably missing something here – one little API or callback or port or something – so any pointers would be welcome.

  13. Martin, for #1: If you’re doing a synchronous write, the I/O subsystem takes a lock around each file object – which might block your thread if someone else is accessing the file object. The fast I/O path won’t typically take many locks, you’re right. But you’re not guaranteed you’ll take the fast I/O path.

    #2: The NT scheduler’s not bad, w.r.t. scalability, and it’s getting better all the time. Earhart reads this blog (at least he comments on here :)), he may be able to answer better than I.

    #3: Let me look into that one, I’m not sure that there are solutions for rate metered delivery on the server side (other than setting the socket buffer size to something small)

  14. Skywing says:

    Synchronous WriteFile will block after you’ve written more data than the cache manager will allocate for a particular file. You should not rely on it completing instantly, particularly when the system is under load.

    I use I/O completion ports in my network server apps and I’ve not yet run into any problems with them.

  15. MartinK, think I can help you with #3:

    Winsock uses an algorithm named Nagel which causes this buffering. If you enable TCP_NODELAY (through setsockopt or any equivalent) then Winsock will send packages straight away.

    Also you can change the send and receive buffers through setsockopt.

    For more/better information see

  16. MartinK says:

    Andreas, thanks. I’m not worried about small-packet latency. It’s the opposite problem: if Windows lets me output 1MB of data to a socket that’s connected to a dial-up client, then anything I send immediately afterwards will take at least 150 seconds to arrive. Having smaller buffers would indeed be the answer, so that my application would be able to make intelligent decisions about which packets of data to send when.

    Unfortunately there’s more buffering inside Windows than just the socket buffers. I’ve run a test: getsockopt() reports a send buffer size of 8192 bytes, but a call to send() that asks to send 1000000 bytes returns immediately, reporting that it’s sent 1000000 bytes, not 8192.