Concurrency, part 11 - Hidden scalability issues

So you're writing a server.  You've done your research, and you've designed your system to be as scalable as you possibly can.

All your linked lists are interlocked lists, you're app uses only one thread per CPU core, you're using fibers to manage your scheduling so that you make full use of your quanta, you've set your thread's processor affinity so that it's locked to a single CPU core, etc.

So you're done, right?

Well, no.  The odds are pretty good that you've STILL got concurrency issues.  But they were hidden from you because the concurrency issues aren't in your application, they're elsewhere in the system.

This is what makes programming for scalability SO darned hard.

So here are some of the common issues where scalability issues are hidden.

The biggest one (from my standpoint, although the relevant people on the base team get on my case whenever I mention it) is the NT heap manager.  When you create a heap with HeapCreate, unless you specify the HEAP_NO_SERIALIZE flag, the heap will have a critical section associated with it (and the process heap is a serialized heap).

What this means is that every time you call LocalAlloc() (or HeapAlloc, or HeapFree, or any other heap APIs), you're entering a critical section.  If your application performs a large number of allocations, then you're going to be acquiring and releasing this critical section a LOT.  It turns out that this single critical section can quickly become the hottest critical section in your process.   And the consequences of this can be absolutely huge.  When I accidentally checked in a change to the Exchange store's heap manager that reduced the number of heaps used by the Exchange store from 5 to 1, the overall performance of the store dropped by 15%.  That 15% reduction in performance was directly caused by serialization on the heap critical section.

The good news is that the base team knows that this is a big deal, and they've done a huge amount of work to reduce contentions on the heap.   For Windows Server 2003, the base team added support for the "low fragmentation heap", which can be enabled with a call to HeapSetInformation.  One of the benefits of switching to the low fragmentation heap (along with the obvious benefit of reducing heap fragmentation) is that the LFH is significantly more scalable than the base heap.

And there are other sources of contention that can occur below your application.  In fact, many of the base system services have internal locks and synchronization structures that could cause your application to block - for instance, if you didn't open your file handles for overlapped I/O, then the I/O subsystem acquires an auto-reset event across all file operations on the file.  This is done entirely under the covers, but can potentially cause scalability issues.

And there are scalability issues that come from physics as well.  For example, yesterday, Jeff Parker asked about ripping CDs from Windows Media Player.  It turns out that there's no point in dedicating more than one thread to reading data from the CD, because the CDROM drive has only one head - it can't read from two locations simultaneously (and on CDROM drives, head motion is particularly expensive).  The same laws of physics hold true for all physical media - I touched on this in the answers to the Whats wrong with this code, part 9 post - you can't speed up hard disk copies by throwing more threads or overlapped I/O at the problem, because file copy speed is ultimately limited by the physical speed of the underlying media - and with only one spindle, it can only read or write to the drive one operation at a time.

But even if you've identified all the bottlenecks in your application, and added disks to ensure that your I/O is as fast as possible, there STILL may be bottlenecks that you've not yet seen.

Next time, I'll talk about those bottlenecks...