Last time, I mentioned that even when you get everything right, you can STILL have scalability issues... To explain what's going on, a digression.
One of the major features of Exchange 2000 was the ability for the store to access multiple databases (before, the store could only access one database at a time).
My dev lead and I had a long series of discussions about whether it would be better to implement multiple databases by running multiple store.exe processes with one database loaded into each process, or to do the work to load multiple databases into a single store.exe process.
Our concern with the multiple process solution was that the context switch overhead of switching from one process to another would overwhelm the performance of the system (since more work is involved in switching from one process to another than is involved in switching from one thread to another).
To resolve the discussion, I wrote a small test application. It would spawn 100 threads doing some "work" (where "work" was defined as incrementing a counter as many times as was possible during a 5 minute period). It would either do the work by spawning 100 threads in one process or by spawning 10 processes with 10 threads in each process.
So in both cases, 100 threads would be created, the only difference would be how many processes would be used. Our belief was that the 10 process scenario would be slower than the 1 process (due to the address space switch issue).
Well, I ran the test, and discovered, much to my surprise the multithreaded version was significantly slower than the multiprocess version.
This was utterly counter-intuitive, so I asked the base team what was going on.
They asked to look at my code, and immediately spotted the problem.
You see, in order to avoid the CPU lock issues associated with InterlockedIncrement (because I knew it would mess the numbers up), I allocated an array with one DWORD for each thread. Each thread incremented its value (once again, using my first principle (hah, I got it right :))), and after the 5 minute period had expired, the main thread of the application would add up all the values in the array. For the multiprocess version, each child process wrote to a shared memory region which was again summed by the main thread.
At this point, the people who have worked on scalability are all smirking at my utterly boneheaded mistake.
You see, my mistake was to assume that there would be no side-effects associated with dedicating an entry in the array for each thread. I assumed that each thread could read and write to its dedicated slot with impunity.
And on modern processors, this simply isn't true.
You see, on modern computers, the system memory is significantly slower than the processor - while the CPU might be running at 3GHz, the memory bus might only be running at 1GHz or so... But computers spend a LOT of time accessing memory. If they had to spend all their time accessing system RAM, then the performance of the processor would be massively throttled.
To resolve this, the processors have several levels of cache. The first level (cleverly named L1) of cache physically sits on the chip. The level 1 cache is typically fairly small (for the Pentium Pro, it was 8K, for the Pentium M, it's 32K (for the P4, they've changed the architecture and replaced the L1 cache with a 16K trace cache)). The thing about L1 cache is that it's REALLY, REALLY fast - the L1 cache operates at full CPU speed. Usually, the L1 cache is per-cpu core, there are no cache coherency issues to deal with (principle 1 - if it's your data, you don't have to protect it from anyone else)
But the L1 cache is really intended for predictive branching, etc - it's really small and blindingly fast because it's about pulling instructions into the processors core. There's a need for a second, higher level cache. This higher level cache is cleverly named the level 2 cache, or L2.
Now, the L2 cache is MUCH bigger than the L1 cache - it's typically hundreds of kilobytes in size (the P4 extreme edition has 2M of L2 cache, for example). But the Level2 cache may be shared between multiple CPU cores (this depends on the processor configuration). As I mentioned before, if you have a resource that's shared, you need to have some kind of locks to protect that resource. And the CPU is no different from any other system.
Internally, the x86 cpu cache divides memory up into "cache lines", or cache blocks (the cache line size varies by CPU). When you perform an interlocked operation, it locks the cache line, writes the memory, and releases the cache line.
So in my little test application, what was really happening was that my application was thrashing the CPU cache - on the machine I was testing, the cache line was 32 bytes in size, so each block of 8 threads was contending for the same cache line, so they were effectively serialized by the processor. On the multi-process version, instead of having one 400 byte array (100 threads, 4 bytes/thread) occupying 13 cache lines, each process had a 40 byte array (10 threads, 4 bytes/thread) occupying two. In the single process version, there were 96 threads, each contending for 12 cache lines and 4 threads contending for the 13th. And in the multi-process version, there were 8 threads contending for one cache line, but only two threads contending for the other cache line (aggregated over the 10 processes, there were 80 threads contending for 10 cache lines, and 20 threads contending for 10 cache lines).
So if you look at this, the cache line thrashing was less on the multi-process version of the test (although it still existed).
When I rewrote the application to remove the cache line thrashing, the differences between the multi-thread and multi-process version of the test application totally disappeared - the multi-process version had exactly the same throughput as the multi-thread version.
Of course, this was a stupid example, with an unreasonable work load, it was totally context switch bounded (because I had way more threads than CPUs), etc. But fundamentally, the measurement paradigm was sound - since the aggregate number of threads in both cases was the same, the only difference would be the cost of switching address spaces, and that turned out to be negligible.
So, the next principle of concurrency: "If you're looking to concurrency to make your application scalable, you need to be really, really smart - there are bottlenecks in places you didn't think about".
Edit: megabytes->kilobytes, multithreaded->multiprocess in example.