Concurrency, part 12 – Hidden scalability issues, part 2


Last time, I mentioned that even when you get everything right, you can STILL have scalability issues…  To explain what’s going on, a digression.

One of the major features of Exchange 2000 was the ability for the store to access multiple databases (before, the store could only access one database at a time).

My dev lead and I had a long series of discussions about whether it would be better to implement multiple databases by running multiple store.exe processes with one database loaded into each process, or to do the work to load multiple databases into a single store.exe process.

Our concern with the multiple process solution was that the context switch overhead of switching from one process to another would overwhelm the performance of the system (since more work is involved in switching from one process to another than is involved in switching from one thread to another).

To resolve the discussion, I wrote a small test application.  It would spawn 100 threads doing some “work” (where “work” was defined as incrementing a counter as many times as was possible during a 5 minute period).  It would either do the work by spawning 100 threads in one process or by spawning 10 processes with 10 threads in each process.

So in both cases, 100 threads would be created, the only difference would be how many processes would be used.  Our belief was that the 10 process scenario would be slower than the 1 process (due to the address space switch issue).

Well, I ran the test, and discovered, much to my surprise the multithreaded version was significantly slower than the multiprocess version.

This was utterly counter-intuitive, so I asked the base team what was going on.

They asked to look at my code, and immediately spotted the problem.

You see, in order to avoid the CPU lock issues associated with InterlockedIncrement (because I knew it would mess the numbers up), I allocated an array with one DWORD for each thread.  Each thread incremented its value (once again, using my first principle (hah, I got it right :))), and after the 5 minute period had expired, the main thread of the application would add up all the values in the array.  For the multiprocess version, each child process wrote to a shared memory region which was again summed by the main thread.

At this point, the people who have worked on scalability are all smirking at my utterly boneheaded mistake.

You see, my mistake was to assume that there would be no side-effects associated with dedicating an entry in the array for each thread.  I assumed that each thread could read and write to its dedicated slot with impunity.

And on modern processors, this simply isn’t true. 

You see, on modern computers, the system memory is significantly slower than the processor – while the CPU might be running at 3GHz, the memory bus might only be running at 1GHz or so… But computers spend a LOT of time accessing memory.  If they had to spend all their time accessing system RAM, then the performance of the processor would be massively throttled. 

To resolve this, the processors have several levels of cache.  The first level (cleverly named L1) of cache physically sits on the chip.  The level 1 cache is typically fairly small (for the Pentium Pro, it was 8K, for the Pentium M, it’s 32K (for the P4, they’ve changed the architecture and replaced the L1 cache with a 16K trace cache)).  The thing about L1 cache is that it’s REALLY, REALLY fast – the L1 cache operates at full CPU speed.  Usually, the L1 cache is per-cpu core, there are no cache coherency issues to deal with (principle 1 – if it’s your data, you don’t have to protect it from anyone else)

But the L1 cache is really intended for predictive branching, etc – it’s really small and blindingly fast because it’s about pulling instructions into the processors core.  There’s a need for a second, higher level cache.  This higher level cache is cleverly named the level 2 cache, or L2.

Now, the L2 cache is MUCH bigger than the L1 cache – it’s typically hundreds of kilobytes in size (the P4 extreme edition has 2M of L2 cache, for example).  But the Level2 cache may be shared between multiple CPU cores (this depends on the processor configuration).  As I mentioned before, if you have a resource that’s shared, you need to have some kind of locks to protect that resource.  And the CPU is no different from any other system.

Internally, the x86 cpu cache divides memory up into “cache lines”, or cache blocks (the cache line size varies by CPU).  When you perform an interlocked operation, it locks the cache line, writes the memory, and releases the cache line.

So in my little test application, what was really happening was that my application was thrashing the CPU cache – on the machine I was testing, the cache line was 32 bytes in size, so each block of 8 threads was contending for the same cache line, so they were effectively serialized by the processor.    On the multi-process version, instead of having one 400 byte array (100 threads, 4 bytes/thread) occupying 13 cache lines, each process had a 40 byte array (10 threads, 4 bytes/thread) occupying two.  In the single process version, there were 96 threads, each contending for 12 cache lines and 4 threads contending for the 13th. And in the multi-process version, there were 8 threads contending for one cache line, but only two threads contending for the other cache line (aggregated over the 10 processes, there were 80 threads contending for 10 cache lines, and 20 threads contending for 10 cache lines).

So if you look at this, the cache line thrashing was less on the multi-process version  of the test (although it still existed).

When I rewrote the application to remove the cache line thrashing, the differences between the multi-thread and multi-process version of the test application totally disappeared – the multi-process version had exactly the same throughput as the multi-thread version.

Of course, this was a stupid example, with an unreasonable work load, it was totally context switch bounded (because I had way more threads than CPUs), etc.  But fundamentally, the measurement paradigm was sound – since the aggregate number of threads in both cases was the same, the only difference would be the cost of switching address spaces, and that turned out to be negligible.

So, the next principle of concurrency: “If you’re looking to concurrency to make your application scalable, you need to be really, really smart – there are bottlenecks in places you didn’t think about“.

Edit: megabytes->kilobytes, multithreaded->multiprocess in example.

 

Comments (26)

  1. Anonymous says:

    I believe the trick in concurrency is to design your application (or problem =) to not have more threads in runnable state than you have CPUs to schedule on. This doesn’t mean you can only have 2 threads on 2 CPU machine, it merely means you should try to scale the application so that no more than 2 threads will contend for a CPU(on a 2 CPU machine). You can have an application with 100 threads, but during a given periond these threads will only want to run a 2/100 of the period time(on a 2 CPU).

    But there really is no "safe" way to avoid all scalability problems. Often a better way is to try to be less smart and use profilers to pinpoint the problems and redesign by a need basis instead of a coolness or a theoretical sound basis.

    In the end it actually has to work on some kind of hardware after all =)

  2. Anonymous says:

    Nice story ๐Ÿ™‚

    But this made me wonder:

    "Now, the L2 cache is MUCH bigger than the L1 cache – it’s typically hundreds of megabytes in size (the P4 extreme edition has 2M of L2 cache, for example)."

    Hundreds of megabytes of L2? I don’t think we’re there, yet. Even the Itanium2 has only 256kb of L2, and 9MB of L3 ๐Ÿ™‚ Read the next version, codenamed Montecito, will have 2MB of L2 cache and 24MB of L3 cache. But this is split for two cores, so effectively it gets 1MB/12MB for each core. There’s even some technology there, "Pellton", for handling cache errors. See http://www.theinquirer.net/?article=20059 for some more details, in The Inquirer’s always nice comments ๐Ÿ˜› (think I’ve never read an article by The Inq about the Itanium so nice worded. They didn’t even call it Itanic here).

    One of the mantras they teached us in an OS course last session was: Cache is IMPORTANT, be careful and don’t mess it up!

  3. Anonymous says:

    Interesting story, but it was confusing because I think you stated the actual results backwards. You said you expected the multiprocess version to be slower, and then you said you were surprised that it was slower. I think you meant you were surprised that the *single* process version was slower.

  4. Anonymous says:

    Nice story, mr Osterman. I’ve been fooled by the compiler or by cpu architecture a number of times as well. Btw, I think you meant to say that the multiprocess version initially ran faster, not slower. Also L2 cache is not typically hundreds of megabytes in size, at least not yet ๐Ÿ˜‰

  5. Anonymous says:

    Yay, I get to be an editor ;):

    >>Well, I ran the test, and discovered, much to my surprise the multiprocess version was significantly slower than the multithreaded version.

    I think you mean the multithreaded version was significantly slower, right?

    >>Now, the L2 cache is MUCH bigger than the L1 cache – it’s typically hundreds of megabytes in size (the P4 extreme edition has 2M of L2 cache, for example)

    I think there you may have meant hundreds of kilobytes?

  6. Anonymous says:

    Larry says:

    Well, I ran the test, and discovered, much to my surprise the multiprocess

    version was significantly slower than the multithreaded version.

    —-

    Is that what you meant to say? Does that contradict:

    So if you look at this, the cache line thrashing was less on the multi-process version of the test (although it still existed).

  7. Anonymous says:

    So maybe I misread this but the two following statements seem wrong:

    > Well, I ran the test, and discovered, much

    > to my surprise the multiprocess version was

    > significantly slower than the multithreaded

    > version.

    This seems to be exactly what you expected. Did you mean the opposite?

    > Now, the L2 cache is MUCH bigger than the L1

    > cache – it’s typically hundreds of megabytes

    > in size (the P4 extreme edition has 2M of L2

    > cache, for example).

    Did you mean hundreds of kilobytes?

  8. Anonymous says:

    I don’t quite get everything here:

    > Our belief was that the 10 process scenario would be slower than the 1 process (due to the address space switch issue).

    > Well, I ran the test, and discovered, much to my surprise the multiprocess version was significantly slower than the multithreaded version.

    So your suspicions were confirmed?

    > Now, the L2 cache is MUCH bigger than the L1 cache – it’s typically hundreds of megabytes in size (the P4 extreme edition has 2M of L2 cache, for example).

    Hundreds? Really?

  9. Anonymous says:

    Anonymous: You’re right, I got it backwards, fixing it.

  10. Anonymous says:

    Haha, caught by the comment latency. Anyway, please keep up the good work, mr. Osterman. I enjoy reading your blog.

  11. Anonymous says:

    > I believe the trick in concurrency

    > is to design your application (or

    > problem =) to not have more threads

    > in runnable state than you have

    > CPUs to schedule on. This doesn’t

    > mean you can only have 2 threads on

    > 2 CPU machine, it merely means you

    > should try to scale the application

    > so that no more than 2 threads will

    > contend for a CPU

    That’s not really possible with most server applications. Usually you have some sort infrastructure running that can handle requests and pass them off to the business logic.

    Typically, user experience when hitting a server is measured in response time (stability is equally important, but we can ignore it for our purposes). If you limit yourself to two threads in a runnable state, you’re effectively limiting yourself to two incoming connections at a time. If one user is on a slow connection and one is on a fast connection, or if one request takes a long time, the other users will not get a low response time.

    That’s not to say every connection should be a thread. I’ve worked on architectures that made that decision and it wasn’t pretty. As soon as you start seeing 1500+ concurrent connections, your machine is grinding to a halt.

    What people are doing now is combining a thread pool of say, ten or twenty threads that can accept connections, unpack the request and add the request to a queue. Then, you can have another five or ten threads that are processing these requests and sending the response back to the client.

    This way, you can effectively control how much work your hardware is attempting to do at any one point in time. Sure, the queue might get very very large, but there’s a better chance it will all work.

    I haven’t touched the ACE framework in a few years, but it’s pretty good:

    http://www.cs.wustl.edu/~schmidt/ACE.html

    The authors wrote some books on patterns for scalable systems that you can use.

  12. Anonymous says:

    Hey, how come all those concurrently submitted edits got serialized?

    Meanwhile,

    3/7/2005 2:09 PM Anonymous Coward

    > If you limit yourself to two threads in a

    > runnable state, you’re effectively limiting

    > yourself to two incoming connections at a

    > time.

    No, you’re effectively limiting yourself to actively serving two incoming action requests at the same time. A bunch of idle threads waiting for input from their connected users are no problem.

    Since two CPU cores essentially limit you to actively serving two action requests at the same time anyway, scheduling is the issue of figuring out how many threads you want to rotate your CPUs among, and what kind of rotation.

  13. Anonymous says:

    Anonymous Coward :

    As Norman said you are not limiting yourself to only two connection at a time. However I do agree that if you server doesn’t have to be geared towards top performance then going done the thread pool road will simplify the application considerably, and that in itself is worth the trade off.

    A performance geared server can be next to impossible to understand, but it will be more effective if you have a huge amount of work tasks.

    I myself prefer the thread pool version, as usually the server isn’t CPU bound, it is many times stuck waiting/contending for a resource(disk, NIC card etc.). By using a pool it is easy to understand resource contention and it is also easy to figure/tweak the pool so a minimal context switching will occur. Also a thread pool version will often keep the code in the ICache.

    Chances are that the thread pool version will be faster because you can easily understand what you are doing and thus get it right.

    I myself am not much for performance, I generally right the code to work and then simply optimize the crucial parts. (That doesn’t mean you should be using the correct data structures at all times and algorithms). If you can keep most the parts of your code simple maintanance will get easier and chances are that the parts you optimized heavily hopefully will be in the framework part and thus not likely to change or have bugs.

  14. Anonymous says:

    One big architectural question that lies behind a lot of this concurrency stuff is the choice between aristocracy and democracy: I wonder if you could comment on that?

    An aristocratic system is one where one thread is in charge and hands off subordinate tasks to very stupid subordinate threads – flushing out LRU buffers, buffering output, managing connections, lighting the fires – so that the top thread only has to concentrate on Real Worthwhile Work.

    The democratic model is where each thread has to do its own washing up and hoovering, thus reducing the amount of time that it can use for real work. You cope just by having more of those threads.

    I used to be democratic – just write your app, then make it scalable by bunging in a few locks – but I’m beginning to think that the aristocratic model makes more sense: start with moderate doses of asynchronous I/O and work upwards. Partly because you are thinking clearly about what needs to be offloaded and how its timing works and whether it’s worth the effort and the risk… and partly because, since you’re introducing concurrency in clear, defined doses, you have something that is much more clearly debuggable on paper and requires less holding of breath and crossing of fingers when you deploy it.

    The only remaining case against the aristocratic model is that there might not be enough for the servants to do – in other words, the main thread might be doing so much more than 50% of the work that the other CPU core will spend most of its time playing cards in the kitchen and drinking the port.

    I’m wondering if this is a useful dichotomy in practice and which side of the argument you’re likely to come down on.

  15. Anonymous says:

    > No, you’re effectively limiting

    > yourself to actively serving two

    > incoming action requests at the same

    > time.

    What if one of the incoming requests takes a long time to read (connection problems, slow connection, etc.)? Would that create a bottleneck?

  16. Anonymous says:

    I dont understand – when you rewrote your app to remove the cache thrash, your multithreaded app must have performed faster due to the address switching between processes. How come they had the same perf?

  17. Anonymous says:

    Sriram,

    I’m SO happy you asked that question :).

    You’re right – we were totally surprised by the result too, because we’d assumed that the time to context switch between processes was significant.

    But we were wrong. We’d made the exact same mistake that the MCIS dev lead made in my previous post (http://blogs.msdn.com/larryosterman/archive/2004/05/03/125198.aspx) – we’d assumed that the overhead of switching between processes would be measurable. And it’s not.

  18. Anonymous says:

    Even if we have 2, 4 or more processors, the number of threads running will always be greater then that and the OS must give the impression of running various processes at the same time. I assume that the OS schedules on a thread basis and not on process basis, even if he tries to schedule various threads of the same process in succession that cannot be guaranteed, this becomes specially true when a process has a lot of threads.

    Maybe this is the reason there is no measurable difference on address switch, and we cannot forget that address switch is a big penalty because no only the register must be switched but also the processor TLB buffers must be flushed (invalidated).

    So when designing a multithreaded app, having a lot of threads or very few depends on the situation: a web server would require a thread for each client, so having lots of thread will be performance wise; but if itโ€™s not a web style app then having few threads and maintaining them occupied as much as possible will be much more performance wise.

    About the processor cache, checkout this web page: http://www.agner.org/assem/

    Itโ€™s a must read for any c/c++ dev, it has tons of info about cache, pipeline, branch mispredition, etcโ€ฆ and itโ€™s up to date, it covers P1 throw P4.

  19. Anonymous says:

    In general, you’re right, it’s also possible that my experiment was sufficiently flawed (spawning 100 CPU bound threads on a 2 CPU machine means that the context switches overwhelm most everything) that the TLB flushing was unmeasurable.

    But I believe it’s also the fact that on modern machines, the TLB flush isn’t as significant as it appears – I believe that it’s another example of assuming a perf hit before measuring one.

    Thanks for the http://www.agner.org link, though.

  20. Anonymous says:

    Did you run the testprogram on a Pentium 4? That should be the worst case scenario due to its "amazingly" deep pipeline ๐Ÿ™‚

  21. Anonymous says:

    Andreas, this was in 1998.

    The principles are still valid (cache line thrashing can still bite you badly (actually, worse than before)) but the situation’s very different nowadays.

  22. Anonymous says:

    > the P4 extreme edition has 2M of L2 cache

    Yes, we benefit from Intel caching in their chips.

    3/8/2005 7:43 AM Anonymous Coward

    >> No, you’re effectively limiting yourself to

    >> actively serving two incoming action

    >> requests at the same time.

    >

    > What if one of the incoming requests takes a

    > long time to read (connection problems, slow

    > connection, etc.)? Would that create a

    > bottleneck?

    A bottleneck for that user, yes, but for the rest of the system and users, no (usually). The CPUs can still be busy serving other users.

    I had to say "usually" because there are cases such as hardware problems (an entire system can be slowed down when trying to read a scratched CD for example) and software problems (even if just one CPU serving one thread hits a blue-screening bug).

    3/8/2005 2:06 PM nugget

    > the OS must give the impression of running

    > various processes at the same time

    Only in socialistic designs ^_^ In aristocratic designs it is possible for programmers or administrators to specify otherwise, but NT-based systems and their predecessor don’t have enough priority levels to do a really good job of that. In democratic designs the threads would all vote on it and they’d spend all their time waiting for each others’ locks to get released ^_^

  23. Anonymous says:

    Along the lines of hidden scalability issues, I’ve often wondered how kernal mode events affect user mode threads.

    For example, if you have carefully constructed your code to avoid blocking etc, but someone comes along and moves the mouse vigorously. Is the kernal mode time that is dedicated to handling the mouse interrupts counted against your timeslice, thereby unfairly limiting your thread’s timeslice. Or is this time accounted for separately.

    Granted the mouse is a very simplistic example. Probably only generating a few hundred events per second. But, other devices can generate far more interrupts. And perhaps in some cases they may be cyclic in nature.

    From an systems point of view, I guess the OS could take the point of view that on the average the interrupts will be distributed fairly among all threads/processes. But I’ve often wondered whether this was something that needed to be considered in high performance applications.

  24. Anonymous says:

    Joseph, that’s actually a huge issue, and one of ongoing work on the windows team.

    The problem isn’t so much interrupts but DPCs (which are the rough equivilant of interrupts). They can take a long time and they totally suspend the running thread while they’re executing.

Skip to main content