Concurrency, part one.

Last week, I mentioned I was going to start a series of articles on concurrency, here goes.  I do feel a need to give a caveat: I’m sort-of winging it here – I know where I want to go on this one, but I’m not sure how I’m going to get there 🙂  Please bear with me, because this is a relatively complicated topic.  Also remember that I’m trying to start this from the ground up – the reality is that a huge number of the issues that show up can’t be addressed unless you’ve covered the fundamentals.

The interesting thing about concurrent programming is that it can be as nightmarishly difficult as you want to make it, or as easy as you want to make it.  The reality is that the more you push the envelope of concurrent programming the harder it is.

On the other hand, with a few relatively simple steps, you can achieve a “reasonable” level of concurrency in your application without too much pain.

At this point, the people who really get high performance, highly scalable computing are cringing.  And they’re right to be cringing, high performance scalable concurrent computing is REALLY hard.  There are only a couple of dozen people at Microsoft who know enough to do it and get it right consistently (and I’m absolutely NOT among them).  Getting high performance concurrent computing (the kind of thing that is needed to get your server application scaling evenly to more than 10 or so CPU cores) is HARD.  It literally ranks among the hardest things we know how to do in computing.  There are reams and reams of PhD theses that have been written about how to do it by people way smarter than I am.

So you can think of this series as being a series about “good enough” concurrent programming – it’s not designed for high-end systems, it’s about making the stuff that runs day-to-day work well enough.

So, with all that said, on with the show.


First off, what IS concurrent programming?  It’s a big phrase, and it’s not clear what’s meant by it.  In a nutshell, concurrent programming is about trying to make your application more scalable by making your application do more things at once.  As I mentioned in my post about Herb Sutter’s Free Lunch article on CPU speed limitations in Dr. Dobbs, concurrency is going to be more and more important in computing in the future.

How can your application do more than one thing at once?  Well, the reality is that if your computer has only one CPU core, then it can’t.  But most machines sold these days come with more than one CPU core (heck, the $700 machine I bought my daughter for Christmas came with a HT P4 CPU).  So the reality is that MP machines are more mainstream than you might think.  And they’re going to become more and more mainstream as time goes on.

But even with that, your machine can only do as many things at once as it has CPU cores.  The key thing to realize is that just because your machine can only do one thing at a time, doesn’t mean it is actually DOING something most of the time -in fact, 99% of the time, your CPU is literally turned off, especially if your machine has a Celeron CPU core.  Even if your application does a lot of stuff, it STILL idle much of the time – every time you read from a file, you block your process waiting on the disk to return your file’s data, every time you allocate virtual memory, you potentially wait until the memory management system flushes dirty pages to disk and maps those pages into your address space.  And all the time that your application is idle, you could potentially be using your CPU for other things.  And if you do have a machine with more than one core), then your machine CAN do more than one thing at once.

Since your machine is idle most of the time, then its useful to identify ways of taking advantage of that idle time – if you can do stuff in the time your app is blocked, that’s stuff that doesn’t have to slow down the main thread of your application.  Simple examples of this are the spelling check (or background printing) in a word processing application – checking the spelling of words in a document (or formatting the document for printing) don’t need to be done in the main thread, so they can be offloaded into a worker thread that would perform the actual work. 

In Win32, there are two major systems for implementing concurrent programming, Threads and Fibers.  I won’t touch on Fibers, because I don’t have enough experience with Fiber programming to write authoritatively on the issues of programming with Fibers, but I can talk a bit about concurrent programming with threads.  And there’s a wealth of content related to concurrent programming with threads so…

Ok, so that’s what concurrent programming is, why does everyone keep saying that it’s so hard?

The big deal with concurrent program is that it introduces a huge set of issues that you don’t see normally.  First off, you need to ensure that your application can take advantage of concurrency – I can’t think of the number of times that someone’s naively added multiple threads to their application and made their application (or system) slow down.  The other problem is that multithreaded applications introduce a series of failures that aren’t seen in single threaded applications.

The basic problem is that many (most) operations are so-called read/alter/write operations.  For example, most (if not all) arithmetic operations in C are R/A/W operations.  For example, in C, b = b + c reads the value of b from memory, reads the value of c from memory, adds those two values and then writes it back to memory at location b.  If multiple threads can access any of those memory locations, that value in memory, then the results are, as they say, “undetermined”.  The language doesn’t help you because neither the C or C++ language make any guarantees about the behavior of operations in the presence of multiple threads.

So if you’re not really careful, you can get truly strange and interesting data corruption issues – even simple operations like incrementing a variable can misbehave in strange and mysterious ways.


Ok, having said all that, tomorrow I’ll start discussing some relatively simple rules for writing concurrent applications.

Comments (16)

  1. Jeff Parker says:

    Ok, just so I am clear on the topic, this is a whole different type of threading? Yes still using theads but doing something different with them. Not like spawning a new thread to go do something while leaving your main UI thread free to show progress bars and update UI which is most commonly how I use threads? Or I might use threads in a thread pool to go get updates from multiple locations. That kind of thing is what I normally do.

  2. It’s not a different type of threading. What I’m trying to talk about is what happens once you decide that you want to have more than just you UI thread in your application and the gotchas that come from that decision.

    The instant you have more than one thread in your application, there are a set of issues that occur that can bite you quite hard if you’re not aware of them. Oftentimes you don’t see them. But every once in a while they surface to cause massive problems.

  3. Jeff Parker says:

    Ahhh, ok the volatile stuff like some thread changing something another thread is not thinking is going to be changed by anything other than itself amonst a lot of other problem. The Things that make your app go BOOM or makes the operations guys want to hunt you down for hanging up their server in the middle of the night.

  4. Brian says:

    Maybe I can get a request in for you to touch on a (relevent) subject I’ve wondered about but never seen an answer to (I’m assuming of course that I know where this series of articles is headed):

    Windows provides MsgWaitForMultipleObjects so that you can wait on e.g. a mutex and still process Windows Messages (which is critical for COM). Since a mutex is overkill for synchronizing access by threads within a process, I’d prefer to use a CRITICAL_SECTION. Here’s the question: Do I have to worry about incoming messages while I’m blocked on EnterCriticalSection? If so, doesn’t that severely limit the number of cases where I can use a CRITICAL_SECTION?

    Also, if anyone is interested in Fibers, Raymond wrote an interesting series a few weeks ago. He also explained why you probably don’t want to use them.

  5. I love thread based concurrency. I do it all in .net cause it’s a little easier with the abstractions. FWIW, if you are writing managed code this is really not that hard.

    Let me put that into perspective. If you can understand the fundamentals of WaitHandles and Locks, it’s really not that hard.

    I try to explain to people that you have to think of a bunch of threads running as streams of water and dams. On a high level it’s your job when programming concurrently to a: make sure the streams don’t intersect and b: make sure the streams don’t overflow. You can also think of it a routing traffic.

    This is a good topic. I hope you won’t mind me chiming in every so often. From reading your blog Larry it looks like you don’t write alot of managed code and I can honestly say I don’t write alot of unmanaged code. Can you spend at least one post of the differences between the two environments?



  6. Paul,

    You’re right – I don’t do a huge amount of coding in managed code, but I’ll try to work something in. Most of the principles I’m describing still apply so..

  7. > or as easy as you want to make it

    Yes if you lock everything. Even consider two people using Notepad on the same file concurrently. You don’t even need two CPU cores to do it, Terminal Services can deliver it to two users concurrently. Whoever saves it last wins.

    Database systems (which also don’t need two cores) have a certain amount of locking built in. If you want to be as simple as possible then you lock the entire database to prohibit all concurrent access. Complexity comes about by finding how to reduce the amount of locking.

  8. Ryan Myers says:

    One thing to keep in mind is that HT is only two cores in the loosest sense. With HT, you have two blocks of minimal processor state (i.e. registers, etc.) that share a set of ALUs, a single fetch/decode engine, and a single cache. The HT engine tries to divvy up the ALUs between the two HTs at each "step" in a way that presumably makes sense.

    Unfortunately, this causes problems when you try to use conventional MP techniques on HT processors. The famous problem with this is spinlocking — if you attempt to use a spinlock on an HT processor, the non-locked core will be starved for ALUs. You have to insert instructions to hint that the core is sleeping in order to pull it off, which XP does. There are also limitations on how one schedules ISR/DPC processing on an HT system.

    In this sense, HT is more of a way to amortize the costs of pipeline stalls than true multiprocessing — and that’s why sometimes turning off HT gives you a total perf gain for tasks. Still, it’s "close enough" for many purposes, if anticipated, and we should make use of what we’ve got.

  9. Have you thought of looking at the prospect of non shared-state concurrency? Ie, using messages to communicate between threads with immutable state. (Thread A creates an object and writes it to a message queue, Thread B reads it, creates a new object rather than mutating the first object and writes that to another message queue), etc.

  10. > > or as easy as you want to make it

    > Yes if you lock everything.

    It is amazing how easy it is to introduce deadlocks if you lock everything. I have been bitten by this more than a couple of times.

  11. A good serie of articles about HT, TLP(Thread Level Parallellism) and/or multi-core CPUs are [1], [2] and [3] from Ace’s Hardware. Note that this is from a hardware architecture point of view, and not much about software.

    Looking forward to read this serie, Larry 🙂




  12. Mike Dimmick says:

    Brian: a CRITICAL_SECTION is a combination of a small number of variables accessed using interlocked operations to see whether it can be acquired immediately/with a given amount of spinning, and a kernel object to wait on with WaitForSingleObject if the variables indicate that it cannot.

    As such, if it doesn’t enter the critical section immediately, yes, you will be blocked and unable to handle window messages. You should try to keep the amount of time spent in the critical section to a minimum – perhaps perform all computations before entering the critical section and updating the protected variables, or enter the critical section, copy the values to local variables, then leave.

    Obviously if you need to read the protected values, process them, then write back you’ll need to hold the section for longer periods, unless you’re willing to abandon the results if you discover the source values have changed. It would be better to perform this processing on another thread so that the UI thread remains responsive – post a message back to the UI thread to indicate completion.

  13. Skywing says:

    EnterCriticalSection will not dispatch messages while blocking.

  14. Mike, you’re ahead of me 🙂 THat’s article 3 🙂

  15. Hey Larry, I’ve written some topics on this in my blog as well. Check them out … start with the Dec. 30 2004 entry and go onward.