The Cup-Stacker Prince (Multi-threading Made Easy, and a little bit strange)

Computers are not intuitive; they can predict the future perhaps, but they cannot paint magic doors through the boundaries of traditional logic - at least not yet. Imagine, if you will, a cup-stacking champion during his or her early life. How does one begin stacking cups? Very carefully, as the saying goes. At first, a yet-to-be champion cup-stacker works slowly and methodically, learning the parameters of the physical reality in which they must operate. In computing parlance, they have no synchronization primitives in place to manage the concurrent access of two hands to a single collection of cups. Access to the cup collection is serialized. As they practice, they begin to get a feeling for how to do concurrent access to the cup collection without crashing (causing the cup structure to collapse). As they learn, the stacking tasks complete faster because they become more efficiently executed - the brain is able to calculate the least possible synchronization that will still maintain the integrity of the resource (the cup structure).

To complete the full analogy, the cup stacker is the multithreaded application, the hands are the CPUs or threads and the cups are the resource. In this case, I'm thinking of the cups as a heap resource and the table as virtual memory because eventually, I'm going to talk about the difference between the garbage-collected heap of the Common Language Runtime and a traditional C Runtime native heap, using this analogy.

Many of the support cases we get involve questions about performance at some level, and performance problems often involve some kind of contention for a resource, be it the disk drive, the network, memory, CPU time or other. These days, most machines have, at least logically speaking, more than one processor. This means, for example, that for a dual-proc, dual core or hyper-threaded machine, two threads of execution can run simultaneously at any given time. However, if the work each does is dependent on the outcome or result of work the other does, the performance benefit can be compromised because the work must be serialized to some degree. It's helpful to keep this in mind when thinking about multi-threaded performance and how, in some cases, adding more threads to the mix doesn't help.

So, back to the story, as we practice our cup-stacking abilities, we begin to optimize our code synchronization logic such that tasks can execute as much as possible independently of one another until we reach the sweet spot with regard to our particular application dynamic. In other words, we have learned how to become ambidextrous when it comes to grabbing for cups and we never make the mistake of grabbing for the same cup at the same time, causing at best an execution block and at worst, a crash (cups all come tumbling down).

We have been working hard, practicing every day and now, it is time to enter the competition. However, the rules are a little vague. In order to determine the true champion, by-stander agents are allowed to throw obstacles in your way as they see fit, but this added challenge is supposed to be tightly regulated, so the organizers say. As you work your way through the competition, winning each match handily, you realize that your experience as a juggler has become quite valuable in this particular competition. The by-stander agents are throwing cups at you in irregular ways and you must work them into the structure as you stack.

These by-stander agents are the memory allocators in the heap analogy.

However, let's start with the many versus few threads problem because it's simpler and doesn't require the cup structure to be a heap; it can be any given resource.  When you are an application in a cup-stacker competition, each of your hands a thread of execution, and you have your synchronization logic in place from practice such that you can handle more threads efficiently, at what point does adding more threads stop becoming beneficial? Imagine a small number of cups. You can almost always perform better with two hands. In fact, you realize that the only case where you can't perform better with two hands is when there is only one cup. So, with two cups, two hands are better, but four are not. With 32 cups, 32 hands might be better, but probably not due to the overhead (OS resources, stack space, etc) required to operate each hand (thread).  This is the extreme case. Generally speaking, the smaller the cup structure in this analogy, the less likely it will be that more threads will help with the stacking as your extra hands will mostly be getting in the way of one another or sitting idle, not able to get into the fray without causing a lot of contention for single cups. This problem is sometimes called churn.

As a side note, another contention problem can occur if you have a partner trying to help you on the same stack of cups. Since you can't read each other's minds, you must take turns accessing the cup structure. You need a third-party to manage access, which is where the mutex comes in. The mutex is a cross-process primitive, versus an event, which is a cross-thread primitive. In our analogy, the stacker is a process and the hands are threads. Another stacker at the same table requires a mutex to govern when you or your partner can access the cups.

The secondary problem of two many threads is when the tasks are not independent of one another.  In that case, since other threads can't continue until the one in progress returns a result, more threads waiting on that result isn't going to help.

Now, for the heap analogy and the garbage collected heap of .NET applications versus the traditional native heap of C/C++ applications.

As you reach the final rounds of the competition, a barrier is added to the amount of space the cup structure may occupy. You are forced to re-organize the structure on the fly to accommodate the new space limitation. In addition to the space limitation, the agents are now throwing you cups of different sizes - they have the same basic shape, but the sizes are different such that they no longer stack evenly or uniformly. 

For a traditional native heap application, fragmentation is virtually inevitable over time in the general case.  There are remedies and strategies to manage fragmentation, but I won't go into that here.  So, generally speaking, the allocation sizes must be reusable to a high degree. As you reach the space limitation and the agent throws you another cup of a varying size, you must be able to replace an existing cup in your structure with the new cup, or everything will come tumbling down. (The analogy sort of breaks down here due to gravity, but assuming you are fast enough to swap the cups in and out.) What if you try to swap one out and it doesn't budge? This is a memory leak in the abstraction I'm using.  What if you reach for a cup and it's no longer there?  This represents a double-free or memory corruption.

The garbage-collected heap:  What if you were allowed to stack the cups inside one another on the side of the table to make room for the new cups?  This is to some degree what you get with the CLR heap, but imagine how expensive an operation like that might be as opposed to just swapping cups in and out. This stacking on the side to make room is called garbage collection in a strictly abstract sense. As you continue through the competition, you realize that stacking all the cups on the side is too time consuming when you need only stack a certain limited amount to make room.

But, you have adjusted well to the new obstacle and things are going well; you are still winning all of your matches. Then to your surprise and chagrin, the agent starts throwing you multiple cups at once. What do you do?  At this point, for the analogy to work, the agents become the application threads and your hands are now the CLR heap threads, the cup structure still being the CLR heap resource.

This is the heap contention problem you can see in .NET applications (and native as well), when you have multiple threads making allocations concurrently.  At some point, serial access to any heap resource has to be strictly enforced. This is the heap lock that you may have heard or read about. You can never have more than one thread inside a heap at any given time because a change in the structure cannot be communicated to other threads in a timely or reliable way. So, at some point, you have to say, everybody wait until I'm done.  Sure, you might manage concurrent access to some degree, but even then, the micro changes must be serialized at some level, causing contention for the lock that's required.

When you have multiple processors running multiple threads simultaneously in a back-end server application, you can have this contention problem, where you have too many hands working on a single cup structure.  What if you could expand your table size, get more hands and use multiple cup structures?  That is what the server GC setting does in the CLR. It gives each processor its own heap, uses a separate background thread for each heap, making the application more scalable and performant for the kinds of back-end applications that have many threads servicing a work queue where the work items are similar in nature and largely independent of one another.

The concurrency setting on the workstation GC determines whether or not you will continue stacking and swapping with one hand while the other piles cups inside one another on the side of the table, making room.  The concurrency setting is not relevant with the server GC setting. 

A great explanation of these differences in the CLR's garbage collected heap can be found in Maoni's second blog on the subject posted here: 

blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx

. . .

Bob LaCasse - Developer Support (languages)

"I was gratified to be able to answer promptly. I said I don't know." - Mark Twain