One of the most important things to pay attention to when tuning a multithreaded application is its performance pattern. There is a set of common poor performance patterns that most developers of multithreaded applications will encounter. These include, among other things, patterns such as oversubscription, serialization, lock convoys, and uneven workload distribution. We have documented some of these poor parallel performance patterns in what we call the Rogues Gallery. We plan to extend this list based on input from the developer community as customers identify and share new patterns.
In this post, I will discuss oversubscription. In the case of oversubscription, the number of threads trying to run exceeds the number of available logical cores. This over-decomposition of the workload leads to additional unnecessary context switches. Context switches have a nonzero cost and are especially costly when they cross cores.
To make this more clear, I will present a simple example. In this case, I solve the same workload using four threads and then again with ten threads. My machine has four cores so I know that creating ten threads is unnecessary. The first two images depict the activity of four threads executing along with stats in the Visible Timeline Profile. The second two images show the behavior when ten threads are used to solve the same problem as well as the associated Visible Timeline Profile.
Using four threads (3.41 sec):
Using ten threads (4.84 sec):
After running several trials for both numbers of threads, I timed both scenarios and found that on average, using four threads solved the problem in roughly 3.41 seconds while using ten threads took roughly 4.84 seconds. In this example, we can see from the threads view that there is a significant amount of additional preemption caused by introducing more threads. A substantially larger portion of thread time is spent waiting for resources to be made available.
The active legend shows 4% of preemption when using four threads and 49% of preemption when using ten threads. There is a bit of preemption even with four threads due to the CPU contention of other processes running on the system. Due to the increased context switching caused by additional threads, the ten-thread implementation ran more slowly. If your algorithm lends itself to parallelization and it isn’t necessary to create more threads than there are cores, it is best to avoid oversubscription.
This is just one of the many possible poor performance patterns that one must keep in mind when writing multithreaded applications. It is important to be aware of whether your application is affected by such a poor performance pattern and the Concurrency Visualizer is a great way to quickly identify if your application is afflicted by such a pattern.
James Rapp - Parallel Computing Platform