Measuring how difficult a Threading bug is

Some bugs are easy and some are hard. Bugs can also be hard for a varierty of reasons. For example, some bugs are difficult because they involve an extremely complex algorithm. Other bugs are difficult because it's not clear what the correct behavior actually is. Other bugs may require complex coordination across many teams or may have to deal with crippling requirements like versioning or globalization.
One set of troubles I'll focus on here is threading bugs and race conditions. I was talking to a coworker about difficult threading bugs in the CLR. I came up with the following rough heuristics to distinguish easy threading bugs from hard ones. 

The order here isn't exact and you could probably swap any item up or down by one. That disclaimer aside, here's how I'd bucket threading bugs from easiest to hardest:
0) Bugs which can be consistently reproed. If you can just rerun the program under a debugger and reproduce the problem, then it's not a hard threading bug (though maybe the bug is hard for other reasons). If you can repro it on your local machine with your own private builds, then you're even luckier.
1) Truly single-threaded stuff. If you've only got a single thread in a single process, then you're not going to run into race conditions.
2) Stuff that can be solved exclusively with locks (critical sections or reader/writer locks). This is a large bucket. Most data races can be solved by slapping a lock around it. If a component can be made thread safe by just taking one giant lock, it's effectively no more complex than a single-threaded component. Maybe for perf reasons you can use interlocked functions like InterlockedIncrement instead of a dedicated lock, though that's effectively like taking a very fast lock for a short window and so has the same algorithmic complexity. Even when a component needs multiple locks, I'd still place it in this bucket. That's because the most common kind of bugs with locks are deadlocks which can be diagnosed at the time of hang with just a list of threads and locks. Certain design patterns like balancing the enter + leaves and using a lock heirarchy greatly reduce locking bugs.
3) Threads having very different roles or thread-local state. These are some random things issues that start to creep up in more complex systems. For example, having a producer thread and consumer thread is more complex than having all the threads do the same thing.
4) Stuff that requires causality such as explicit calls to WaitForSingleObject. Now things can really start to get ugly. Threading bugs from the previous buckets can all be diagnosed from full dumps containing callstacks and lock state. Being able to diagnose race conditions statically from a full dump is a gift. That is no longer true starting in this bucket. Now the program hangs and there's no thread caught red-handed anymore. WaitForSingle/MultipleObject may block on any waitable object such as waiting for another thread to exit or waiting for an event to be signalled.
5) Causality part2: requring explicit WaitForMultipleObjects. Waiting on multiple objects is more complex than just waiting on a single object because there are more things that can go wrong. For example, you may have to deal with races between multiple objects in the wait set getting signaled. Things get even more intense when the synchronization is across multiple processes.
6) Stuff requiring SetEvent and complex compound wait primitives. Synchronization with Win32 events (aka conditional variables) becomes even more complex. Bugs around events don't have the strong invariants that locks do and so can't be statically diagnosed as easily. For example, if the process is deadlock because a thread is blocked at a WaitForSingleObject expecting an event to be signalled, it may not be clear which thread was supposed to signal that event.
7) Live lock, starvatation, thread-prioritization issues. These bugs are evil because it's not clear the program is obviously misbehaving. Is it really a live lock of is the program just slowly chugging through a complex operation? These are cases where even algorithmically correct behavior is still considered buggy.
8) Asynchronous suspension and messing with other threads contexts. Asychronously suspending another thread is evil, which is why Thread.Suspend was deprecated in the BCL for v2.0. You don't know what state the suspended thread is in, and if it's in kernel mode things are even wackier. It's also very dangerous to call SetThreadContext on a thread in your own process. First you need the thread suspended (and thus hit the problems I just alluded to), and then you still have to worry about somebody else also modifying the context (whether it be another thread or the kernel). For example, if you suspend a thread right as it takes an exception, things can get wacky.
9) Stuff involving OS synchronization bugs, races with OS user/kernel mode transitions, and exploiting undefined OS synchronization behavior. If you're working in bugs that are forced to confront crazy OS synchronization behavior, that's just hard stuff.  Debugging the OS is not an option for most people, and so these bugs mean a critical part of your state can not be examined. Most people can't change their OS, so it also means that you need to work around whatever issues you find.
10) Stuff that's provably unsolvable, but for which customers demand a solution anyways.

What other buckets would you add to the list? Would you change the ordering?


Comments (5)
  1. Code running in an OS/platform callback. You can be running with internal locks held that aren’t documented allowing you to unexpectedly deadlock or silently corrupt OS data structures in a way that causes a bug in the far future. It may not even be possible to debug your code normally because debugging no longer functions. Examples: loader lock, audio subsystem.

    6-8 depending on how well documented the callback function is.

    Race condition/locks taken against hardware. Debugging the other side is impossible, little chance of getting the hardware changed, and the problem might only be visible below any software (processor, system bus, an expansion card). Examples: video cards, sound cards, anything that electrically resets on fault.

    8-10 depending on the hardware quality.

  2. > Win32 events (aka conditional variables)

    Condition variables are not the same as win32 events. If anything, they are closer to System.Threading.Monitor.

    In general I think that monitor-style code is easier to write and debug than equivalent code using low-level primitives like events.

  3. Dmitriy Zaslavskiy says:

    Somewhere with #9. Is the memory model related issues (that people seem to ignore). CPU reordering is a real issue (for algorithms that try to avoid even critical section, or just trying to be cute). What’s very commonly overlooked is that reordering happens not only on IA64 but also on x86 (IA32)

  4. For a compiler, the common things are code size vs. speed . The conventional wisdom is to optimize for

  5. Anybody who says "I can write correct multi threaded code" probably should be saying "I

Comments are closed.

Skip to main content