Now for a blog post on that all-time favorite developer activity: debugging. The detective work involved in debugging takes on a new challenge when multiple threads or multiple processes are involved. You also have to learn some new terminology - words and phrases like "atomicity", "race conditions", "locking and unlocking", and so on. However, the word used most often when investigating multi-threaded bugs is simply "timing" (or "concurrency" if you must use the technical term)!
This is the scenario. You develop a multi-threaded application that appears to work just fine when you test it once, or twice, or ten times. But then, out of the blue on another test, the program crashes dramatically. The same code, the same computers, the same inputs, but the result is a crash. The problem is often a concurrency bug - there was something not-the-same about the exact timing of the two threads, and it was this that exposed the faulty code and then the crash. OK, so now we know the application has a bug, but how to find it? In a large application with more than two threads this can be a daunting task. In simple terms the problem is "how can I reproduce the same timing conditions so as to reproduce the bug?". There is of course the simple answer, keep testing and testing, but there is another and better answer: use a concurrency debugger.
Let's assume you have a really simple application with two threads. The first thread calculates the area of a circle (given a radius), and the second thread reads both the radius and area and simply checks that they make sense. The following code snippets just show the pertinent parts:
Most of the time the application may just work fine. However, consider the situation when Thread A has executed Line 1 of the changeCircle function, but has not yet executed Line 2. At this exact moment Thread B executes Line 3. The result - a crash - radius and area are out-of-sync by Thread B.
This example is what is known as an atomicity bug. Both Line1 and Line2 of Thread A should be performed together (as an atom, so to speak) before Thread B is allowed to read anything. The fix for this bug is to lock these two statements together:
A new company, PetraVM, has created a novel concurrency debugging tool, called Jinx, that can be used along with Visual Studio 2010 (or 2008) to help nail bugs such as these. There are two big issues: how to expose concurrency bugs that may in normal time only occur very infrequently, and secondly how to understand the cause of the bug.
Both of these problems are handled by Jinx in a pretty impressive way. To reproduce the bugs Jinx makes a copy of the current application, and runs multiple "simulations" of it behind the scenes, trying to force concurrency bugs to occur. Jinx can examine code and force delays in the threads that are running so that clashes involving shared data occur much more frequently. When one of these "simulations" results in a crash, that simulation becomes the reality for the programmer testing their application. The simulations occur at the machine-language level, and of course cannot pass a point in the program where input or output is required (either from a user, or to a device, including the computer's screen), but instead focuses on the accesses to shared data.
In some simple tests, Jinx can find a concurrency bug in seconds that otherwise can take many minutes of repeated testing. The company likes to say that Jinx makes your code "unlucky".
The image below shows the Visual Studio Jinx plug-in being used to track down the concurrency bug in our simple example.
Finding out that a bug exists is half the battle. The second problem is locating the faulty code. One of the problems with multi-threaded application debugging is called overshoot. One of the threads causes a problem, another thread crashes, but the problem thread will run on for a short amount of time (so a lot of statements are executed) before the processor brings all threads to a halt. This overshoot by the problem thread aids the bug in remaining undetected. The overshoot can even result in the problem data being repaired, or perhaps partially repaired, making discovery of the faulty code even harder. Jinx has a clever feature called SmartStop, which, when a bug is detected, prevents the overshoot and holds the problem thread on a line of code that was the last point of communication with the shared data. In our example above, SmartStop would stop Thread A on Line 1 of the changeCircle function - as this was the last point of communication before the crash. Clearly a developer should examine the lines of code around the stopped point of the non-crashing thread very carefully!
It is often the non-crashing thread that is at fault, the crashing thread was just doing its job!
Jinx is simple to install, and requires very few changes to existing code (just an inclusion of a header file, a call to a function named jinx_register_application - and optionally replacing some or all assert calls with jinx_assert calls). In other words it can only take a few minutes to get Jinx working on your behalf. The Jinx Control Window gives the option of debugging just the most recently registered application, or all registered applications, or even the full system including the operating system.
Jinx only runs on Windows. Test Jinx out by installing the beta from here:Run the Jinx installer.