Practical debugging; Apply some science to the problem

Practical debugging; Apply some science to the problem

Untested software has bugs. As we test software we find a lot of different kind of bugs. We want to identify the root causes for as many bugs as we can. Finding root causes can be fiendishly difficult and time consuming. Using a systematic, scientific approach can help you with the hard problems.

Say you have a program that used to work, but now it crashes. You look into it and realize it’s been broken for quite some time and fell through the holes in your testing. Now you have several related check ins and dozens of others that have a remote possibility to be the culprit.

Stop poking at problems with that stick

When we find problems in software we often have a good idea what the problem might be. We do some informal checking and poking around. Most of the time, this behavior pays off. Our experience and intuition are valuable and we get to the heart of the problem without a lot of thrashing. The problem comes when this approach fails us. All too often we keep blindly poking with a stick and end up thrashing around and not finding the answer.

A good rule of thumb is to switch to a more rigorous approach when your guesswork has failed three or more times. In our example you might guess it was the latest check-in and back that out. You find out that is no good and the program still crashes. You check the dependencies and they don’t reveal anything. Next you run the test on some other machines and find it’s a universal problem. At this point you should make a solid plan for when to stop guessing and when to switch to a more controlled approach.

Use the scientific approach

At its heart science is testing your beliefs. When you were guessing about the source of the problem you were doing this in a relaxed way. Now you realize that nothing obvious is wrong and you will need to do some deeper technical work. You need to use the scientific method. For software testing there are really just three key things to keep in mind.

Control the variables

You need to make sure you are changing things one at a time while you run your tests. You might get the program to work after you re-installed the operating system on the host, changed some code and rebuilt the database. What you won’t know is what fixed the problem and if it’s gone for good. When you switch to a rigorous method, you have to be careful and control the variables one at a time. It’s easy to fall into the trap of changing lots of stuff and trying again. You might get the program working again quicker, but you lose too much valuable diagnostic data. In lab environments this is especially tempting. You just need the environment “up” to complete your work. The problem is you don’t know if it was a trivial error you will never see again, or if you just put a tarp over a deep pit.

You should be taking notes on the variables and how you change them at this point. You don’t have to exhaustively note every little thing, but you should be able to back out any changes you make. In code this means having a checkpoint. In an installation this may mean a snapshot image. Maybe it’s just a list of values you change in the database. Just be sure you can unwind the stack if you have too.

Create a hypothesis

Once you have control of the variables you need to guess again about what is wrong. But you need to do it in a controlled way. State your idea in a way that you can prove wrong. It’s a good idea to write your hypothesis down.

In the example of the crashing program, we might hypothesize anything from a corrupt pointer assignment to something outrageous like “it won’t work on Tuesdays when the moon is full.” The most important thing is that your hypothesis is narrow enough to be tested. Often it’s a really good idea to make a list of things you don’t think are wrong and test them first. A good example might be “The file was corrupted at install time.” We don’t really think this is causing the program to crash, since we have several versions of the installer and it’s unlikely they would all be corrupt. However its possible and until we clear it as a culprit it could be masking other problems.

Try to prove it wrong

Once you have a hypothesis do an experiment. You need to design your experiment so that it will prove your hypothesis wrong. You also need to design it so that you change the fewest variables possible to prove your hypothesis wrong.

In our example we could diff the file with the one on the build server. If they are identical, we know that the corruption hypothesis was wrong. Scratch it off the list and move on to another one.

Rinse and repeat.

Jump start the car before you take the engine apart

The problem with this approach is the sheer number of hypothesis there are. Defining them all narrowly and then designing an experiment to test them will be very time consuming. You need to use an approach that weights getting results quickly. An analogy is a car that won’t start. Nearly any part in the car could be bad. Based on your history with the car you might have some guesses. While you are going through all the possibilities you might as well try to jump start the car. It’s easy to do, and only take a few minutes. If the car starts, wonderful, you are done. If not, you quickly ruled out a very common reason for cars not starting (dead battery).

Rank your hypotheses to get most coverage

When you start to think about your hypotheses rank them by how hard they are to do and how much information they give you. Start with the easiest experiments that give you the most information. Every test you do gives you data. All of that data can help you decide what tests you should focus on.

In the example of the crashing program, you might have one hypothesis that a particular line of code is at fault and another that the network connection on the host machine is at fault. An experienced network engineer can rule out 99% of network problems with a few simple tests. If that’s quicker and easier than tweaking the line of code, start there. On the other hand, you may have the code open in an editor and building and installing may be trivial. In that case you can test your “line of bad code” hypothesis much easier than a network problem.

Stick with the science

It’s tempting to go back to poking your problems with a stick. You rule out a few major problems and you start exploring at random again. This is rarely productive. If you find yourself changing lots of variables at once, you are probably wasting time. If you are stuck for more experiments, then you can explore some to get more data. But if you have plenty of possible culprits, keep working through the list. It’s just like running any other set of test cases.