Bug prevention in five minutes

These are some notes from a lightning talk  I did at STAR East. I didn't say everything below, but I probably said most of it plus some other stuff. Nobody counted officially, but I am positive that I packed more words into my five minutes than any of the other eight speakers.

How many of you have found bugs? Anybody ever found the same bug twice? Ever found the same bug two different areas of the product – or the same type of bug in two completely different applications? If you haven’t, you will. That’s why we have heuristics and patterns to identify bugs – we know that certain types of bugs will always exist, so we test for them. The problem (for me), is that we keep on finding those same bugs.

Maybe you haven’t got there yet, but I hate finding the same bugs over and over “oh look, I put a really big number into the text field and something goes wrong”. Not again… I love a challenge, and for me, I get the most satisfaction – the biggest smile on my face, when I find a bug that would make Sherlock Holmes (or the guys on CSI) proud. The problem is that there are so many of the “easy” – “fish in a barrel” bugs that I don’t get to spend enough time digging, investigating, and finding those really interesting bugs.

The idea behind bug prevention (or defect prevention) is to take the big buckets of bugs – bugs that happen often, and implement some sort of prevention technique. Not always the easiest thing to do, but usually very effective.

So, how do you determine bug prevention techniques?

You start with root cause analysis (or brute cause analysis as Jon Bach calls it). RCA is used extensively outside of the software industry (think about space shuttles and pintos), but has also had some success in the software industry.

One of the great anecdotes in root cause analysis is Richard Feynman’s analysis of the Challenger shuttle disaster. Through extensive root cause analysis, he determined how and why an O-ring failed.

Of course, there are many ways you can do RCA. There's Cause Analysis, Failure mode and effects analysis (FMEA), and lot's of other types of analysis with even fancier names. The people who invented those names will be mad at me for telling you this, but you don’t need to remember any of that.

RCA has traditionally been a heavyweight process that takes extensive research and analysis. In fact, one study I’m familiar with did an extensive analysis on “late cycle defects” found on in a major product (bugs found in the last 2 months before release). They looked at 150 or so bugs and wrote (as researchers often do), a 100 page paper saying that better code reviews would have prevented most of them. You could have just asked me :}

Just think of Root Cause Analysis as a careful look at undesired events. That’s all it is. Nothing about that phrase requires a heavyweight process, but let me give you a technique that can carefully look at undesired events, and be a great first step toward defect prevention

The 5 whys is a technique developed in the 1930s at Toyota. Most of the testers I know love to ask questions, so the 5 whys is a perfect technique for testers. The concept is simple. Ask “why” until you get to a point that you can implement a prevention technique. The hypothesis is that each error / bug / defect is the symptom of some other underlying problem. The inventors decided that 5 whys was about the right number of times to ask why until you could get to something actionable. Sometimes you really use 3 or 4 whys, and sometimes, you may use 6 or more.

Let’s give it a shot:

1. Why did I lock my keys in the car?

- Because after I took them out of the ignition, I left them on the seat.

2. Why did you leave them on the seat?

- Because I had to reach on the floor to get my wallet.

3. Why did you have to reach on the floor to get your wallet?

- Because I was in a hurry when I got in the car and threw my bag on the seat without zipping it up and my wallet fell out.

4. Why were you in a hurry?

- Because I overslept.

5. Why did you oversleep?

- Because I was up late last night answering your questions.

From this, we could hypothesize that I locked my keys in the car because I was tired. As you can tell, it doesn't always lead to something that can be prevented, but it's fun anyway. Let’s try it in software.

1. why did the program crash when I used a really long filename?

 - because the filename passed to the program was larger than the buffer allocated for the filename 

2. Why didn't the developer use a bigger buffer?

 - Because the developer didn’t know filenames could be that long

3. Why didn’t the developer know filenames could be that long

 - The developer wasn’t trained / mentored, or didn’t have proper documentation

4. why was there no training or mentorship or documentation for the developer

 - Training or code review are skipped when the team is busy

Sometimes, a process change is necessary as a preventative technique

How do you start RCA? You probably don’t want to start by analyzing every single bug found by your test team, so start with a subset. In the past, I’ve done projects where I’ve done lightweight RCA on every bug found post release – bugs that needed a patch (i.e. the bugs that cost lot’s and lot’s of money). Another good approach may be to analyze all high severity bugs. There are probably other classes of bugs that may be better for your particular situation.

Now – are you going to implement a prevention technique for every root cause you find? Probably not, but this is another case where the Pareto principle (80% of the problems are caused by 20% of the causes) is pretty accurate. Identify a root cause – don’t worry if you went deep enough at first, just dig until you find a root cause that you think is actionable. Once you’ve gathered a few dozen potential root causes, you can bucket / group the data and see if there’s a prevention technique that should be implemented.

The 5 whys have some shortcomings – they don’t always identify the true root cause, and they definitely don’t tell you what the preventative technique is. Often, once you find a common root cause, you may have to brainstorm on a preventative technique.

The answer could be to implement a tool or process / policy, or it may be to provide additional education, or some sort of combination of all of those things. The prevention measure may also depend on the context of your organization – training or tools may not be an option on your team. Process changes are often the most difficult to implement, but may be the best solution in many cases.

Often, a change in process is the method of prevention. The Feynman O-ring story I mentioned earlier ended with Feynman writing an extensive report. The report didn’t say that the O-Ring failure could have been solved with better O-rings. Instead, he identified that the NASA safety commission and launch process were seriously flawed.

One final thing to remember is that a little goes along way – even if you prevent dozens of classes of bugs from making it to test, there will still be more bugs to find, but now, the bugs that are left will be a challenge to find, and will continue to make testing fun.

whew! That's about 1200 words, which is 240 per minute, and a mere 4 per second. That's quite a mouthful, but if you've ever heard me talk, you know I could pull it off!

Want to know more? Pre-order a copy of this book.