Bouncing Zero Bugs, Together

Sorry for not blogging much lately, but good heavens, I've been busy. I'm working on a new book, I cooked an early Thanksgiving turkey dinner for 13 of my favourite people yesterday, and we bounced zero bugs on Friday.

We what on Friday?

Lemme splain.

Visual Studio and the .NET runtime and framework libraries are by any sensible measure absolutely immense pieces of software. They've required the combined efforts of thousands of designers, developers, testers, writers, you name it. Getting such a beast out the door and into the hands of customers is non-trivial to say the least.

One of the ways that we make this happen is by shipping the software several times before it is fully ready -- we ship beta releases. Most people look at beta releases from the customer point of view: betas give customers a chance to preview what's coming up, prepare their organizations to take advantage of the new tools, and provide Microsoft with feedback while there is still time to change things.

But look at betas from the release manager's point of view: if you know that you are going to have to ship a few betas between day one of coding and the final release date, it means that you have to keep the quality relatively high throughout the entire cycle. You can't let things get worse and worse, assuming that you'll have time to fix all the bugs just before release. Doing that is a sure way to release late, broken software. Betas force us to produce a product that works reasonably well throughout much of the development process. We're working on the second beta release of VS 2005 right now, and we've just hit an important date in that process.

Something we do to manage the complexity is track "bugs" in a big old SQL database. I put scare quotes around "bugs" because first off, I've always thought that "bug" was a needlessly colloquial expression for what is sometimes a serious flaw in a product. "Bug" imputes a sense of the trivial to me that "flaw" does not. Second, we track way more than just software flaws in the bug database -- not-yet-implemented features, work items (such as "foo.cs has not yet been reviewed by the security team"), spec bugs (where the code is correct and the specification is wrong), and so on.

The art of software design, like all design, lies in making tradeoffs. We could fix bugs forever, quite literally. We have to get the bugs under control so that we can make rational decisions. We need to be allocating scarce developer resources to address the most important issues in a timely manner. One of the ways we do this is we work towards "bouncing zero". Zero Bug Bounce (ZBB) day is the day on which, even for a single instant, every bug in the database is either (a) postponable to the next release/version, (b) fresh -- discovered within the last 48 hours, or (c) not fixable right now (because, for instance, your bug is caused by a bug in the C# compiler which has been fixed, but the fix hasn't propagated into your copy of the sources yet.)

Of course, when we hit ZBB all kinds of cheerleading emails go out from management, with 72 point red YOU ROCK messages filling up my inbox. But it's not just a cheerleading thing -- if you can hit ZBB, it means that it is extremely likely that the incoming bug finding rate from testing is smaller than the bug fixing rate from development, which means that it's extremely likely that you can ship the product someday.

Now, of course it is well known that the first rule of metrics is you get what you measure . If you reward your developers for fixing bugs, they'll write a lot of bugs and then fix them. Similarly, if we're pressured to hit ZBB for ZBB's sake then people will cut corners to get there. Fixing bugs is not the only way to progress to ZBB -- postponing more aggressively is another way to get there. Incorrectly resolving lots of bugs as "by design" or "not reproducible" is another. Distracting the testers so that they can't find more bugs also helps -- my old manager Trapper used to joke "everyone go dangle shiny objects outside the tester's offices!" Therefore we not only measure progress against the bug count, but other rates as well, like the postpone rate, or the percentage of bugs resolved as "by design". If those spike up as the ZBB target date approaches, there might be a problem.

We did something really interesting this year. The VSTO Frameworks dev team -- the half-dozen or so devs who work on the underlying programming model of VSTO 2005 -- were not progressing towards ZBB fast enough to hit the date. We therefore cancelled every dev's meetings, booked a big classroom with lots of computers, monitors, desks, whiteboards and sugary snacks, and moved out of our offices to do nothing but fix bugs for a solid week.

It worked -- we got through the bugs extremely quickly. More quickly than anticipated, actually. We rotated Frameworks testers through so that there was always a tester on hand to help us with tricky repros, give opinions on bugs, etc. We had a program manager on hand too to help triage and clarify the specifications when necessary. And any time there was a problem -- say, a check-in test suite was returning an unexpected failure -- there were people right there who could try it on their machines to see if it was a problem with the suite or a problem with the fix. Code reviews were a snap. (Afterwards, we realized that we should have built a machine that was used just to run clean test suites, which would have helped speed things up even more.)

However, it was also an INCREDIBLY IRRITATING way to work. There's a reason why we have offices with doors! Yes, many of the bugs we were working on were "crank through 'em" bugs -- bugs where both the cause of the flaw and the solution are straightforward. Isolate the problem, make the fix, get a code review, run the test suites, check in, lather, rinse, repeat. I, however, was working on a "bug farm" feature. One of the VSTO features didn't have a particularly clear spec to begin with, and over the last two years we've made many design and implementation changes without always taking the time to spec out the new semantics. As a result, it was extremely easy to introduce new bugs by fixing old ones -- I had four bugs assigned to me in this area, and in reviewing the code to isolate the bugs I found a good half-dozen more.

Trying to patch each piecemeal and crank through them was not a good idea. First, because I'd probably introduce five new bugs when I fixed the ten old ones. Second, because without a clear idea of what the correct semantics are, it's hard to know whether you've made the right fix. I ended up spending the whole week refining the desired semantics, rewriting the code as I went, and covering huge whiteboards with notes on proposed fixes for my companions to review.

On the one hand, having people around to bounce ideas off of all day was a big help in refining what the right thing to do is. But, on the other hand, every time one of them called me in for a code review it was a big derailment of my train of thought. And it goes the other way too -- I derailed a lot of people when I'd stand up suddenly and ask some ludicrously complicated question "what about the case where the item was not loaded from the on-disk cache, then added to the in-memory cache, then removed from the in-memory cache? Do we need to track the fact that the disk cache is still clean? " Very distracting, that.

It was definitely a good thing to try, but I think it works best as an emergency measure rather than as a day-to-day practice.

-------------------------------------------------------

In other news, I've finished the first cut of my Google-on-the-cheap tool and it seems to be working really well. However, I'm going to be going dark again for a couple of weeks. I am heading to Mexico to attend a family wedding and visit an old friend who has recently moved to the middle of Mexico. I'll have absolutely no access to computers while I'm gone. I might post a bit more before I go, but probably not. I'll see y'all in a couple of weeks!