Larry's rules of software engineering, Part 3: "Zero Defects" doesn't result in zero defects

I've held off on writing this particular post for a while, since it's somewhat controversial, but what the heck, you only live once :).

As Fred Brooks pointed out in his seminal "The Mythical Man Month" (a title that EVERY engineer should have on their shelf), one of the unavoidable aspects of software engineering is the bug curve.  As Brooks explained it, the bug count associated with every software project has a rather distinctive curve.

At the start of the project, the number of bugs is quite small, as developers write new code.  Since the new code has significant numbers of bugs, the bug count associated with that new code increases in proportion to the number of lines of new code written.

Over the course of time, the number of bugs found in the code increases dramatically as test starts finding the bugs in the new code.  Eventually, development finally finishes writing new code and starts to address the bugs that were introduced with the new code, so the rate of bug increase starts to diminish. At some point, development finally catches up with the bug backlog and starts to overtake with the new bugs being discovered by test.

And that's when the drive to ZBB (Zero Bug Bounce) finally begins, indicating that the project is finally on the track to completion.

Over time, various managers have looked at this bug trend and realized that they can stop this trend of developers introducing new bugs, test finding them, and so on by simply mandating that developers can't have any active bugs in the database before writing new code.  The theory is that if the developers have to address their bugs early, there won't be a bug backlog to manage, and thus there will be a significant reduction in the time to ZBB.  This means that the overall development time for the project will be reduced, which means that the cost of development is lower, the time-to-market is sooner, and everyone is happy.

On paper, this looks REALLY good.  Developers can't start working on new features until they have addressed all the bugs in the previous features, this means that they won't have to worry about a huge backlog of bugs when they start on the new feature.  Forcing development to deal with their bugs earlier means that they'll have a stronger incentive to not introduce as many bugs (since outstanding bugs keep the developers from the "fun stuff" - writing new code).

The problem is that Zero Defects as often promoted doesn't work in practice, especially on large projects.  Fundamentally, the problem is that forcing developers to keep their bug slate clean means that developers can be perpetually prevented from writing new code.

This is especially true if you're dealing with a component with a significant history - some components have intractable bugs that would require significant rewrites of the code to resolve, but the rewrite that is necessary to fix the bug would potentially destabilize hundreds (or thousands) of applications. This means that the fix may be worse than the actual bug.  Now those bugs are very real bugs, so they shouldn't be ignored (and thus the bugs shouldn't be resolved "won't fix"), on the other hand, it's not clear that these bugs should stop new development - after all, some of them may have been in the component for two or three releases already.

Now this isn't to say that they shouldn't be fixed eventually.  They absolutely should be, but there are always trade-offs that have to be made.  Chris Pratley (who need to blog more :)), over on the Office team has a wonderful blog post about some of the reasons that Microsoft decides not to take bug fixes, before anyone criticizes my logic in the previous paragraphs ("But of course you need to fix all the bugs in the product, stupid!"), they should read his post.

But the thing is that these bugs prevent new development from proceeding.  They're real bugs, but they're not likely to be fixed, and it may take months to determine the correct fix (which often turns them into new work items).  The other problem that shows up in older code-bases is bug churn.  For some very old code-bases, especially the ones written in the 1980's, there is a constant non zero incoming bug rate.  They show up at a rate of one or two a month, which is enough to keep a developer from ever starting work on new features.

In practice, teams that attempt to use ZD as a consistent methodology to reduce development time on large scale projects have invariably found that it doesn't reduce the overall development time.

If, on the other hand, you apply some rational modifications of ZD, you can use the ZD concepts to maintain consistent code quality throughout your project.  For instance, instead of mandating an absolute zero defects across developers, set criteria about the bugs that must be fixed.  For example, bugs that are older than 2 years may be lower priority than new bugs, security bugs (or potential security bugs) are higher priority than other bugs, etc.

Also, instead of setting an absolute "zero" defects policy, instead, set a limit to the number of bugs that each developer can have outstanding - a this adds some flexibility in dealing with the legacy bugs that really must be fixed, but shouldn't preclude new development.  Also, as this Gamasutra article indicates, it's often useful to have a "ZD push" where the entire team drives to get their bug backlog down.

In general, I personally think that ZD has a lot in common with a large part of the XP methodology - it works fine in small test groups, but doesn't scale to large projects.  On the other hand, I think that at least one aspect of XP - TDD has the potential to completely revolutionize the way that mainstream software engineering is done.  And I'll write more about that later.

Edit: Darned newsgator/word/oulook/.text formatting issue.