I know I'm overdue for a post when Bj starts lapping me on posts.

This week, I've been finishing up my presentations for STAR and PSQT (for those who haven't presented at a conference before, presenters are expected to have their presentation complete 2-3 months before the conference date). The two conference presentations are barely related, but the piece that brings them together is something that I have been meaning to post here for a while.

For anyone who has come across this post looking for information about Dr. Dre, all I can tell you is that I'm a big fan of the Ben Folds cover of a Dre song that I can't post the title to here (and certainly not lyrics!). For everyone else, this post is about Defect Removal Effectiveness.

I talk a lot about preventing bugs and everyone has heard ramblings on moving quality upstream, but nobody really seems to know too much about what it means. I believe that most software defects can be prevented, and for those bugs that cannot be prevented, that they should be detected as early as possible. DRE is simply a measure of how effectively defects (bugs) are removed at each stage of the product cycle.

Note: now that I've mentioned stages of the development cycle, some scrumbut is going to proclaim me as waterfall man. Please consider "stages" any serial activity in software development. Even agile purists want to have their note card user stories in place before writing code, so please apply this concept to whatever flavor of water-spiral-v-w-x-xp-xyz-iterative-prototype-rup-nm model you choose to use.

Hold on tight for a moment while I attempt to explain DRE without the use of any fancy formatting tricks such as tables, scrolling text or lightweight ajax controls.

Say, for example, that you find 10 bugs during requirements review (ambiguous statements, conflicting statements, etc.) Note yet again, that you can find these same sort of bugs reviewing user stories for consistency. Say that throughout the rest of the product cycle that you find another 15 bugs that relate to requirements. This means that during the initial stage / phase of the product cycle you found 10 bugs of the eventual 25 that were there at the time. Grade school math tells me that your effectiveness during this phase was 40% *10/25). Is that good - I don't know, I'll tell you in a minute.

Now, let's say that while the devs were coding they found another 10 of the requirement bugs, as well as 10 errors in their coding (due to unit test, buddy test or sheer luck). Let's also assume that 15 additional bugs were found in the testing phase which were attributed to developer error during coding.

This means, that during the coding phase there were 40 bugs latent in the product (15 requirements defects, 10 dev errors found and 15 remaining). 10 coding errors were found, along with 5 requirements errors. Grade school math (which again comes in handy) says our defect removal effectiveness was (15/40) 37.5%.

The numbers in this example are pretty close, so we can't say whether we're significantly better at one phase vs. the other. However, if you track this sort of metric throughout the product cycle, it allows you to do two important things.

  1. Measuring DRE helps you target where improvement needs to be made. If, for example, developers are introducing a huge number of defects while writing code and finding very few of them, it would point to the necessity of additional detection techniques during that phase
  2. Measuring DRE helps validate any sort of improvement programs. Say, for example, your development team is implementing unit tests. If they track the number of defects found they can validate the effectiveness of the time invested in writing unit tests.

The big caveat (if you haven't thought of it already), is that if you don't track when a defect was introduced in your bug management system, you can't track DRE. Hardly anyone I know currently tracks this, but I get more converts to this sort of thinking nearly every day.

ed. 9:20 pm. formatting

**boy I hope my grade school math holds up.

Comments (7)

  1. Joe Strazzere says:

    Interesting post!

    So when do you start and stop counting "bugs found"?

    If I as a QAer read an initial (but not final) draft of the Requirements, and find that some are untestable, are each of those Requirements a bug?

    What if the writer of the Requirements finds and corrects the problem before it’s reviewed by anyone?

    If a Developer finds and fixes a bug during Unit Testing, does that count?

    And at the other end, when do you stop counting?  Never?  Some number of months after release?


  2. Chris Smith says:

    I believe the song you are refering too was by Snoop Dog, the one with the titled "* ain’t *"?

    I think this notion of DRE is quite powerful, I recall watching a Channel9 video where they tracked the type of issue and when it was found during the product cycle. If spec issue bugs were being found late in the cycle, that is a pretty strong indication that things aren’t on track.

    Looking forward to reading your slides,


  3. Alan Page says:

    [alanpa] – wow – I just noticed that my comment stylesheets are completely whacked. I’ll get to those in a minute.

    re: Snoop Dog – I’ve seen Ben Folds perform that song twice, and on both occassions he introduces it as a Dre song – I guess he could be wrong…

    Joe – You hit on a big problem with implement DRE – most teams don’t track any bugs not found during testing. I think this habit is another force driving us toward finding bugs late in the cycle via testing. Some teams I’ve worked with don’t want the bug database "cluttered" with bugs found during use case review or code review. The argument I give (which I usually win) is that if you don’t track these things, you will have no idea how effective reviews are. On the rare occassion that I lose the argument, I convince them to track the defects in a spreadsheet at the very least.

    Similarly, developers are rarely going to record the code bugs they fixed during unit test. However, they can (and should) record the design bugs they found to prove the efficiency of unit testing in clarifying desing issues.

    I think the original proponents of DRE were working in a system similar to PSP/TSP where everything was tracked. I’m working on an approach to make DRE practical to more "typical" software projects, and my thoughts on making this concept accessible to the masses are still in process.

  4. Mark Waite says:

    I am very skeptical of this sort of metric because it is so easily "cheated" or "gamed".  Cem Kaner and Walter Bond have an interesting presentation on the challenges and risks of software metrics at http://www.kaner.com/pdfs/metrics2004.pdf.  

    I’m still trying to understand how to apply their warnings to the metrics my employer has been using for years.

    I would be very worried about the impact of the DRE metric you propose on my team (I’m the manager).  My team would see that I am measuring something, and would assume that must mean it is important, and they should "improve it".  I can see so many ways they would "improve the measure" and either provide no customer value, or actually decrease the customer value we are trying to deliver.

    While they are improving the metric, are they actually improving the product?  Does a bug report, or even a "single check mark on a piece of paper" add customer value?  

    I don’t think counting bug reports adds customer value.  I think the highest value is in the knowledge gained by the finder during the finding of the bug, and the next highest value is in guiding the business decision which needs to be made if the bug should cause a change in the thing the customer values.  Others may disagree with the idea, but I think the purpose of bug reports is to guide business decisions about customer value, not to decide which stage, phase or step most needs our attention.

    I think there are better ways to measure the effectiveness of a code review than the number of bugs it found.  Ask the code reviewers if there is a better way, and if they say yes, have them use that better way.  Code reviews (whether in pair programming, Fagan style inspections, or something else) have seemed very effective in all the times I’ve seen them used.  Gathering code review effectiveness metrics seems like another risk of metrics madness, with metrics too easily "gamed" intentionally or unintentionally.

  5. Alan Page says:

    Great comments Mark – you are on the money on everything. Metrics, for better or for worse, are a passion of mine, and I actually loathe bug metrics, as there are too many factors that can influence those, and that’s definitely the big drawback of DRE.

    On the other hand, I also have a big problem with any sort of improvement plan that doesn’t have an objective measurement associated with it. I’ve seen too many random process changes be jusdged solely on how people "felt" about them (I realize that’s not exactly what you’re saying above – just making a counterpoint to myself).

    You are also right that metrics will be gamed (e.g. the hawthorne effect). As I think through DRE, I’m trying to determine what other measurements I could weight with the raw bug numbers to get more accurate results, or if another meaurement entirely would be a better approach.

    So, now that I’ve presented DRE, the hundred dollar questions is: How can you measure the effectiveness of code reviews, unit testing, static analysis, or any other "early detection" techniques?

  6. Mark Waite says:

    The $100 question is excellent, but I continue to be unable to find an answer.  I know that is not palatable, and that it probably sounds unprofessional, but every attempt I’ve made or considered to measure the human powered, creative activities involved in code reviews, unit testing, static analysis, or any other "early detection" techniques has failed.

    Measures that I think would be direct are too dependent on the measured subject’s self reporting to feel like they can give me reliable data.  How many bugs were found in a code review per hour?  I think it depends on the definition of bug, and the attitude of the reviewers, and the time of day, and so many other factors that I don’t see how to measure it in a way that will reduce it to something that will help us make a business decision.

    I think the business decision is "what will give us the best return on our investment in early detection techniques?"  I think the best people to answer that business decision are the people applying the early detection techniques.  What if we had the product team experiment with different early detection techniques and make their recommendation, without any measurement method?

    Couldn’t we trust a product team to make the best methodology choices for their product?  Could we have different product teams share their observations and experiences in hopes that the product team ideas would spread effectively?

    Even my attempts to apply late detection techniques (like automated test code coverage reporting, and weighted defect find rates in interactive test, and number of unit tests) have not always given trustworthy answers.  The code coverage tools reports numbers, and the numbers are directly measurable, and verifiable by analysis, but they still don’t answer the $100 business question, where should I invest my next $1 of testing.  We’ve steadily increased automated code coverage and still had some releases that were considered poor quality and others that were considered good quality.  We’ve met or exceeded all our other measures and had some releases that were poor quality and some that were good quality.

    (Sorry, I probably sound pathetically long winded, feel free to ignore it.  I still haven’t found the answer for good metrics in creative pursuits like programming and testing.  I’ll keep looking, and I’m sure you’ll keep looking, and we’ll keep learning as we measure new things and decide the weaknesses of those new measurements)

  7. Alan Page says:

    Thanks again Mark – I do appreciate the comments.

    I don’t have a good answer either, but this is something I think about a lot. I want to trust the product team to tell me if a change is effective or not, but often when something new is introduced (unit tests, formal inspections, etc.), part of the team will feel it’s a fantastic improvement, part of the team will think it’s a waste of time, and the rest will be indifferent.

    We (as an industry) simply want to create software that’s great quality that gets to our customers when (or before) they need it. Obviously, that’s much more difficult than it sounds, because we (again, as an industry) constantly tweak the software engineering process.

    I think the tweaking is generally good – but I want to know what the tweaking is doing, and that’s what I hope to measure.

    Perhaps this is something that is unmeasurable – it certainly isn’t the only thing we deal with that is (my next blog post will deal with another). My hope is that through discussion like this, that something tangible may result.

    Or not. At any rate, I do like the discussion, and think your comments make a great addendum to this post.

Skip to main content