Larry's rules of software engineering #2: Measuring testers by test metrics doesn't.

This one’s likely to get a bit controversial J.

There is an unfortunate tendency among test leads to measure the performance of their testers by the number of bugs they report.

As best as I’ve been able to figure out, the logic works like this:

Test Manager 1: “Hey, we want to have concrete metrics to help in the performance reviews of our testers. How can we go about doing that?”
Test Manager 2: “Well, the best testers are the ones that file the most bugs, right?”
Test Manager 1: “Hey that makes sense. We’ll measure the testers by the number of bugs they submit!”
Test Manager 2: “Hmm. But the testers could game the system if we do that – they could file dozens of bogus bugs to increase their bug count…”
Test Manager 1: “You’re right. How do we prevent that then? – I know, let’s just measure them by the bugs that are resolved “fixed” – the bugs marked “won’t fix”, “by design” or “not reproducible” won’t count against the metric.”
Test Manager 2: “That sounds like it’ll work, I’ll send the email out to the test team right away.”

Sounds good, right? After all, the testers are going to be rated by an absolute value based on the number of real bugs they find – not the bogus ones, but real bugs that require fixes to the product.

The problem is that this idea falls apart in reality.

Testers are given a huge incentive to find nit-picking bugs – instead of finding significant bugs in the product, they try to find the bugs that increase their number of outstanding bugs. And they get very combative with the developers if the developers dare to resolve their bugs as anything other than “fixed”.

So let’s see how one scenario plays out using a straightforward example:

My app pops up a dialog box with the following:

 

            Plsae enter you password: _______________ 

 

Where the edit control is misaligned with the text.

Without a review metric, most testers would file a bug with a title of “Multiple errors in password dialog box” which then would call out the spelling error and the alignment error on the edit control.

They might also file a separate localization bug because there’s not enough room between the prompt and the edit control (separate because it falls under a different bug category).

But if the tester has their performance review based on the number of bugs they file, they now have an incentive to file as many bugs as possible. So the one bug morphs into two bugs – one for the spelling error, the other for the misaligned edit control. 

This version of the problem is a total and complete nit – it’s not significantly more work for me to resolve one bug than it is to resolve two, so it’s not a big deal.

But what happens when the problem isn’t a real bug – remember – bugs that are resolved “won’t fix” or “by design” don’t count against the metric so that the tester doesn’t flood the bug database with bogus bugs artificially inflating their bug counts. 

Tester: “When you create a file when logged on as an administrator, the owner field of the security descriptor on the file’s set to BUILTIN\Administrators, not the current user”.
Me: “Yup, that’s the way it’s supposed to work, so I’m resolving the bug as by design. This is because NT considers all administrators as idempotent, so when a member of BUILTIN\Administrators creates a file, the owner is set to the group to allow any administrator to change the DACL on the file.”

Normally the discussion ends here. But when the tester’s going to have their performance review score based on the number of bugs they submit, they have an incentive to challenge every bug resolution that isn’t “Fixed”. So the interchange continues:

Tester: “It’s not by design. Show me where the specification for your feature says that the owner of a file is set to the BUILTIN\Administrators account”.
Me: “My spec doesn’t. This is the way that NT works; it’s a feature of the underlying system.”
Tester: “Well then I’ll file a bug against your spec since it doesn’t document this.”
Me: “Hold on – my spec shouldn’t be required to explain all of the intricacies of the security infrastructure of the operating system – if you have a problem, take it up with the NT documentation people”.
Tester: “No, it’s YOUR problem – your spec is inadequate, fix your specification. I’ll only accept the “by design” resolution if you can show me the NT specification that describes this behavior.”
Me: “Sigh. Ok, file the spec bug and I’ll see what I can do.”

So I have two choices – either I document all these subtle internal behaviors (and security has a bunch of really subtle internal behaviors, especially relating to ACL inheritance) or I chase down the NT program manager responsible and file bugs against that program manager. Neither of which gets us closer to shipping the product. It may make the NT documentation better, but that’s not one of MY review goals.

In addition, it turns out that the “most bugs filed” metric is often flawed in the first place. The tester that files the most bugs isn’t necessarily the best tester on the project. Often times the tester that is the most valuable to the team is the one that goes the extra mile and spends time investigating the underlying causes of bugs and files bugs with detailed information about possible causes of bugs. But they’re not the most prolific testers because they spend the time to verify that they have a clean reproduction and have good information about what is going wrong. They spent the time that they would have spent finding nit bugs and instead spent it making sure that the bugs they found were high quality – they found the bugs that would have stopped us from shipping, and not the “the florblybloop isn’t set when I twiddle the frobjet” bugs.

I’m not saying that metrics are bad. They’re not. But basing people’s annual performance reviews on those metrics is a recipe for disaster.

Somewhat later: After I wrote the original version of this, a couple of other developers and I discussed it a bit at lunch. One of them, Alan Ludwig, pointed out that one of the things I missed in my discussion above is that there should be two halves of a performance review:

            MEASUREMENT: Give me a number that represents the quality of the work that the user is doing.
And EVALUATION: Given the measurement, is the employee doing a good job or a bad job. In other words, you need to assign a value to the metric – how relevant is the metric to your performance.

He went on to discuss the fact that any metric is worthless unless it is reevaluated at every time to determine how relevant the metric is – a metric is only as good as its validity.

One other comment that was made was that absolute bug count metrics cannot be a measure of the worth of a tester. The tester that spends two weeks and comes up with four buffer overflow errors in my code is likely to be more valuable to my team than the tester that spends the same two weeks and comes up with 20 trivial bugs. Using the severity field of the bug report was suggested as a metric, but Alan pointed out that this only worked if the severity field actually had significant meaning, and it often doesn’t (it’s often very difficult to determine the relative severity of a bug, and often the setting of the severity field is left to the tester, which has the potential for abuse unless all bugs are externally triaged, which doesn’t always happen).

By the end of the discussion, we had all agreed that bug counts were an interesting metric, but they couldn’t be the only metric.

Edit: To remove extra <p> tags :(