Grading On A Curve

A topic I've been pondering of late is grading test cases. If I have two test cases that appear to do exactly the same thing, how do I decide which one to keep and which one to turn off? If I am wading though a large number of failing or unstable test cases that I inherited from someone else, how do I decide which ones are worth spending time on and which should just be thrown away?

One point to consider is the test case's result history. If it just recently started failing, it probably either a) needs to be updated to reflect a change in the product under test, or b) has found a bug in the product. On the other hand, if it hasn't ever worked then it's much less likely to be worth saving. (Believe it or not, I have seen test cases that have been running in automation runs for years and have been failing that entire time! Someone turned it on but never bothered to check whether if actually worked or not! Pure insanity.)

A different aspect of history is whether the test case corresponds to any bugs. If the test case was part of a planned test suite, it may merely be confirming that the developer correctly handled that particular case. If the test case is related to one or more bugs, however - either cases the developer handled incorrectly or bugs recorded for posterity in executable fashion - the Iceberg Principle ("For every bug you find there are nine related bugs lurking close by") says the test case is worthy of further investigation.

Another point to consider is documentation. Does anything anywhere explain what the test case is supposed to be testing? External documentation (way out of date probably but better than nothing), a description in the test case management system, comments in the code, the test case's name, the code itself - if none of this gives you any clue then you may as well toss the test. Sure you may be losing test coverage, but if you don't know what it is you're losing than I would posit that it doesn't really matter.

The test case's verification is another item to inspect. I have seen test cases that didn't bother to verify anything, test cases that simply logged a Pass without regards to what actually happened, and verification that may have made sense three releases ago but is meaningless or - horrors! - blatantly incorrect. Verification is the most important part of a test case, and the verification's quality (both what is verified and how the results are logged) is a direct indicator of the test case's quality.

If the test cases actually run and do so reliably, you may be able to use code mutation to learn which of them actually catch bugs. This process, wherein the code under test is changed in subtle ways (an equals comparison converted to a not equals check, for example) followed by a test run to see which tests catch the injected bug, is tedious, but it's also eminently automatable. Nester and Jester (for .Net and Java, respectively; find both on SourceForge) are two tools that do just that. This technique is most useful with unit tests but can sometimes be applied to larger-scoped tests as well.

The quality of the test case code itself is another indicator. Well-written code doesn't guarantee a good test case, but badly written code does usually translate to a poor test case.

Code coverage can be useful, but you must be careful not to use it incorrectly. Code coverage is useless for telling you how good your testing is. Sure, you may be hitting that line of code. But are you throwing every equivalence class at it? Are you executing it in every different context that can possibly occur? Code coverage can't tell you.

What code coverage *can* tell you is where your testing is lacking. Again, though, increasing code coverage shouldn't be your goal. Instead, use the data to direct your testing efforts: what tests are you missing? Write and execute them, then check your code coverage numbers again.

My favorite method of evaluation is to map bugs back to test cases. If a test case purports to test something, but a bug in that something is found after the test case reported all clear, thenthe test case is clearly lacking.

As you can see, I don't have The One True Answer to give you. But, thinking about this question and applying your thoughts to your testing can only make it better!

*** Comments, questions, feedback? Want a fun job on a great team? I need a tester! Send two coding samples and an explanation of why you chose them, and of course your resume, to me at michhu at microsoft dot com. Great coding skills required.