Grading On A Curve

A topic I’ve been pondering of late is grading test cases. If I have two test cases that appear to do exactly the same thing, how do I decide which one to keep and which one to turn off? If I am wading though a large number of failing or unstable test cases that I inherited from someone else, how do I decide which ones are worth spending time on and which should just be thrown away?

One point to consider is the test case’s result history. If it just recently started failing, it probably either a) needs to be updated to reflect a change in the product under test, or b) has found a bug in the product. On the other hand, if it hasn’t ever worked then it’s much less likely to be worth saving. (Believe it or not, I have seen test cases that have been running in automation runs for years and have been failing that entire time! Someone turned it on but never bothered to check whether if actually worked or not! Pure insanity.)

A different aspect of history is whether the test case corresponds to any bugs. If the test case was part of a planned test suite, it may merely be confirming that the developer correctly handled that particular case. If the test case is related to one or more bugs, however – either cases the developer handled incorrectly or bugs recorded for posterity in executable fashion – the Iceberg Principle (“For every bug you find there are nine related bugs lurking close by”) says the test case is worthy of further investigation.

Another point to consider is documentation. Does anything anywhere explain what the test case is supposed to be testing? External documentation (way out of date probably but better than nothing), a description in the test case management system, comments in the code, the test case’s name, the code itself – if none of this gives you any clue then you may as well toss the test. Sure you may be losing test coverage, but if you don’t know what it is you’re losing than I would posit that it doesn’t really matter.

The test case’s verification is another item to inspect. I have seen test cases that didn’t bother to verify anything, test cases that simply logged a Pass without regards to what actually happened, and verification that may have made sense three releases ago but is meaningless or – horrors! – blatantly incorrect. Verification is the most important part of a test case, and the verification’s quality (both what is verified and how the results are logged) is a direct indicator of the test case’s quality.

If the test cases actually run and do so reliably, you may be able to use code mutation to learn which of them actually catch bugs. This process, wherein the code under test is changed in subtle ways (an equals comparison converted to a not equals check, for example) followed by a test run to see which tests catch the injected bug, is tedious, but it’s also eminently automatable. Nester and Jester (for .Net and Java, respectively; find both on SourceForge) are two tools that do just that. This technique is most useful with unit tests but can sometimes be applied to larger-scoped tests as well.

The quality of the test case code itself is another indicator. Well-written code doesn’t guarantee a good test case, but badly written code does usually translate to a poor test case.

Code coverage can be useful, but you must be careful not to use it incorrectly. Code coverage is useless for telling you how good your testing is. Sure, you may be hitting that line of code. But are you throwing every equivalence class at it? Are you executing it in every different context that can possibly occur? Code coverage can’t tell you.

What code coverage *can* tell you is where your testing is lacking. Again, though, increasing code coverage shouldn’t be your goal. Instead, use the data to direct your testing efforts: what tests are you missing? Write and execute them, then check your code coverage numbers again.

My favorite method of evaluation is to map bugs back to test cases. If a test case purports to test something, but a bug in that something is found after the test case reported all clear, thenthe test case is clearly lacking.

As you can see, I don’t have The One True Answer to give you. But, thinking about this question and applying your thoughts to your testing can only make it better!

*** Comments, questions, feedback? Want a fun job on a great team? I need a tester! Send two coding samples and an explanation of why you chose them, and of course your resume, to me at michhu at microsoft dot com. Great coding skills required.

Comments (5)

  1. Bruce McLeod says:


    Yet another great post, keep them coming !

    Your comment about the iceberg principle, really struck a cord with me and I blogged about it briefly here:



    bruce at teknologika dot com

  2. Neilson Eney says:

    Long time reader, first time commenter…

    This post reminds me a lot of how I use iTunes. I have songs that I really like, I have songs that I hate but are useful to keep around for some special occation, and there are other songs that were bad rips that I need to delete, but maybe I want to re-encode them before I delete them.

    It occured to me as I read this post that wouldn’t it be sweet to have a test case management tool that had some of the same features as iTunes, or other music players? I already rate my test cases by frequency. Some test cases are BVT’s and are run often, if not nightly. It all comes down to how "deep" I want the tests to go and how much time I have to execute a given test pass.

    You say you’d like to know which tests to keep and which to turn off? Well, what about something less binary? How about a test you want to run during a test cycle and another test that you’ll let run, but you might want to flag is as a potential problem child that needs revisiting after the product ships. Often it’s during the middle of a test cycle that I realize old tests that we thought were good at the outset, actually need some redesign.

    I already put some limited metadata into my test cases such as the frequency that I run them. Why not go ahead and add additional metadata that includes the type of "grading" information you’re discussing here? The more knowledge I can capture as I think about test cases, the better since I can then go back and use tools to organize that knowledge to help me make my test cases better.

  3. The Braidy Tester says:

    It would indeed be great to have an easy way to annotate test cases with this kind of information, and then use that metadata as part of the queries used to decide which test cases are included in a run. Visual Studio Team System, are you listening? <g/>