Test Suite Granularity Matters

I just read a very interesting research paper entitled, "The Impact of Test Suite Granularity on the Cost-Effectiveness of Regression Testing" by Gregg Rothermel et al.  In it the authors examine the impact of test suite granularity on several metrics.  The two most interesting are the impacts on running time and the impact on bug finding.  In both cases, they found that larger test suites were better than small ones.

When writing tests, a decision must be made how to organize the tests.  The paper makes a distinction between test suites and test cases.  A test case is a sequence of inputs and the expected result.  A test suite is a set of test cases.  How much should each test suite accomplish?  There is a continuum but the endpoints are creating each point of failure as a standalone suite or writing many points of failure into a single suite.

The argument for very granular test suites (1 case or point of failure per suite) is that they can be better tracked and analyzed.  The paper examines the efficacy of different techniques for restricting the number of suites run in a given regression pass.  They found that more granular cases were more effectively reduced.  However, the time savings even from aggressive reductions in test suites did not offset the time taken to run all test cases in larger suites.  Grouping test cases into larger suites makes them run faster.  Without reduction the granular cases in the study ran almost 15 times slower.  With reduction, this improved to running only 6 times slower.

Why is this?  Mostly it is because of overhead.  Depending on how the test system launches tests there is a cost to each test suite being launched.  In a local test harness like nUnit, this cost is small but can add up over a lot of cases.  In a network-based system, the cost is large.  There is also the cost of setup.  Consider an example from my days of DVD testing.  To test a particular function of a DVD decoder requires spinning up a DVD and navigating to the right title and chapter.  If this can be done once and many test cases executed, the overhead is amortized across all of the cases in the suite.  On the other hand, if each case is a suite, the overhead is multiplied by each case.

Perhaps more interesting, however, the study found that very granular test suites actually missed bugs.  Sometimes as much as 50% of the bugs.  Why?  Because more less granular cases cover more state than less granular ones and are thus more likely to find bugs. 

It is important to note that there are diminishing returns on both fronts.  It is not wise to write all of your test cases in one giant suite.  Result tracking does become a problem.  It can be hard to differentiate bugs which happen in the same suite.  After a certain size, the overhead costs are sufficiently amortized and enough states traversed that the benefits of a bigger suite become negligible.

I have had first-hand experience writing tests of both sorts.  I can confirm that we have found bugs in large test suites that were caused by an interaction between cases.  These would have been missed by granular execution.  I have also seen the immense waste of time that accompanies granular cases.  Not mentioned in the study is also the fact that granular cases tend to require a lot more maintenance time. 

My rule of thumb is to create differentiated test cases for most instances but then to utilize a test harness that allows them to be all run in one instance of that harness.  This gets the benefits of a large test suite without many of the side effects of putting too much into one case.  It amortizes program startup, device enumeration, etc. but still allows for more precise tracking and easier reproduction of bugs.  If there is a lot of overhead, such as the DVD case mentioned above, test cases should be merged or otherwise structured so as not to pay the high overhead each time.