“The first priority, young man, is to find the bugs…”

 Prologue

One of my teammates recently set up a Wiki site for
my team, and I must admit that I have become quite addicted to it.  One of the
pages I’ve created has been a list of common test areas, a checklist of things most
of our tests will want to concern themselves with.  Things like “bad parameters”,
and “different transport protocols”.

Let me interject at this point that this concept of a common test checklist is by
no means a new one.  Anyone among my three readers who has done software testing
before knows what I mean.  Despite all
the prior art
, however, I believe it is useful to have a team-specific list; something
that targets those test areas particularly relevant to that team.  To pick a
team completely at random, for example, .NET Remoting likely has specific test areas
and checklist items that are particularly relevant to it.

My checklist currently has about 20-30 items on it; today I would like to focus on
a single one of them; two simple words: Multithreaded
Testing

In reality, it’s a topic that can fill a library.  I’m going to chat about my
empirical experiences in this area, but if you are interested, there is also considerable computer
science research in this area
.

As I write this, I find myself mentioning test topics that deserve more discussion,
in a future post.  I’ll mark those with a *Ping*
so I remember to go back to them some other day.

“The first priority, young man, is to find the
bugs…”

Many moons ago, in the before-time, I wrote a test suite for the IErrorInfo interface
and the associated COM infrastructure.  (Yes, it is my fault if it doesn’t work
right…)  One set of tests that I wrote was designed to discover race condition
problems when multiple objects on different threads got error objects back from the
same target object.

I was actually quite proud of these tests.  In one case, I would have threads
A and B call object O, where the calls arrived in the order A, then B, but by judiciously
blocking the calls in the target object, they would return in the order B, then A. 
In another case, I would force the order as A-calls, B-calls, A-returns, B-returns.

I had about four of these test variations.  I had beautiful charts in my test
specification describing the control flow.  I had bountiful program output, describing
the scenario in loving detail for anyone who might happen to need to debug a failure. 
I loved those, because if anything failed, I could point to the exact repro scenario
and documentation needed to demonstrate and debug the bug.  *Ping*

What I forgot (or hadn’t learned yet), was that documentation and easy repros, while
important, are all “priority two”.  The first priority is to *find
the bugs
*.

Hindsight is 20/20

You see, by artificially controlling the ordering of the thread actions, I’m also
artificially constraining the product code paths that my test explores.  For
example, my test would never try the case where a call arrives at the target object
at the exact same time as another call returned from that object; the care I took
in synchronizing the scenario prohibited it.

What I should have done was kick off about a hundred threads, set up some loops so
they continuously hit the target object, and let it run for a minute or two. 
Sure, random testing is not deterministic; there is no guarantee that a given failure
will repro, and figuring out what happened when something does fail is a major pain
in the rear.  But remember, that is all “priority two”.  Think about all
those calls, twisting and twining, overlapping and conflicting, throughout the internal
IErrorInfo infrastructure.  Its gonna be *tons*
better at finding bugs than the four simple variations I wrote years ago.

Now, even that random test case isn’t enough.  There may be races that will just
never show up on your machine, in your configuration.  To catch stuff like that,
you’ll want to induce errors or delays; the joys of fault injection.  *Ping*

You also might want to run tests for longer than a minute or two, which brings us
to our next topic…

How I learned to stop worrying and love the stress

Sometimes, a race only shows up once every couple months.  Sometimes it will
only show up on a single machine; the one that has the magic combination of system
components that demonstrate the failure.  To catch these, the Windows team has
this thing called “Office Stress”.  This runs a bunch of different tests, exercising
many of the features of Windows.  It crushes the machine – office stress will
routinely peg the CPU, and things run so slowly that failures are the norm.

Now, honestly, that is mostly useful for testing software that cannot fail. 
Things like winlogon, or rpcss; if they fail, the machine fails.  These core
system components have to keep functioning even if 90% of their memory allocations
start failing – and office stress will force that condition.

Office stress is not so useful for testing other kinds of programs.  Your typical
application doesn’t expect to keep working with memory failures – it just dies, hopefully
with some appropriate error message.  Typically when you run these programs under
office stress, they’ll die in the first 10% of the program, so you end up never testing
the other 90% of the program.  For these programs, a lower-intensity variant
of stress is appropriate.  We’ll tune the test configuration so that it runs
at about 70% resource utilization, and just let it run continuously.  The ASP.NET
team does a lot of their stress testing like this; in addition to finding race conditions
and other multithreading issues, it is also good for finding slow resource leaks.

On a side note, we also have a concept of “long-haul” stress.  Teams at Microsoft
often have ship criteria where we won’t ship a product unless it has run on so-many-hundreds
of machines, for a certain number of days, under stress, without failures.  For
Windows, for example, I believe it is something like forty days.  (Once we start
the last one of these forty-day test passes, it is a major pain in the rear if someone
finds a showstopper bug that resets testing…)  Depending on the team, long-haul
stress may consist mainly of high-intensity or low-intensity stress testing.

In the managed world, we do an interesting variation on fault-injection combined with
stress.  We have this tool called GCStress, which basically forces the garbage
collector to do a collection on every program step.  Yeah, its really slow. 
By running this along with our tests, we can surface memory failures pretty much at
the point where they occur, which makes them much easier to debug.  (Similar
to using the appverifier and turning on full pageheap, in the unmanaged world.)

Anyway, I’m definitely rambling away from the original topic of this entry, so I’ll
sign off now.  I’ve enjoyed writing this up; it helps me clarify the concepts
in my mind.  Please do let me know if you find it interesting as well!