Test tracing - it is not optional!

I’d like to talk about one of the first lessons I learned as a tester at Microsoft:

Anticipate failure investigations, and log everything.

Let me tell you a fairly rambling tale of my early days at Microsoft:

We’d just shipped Windows NT 4.0 (and IIS 1.0).  I’d recently moved from a contract
position in the Windows build lab; to a full-time position (aka FTE, aka ‘blue-badge’)
in the COM test team.

In those days, the COM team was just beginning its transition from a consumer, drag-n-drop,
linking-and-embedding technology; to a distributed computing, enterprise-application
sort of focus.  This was all before COM+; DCOM was the new thing.  Given
that history, it is perhaps not surprising that we had a bunch of old tests covering
a variety of those aforementioned consumer-oriented applications of COM.  Linking
and embedding, COM common dialogs, and more; these were tests that had been lovingly
handed down from tester to tester over the years.

You’ll recall I mentioned a transition.  Now, those who have been in the industry
for a while know what it means when the product changes focus like that.  Resources
shift.  Things happen, like taking these “upper layer” tests from the five experienced
testers who owned them, and giving them to a new, green tester fresh from the build
lab.  Sink or swim, as they say.  Let me tell you, taking over maintenance
of a large test suite (written by a number of different people, in a number of different
styles *ping* ) is a great way to learn
some valuable lessons about test code design.

For example, there was a test for the OLE common dialogs, used for inserting OLE objects
and other operations.  This test basically ran a large script ( *ping* )
that iterated through a bunch of scenarios exercising these dialogs.  In total
the scripts performed thousands of operations – click that button, move that window,
select this other menu option, etc.  Now, any time you have one program executing
a script like that, you basically have one program running another program. 
This means it is particularly difficult to figure out, from a debugger, what the dang
thing is doing.  When you break in, the stack just shows you what the script
interpreter is doing, and it requires some investigation of the program variables
to figure out where you are in the script being interpreted.

Now, the test had a log file output, but unfortunately when I acquired the test this
output was fairly minimal.  Let me tell you, debugging a failure in this test
was an adventure!  It was time-consuming, and technically challenging. 
(Technically educational, too, I will say.)  So I went into the interpreter and
did something very simple – I just logged each unique operation as it executed:

LOG TRACE: Menu Select ‘File Open’

The logs got a lot bigger, and the test ran marginally more slowly (those UI tests
tended to be dog slow, anyway), but now when something failed, I could look at the
log and see the precise script actions that led up to the failure.

So what I learned is that it is really, really important in a test to log *everything*. 
Log your inputs; log your outputs; log all the program variables that seem even halfway
interesting.  It will make your life so much easier.  It will make the life
of the next test maintainer so much better.  If it causes a severe performance
hit, *then* you can perhaps think about
placing some of the trace output under a switch, that you can enable or disable as
needed for debugging purposes.  Even there, make sure you can turn on the trace
output without having to recompile the program – you’ll thank yourself later.

On a side note, in the last couple years I’ve noticed a trend in Microsoft product
design to add this kind of detailed tracing to our products.  Debug level tracing,
in the production code – something you can enable or disable with a program or registry
entry.  I know that DTC has done this for a while, and I believe it has been
standard practice in the mainframe world for decades.  Of course, for performance
reasons the tracing is typically turned off in normal execution – but it is there,
and it doesn’t require a re-compile to enable.

PSS loves this – it really helps them track down and solve customer problems.