Finding an easier way to reproduce a bug


 

We spent a good amount of time trying to reproduce the exact customer reported problem trying to repro the crash in the merge pages addin.  Theo and I sent about 8 emails backs and forth, and three testers here donated our time to try to solve this problem.  After the big clue that Theo was not seeing the crash with English settings and was only seeing it with Dutch, we set our test machines to use Dutch and still could not see the problem.


 


Suspecting the crash was due to some international setting, we started to change our regional settings to Spanish, Hindi, Chinese and Greek.  We managed to reproduce the problem with Greek but still did not know why the crash occurred.  As it turns out, it was the date format, but international characters or Unicode text could potentially have been a problem as well.


 


Jeff stepped through the debugger, found and fixed the  problem, which was the comma used as a numerical separator.


 


Looking back on the bug in hindsight, there was a much simpler way to repro the crash.  On any machine, change the Windows regional settings to use a comma (or any character) as a decimal separator instead of a period.  Wham – crash.


 


One of the ongoing tasks in testing is to find the simplest way to reproduce a bug.  Those simple steps make it easy for anyone to verify the bug remains fixed during further (regression) testing.  For this example, I can regression test this bug on any machine now that I know there is a common Windows setting I can change to put the addin in the needed state.  With the original steps, I would (perhaps) have needed to install font support or other settings to get Windows to work with the UI for the language the repro steps called for.  Since I do not need to install anything, I can much more quickly regress the bug.  Plus, whoever is fixing the bug does not have to install any unneeded components, saving him or her time as well.


 


Minimal repro steps would have saved hours of time in this case (three testers, one customer, about two hours each to isolate) and could have resulted in a faster fix.  So we will apply the lesson learned and use it for our future testing.


 


Comments, concerns, questions and criticisms?


John

Comments (3)

  1. d says:

    Something that is turning out to be quiet useful is doing testing inside a VM and snapshotting the state when an error occurs.

    This can be easier for another party to observe and the snapshot state provides a known point that can be returned to repeatedly to figure out the what’s going on piece by piece in the environment when an error pops up.

    Time to examine a known system state where and error is occurring or has just occurred could provide the opportunity to tease out the details of what’s going on and cut the time it takes to trial-and-error some random combo of settings on a live system.

  2. TechieBird says:

    I used to work in a PSS team at Microsoft.  We always used to joke that about 20% of our job was trying to make things work, 80% was trying to make them fail in a predictable way.

    Now I’m a customer again I find it hugely helpful to know that’s what the engineer is looking for when I raise a case.

  3. JohnGuin says:

    My first year at MS was as a contractor on Win95 tech support.  I know *exactly* what you mean.

    Getting to the root of the problem was, and still is, key.  "When did it last work?  What did you change? What have you already tried?" and the like were key questions.

    I learned so much as a PSS tech.  All the jokes (ALL the jokes) you hear about tech support calls are not jokes, they are true.  I use that experience at design meetings: if someone says "No one will ever do that," I can say "Oh, yes someone will.  Let me pull out my list of names of who will call tech support about this…"

    John