the Zune issue

Article
01/06/2009

As you can imagine there is a pretty lively debate going on over the Zune date math issue here in the hallways and on our internal mailing lists. There are plenty of places one can find analyses of the bug itself, like here, but I am more interested in the testing implications.

One take: this is a small bug, a simple comparator that was ‘greater than’ but should have been ‘greater than or equal to.’ It is a classic off-by-one bug, easily found by code review and easily fixed then forgotten. Moreover, it wasn’t a very important bug because its lifespan was only one day every leap year and it only affected the oldest of our product line. In fact, it wasn’t even our bug; it was in reused code. Testing for such proverbial needles is an endless proposition, blame it on the devs and ask them not to do it again. (Don’t get your knickers in a twist, surely you can detect the sarcasm.)

Another take: this is a big bug, in the startup script for the device and thereby affected every user. Moreover, its effect is nothing short of bricking the device, even if only for a day (as it turns out, music is actually a big deal on that specific day). This is a pri-1, sev-1, run-down-the-halls-screaming-about-it kind of bug.

As a tester can I take any view but the latter? But the bug happened. Now we need to ask what can we learn from this bug?

Clearly, the code review that occurred on this particular snippet is suspect. Every code review I have ever been part of, a check on every single loop termination condition is a top priority, particularly on code that runs at startup. This is important because loop termination bugs are not easily found in testing. They require a “coming together” of inputs, state and environment conditions that are not likely to be pulled out of a hat by a tester or cobbled together using unthinking automation.

This brings me to my first point. We testers don’t do a good job of checking on the quality of code reviews and unit testing where this bug could have been more easily found. If I was still a professor I would give someone a PhD for figuring out how to normalize code review results, unit test cases and system test cases (manual and automated). If we could aggregate these results we could actually focus system testing away from the parts of the system already covered by upstream ‘testing.’ Testers would, for once, be taking credit for work done by devs, as long as we can trust it.

The reason that system testing has so much trouble dealing with this bug is that the tester would have to recognize that the clock was an input (seems obvious to many, but I don’t think it is a given), devise a way to modify the clock (manually or as part of their automation) and then create the conditions of the last day of a year that contained 366 days. I don’t think that’s a natural scenario to gravitate toward even if you are specifically testing date math. I can imagine a tester thinking about February 29, March 1 and the old and new daylight savings days in both Fall and Spring. But what would make you think to distinguish Dec 31, 2008 as any different from Dec 31, 2007? Y2K seems an obvious year to choose and so would 2017, 2035, 2999 and a bunch of others, but 2008?

This brings me to my second point. During the discussions about this bug on various internal forums no less than a dozen people had ideas about testing for date related problems that no one else involved in the discussions had thought of. I was struck by a hallway debate between two colleagues who were discussing how they would have found the bug and what other test cases needed to be run for date math issues. Two wicked smart testers that clearly understood the problem date math posed but had almost orthogonal approaches to testing it!

The problem with arcane testing knowledge (security, y2k, localization all come to mind) is that we share our knowledge by discussing it and explaining to a tester how to do something. “You need to test leap year boundaries” is not an ineffective way of communicating. But it is exactly how we are communicating. What we should be doing is sharing our knowledge by passing test libraries back and forth. I wish the conversation had been: “you need to test leap year boundaries and here’s my library of test cases that do it.” Or, “counting days is a dangerous way to implement date math, when you find your devs using that technique, run these specific test cases to ensure they did it right.”

The testing knowledge it took to completely cover the domain of this specific date math issue was larger than the set of folks discussing it. The discussion, while educational and stimulating, isn’t particularly transportable to the test lab. Test cases (or models/abstractions thereof) are transportable and they are a better way to encapsulate testing knowledge. If we communicated in terms of test cases, we could actually accumulate knowledge and spread it to all corners of the company (we have a lot of apps and devices that do date math) much faster than sitting around explaining the vagaries of counting time. Someone who doesn’t understand the algorithms to count time could still test those algortihms using the test assets of someone else who did understand it.

Test cases, reusable and reloadable, are the basis for accumulated knowledge in software testing. Testing knowledge is simply far too distributed across various experts’ heads for any other sharing mechanism to work.

the Zune issue

Additional resources