More on Quality

In my last post I discussed the subject of shipping quality products (which is really rather different from code quality or stability, or many other measures one could use). And let’s not forget design quality – obviously, if a product does not meet the customer’s needs or is too hard to use, it doesn’t matter how stable or well written it is. I got quite a few comments on my last post, so I guess this is an interesting topic. For the record, I am not writing a book here – so I am not necessarily exhaustive when I describe bugs or coding techniques. The remarkable thing about Watson that made me want to write about it is that it is a new way to make products better that tries to measure the real world. It is not an excuse to avoid good code architecture, or to not think about product design. One person pointed out that non-crashes can be more annoying than crashes – that’s very true – although crashes and hangs are pretty nasty. In fact we’re looking into ways to extend Watson beyond crashes and hangs. The Watson guys have a bunch of exciting ideas around that.

I mentioned last time that the goal is to ship a product with the highest known quality, not necessarily with fewer bugs. This is a counter-intuitive concept, so I’ll explain a little more. The naïve way to think about bugs is that if you fix a bug, the product is better as a result. The truth is that it probably is better, but you cannot say for certain right after you check in your change that it absolutely is better. You may have inadvertently introduced a different problem by fixing the issue you were dealing with. That kind of bug is called a “regression” – the product quality has regressed (decreased) as a result of trying to improve it.

In the example I gave, fixing a redraw problem in a toolbar button may seem pretty harmless, but in fact something is now different, which might introduce a change somewhere else you did not anticipate. If you doubt this, then you haven’t developed enough software or had enough contact with the people who use your stuff to know that you sometimes make things worse, and you don’t always find out immediately. In the toolbar button example, if you do it by setting a flag on a call to the system which causes the video driver to be called in a different way, that’ll probably be fine on your machine, but some fraction of people out there might be using a video driver that can’t handle that particular request, and they go down. Unless you or your test team happen to have that video card and driver, and happen to try this new code out, you will have no idea that your fix to a minor visual glitch has now caused 1.5% of your future users to have a horrible experience with bizarre crashes that seem to have no known cause since their actions are not causing the problem. It may take you months to discover what has happened – months of complaints from a seemingly random set of customers who have no repro steps for their crashes – truly a nightmare. If you fixed this bug right before you thought you were done, then you’ll probably find out about the problem from your customers rather than your test team, which is not good. Once you do find out about this other bug and finally track it down to the video driver (if you ever do) you will of course fix it, but it is a little late – the damage is done. Some customer might have deployed your software on thousands of machines, and your fix is going to have to be deployed on all those machines as well. That little bug might cost your support lines and customers more money than they paid you for the software, or that you made from it.

So the goal is to get the product to a known state of quality before you put it out there. To maximize the known quality, the process we use in Office and the rest of Microsoft for the most part is a staged reduction in bug fixing, believe it or not. Naturally during the course of a project we fix just about every bug we find. But near the end of the project, as it gets harder to find bugs and our predicted ship date approaches, we need to make sure we can control the quality of the product in the end game. We begin a process of deciding which bugs to fix and which ones our customers can live with (because they won’t notice them, mainly). Some people will read this and say “I knew it! See, Microsoft guy admits shipping buggy software!”. If you conclude that, you did not understand.

Bugs are not all equal – we use a measure called “severity” to gauge the impact of a bug in isolation of how common it is. This is a measure that a tester can apply without having to make a judgment on the importance of fixing a bug. It’s pretty simple: a “Sev 1” bug is a crash or hang or data loss. Sev 2 is serious impact on functionality – cannot use a feature. Sev 3 is minor impact on functionality – feature doesn’t quite work right. Sev 4 is a minor glitch – visual annoyance, etc. The severity doesn’t control whether we fix a bug or not. We use another subjective measure – the opinion of a program manager, mainly. For example, if the splash screen for Word has “Word” misspelled, then that is a Sev 4, but clearly is a must-fix.

We know that statistically our testing process flushes out most regressions within about 3 months of the time they were introduced. Since there is such a long lead time, we start reducing the number of changes months before our expected ship date. Of course usually regressions are found within a day or two if not sooner because we test specifically for them, but some of them lurk, as I have explained. If a regression is not found by the special “regression testing” done specifically to try to catch these, then that area of the product may not be revisited for 2-3 months as the test team works on other areas, hence the long period.

Because of this long time that regression bugs can lurk, some months before we decide we can ship the product we begin a “triage” process. We start rejecting bugs (marking them as “won’t fix“, or “postpone”) that have little to no customer impact, and that only a very persistent customer would run into (like, typing 1,000,000,000,000,000 into the footnote number control causes the footnote format to look incorrect – who cares?) The goal is to reduce “code churn”. Any sizeable software project is something you have to manage as an organic thing since as I discussed, it is no longer possible to know for certain what effect any particular code change will have. By simply reducing the amount of change, you can reduce the amount of random problems (regressions) being introduced.

As time goes by, we raise the “triage bar” that a bug must meet in order to be worth fixing. That is, bugs need to be a little more serious to get fixed. As a result of this process, fewer and fewer changes are happening to the code base. Time passes, and we can be more confident that new bugs caused by bug fixes are being introduced in much smaller numbers. Finally the number of bug fixes we are taking per week is down to single digits, and then finally no bugs remain that we would consider “must fix”. We now have a “release candidate”, and we will only take “recall class” bugs.  Essentially we could now ship the product, but we leave it in “escrow”, testing like mad to see if any truly heinous bug that would make us recall the product appears, and also whether those last few fixes have regressions. So now we have reached a state where by choosing to not fix bugs that we considered survivable, we have minimized the chance that even worse bugs are now in the code instead. We’re at maximum known quality.

And of course the inexperienced among us are shocked, because they can’t understand why this or that bug did not get fixed. But the wiser ones remember what happens when you buck the process and fix that last bug.

A last anecdote to leave you with. Even re-linking your code (not even recompile) can introduce a crashing bug. A few years ago, we were working on the release candidate for an Asian-language version of Word97. We thought we were done, and ran our last optimization on the build. We have some technology at Microsoft that profiles code usage and arranges the code modules so that they are in the executable in the optimal order to produce the best possible boot speed. After running this process which involves mainly the linker and is considered very safe, we put the code in escrow while the testers tried to break it. And they did – they found that it would crash on some of the machines they had when a certain feature was used. But the unoptimized build did not crash with the same steps and machine.

So we ran the “debug” build (a version of the build that has the same code as the “ship” build you all use, but includes extra information to allow debugging) and it also did not crash. We then tried to debug the ship build – but just running the ship build in the debugger caused the crash to go away. Some developers love this sort of mystery. One of the future inventors of Watson stopped by to get involved. No matter what they did, as soon as they tried to find out what was causing the crash, it went away. We went to get an ICE – a hardware debugger that might give us some clues.

Then we noticed that there was a pattern. The only machines that showed this crash had Pentium processors of 150MHz or less, and some of the 150MHz machines did not show the problem. We had a hunch, and searched the Intel web site for “errata” (their word for bugs – we prefer “issues”). Sure enough, there was a flaw in the Pentium chip which under certain obscure circumstances could cause a fault – there needed to be a JMP instruction followed exactly 33 bytes later by a branch, and the JMP instruction had to be aligned with a “page” (4096 byte block) boundary in memory. Talk about specific. The flaw had been fixed in later versions of the 150MHz generation, and all chips produced later.

Now, there are actually flaws in chips quite often (one or two are even famous). Those chips include software after all. But as they are discovered, chip manufacturers tell the people who write compilers about the bugs, and the compiler people modify their compilers to produce code that cannot hit the bug. So the flaws seem to just disappear (thank goodness for software!) It turned out we were using a slightly older version of the compiler which did not know about this flaw. An Intel rep at Microsoft confirmed the details of the problem, and rather than relinking or taking any chances whatsoever, we ran a check for this byte sequence, and manually moved three bytes in our 5MB executable to make sure those instructions were 34 bytes apart. Problem solved. So now when someone tells me a fix is “safe”, I can tell them that no fix is truly safe – you really can never know.

Comments (41)

  1. moo says:

    I actually like the way some people resolve bugs in RAID as "By Design".

    Is something is raised as "By Design" I would question the design, and not do what most PMs do and just brush it aside.

    The problem with software developed on the Open Source model on sourceforge etc is theyre very nature of being homebrew. Not desing, no planning, no post mortems. Alot are just hacked up and if you dare question anything its "Yeah and yo mama sux.." or "j0 lamer n00b".

    Talking to an egotistical Linux head or Open Source developer for a SF project is like playing a game of Unreal Tournament 2003 and reading the chat window.

  2. moo says:

    Software should be kept soft, I awate the day when we will have programmable CPUs so we can just update the CPU with the latest code and bug fixes either like a BIOS update or on the fly.

  3. moo says:

    You may find this funny 😀

    Eat Collins dictionary biatch 😀

  4. senkwe says:

    This is fascinating stuff. Subscribed!

  5. Larry Osterman says:

    Chris, the 33 byte bug you reference is EXACTLY the same bug that I use in my war stories about why you never change any code ever.

    In my case, it was an interim build of the Exchange 5.5 IMAP server that would crash 100% of the time on a particular machine but on no others, and, like this bug would only fail with no debugger attached.

    Cool to find others who ran into it 🙂

  6. moo,

    Actually, some CPU updates actually DO happen when you flash the BIOS. The BIOS contains microcode updates for the processor. Microcode is sort of an "assembly language for assembly language". Microcode defines how the x86 assembler instructions get turned into microinstructions that get executed in the RISC core of the CPU.

    Newer revs of the BIOS from your motherboard maker often contain microcode updates.

    Granted, it’s not a huge update, but an update nonetheless.

  7. Rick Schaut says:

    I’ve often referred to bugs whose ease of reproduction appears to run inversely proportional to the rigor with which one attempts to observe the cause of the bug as "Heisenberg" bugs.

    We had one of these in Mac Word 98, and it was cased by a string change. The code that referenced the string copied it from the resource database into a stack-based buffer making substitutions as it went along. For performance reasons, the code relied on the existence of a sentinel character at the end of the resource string. When it found the sentinel character, it would slam a NUL character in its place, thereby terminating the string C-style.

    Well, the string change omitted the sentinel character, so the code would quite contentedly march right on past the input string until such time as it found the sentinel character. Turns out, however, that the particular sentinel character in question was commonly the same as the highest-order byte of an instruction address.

    So, depending upon what actually happened to be on the call stack when this routine was called, the code in question might, or might not, stomp a return address that’s on the call stack. Often-times, the result was completely innocuous. Other times, the result was catastrophic.

    Moral of the story: even string changes can leave you with an unkown bug that’s difficult to find.

  8. Chris Pratley writes about bug fixing within Microsoft and how Watson, a software reporting tool, has changed their debugging process. He follows up with a second post about attention to detail and how hardware obscurities sometimes lead to buggy software….

  9. Woo Seok Seo says:

    After reading your article, I’m just wondering your bug tracking system. What kind of?

  10. moo says:

    Its called RAID. And no its not Redundant Array of Inxpensive Developers (in India):

  11. moo says:

    Bug tracking system used by Microsoft is known as RAID, a Sql Server backend and file attachments stored as normal files on a shared folder over netbios.

    It tracks statistics, severity, priority, who raised the bug, who it is assigned to which components / feature area etc what one normally expects from a bug tracking system. RAID is an internal custom solution.

    For source control they use source depot which was first deployed for windows NT kernels ( I think there is an article on this on the net somewhere If I remember).

    Sourcedepot is mostly command driven and usually people internally hack up VS plugins and UIs and dump them onto http://toolbox the internal tool sharing repository.

    Usually each businuess unit doing a product will run theyre own RAID bug tracking system and sourcedepot repository. These servers are maintained by OTG and only authorized persons get access to the server rooms with prior permission from OTG.

  12. moo says:

    Do these CPU Updates that are done on a BIOS update actully change the CPU or are they just patching the BIOS with changes, is so where is there information on how to modify a CPU then?

  13. The microcode files are distributed by Intel and are statically linked into the BIOS binary. When you update your BIOS, you are also updating the microcode. During the early stages of POST, the microcode instructions are downloaded into the CPU.

    I believe that if the BIOS provides no microcode, the CPU has a internal "default" microcode that it will use instead.

    Of course, it’s up to the BIOS Engineer to grab the latest microcode form Intel and link it into his/her BIOS.

  14. Also, I beleive that Linux and Windows both offer "microcode update" drivers that allow them to update the CPU’s microcode, in case the BIOS doesn’t do it.

  15. moo says:

    That could be nasty if I just take a popular BIOS, modify it, and host that. Lots of duff CPUs 😀 Could be more fun than disabling the CPU fan in the hacked bios and OCing it to the hilt so it burns 😀

  16. moo says:

    This CPU modifying is just Intel’s wisdome , this isnt possible with AMD as far as I know. I guess Intel have to do this because they make crap products. Anybody who buys intel is just a dumb suit that doenst realise that its overpriced shit thats costing them productivity.

  17. Does N. Matter says:

    Hey moo, been away from Redmond for a long time, huh? RIAD, gee. Nobody’s using that any more. We’re on PS now. 🙂

  18. moo says:

    Not everybody is using PS 😀 Some still use RAID

    Who ever said I was in redmond. Not me.

  19. moo says:

    So what product do you work on Does.N.Matter , I would gladly pulverising your product and zero-day the bugs to every public channel available with full repro code.

    Want a challenge?

  20. moo,

    Actually, you could make a bogus BIOS pretty easily. Just put together a 256kb file with a .BIN extension. Fill it with bogus code.

    BTW, are you actually trying to be a troll? AMD cpu’s have an equvalent microcode update system too. All major x86 systems do.

  21. moo says:

    Thats news to me, I was told by somebody that Intel only had this feature. Guess they where wrong. Do you have any links to refernces on this info?

  22. moo says:

    I guess since we are moving towards (and rightly so but as long as they are digitially signed maybe I dono as that would lock out the non commercial research elements) softer CPUs or FPGAs.

  23. If your CPU has a "device driver" listed in Windows, it supports microcode updates.

    Pratically every x86 since the Pentium Pro supports microcode. They need some kind of CISC-to-RISC translation table, and that needs code to work.

    Here’s a link:

  24. moo says:

    Heres maybe a topic. How to design with testability in mind or verifability blah blah etc etc

    Things I do are add lots of events that would maybe not be of use to a caller of the compontent (well not in a usual use case) but would be invaluable to proving the component or test various parts. The test harnesses can subscribe to these events etc.

    I also for debuggin set alot of retvals so I can see clearly every API call status if there is a retval. Intellisense helps out here alot.

    I also expose alot of properties showing internal state for testing.

    Things like the above make it much easier for testing by providing the hooks there for them instead of them scratching theyre heads "how can I do this". Saves me agro in the long term also.

  25. Jack Mayhoff [MSFT] says:

    What kind of testing do YOU do (and I dont mean theoretical spouting but in actual reality what you do), what kind of automation for BVTs and so on?

    Is it mostly scripting, automated with static data, auto generated data, what?

    How much of the data is machine generated at runtime or preruntime or is it human made?

    If machine generated, how do you generate it (storage isnt an issue), how do you plug in that data to the application under test? Templates with variable placeholders or just general reading from storage and dumping in?

  26. Jack Mayhoff [MSFT] says:

    My basic question (damn why cant I edit my blogs) is, How much automated are you, is it a one button generate and test app or alot of time spend on doing the cases? Im a strong advocater of more than 50% automated, the rest of the time we can do adhoc or special cases, we can even automate adhoc cases via controlled randomization.

  27. Jack, I am wondering why you put "[MSFT]" at the end of your name. I cannot find you on the internal address book. Are you a Microsoft employee?

  28. Jack Mayhoff [MSFT] says:

    You mean the LDAP directory or headtrax?

  29. No, I mean you do not work at Microsoft, period. I was able to confirm that today. I see that you and "moo" are in fact the same person as well. Of course you’re welcome to call yourself whatever you like, however I for one would appreciate it if you did not mislead others into thinking you are an employee. Thanks.

  30. Jack Mayhoff [MSFT] says:

    How do you know, you dont, thats right.

    Just because this name isnt in headtrax doesnt mean I do or dont work there. I guess this would show up in your test matrix if you think like that. (full of holes)

  31. Jack Mayhoff [MSFT] says:

    Put it this way, I am what you are up against when you go RTW or RTM on a product.

    I have a policy of zero-day full source repro on any serious bug. Having tried to contact those and notify them of issues and get arrogant asshat ignorant fool with ego the size of the moon yet he is clueless, this is the only way to go.

  32. Big Red Blob says:

    An excellent and insightful peek into the mind of a Microsoft developer as he ruminates on software bugs and the risks that come from even attempting to fix software flaws. There are dangers everywhere. Even re-linking isn’t safe, as this article shows. The reluctance to even touch the code for…

  33. When a Windows program crashes, Windows XP gives you the opportunity to send an error report to Microsoft. The process is called Online Crash Analysis. My advice: Do it. Here’s a perfect example of why it’s good for you and for your fellow PC users. For years, I’ve encountered a sporadic…