Recall class


So in our drive to get beta2 out the door we’ve been continuously raising the bar on what sort of changes are allowed to the codebase.  As you can well imagine there is risk with pretty much any change and at a certain point the need to keep the product stable outweighs the benefit of a fix.  Even the most innocuous fix can end up causing problems in ways that you never realized (especially in a huge nondeterministic product like VS with all the threads that it contains). 

As of this week we raised the bar to what we call “recall class”.  Specifically, a bug fix needs to be for something so bad that we would recall the shipping of the product with all the enormous cost involved in that.  Why is that cost enormous?  Well, not only does it mean recalling the media we’ve printed, throwing it all away, and reprinting it all over again, but it also means that we’ve noe delayed all the companies out there that want to use beta2, and we’ve potentially caused them to have to delay their own releases.  It cascades into something pretty horrendous and you want to avoid it at all costs if possible.

So what does it take for a bug to be “recall class” worthy?  Well, that’s tough and invariably comes down to a case by case investigation of each issue that arises.  A good rule of thumb IMO would be “siginificantly affects a large number of customers in a negative way in an highly common scenario, and which there is no way for the customer to work around the issue.”  To break it down a little, i would expect a recall class bug to have the following characteristics:

  1. affects many customers.  If only 1 out of 10,000 customers is affected by a bug, well, then it’s something that we can wait to fix
  2. is a significant problem.  If we have a small colorization bug, then we don’t care if it’s affecting 99% of users since they can continue to be productive in the face of it.
  3. you can’t work around it.  Enough said.  If there is something quite simple that the user can do to get aroudn this, then we can just add this to the “Readme” and still get it out there.

So, for example, the runtime not loading any dlls without dying no matter what… that would be bad.   That’s an easy call.

The C# compiler producing unverifiable code in very common code… that would be bad.  Now, if you could make a small change (say a command line switch), to get around this, then it probably wouldn’t be “recall class”.  *but* if it required the user to work around the problem anywhere they used this “common code” then that would probably make it “recall class”. 

Now, where it gets fuzzy is when it’s not clear if all the above criteria is met.  For example, take the following hypothetical bug:

“When typing the switch keyword there is a high probably of a hang in the IDE”  i.e. if you have:

class C {

    void Foo() {

        switch

 

    void Bar() {

    }

}

then VS completely hangs.

Well, it would certainly affect many customers.  The entire C# user base would be affected by this as it would make typing a common construct incredibly risky.  It would certainly be significant.  Hangs are up there with crashes and data-loss as being the worst bugs in the product.  Even though we have automatic backups (every few minutes or so), it means that your current work is lost not to mention the time wasted killing/restarting VS and figuring out what you need to redo.  Now… you can work around it.  You can avoid the “switch” construct, or you can do something like the following:

Start by typing in the body of the switch like so:

class C {

    void Foo() {

        {

        }

 

    void Bar() {

    }

}

Then add the switch keyword

class C {

    void Foo() {

        switch

        {

        }

 

    void Bar() {

    }

}

and you’ll be fine.

So is this recall class?  It’s just not clear.  You could work around it… except that it’s not clear if users will figure out how to work around the problem. The workaround is also not simple.  Even if you figure out how to not trigger the bug and you’re savvy enough to use the workaround, then you could still be hosed a few minutes later when you forget and just type “switch”.  So, one could argue that the workaround isn’t actually simple.  Since it’s not jsut a one time fix that prevents the problem entirely, but rather is a complete change to how a user would normally edit code and which the user would have to vigilantly enforce over themselves, it could probably be considered bad enough.

My desperate hope is that the C# IDE doesn’t have anything as bad as that show up while we’re doing tons of testing over beta2.  I just want this out the door and in your hands.  Now, it just remains to be seen if our code is good enough.


Comments (6)

  1. AndrewSeven says:

    "So is this recall class? It’s just not clear."

    Yes, absolutely.

    How could you be expected to be taken seriously if the system fails on something so trivial.

    Just imagine whast they would say on slushpot err I mean slashdot.

  2. Andrew: So it would be worth recalling a beta (which is expected to have bugs), in this case?

    And just because something is trival doesn’t mean it should necessarily get fixed either. As i mentioned, it’s a combination of factors. If we kept on fixing trivial bugs then we’d never get the beta done. *ever*

  3. Steve Hall says:

    For the above hypothetical example, I would expect that such a "trivial" bug to be "prevalant". I.e., the MTTF (mean-time-to-finding it) would be hours after the beta gets posted on MSDN. And it probably wouldn’t be just a single user…

    I think the problem here is the terminology you used to categorize the problem: "trivial". I don’t think ANY user would classify such a problem using that word when they encountered it. (I can think of other more crass swear words though…) The risk assessment must be more objective than using flippant (and emotionally charged) words like "trivial". In short, your idea of trivial is not going to be another person’s definition. Stay away from using those kind of words!

    In good problem tracking systems, there should always be two separate rankings: priority and severity. In fact, IBM codified severity ranking decades ago which has been in use in thousands of data-centers for over 30 years: 1=Severe, 2=Serious, 3=Moderate, 4=Low. Most data-centers added a 5=Wish-list to account for problems that would be "nice to solve at some point in the future". (Note that IBM published very explicit definitions in the ’60’s to ensure that all customers were in sync with them and these definitions are still in use for all their software and hardware problems.) The priority ranking would go from 1 to 4 or 5 to indicate the order in which problems were to be diagnosed and solved. Typically, priority 1 problems would be solved first, and in ascending severity order. .E.g., priority1/sev1 problems solved first, priority1/sev2 problems next, etc.

    Given the above scheme, the above hypothetical problem would be ranked severity 2, since the app. completely fails (but can be circumvented). (If the app. failed repeatedly and there was no circumvention or work-around, then it would be severity 1. "Sev1" problems are those that are an "outage" and require vendor-intervention.)

    Essentially, the severity ranking quanitifies the prevalance of the problems, i.e,. how many customers are likely to encounter the problem and the resultant "outage".

    If for instance, the IDE were, instead of failing, to display an error and continue running, then that would be a severity 3 problem since the application recovered. Only if the IDE were to recover from the erroneous error state and resume running WITHOUT functional errors, could it be expected to be termed severity 4…which could be called "trivial".

    The above hypothetical error is certainly NOT severity 4 (what you’re calling "trivial").

    Now some have argued over the decades that severity should be broken into it’s two determinant components: prevalance and error-result. And some problem tracking systems have actually implemented that. But, experience has shown that a dual-ranking is sufficient for most problem domains. A tri-ranking problem domain introduces more complexity in arguments over whether the sort-key (order of problem solving) should be priority-prevalance-errorresult or priority-errorresult-prevalance.

    Now getting back to the notion of a beta recall, you need to decide what the pain-threshold is for back-tracking to fix embarassing seemingly "trivial" bugs. This is why you should have priority ranking system in place. Then it becomes simple to define a recall rule like this: Recall a beta if any priority1/sev1 bugs appear within the first 3 days after release. For the above hypothetical case, even though the bug could be classified as priority 1, because it’s a severity 2 problem, it wouldn’t trigger a recall.

    Note that I’ve written a half-dozen large-scale problem tracking systems (and SCCS’s that were integrated with them), and have observed such dual-ranking systems proven to satisfy the needs of large-scale software (and hardware) bug resolution. (I.e., Both OS and computer hardware development.) The biggest problem encountered with problem tracking systems (even with ranks to help dictate the order of problem resolution) are the politics of fudging the rankings up or down to change the sort-order to satisfy whichever project manager or product manager is screaming the loudest. Having numeric ranks helps to keep all the stake-holders honest, since each rank is quanitifed. Problem tracking systems that use words for the ranks always end up with squabbling over what the words mean…just as I pointed out with your use of the word "trivial".

    Hope this helps! (Think objectively. Use dual-ranks. Use numeric ranks.)

  4. Jeff Atwood says:

    Your "hypothetical" examples are scaring me.

  5. AndrewSeven says:

    For a beta, that problem would be ok, after all it encourages you to use polymorphism instead of switches 😉

    I’m only using Beta1 at home, but so far I haven’t noticed anything near that level.

    Off topic : When you type a method name that doesn’t exist, there is a smart tag thingy that appears to generate a method stub.

    Is there a way to generate a property stub the same way?