More thoughts on error tolerance.


I’ve posted before about error tolerance and how I consider that a very important part of VC#.  This time I want to talk about some of the difficulties that arise when you go down that route.

I’m going to start with a bug we recently discovered and how error tolerance might have made it go overlooked.  Imagine you are writing:

using System;

 

class MyAttribute : Attribute {

    public MyAttribute(DateTime dateTime) { /* … */ }

}

 

[My(   //<– typing here

We’re going to show you parameter help for the “MyAttribute” constructor right there.  In order to do that we first need to figure out what “My” is.  To do that we go to:

[

and we try to figure out everything that’s valid there.  Because it’s the beginning of an attribute we know that only types that extend System.Attribute  are valid there and that if you’ve typed a name like “Foo” then we need to search for types named Foo or FooAttribute.  Once we’ve done that we find all the valid attribute constructors and build up the parameter help tooltip that we want to show you.  Part of that process is figuring out what the parameter help will look like.  You might think it’s always the same, but that’s not actually the case.  If you have an empty file and you start typing:

[My(

then you’ll see “My(System.DateTime dateTime)”.  If you have:

using System;

 

[My(

then you’ll see “My(DateTime dateTime)”; and if you have:

class System {

    [My(

}

then you’ll see “My(global::System.DateTime dateTime)”; and if you have:

using System;

using SomeOtherNamespaceThatIncludesDateTime;

 

[My(

then you’ll see “My(System.DateTime dateTime)”.  etc. etc.

As you can see the only difference is how we qualify the type of the argument for the constructor.  We try to use the simplest type name possible to give you a clear and concise tooltip.  As it turns out there was a small bug when doing this.  We already built up the list of valid types that we would know about when you typed the ‘[‘ and we mistakingly used that list when trying to figure out the simplest type name for the argument.  Because of that we weren’t even able to bind to the type “System.DateTime” (it doesn’t extend “Attribute”) and so figuring out the simplest type name wasn’t possible.  Now, in the past we took a very error tolerant route and just output the fully qualified name to the tooltip.  Unfortunately that decision helped to make this bug go overlooked.  

The number of different intellisense features is huge and when we talk about the “matrix” (how each feature interacts with each other) you can be staggered by how many different interactions there are to consider and test.  So in this case there probably wasn’t a specific test for this exact functionality and even i never noticed that anything was wrong when I was using attributes (who really thinks anything is wrong when they see “System.DateTime” instead of “DateTime”?).  So how did this bug get caught?  Well, a little while back Kevin decided to change the logic for finding simplified type names a bit.  Instead of being error tolerant and choosing the fully qualified name if we can’t find the best type name we instead fail fast.  This failure tends to propagate up a while until it is caught and tends to manifest itself through very obvious broken functionality.  In this case you might be missing parameter help completely in that case.  After making this change we received a large amount of error reports on many features from “extract interface” to the “code definition window” (there are a heck of a lot of features that sit on top of this functionality).  

The question that I’m debating right now is how to balance error tolerance vs. ease of finding problems and how we should ship VC#.  Should we ship with this error-intolerant behavior where bugs are immediately apparent but sometimes functionality is completely broken?  Great for finding problems, but if we don’t provide patches in a timely manner then we’re just making life painful for you.  Maybe we could use the SQM system to get notified about these problems so we can fix them up.  However, SQM is extremely limited in the information it can send so I don’t know if it will be able to help here.   For now I think while we’re in extreme bug finding/fixing mode it’s very worthwhile to cut down on tolerance, and then as we get closer to shipping (but still have enough time to fully test everything) we go back to a tolerant mode.

This is definitely something I think we need to focus on more when developing things in the future.  So rather than blindly swallowing errors in order to give the user a pretty decent experience, we figure out how to get at that information (like Watson) so that we can fix the bugs as well.  Anyone have any experience with this sort of thing?  How do you get the best of both worlds for your users?


Comments (6)

  1. Philip Rieck says:

    How can we answer your question without knowing what the service pack story will be for Visual Studio 2005? Well, we could guess based on VS2002 and VS2003.. That is, no service packs.

    Sure there were several hotfixes – but these are not even close to being in the same league as service packs. So with this in mind, I say "don’t break my code editor, or you’ll find I purchase a different one".

    If that means ship with error tolerance that lead to a few obscure quirks, I’m fine with that. If it means ship with no error tolerance because you found all the bugs, great! But if you ship with no tolerance and lots of problems, I will personally write several very whiny blog entries. And you don’t want that.

    (Aside:Obviously, making actual service packs out of the numerous hotfixes would be a great thing to do to. I understand you probably don’t have any personal power over that, but you can go tap the head of the people who do. Do it! Make a difference. — or tell me who they are, and I’ll try.)

  2. David Levine says:

    My own preference is for you to have two builds – one that is tolerant of faults and is released to the outside world, and one that is intolerant and is used for your internal testing. I personally do not want my editing experience degraded to the extent you describe, but your own testing should be as regorous as possible.

    You stated that you swallow errors – are you literally swallowing exceptions are do you mean that you just discard error results? If you are swallowing exceptions then you can write code like this…

    try

    {

    // other code here…

    }

    catch(Exception ex)

    {

    Utils.SwallowException(ex,"Context info"); // a static method

    }

    In the builds released to the outside world SwallowException may log the data, discard it, etc. based on config settings – the default could be to ignore it. For your internal builds have it rethrow the exception to stop the code immediately. You can add parameters to fine tune your response in a particular code path.

    We use a similar approach and find that it works well – we found a number of errors and problems that were being hidden.

  3. David: Yes, we’re literally swallowing errors. This code is in the non-exception part of the code base so it looks more liked:

    for (SomeLoop()) {

    if (FAILED(hr = BlahBlahBlah())) {

    continue;

    }

    //do stuff

    }

    So we’re "error tolerant" in the sense that the entire operation doesn’t fail if some small part fails. But we lost all information about that failure.

    Note: I do realize how that is pertty much exactly like a "catch(Exception e)" approach. It’s not how I want it to be and I want this to be better in the future. I’m just not sure how.

  4. Mark Levison says:

    What about using one a logging tool (log4Net http://logging.apache.org/log4net/ or its equivalent) to log all the failures. Every so often it asks permission to email the failures home. Even if you only logged from builds used internally that would be an improvement.

  5. David Levine says:

    Yes, that’s similar to swallowing an exception but with none of the benefits of having a centralized catch handler.

    You could modify the FAILED macro so that on an error it reported it to a logging facility, and under extreme (i.e. rigorous) testing it could actually throw an exception, thereby halting your code dead in its tracks so that you could load a debugger in the full context of the faulting code sequence.

    That would leave the shipping code fault tolerant and still provide a mechanism for making the internal builds less fault tolerant so bugs can be detected earlier.

    At least this is easier then the "good old days" where I had to spit out numbers to the screen buffer – that was my trace output – to tell me which function the code was in when it died.