What to do when the unexpected happens

Article
01/27/2005

Quite recently I made a pretty visible change to the C# language service for Beta2. For many operations that you can perform in the IDE (like generate method stub, or trying to set a named breakpoint) we will throw an exception in the case of failure. These exceptions can happen for many reasons, especially as we call into many COM interfaces any of which can return a failure HRESULT which we won’t know how to deal with. In the root of all code paths that can lead to an exception being thrown we have a catch for our root exception type so that we can pop up a message saying that we were unable to do what you requested. While these exceptions are unexpected (as we believe they should be) we do not believe that it is wise for the entire application to be torn down just because we couldn’t do something like replace some text in the editor.

So what have we changed? Well starting with very recent builds we will now pop up a Watson dialog asking if you would send that information to us. The dialog looks like the regular Watson dialog which you (hopefully) don’t know and love:

However, there are significant differences (though you probably won’t notice them since you’re so used to what this dialog normally means). First of all, this isn’t a regular Watson dialog that indicates that the program terminated abnormally. Rather we are in a state where we can know something has gone wrong but we can recover just fine from the issue. We try to indicate this by telling you that the error is recoverable and that no information has been lost. Second of all, we will only ask you to do this once while VS is running. This is so that we don’t annoy you every time we encounter a problem and also it’s to ensure that in pathological cases after you dismiss this dialog you don’t just get one 1 second later if we’re in a loop that’s failing over and over again.

So why are we doing this? Well, as much as we’d like to think that QA and good dev work will prevent all errors, we know that that’s simply not the case. Users will encounter errors and things won’t work and it’s just going to suck for them. By having this information get sent back to us during this beta we can see what real issues users are running into and we can know that we have to focus on that code and really investigate and determine why it’s failing. How do we do that? Well, when one of those exceptions is thrown the information that will get sent to us is the call-stack that happened when the error occurred. The Watson system will collect those call-stacks and make them available to us to investigate. It will also do very convenient things like calculating how many times this issue has been hit ever, how many times it was hit with the same callstack, etc. It will also let us send out a survey to the user so that if we need some more info then the next batch of people who hit this will get asked if they can give us some extra information that might help us with this. For example, we might ask them “are you using a source control system? Did that system change the readonly status of the file outside the IDE? Etc.”

By sending this information to us we can now start working on issues that are actually you and we can do that in the order of how important it is (which we can determine by the quantity of cases, and what operation you’re being prevented from doing, as well as other factors).

So that’s all lead up to the debate we’ve been having internally. We felt that this was a safe action to take because when these exceptions are thrown we know for certain that something has gone wrong. We asked someone to do something and they returned a failure which has caused us to abort the current action. However, there’s another type of error that occurs inside the IDE. Specifically with our VS2003 codebase (which didn’t use exceptions) we used a very common coding metaphor of:

if (FAILED(hr = SomeCall(…))) {

VSFAILF(“…”);

return hr;

}

Basically what we’re doing is emulating exceptions in a codebase that didn’t use them. After every call we check an error value and if we can’t handle it we propagate it up. We also assert (VSFAILF) so that if we’re in a debug build we’ll stop executing so we can attach a debugger. Recently I’ve started switching that code to use a macro instead:

#define IfFailAssertReturnHR(expr) \

if (FAILED(hr = (expr)))) { \

VSFAILF(“”); \

return hr; \

}

This macro has many benefits over the preceding code, but I don’t really want to get into them now. One benefit that could be added to this is that we could change that macro to then have the line: “Watson::ReportUnexpectedCondition()” which would them proffer the dialog I showed earlier.

So why wouldn’t we do this? Well, unfortunately, unlike the exception based code base we have where we know that an exception is something bad, these failure cases are not always unexpected. For example, inside the IntelliSense engine one might be trying to iterate over the members of a type. However, some of those members might be inaccessible to the caller because they’re private. Unfortunately to indicate that a member cannot be retrieved an error HRESULT will get returned. That error then gets bubbled up and is might cause some other operation to fail. This is rather unfortunate, but without an inordinate amount of code churn we won’t be able to fix up this code to deal with these different concepts we can't change it.

So this means that 95% of the time when we get one of these asserts it’s a bad thing, but 5% of the time it’s actually ok and we’re just getting the assert because of poor architectural decisions. This is rather high in my mind and it means that if you’re running the IDE (and thus performing millions of operations behind the scenes) you have a good chance of running into this. If we then submit this report we’re likely to miss an important error and we’re also probably going to annoy you every time you run the Beta2.

So what to do instead? Well, Watson has the concept of a “queued report”. Rather than interrupting you when the problem happens the information is logged to a queue. Later on (usually after a reboot in my experience) you’ll get a message saying something like “The following applications experienced problems during execution. Would you like to send the information blah blah blah…” This would allow us to still report immediately about significant problems (which also has the added benefit of letting you know that while we were unable to do what you just asked we’ll now know about it and we’ll hopefully be able to do something) while getting valuable feedback about these other errors that people are experiencing that we are recovering from.

Of course, like all things nowadays we’re incredibly wary about making changes (especially highly visible ones) so close to Beta2. God forbid I have a bug in my code which submits this stuff to Watson and now instead of gracefully recovering from an exception like we normally do we crash 10% of the time. So it’s a tough call between asking ourselves if the information gathered is worth the code risks, and also if the fundamental change in behavior will be ok with our users.

Personally, I feel that it’s worth it. We’re doing this for a beta and the point of a beta (in my mind) is to find out what needs work so that we can get everything up to extremely high quality for the final release. We depend on users to help us with that by trying out the product and reporting back to us the problems they have. But now we can make that process a lot easier by automatically collecting information that is incredibly valuable. While exact repro steps are the most ideal for diagnosing and fixing issues, callstacks to problem areas are the next best thing.

How do you feel about this sort of thing happening to you while you use the next beta? Also, how do you feel about this sort of thing happening while you use the final product? Is it worth sending us the information so that we can better address issues that you’re running into? Or is the intrusiveness just far too much to handle? What other feelings ro concerns do you think exist with this?

What to do when the unexpected happens

Additional resources