What to do when the unexpected happens


Quite recently I made a pretty visible change to the C# language service for Beta2.  For many operations that you can perform in the IDE (like generate method stub, or trying to set a named breakpoint) we will throw an exception in the case of failure.  These exceptions can happen for many reasons, especially as we call into many COM interfaces any of which can return a failure HRESULT which we won’t know how to deal with.  In the root of all code paths that can lead to an exception being thrown we have a catch for our root exception type so that we can pop up a message saying that we were unable to do what you requested.  While these exceptions are unexpected (as we believe they should be) we do not believe that it is wise for the entire application to be torn down just because we couldn’t do something like replace some text in the editor. 

 

So what have we changed?  Well starting with very recent builds we will now pop up a Watson dialog asking if you would send that information to us.  The dialog looks like the regular Watson dialog which you (hopefully) don’t know and love:

 

 

 

However, there are significant differences (though you probably won’t notice them since you’re so used to what this dialog normally means).  First of all, this isn’t a regular Watson dialog that indicates that the program terminated abnormally.  Rather we are in a state where we can know something has gone wrong but we can recover just fine from the issue.  We try to indicate this by telling you that the error is recoverable and that no information has been lost.  Second of all, we will only ask you to do this once while VS is running.  This is so that we don’t annoy you every time we encounter a problem and also it’s to ensure that in pathological cases after you dismiss this dialog you don’t just get one 1 second later if we’re in a loop that’s failing over and over again.

 

So why are we doing this?  Well, as much as we’d like to think that QA and good dev work will prevent all errors, we know that that’s simply not the case.  Users will encounter errors and things won’t work and it’s just going to suck for them.  By having this information get sent back to us during this beta we can see what real issues users are running into and we can know that we have to focus on that code and really investigate and determine why it’s failing.  How do we do that?  Well, when one of those exceptions is thrown the information that will get sent to us is the call-stack that happened when the error occurred.  The Watson system will collect those call-stacks and make them available to us to investigate.  It will also do very convenient things like calculating how many times this issue has been hit ever, how many times it was hit with the same callstack, etc.  It will also let us send out a survey to the user so that if we need some more info then the next batch of people who hit this will get asked if they can give us some extra information that might help us with this.  For example, we might ask them “are you using a source control system?  Did that system change the readonly status of the file outside the IDE? Etc.”

 

By sending this information to us we can now start working on issues that are actually you and we can do that in the order of how important it is (which we can determine by the quantity of cases, and what operation you’re being prevented from doing, as well as other factors).

 

So that’s all lead up to the debate we’ve been having internally.  We felt that this was a safe action to take because when these exceptions are thrown we know for certain that something has gone wrong.  We asked someone to do something and they returned a failure which has caused us to abort the current action.  However, there’s another type of error that occurs inside the IDE.  Specifically with our VS2003 codebase (which didn’t use exceptions) we used a very common coding metaphor of:

 

if (FAILED(hr = SomeCall(…))) {

    VSFAILF(“…”);

    return hr;

}

 

Basically what we’re doing is emulating exceptions in a codebase that didn’t use them.  After every call we check an error value and if we can’t handle it we propagate it up.  We also assert (VSFAILF) so that if we’re in a debug build we’ll stop executing so we can attach a debugger.  Recently I’ve started switching that code to use a macro instead:

 

#define IfFailAssertReturnHR(expr) \

    if (FAILED(hr = (expr)))) { \

        VSFAILF(“”); \

        return hr; \

    }

 

This macro has many benefits over the preceding code, but I don’t really want to get into them now.  One benefit that could be added to this is that we could change that macro to then have the line: “Watson::ReportUnexpectedCondition()” which would them proffer the dialog I showed earlier.

 

So why wouldn’t we do this?  Well, unfortunately, unlike the exception based code base we have where we know that an exception is something bad, these failure cases are not always unexpected.  For example, inside the IntelliSense engine one might be trying to iterate over the members of a type.  However, some of those members might be inaccessible to the caller because they’re private.  Unfortunately to indicate that a member cannot be retrieved an error HRESULT will get returned.  That error then gets bubbled up and is might cause some other operation to fail.  This is rather unfortunate, but without an inordinate amount of code churn we won’t be able to fix up this code to deal with these different concepts we can’t change it.

 

So this means that 95% of the time when we get one of these asserts it’s a bad thing, but 5% of the time it’s actually ok and we’re just getting the assert because of poor architectural decisions.  This is rather high in my mind and it means that if you’re running the IDE (and thus performing millions of operations behind the scenes) you have a good chance of running into this.  If we then submit this report we’re likely to miss an important error and we’re also probably going to annoy you every time you run the Beta2.

 

So what to do instead?  Well, Watson has the concept of a “queued report”.  Rather than interrupting you when the problem happens the information is logged to a queue.  Later on (usually after a reboot in my experience) you’ll get a message saying something like “The following applications experienced problems during execution.  Would you like to send the information blah blah blah…”  This would allow us to still report immediately about significant problems (which also has the added benefit of letting you know that while we were unable to do what you just asked we’ll now know about it and we’ll hopefully be able to do something) while getting valuable feedback about these other errors that people are experiencing that we are recovering from.

 

Of course, like all things nowadays we’re incredibly wary about making changes (especially highly visible ones) so close to Beta2.  God forbid I have a bug in my code which submits this stuff to Watson and now instead of gracefully recovering from an exception like we normally do we crash 10% of the time.  So it’s a tough call between asking ourselves if the information gathered is worth the code risks, and also if the fundamental change in behavior will be ok with our users. 

 

Personally, I feel that it’s worth it.  We’re doing this for a beta and the point of a beta (in my mind) is to find out what needs work so that we can get everything up to extremely high quality for the final release.  We depend on users to help us with that by trying out the product and reporting back to us the problems they have.  But now we can make that process a lot easier by automatically collecting information that is incredibly valuable.  While exact repro steps are the most ideal for diagnosing and fixing issues, callstacks to problem areas are the next best thing.

 

How do you feel about this sort of thing happening to you while you use the next beta?  Also, how do you feel about this sort of thing happening while you use the final product?  Is it worth sending us the information so that we can better address issues that you’re running into?  Or is the intrusiveness just far too much to handle?  What other feelings ro concerns do you think exist with this?


Comments (30)

  1. Matt says:

    Awesome stuff, we use exactly the same style of error reporting in our .NET application – we have the ability to report a caught exception or an unexpected condition (it’s also hooked up to the top level exception handlers). We can either prompt the user to enter a description of what they were doing at the time (they usually just type particular four letter words), or just submit the error silently. The error reports include callstacks, product versions, system and environment information (memory, GDI resources, disk space, etc), as well as a list of recently executed sql commands, keystokes and control navigation, and a number of other pieces of data relevant to the application.

    If what you’re proposing means that an error that occurs while we’re developing can get reported to Microsoft and as a result gets fixed, I say bring it on – our experience as an end user will be all the better for it in the long run.

    Besides, we do the same thing to our users, so it would be hypocritical of us to object to Microsoft doing the same to us 🙂

  2. I like the idea of using Watson infrastructure to report non-critical problems. I would even tolerate it in a shipping product if it’s not very intrusive (e.g. there could be a "automatically send any reports and don’t ask me again" option or something).

    May the OS Watson should also start reporting recoverable problems? I could see something like this:

    Application XYZ is impacting performance by generating too many page faults/using lots of memory/etc. Do you want to submit an error report?

  3. Adam Young says:

    It might confuse users… I’m used to the app being closed and my unsaved work being lost when I see this dialog; but the behaviour in this case will be different to what I expect. I’d recommend changing the dialog, rewording it to make it clearer… if this is going to be used elsewhere in VS for reporting purposes, the difference between the Watsons should be visually obvious (i.e. 1 means, the app is closing, your work is lost, the other means the app is ok, your work is ok, but please report the issue). Could this not be incorporated into customer feedback in some way…

    Overall I think it’s a good idea in principle, but MS don’t tend to ship service packs for the VS.net 1.x versions; back in the day, each VS release prior to .net would be periodically service packed. This was a Good Thing. Then VS .net came along in 2001 & the practise of patching bugs stopped (well, apart from VS 1.1, which was really just a service pack, although marketed as a new version). So, we’ve had two versions (or perhaps 1 and a half versions) in around 4 years. I’ve used VS for serious dev from beta1 (Dec 2000), so it probably seems a lot longer to me than it should.

    What I’m trying to say is, great, take diagnostic info to improve the product, but if you only ship a version once every year / 2 years then it’s pretty much useless to me – I still have to live with the bugs!

  4. Luc Cluitmans says:

    I agree with all points made by Adam.

    In particular, make the dialog clearly different from the ‘real’ watson dialog, the example image looks too much like the normal ‘crash dialog’, also for seasoned developers.

    I don’t think you should use the ‘queued report’. It will be almost impossible for someone to remember what the heck they were doing, unless asked straight away. And, as a ‘drawback’ of the stability of windows these days, who reboots their computers regularly anymore? I think I reboot my main development machine at the office at most once per three weeks or so. Getting a message about sending an error report for something that happened two weeks ago is, to put it mildly, confusing.

  5. Alex says:

    I think for a regular application non critical error reporting would be too in your face, but for Visual Studio the target audience is very different and this sort of scheme works well. I recently experienced a crash with Lutz Roeder’s reflector – it packaged the exception and asked if it could be sent with an optional user feedback comment (as to what i was doing). I sent the report and was shortly after contacted about the issue we managed to quickly narrow down the issue to a 1.0 compiled plugin.

    To me it makes sense to have such a system, and while the level of interaction that I had with Reflector may not scale to Studio I think you can leverage the intrinsic knowledge of your userbase and they will apreciate the involvement (especially with an optional don’t bother me again style switch), especially as Adam states there are actually fixes forthcoming (and not just hotfixes – they aren’t good enough).

    I would also agree on the issue of reports on reboots – I currently pretty much only reboot for Windows Updates and if I’ve got a lot of work on I’ll even wait a few days on that (with the reminder dialog in the corner of my screen). In fact I’d say the same about Studio itself often it will be left on for weeks at a time when I have a major project on.

  6. Samuel Jack says:

    I also agree that this dialog needs distinguishing from those that are displayed when a fatal error occurs.

    How about having a setting (in Tools->Options say) to

    (a) Report all non-fatal errors silently

    (b) Ask what I want to do about a particular error

    (c) Never ask and never report non-fatal errors.

  7. Laura T. says:

    Over all, diagnostics is a very good idea.

    What is missing is diagnostics response. I mean, when I submit an error report/dump/trace (whatever), I will never know (today) if it’s a known problem, there is a fix or workaround or something. This is something Dr. Watson/Mr.Hyde does not do.

    It would be more usefull, if the reporting system could also give me some feedback, like "Yes we kno this. Look this Q or download this fix" or "This a new one" (so you save the work to google it).

  8. Code Monkey says:

    I’m all for it, the long term benefits are obvious. In the short term, I would think having this in beta2 would be enormously helpful. Just have a couple good devs go over your code so everyone is comfortable you won’t blow up beta2. 🙂

  9. Philip Rieck says:

    Adam hit it on the head. Since no version of VS.NET has ever had a service pack regardless of known bugs (and hidden hotfixes), I’m not likely to hit "send" on this dialog.

    Why not (when I would love to report errors to make products better)? If a bug causes vs to pop this up, I have two choices. 1) Hit send, wait 10-60 seconds, hit okay, lose "the zone"… or 2) hit "don’t send", keep working immediately.

    If I had any hope that doing #1 would cause the bug to stop six months down the road, I’d do it. But why waste one minute times thousands of users fo no reason?

  10. Won’t it dramatically affect the precived quality of the product to constantly inform the user that something went wrong?

  11. Philip Rieck says:

    On my previous – clearly this applies only to production.

    On a beta, the more ways and times I can send you fault information, the better. I love the "queued report" thing, but for beta2 I wouldn’t even care on that… I’d always send.

    While I wouldn’t want you to crash in an instance where you used to gracefully and silently fail, I’d rather that 10% bump (during the beta) than have you guys never know about a scenario that fails, since that scenario may become somewhat common down the road. That’s what a beta is for, in my opinion. I know that perception of quality and stability is important, even at beta2, but quality of RTM is much more important to me (especially since it may never be patched 🙂 )

  12. Lots of things to respond to. I’ll start with Paul.

    Paul: "Won’t it dramatically affect the precived quality of the product to constantly inform the user that something went wrong? "

    In return i might ask you: "won’t it dramatically affect the perceived quality of teh product to constant fail to do something that you asked us to do?

    Should we look professional while acting incompetant, or should we recognize our shortcomings and make a best effort to try to fix them.

  13. Pavel: "I like the idea of using Watson infrastructure to report non-critical problems. I would even tolerate it in a shipping product if it’s not very intrusive (e.g. there could be a "automatically send any reports and don’t ask me again" option or something). "

    Those options do exist in my implementation, but we havne’t exposed them (And most likely won’t for beta2). UI changes are very difficult to make considering all the issues like localization and accessibility that have to go into them. However, i will see if something like this can be done for the release.

  14. Adam: "It might confuse users… I’m used to the app being closed and my unsaved work being lost when I see this dialog; but the behaviour in this case will be different to what I expect. I’d recommend changing the dialog, rewording it to make it clearer… if this is going to be used elsewhere in VS for reporting purposes, the difference between the Watsons should be visually obvious (i.e. 1 means, the app is closing, your work is lost, the other means the app is ok, your work is ok, but please report the issue). Could this not be incorporated into customer feedback in some way… "

    I agree with you about the confusion. This dialog is owned by the watson team, so the only customization that can be done is through the hooks that they give us. I’ll see if i can make enough changes so that it appears wuite different from teh regular crash dialog.

    "Overall I think it’s a good idea in principle, but MS don’t tend to ship service packs for the VS.net 1.x versions; back in the day, each VS release prior to .net would be periodically service packed. This was a Good Thing. Then VS .net came along in 2001 & the practise of patching bugs stopped (well, apart from VS 1.1, which was really just a service pack, although marketed as a new version). So, we’ve had two versions (or perhaps 1 and a half versions) in around 4 years. I’ve used VS for serious dev from beta1 (Dec 2000), so it probably seems a lot longer to me than it should.

    What I’m trying to say is, great, take diagnostic info to improve the product, but if you only ship a version once every year / 2 years then it’s pretty much useless to me – I still have to live with the bugs! "

    Understood. I’m not involved in the patching process for VS so i’m not sure why things have been so different for 2002 and 2003. You rightly percieved that 2003 was really a big service pack for 2002 (which is why it only cost something like 10$ to get). I’m hoping we do better for 2005.

  15. Luc: "I don’t think you should use the ‘queued report’. It will be almost impossible for someone to remember what the heck they were doing, unless asked straight away."

    Luc, if we need info from the user then we can tell watson to not queue it up and instead ask the user for data. It’s only when we don’t need any extra info that it would be immediately queued.

    "And, as a ‘drawback’ of the stability of windows these days, who reboots their computers regularly anymore? I think I reboot my main development machine at the office at most once per three weeks or so. Getting a message about sending an error report for something that happened two weeks ago is, to put it mildly, confusing. "

    True, but is that really such a negative?

  16. Alex: Thanks for the observation about our expected userbase. I agree that the VS audience is probably the best to understand and work with us to work out these kinks. It’s nice when you’re a developer writing software for developers.

    Also, the anecdote about Reflector was very encouraging.

  17. Laura: "Over all, diagnostics is a very good idea.

    What is missing is diagnostics response. I mean, when I submit an error report/dump/trace (whatever), I will never know (today) if it’s a known problem, there is a fix or workaround or something. This is something Dr. Watson/Mr.Hyde does not do.

    It would be more usefull, if the reporting system could also give me some feedback, like "Yes we kno this. Look this Q or download this fix" or "This a new one" (so you save the work to google it). "

    That’s an excellent suggestion. I’ll send that to the Ladybug team (msdn/ProductFeedback). It might be nice that if we od ship service packs if we then list the bugs that have been fixed since the previous release.

  18. Philip: "If I had any hope that doing #1 would cause the bug to stop six months down the road, I’d do it. But why waste one minute times thousands of users fo no reason? "

    So true.

    I’m doing this so that we will be ablet o fix problems. If that doesn’t happen i’ll be very unhappy. At the very least i know we’ll be fixing issues from beta2, and i hope that we do that as well for teh released product.

  19. Sean Chase says:

    Personally, I have no problem sending error info back to the product group if it results in better products and service packs. This is true for betas or RTMs. Not only do I want VS to work very well for me, but also the MS Anti-spyware beta and the MSN Desktop Search beta. In short: I’m responding well to the "help me, help you" request. 🙂

  20. Cyrus,

    I think there is a better way to frame it than error reporting that’s all. If everytime I opened VS2005 you showed me a list of errors and said these things broke, I might get the impression that the product is a piece of crap that is broken every time I run it.

    Now, I should say I, because I personally love the tool, understand how it works and know for a fact quality matters to the people who are working on it.

    I’m just throwing it out there because it seems like you guys are making an effort with the low end SKUs to attach VS2005 to markets that compete with IntelliJ and Eclipse.

    Perception is reality, so while everyone who loves Microsoft might see the error as fixing obscure bugs in a high quality product, those taking a peek at the tools who have been writing Java for the last 10 years might think it’s just another piece of buggy Microsoft software.

  21. Paul: A couple of things should temper this opinion. First of all, we’re making this change for the beta, and not necessarily for the final release. Second of all, when i’ve used products like Eclipse and Netbeans I’ve experienced times when an exception was thrown and caught a top level dialog popped up to inform me about it.

    So if there are people who delight in find out that beta products have bugs in them… well i guess that’s just unfortunate. In the meantime i think we owe it to our user base to not provide the *appearance* of a solid product but to actually provide a solid product instead.

    I think it’s worth the risk.

    We’ll revaluate this for the final release when the audience becomes much more widespread.

  22. I think this is a mistake. Don’t put a debug button there and don’t make it in any way/shape/form similar to Dr. Watson. Reason: if I had seen this dialog w/o reading this blog, I would let VS send the error report and then I would promptly close VS. I would simply not trust it to be in a good state. If you can get user’s OK to send reports upfront, I wouldnt even ask that.

  23. James Finnigan says:

    Please give me an option to always submit these for VS – then I won’t be bothered by the dialogs disrupting my thinking, I won’t perceive that the app is buggy, and I can know that the real bugs that I’m hitting are being looked at.

    It would be even better if I could know if my issue(s) had been looked at / fixed – then I would absolutely love this feature and really feel the improved quality enabled by it (like the Reflector case).

  24. James: The options for configuring this are specified in my later blog post on this topic. By setting teh regkeys:

    HKCUSOFTWAREMICROSOFTVISUALSTUDIO8.0CSHARPEDITORWatson_SendExceptionReportWithoutPrompting – true

    HKCUSOFTWAREMICROSOFTVISUALSTUDIO8.0CSHARPEDITORWatson_SendFailureReportWithoutPrompting – true

    You will never be bothered while you’re in the middle of something.

  25. AT says:

    Hmm .. Will you trust me or nope – but I’ve just seen this dialog on my PC.

    It was not happy expirience for me.

    Actualy I do not care that much about how it looks like or that it suppossed to do.

    The biggest problem for me are IDE hang for at least minute (or two-three) before showing this dialog.

    And this is definely Cyrus fault "ModName: cslangsvc.dll" 😉

    I’ve just created a repro and will fill a bug report on Product Feedback.

  26. AT: your product feedback was marked as spam. Can you send it to me through the contact link.

    And yes, we know that there will be a hang while the minidump is collected. We feel that that’s acceptable behavior during the beta given the enormous feedback it will give us about problems you are having.

  27. AT says:

    I’m sorry. It was me who marked it as spam to remove from database.

    I’ve clicked Submit twice and got two exactly the same reports on Product Feedback.

    Take a look on FDBK21069

  28. Kurt says:

    I think this is an important step in improving product quality, for both beta and production releases. I agree however with some comments that the dialog may give the impression that the program is still buggy when the user sees this notice. Even if it is only displayed once per program execution, they will get the impression that every time I use the program an error occurs. Some users may perform the same tasks using different methods, users may begin to some extent blaming themselves… it never does that when I do it, you must be doing something wrong. (Well perhaps not us developers as we’re not as dumb but if this idea is incorporated into other products like Office, which it probably should be eventually).

    I agree with the idea of persistant options where a user can specify to automatically send errors and/or never send errors. However, I would however completely do away with the bulky dialog. In it’s place you might try a status bar icon that breafly displays notification when it appears and then fades away leaving the icon behind much like MS Outlook does when it is having a problem connecting to the mail server. Error is a fairly strong word when the situation is recoverable perhaps something more like the following would be less scary to users:

    "To improve product quality microsoft is collecting information about the actions that have just been performed. Your preference currently indicate that you would like to be notified when this occurs. Click here to learn more more about improving product quality, provide additional feedback, or to change your preferences."

    The user can click on the tool tip window before it fade or the icon afterwards to show the modal dialog. This is less intrusive as they can completely ignore it if they like or go back to it when it is convienent for them. I especially hate popup dialogs that change the focus window while I am typing! A tool tip you could easily show during a loop as well, it might stay on the screen longer then the normal interval but I don’t think anyone would really notice or care. This would make it less important to only show the error once, in fact I would like to know each occurance that occurs, if something action I performed did not produce the correct result, ie if I do a refactor extract method and the method is created but the parameters weren’t added then I’d like to know that Microsoft knows this occured, if I don’t see the message at all then I might assume that a manually report might be necessary, also I would be able to click the tool tip and type additional information if I had additional information I thought might be useful, such as this seems to only occur when the local variable starts with an underscore, etc.

    – Kurt