Why haven’t our Exception Management Practices evolved from the 60’s?

It is amazing to think how much has changed over the last decade or so – evolution of the cloud, broad adoption of natural user interfaces and general acceptance of agile development techniques – just to name a few. But one thing that causes me no end of frustration is that during this entire period of time we do not appear to have improved our user experiences when exceptions occur. In fact I would argue that they haven’t changed significantly since abends were incorporated into the IBM OS/360 in 1964.  

Before we talk about the solution let me give you some recent examples that have frustrated me. Just to ensure you don’t think I am trying to say Microsoft is any better / worse than anyone else I have cited a Microsoft example as well.  

-          I recently purchased an iTunes card for my dad for Christmas. We entered the card id exactly as was printed on the card but kept receiving an invalid card error message. It didn’t say why it was invalid or what we should do to resolve the problem – it just said the id was invalid. We went into the Apple store and asked the support person for help. First he tried to get us to call support instead but finally under duress he started the process of getting us a replacement card. During that time I mentioned it was urgent because dad was returning to Australia – at which point he goes “aha” – that is the problem - US cards do not work with Australian iTunes accounts! Why didn’t the error inside iTunes provide this information?

-          Inside Microsoft we have deployed DirectAccess – which is arguably one of the coolest security innovations in years – allowing us to connect to the Microsoft network remotely without having to use our Smart Cards. However, it doesn’t always work. And when it doesn’t work this is the error you receive “Direct Access Connectivity is not working” – nothing to help you understand what the problem is or how to resolve it. The problem is usually caused by recent security patches not having being installed – but not always.

So how do we as an industry resolve this? In my opinion there are two changes that have to occur:

1.       Add a “User Exception Testing” phase to your development lifecycle – As part of code reviews during development we should investigate every single exception that is thrown and triage the user experience as we would any other user interaction. If the user can’t diagnose the error with the information provided then raise a bug. This would have resolved both of the above issues.


2.       Incorporate links to “Live Troubleshooting Guidance” – In some cases there are legitimate reasons why an exception occurs and occasionally it is possible that it isn’t possible to know all problems and solutions during development and so a more flexible solution is required. A good example of our Information Experience team is now resolving this problem is by allowing certain events from within the Event Viewer to link directly to the TechNet Wiki – for example Event ID 100016 redirects to here where Microsoft support and community members can update the content as new problems + solutions are identified.

I would love to hear your thoughts on what needs to occur – plus feel free to use this blog as a way to rant about any specific exceptions that are driving you nuts at the moment.

Comments (5)

  1. Paul Smith says:

    I couldn't agree more with this post. Encountering an exception is bad enough – but for me to have to default to Google / Bing in an attempt to resolve this issue is unnacceptable in this day and age.

  2. Oliver says:

    Agree completely – I just had the following error when I tried to sync my Windows phone with Outlook. Error 0x86000108. That's it. Nothing else…

  3. Jason Hogg says:

    Oliver – that one I can help with. My wife has a Windows Mobile 6.5 phone and it gets that error about every 6 months. From my experience it means there is some corrupt data either in an email, calendar, notes or contacts. So what I do is isolate which of these has the corrupt data by removing them from the sync process. Then when you know which source contains the corrupt data – lets just say it is mail – go to Outlook and move all your mail (inbox and sent) into an archive folder. Then reenable the data source on your phone and resync. It should work. Then on your computer move the archived emails back into the inbox and then resync your phone. Should work…

  4. AlexMiguel says:

    Jason, interesting problem with Dad's iTunes:

    1) Have you thought about the tension between security requirements (to maintain internal secrets and avoid incrementing the chance of system's vulnerabilities), and the need to be user friendly while throwing error messages and exceptions?. Are we amplifying the problem by not solving this tension first?

    2) Is there a need for capturing "richer error-context" or the envirronment configuration where the exception was thrown?

    3) How about a user engaged in spoofing, for whom a more descriptive errors can help geopardize the integrity of the system?  In that case, should the application canpture user's intent before being able to get more descriptive errors?

    4) How many errors are amplified by the lack of error handling for distributed and/or autonomous components?

    5) Additionally, do we need to leverage patterns for claims based authorization to handle 'what error should be reported across a series of complex or conflicting rules?'

  5. Jason Hogg says:

    Hey Alex –

    1 – One of the patterns our initial Security Patterns guide described was that of Exception Shielding(see msdn.microsoft.com/…/aa480591.aspx) which directly addresses this problem. Pattern was reprinted in the SOA Patterns Guide.

    2 – IMO it is about people designing software that focuses not on exposing exception information – but on helping the end user troubleshoot. If the troubleshooting guidance is to contact an administrator – then that should FAIL.

    3 – All systems should have formal threat models and risks associated with such scenarios should be analyzed. But from my perspective this is not the majority of the time.

    4 – 10 🙂

    5- Potentially – but I am thinking about something even simpler. Just establish some formal practices (potentially even a sub-stage of User Acceptance testing) where you simulate every single possible exception and test the end user experience. We do this for all positive scenarios – (how many commerce sites would not test whether a credit card can be validated) but appear to ignore all negative scenarios. For anyone who has done Kino analysis these exceptions are imo dissatisfiers and one of the major reasons frustrated people such as me give up on certain products.



Skip to main content