Windows Azure Guidance – Failure recovery – Part III (Small tweak, great benefits)

In the previous post, my question was about a small change in the code that would yield a big improvement. The answer is:


What changed?

  1. No try / catch
  2. We reversed the order of writes: first we write the details, then we write the “header” or “master” record for the expense.

If the last SaveChanges fails, then there will be orphaned records and images, but the user will not see anything (except for the exception), and presumably would re-enter the expense report. In the meantime a background process will eventually clean up everything. Simple and efficient.

Comments (4)

  1. naiman3000 says:


  2. Why did you remove the try catch? You can handle the exception and show a user friendly message.

    Putting header save in the a nested try catch could help you track which part crashed and show an appropriate error message.

    Or is it just that you want to show an exception to the user to warn him that something went wrong, and the "beautification" of your code is left as an excercise for the reader? 🙂

  3. Thanks Panagiotis,

    You can keep the try/catch. In a system like this, it is likely the repository is called by many layers of differetn components before returning to the user. So the excpetion at this level will be caught by someone before returning to the user.

    The point of the blog post is that you can’t rely on your "catch" executing at all. The server might go away before it reaches that point. What do you do then?

    The code above leaves the system in a fairly consistent state. At least for a user perspective.

  4. Oh, I see your point now. You successfully described this as a "pull the plug" scenario. You can’t rely on your catch but that’s the case on non-cloud solutions too. At any moment a running application can terminate ungracefully, leaving a system in a bad, "dirty" if you want state. I’m not sure if such a scenario exists in Windows Azure. I mean, there is a 30 second notice before shutdown on your instances. A possible hardware failure would probably be forseen before happening, allowing the platform to provide those 30 seconds before switching bringing a new instance to life one some new hardware. If your application is running over the wire, then are some other points of failure too. Then again, having an "always, something can fail" approach is not bad, if properly applied. My concern is how to achieve that thin line between an "overkill" approach which can hurt productivity by being over-cautious to being completely reckless.

    Thank you,