Windows Azure Guidance – Failure recovery and data consistency – Part II

I had some great answers on my previous post question, like Simone’s. Some where closer than others, but in general you got it right, Thanks!

The recovery strategy depicted there assumes that all failures are external. That is, writing to a table fails, for example, and you have a chance to run the clean up code. But what happens if your own code fails? Remember: the entire VM can go away at any point in time!

Note: as someone told me one, you should design for “unplug scenarios”. That is, at any point in time your system should recover for someone unplugging your server.

For example, what happens if your VM evaporates just before it executes the SaveChanges:image

Then you end up with a some blobs, a few messages in a queue notifying a worker that image compression is needed, a record in the “Expense” table (the “write master” in the diagram above), but no details….

The additional problem is that the user can go back to the site and might even see the expense in the grid, but then when attempting to navigate to the details, guess what… nothing will be there.

The background process that looks for orphaned records might eventually pick up a “detail-less” Expense record and clean it. But this is probably not a great solution.

A small change can greatly improve the user experience. Can you think suggest one?