The Windows Azure platform is a solid, highly available, highly scalable environment - but, like any system (on premise or in the cloud) there are risks which could threaten the desired operation of your app (what I refer to herein as the 'normal scenario'). In this article, I discuss the idea that failures can be categorised according to severity and that many effects associated with these failures are avoidable, various failure scenarios you might need to deal with (on any cloud platform - but specifically, Windows Azure), understanding and defining risk, and why it is important that you design for these failures and tolerate them in your cloud solutions.
First though, allow me to highlight my reasons for writing this article:
- Developers tend to polarise when it comes to cloud services: you either love them, or you hate them (admittedly, there are also plenty on the fence). Of the advocates, many feel that the cloud is invincible and resistant to all failures, effectively being permanently available save for an Act of God. The reality is that the cloud, like any system, is subject to failure. But while a failure in many on-premise data centers is a fundamental event, in the cloud (and certainly on the Windows Azure platform) failure isn't such a big deal: in fact it's expected and anticipated; and you should be pleased about that (and I'll show you why);
- In my experience, many developers tend to focus on mitigating the risk of downtime caused by failure by adding more servers and some load balancing: they've actually not deal with the failure scenario at all;
- I have observed a general trend for developers to metaphorically 'give up' when a condition is encountered that they would not normally expect, for example - when a remote database server is unavailable, an exception is thrown.
The idea behind this article is the thought that actually, failure is generally OK: what we try to avoid is the manifestation of individual risks becoming a problem. Risk avoidance is absolutely essential to cloud development and, to borrow an old phrase, sometimes you have to 'roll with the punches'. Accept that your cloud application might fail at some point and in some way and that it's your job to complement the 99.95% availability guarantee by having strategies to help you quantify and mitigate the risks beyond the up-time guarantee.
Building highly available, highly scalable applications in the cloud requires us to embrace this principle: to understand that we will encounter failures and that many of them can often be transparently handled until normal operation is restored.
Finally, before we continue I want to make an observation:
When you deploy to Windows Azure, you are asked to choose which data center you want to deploy to. Note here, that this is singular: you are selecting one data center. Thus, although the liklihood of a data center failing is very slim, you in theory do have a single point of failure. You can absolutely mitigate against this risk (see the foot of this article if you require pointers on how to mitigate risk above the data center level). However, in this article I am setting a cap at mitigating risk up to the point of failure of a single data centre: we are accepting the risk of a data centre failure is minimal and that is enough for us here. I appreciate that for others though, this is not possible and multiple failover options are required. This scenario is possible on Windows Azure and I point out where to start with this in the footer, and may in a later article cover how to do this in depth.
Let us first begin with a quick discussion about failure; what it means in this context and how it applies to you on Windows Azure:
Stuff goes wrong: accept it!
The basic principle of failure states that at some point:
- the hardware your code runs on will fail (disk, RAM and power failures are all relatively common over the 2-3 year lifetime of a server that is running 24 hours per day, 365 days per year);
- a remote service your app is dependant upon may not be available;
There is just no such thing as a mechanical item that isn't subject to failure.
In this article I am excluding from the scope any errors and/or risks caused by poorly written application code.
Let's get started with the excercise of defining what the risks are to your deployment on the Windows Azure platform.
Identifying the risks and understanding & categorising the effects
"Risk is the potential that a chosen action or activity (including the choice of inaction) will lead to a loss (an undesirable outcome). The notion implies that a choice having an influence on the outcome exists (or existed). Potential losses themselves may also be called risks".1
This definition hints at the necessity to both understand that there is the potential for risk in any situation and that the outcome of any given situation may be influenced (otherwise, it is a certainty) in some way so as to be able to lessen or prevent the effect from being noticable. In this section, we will identify what the risks are and, what the effect of each risk manifesting itself is.
Integral to your deployment on Windows Azure should be an understanding of:
- What the risks are;
- What steps can be taken to mitigate the effect of the risk surfacing;
- What category the effect falls within.
For example, when you buy a car, you know that there is a risk that it might get damaged, either by you (racing around again!), or by another road user. Assuming you're a law abiding citizen, you'll buy insurance to mitigate against the risk of damage to your car, or somebody else's. But within your policy document will be a list of expectations around what happens when your car is damaged: you'll be told how long your car will be unavailable, whether you'll have the use of a rental car, etc.
It is the same for deployments on Windows Azure, except this time we're not talking about the effects to your car, rather the effect to your business caused by risk actually becoming a reality (or, 'surfacing').
I've often found that the effects of the risk (the effect the risk has on your app once it has manifested) can generally be categorised according to the following scale (in order of descending severity):
- Catastrophic: there is nothing that can be done to mitigate the effect to normal operation;
- Fault: with careful planning and development work, a suitable mitigation can be automatically implemented to prevent the effect from surfacing;
- Avoidable: the effect can be avoided with a trivial amount of effort.
In this discussion, I'm assuming that the primary risk we're attempting to mitigate is downtime caused by loss of connectivity to the data centre. In my example deployment, we're talking about a simple web application with two web roles, two worker roles and a dependancy on a database on SQL Azure. If we dig further, our full risk register may look similar to the following:
|Instance taken offline for patching/maintenance, where only one instance of that role is deployed||Your app goes offline.||Catastrophic|
|Instance taken offline for patching/maintenance, where two or more instances of the role are deployed||Potential for increased load on remaining instances; but otherwise no disruption to service.||Avoidable|
|Instance (in a multi-instnace deployment) goes offline due to failure of the instance itself||As above.||Avoidable|
|Connectivity failure to a dependant resource in the data centre||The resource is unavailable for the duration of the disruption to connectivity.||Fault|
|Failure of the dependent resource||Potential of data loss. The resource is unavailable until it is recovered either manually or automatically.||Fault|
|Total loss of inbound and/or outbound connectivity to the data centre||Your app goes offline.||Catastrophic|
|Catastrophic loss of the data centre||Your app goes offline.||Catastrophic|
Only once both your technical team and your business leaders are aware of the risks, their manifested effect and what can technically be achieved to mitigate them, can a discussion about the extent to which you wish to implement these measures take place. Try and avoid the tendency of shooting for 100% availability across 100% of your dependant resources and remember that often, different parts of an app can tolerate different failures differently! Understand that risks also have a field of impact, too. For example, a catastrophic data centre failure would affect the whole of your app, whereas the failure of a database would impact only those sections which require connectivity to it.
Crucial to this discussion is having an open and honest discussion with the business, and with your customers, about what level of risk is acceptable to them. This will determine how much effort goes into your risk avoidance strategy. You need to understand what level of risk is acceptable.
On Windows Azure, one significant advantage is that the cost of maintaining a highly available, highly scalable solution that is both maintained and secure is generally orders of magnitude cheaper than the equivalent private, on-premise set up. The last thing you'd want to do is erode that saving by planning and deploying avoidance techniques that are completely over the top: so be reasonable with your understanding of acceptable risk.
This exercise may seem academic and fairly obvious but it is often overlooked for that reason. Without it, though, it is difficult to fully appreciate what steps are necessary, and to inform your UX designers properly about the types of scenarios that could naturally occur that you may well need to surface in your app to let your users know.
We've covered risk, now let's turn our attention to what we need to do should the worst happen: a risk has manifested and the effect has begun.
It's a common misconception that disaster recovery and risk mitigation are the same things.
'Disaster recovery' refers to the things you do (either automatically or as part of a manual activity) that restore you to your normal scenario; for example, something exceptional occurred and you have suffered a catastrophic event and need to get back to 'business as usual' as fast as possible, while minimising loss. Risk mitigation, on the other hand, is about the things you can do before a condition occurs that triggers your failure scenario.
So that you can do this effectively, you need to first understand what risk has surfaced, what your recovery options are for that particular risk, and therefore what your recovery strategy and objectives actually are.
Let's put this into context:
Your app went offline due to a failure of a database connection. The effect was that users of your app could no longer publish new content. There are potentially two recovery options available to you here: you could either write new content to a separate store temporarily and automatically update the failed database when it becomes available, or your other recovery option is to simply wait until the failed database is available again. Your strategy for recovery from this particular risk is therefore directly dependent on what your business expects you to be able to achieve in this scenario.
Putting it altogether
We've introduced the notion that risks are no less likely to occur on the Windows Azure platform than on-premise, and we know that Azure is capable of recovering from most of these risks without any input from you. What we're trying to look at here is what steps you can take as developers to stop any non-catastrophic effects from impacting your app, causing a 'failure scenario'. If you embrace the concept of expecting failure, it becomes quite easy to see what you must do in order to maintain normal operation during a failure situation. In general, remember you can:
- Use alternative persistent storage should a database become unavailable, and re-synch when available;
- Continue retrying a failed connection until it succeeds or 'defer failure' until after a certain number of retries;
When designing for high availability, it is a good idea to keep these questions in mind:
- Prevention: what can you do to stop the risks you've anticipated from occurring?
- Detection: how will you detect that your app is no longer in it's 'normal state'?
- Recovery: what can your app do to either temporarily mask the failure condition and maintain the appearance that everything is OK, or what steps must take place to get things back to normal operation?
Do not rely on the availability guarantee: it isn't enough (a 100% up-time guarantee wouldn't be, either) and remember, availability is only one part of the equation. If we go back to the car insurance metaphor, you don't just buy car insurance to mitigate against the risk of injury or damage to yourself or to others: you also drive safely and obey traffic rules. So it's actually more about adopting a philosophy and taking a series of actions that is important.
In summary, Windows Azure is and will remain a highly available, stable and reliable cloud platform and it will continue to be enhanced and improved over time. As developers though, we have to appreciate that failures of course can, and do, occur. Every object is subject to entropy, and hard disks, network cables and switches are no exception. Understanding that there are parts of the availability equation that you can - and should - take responsibility for is essential to a healthy cloud deployment and arguably, even if your app is deployed on-premise, you might want to consider adopting 'cloud risk principles', too!
My point ultimately is that risk isn't a problem: not understanding the effects of the risks and how your app can restore normal operation with minimal loss and minimal effort is a problem.
Need higher availability?
Deployments on Windows Azure are typically within a single data centre, and in my experience few developers realise that by simply using Windows Azure Traffic Manager CTP, you can deploy your application multiple data centers globally, and fail over to your back up data centers within a predetermined period.
I'd just like to finish by saying that I expect this to be the first of a group of related articles and I welcome and encourage all feedback.
If you, or your team, need some help getting started with designing and developing resilient applications on the Windows Azure platform, then please get in touch and we'd be happy to assist.
1 Risk (Wikipedia)
Original Post by Richard Parker on Mar 31st, 2012