The team have recently released a new whitepaper Disaster Recovery and High Availability for Windows Azure Applications
The whitepaper outlines the necessary architecture steps to be taken to disaster-proof a Windows Azure deployment so that the larger business continuity process can be implemented. A business continuity plan is a roadmap for continuing operations under adverse conditions. This could be a failure with technology, such as a downed service, or a natural disaster, such as a storm or power outage. Application resiliency for disasters is only a subset of the larger DR process as described in this NIST document: Contingency Planning Guide for Information Technology Systems.
The following taken from the blog post for the paper describes what the paper covers. I would strongly suggest that if anyone is putting nay missions critical systems in any cloud provider, that they have a good read of this paper.
Characteristics of Resilient Cloud
A well architected application can withstand
capability failures at a tactical level and can also tolerate strategic
system-wide failures at the datacenter level. The following sections define the
terminology referenced throughout the document to describe various aspects of
resilient cloud services.
A highly available cloud application implements
strategies to absorb the outage of the dependencies like the managed services
offered by the cloud platform. In spite of possible failures of the Cloud
platform capabilities, this approach permits the application to continue to
exhibit the expected functional and non-functional systemic characteristics as
defined by the designers. This is covered in depth in the paper Failsafe: Guidance for
Resilient Cloud Architectures.
The implementation of the application needs to factor
in the probability of a capability outage. It also needs to consider the impact
it will have on the application from the business perspective before diving deep
into the implementation strategies. Without due consideration to the business
impact and the probability of hitting the risk condition, the implementation can
be expensive and potentially unnecessary.
Consider an automotive analogy for high availability.
Even quality parts and superior engineering does not prevent occasional
failures. For example, when your car gets a flat tire, the car still runs, but
it is operating with degraded functionality. If you planned for this potential
occurrence, you can use one of those thin-rimmed spare tires until you reach a
repair shop. Although the spare tire does not permit fast speeds, you can still
operate the vehicle until the tire is replaced. In the same way, a cloud service
that plans for potential loss of capabilities can prevent a relatively minor
problem from bringing down the entire application. This is true even if the
cloud service must run with degraded functionality.
There are a few key characteristics of highly
available cloud services: availability, scalability, and fault tolerance.
Although these characteristics are interrelated, it is important to understand
each and how they contribute to the overall availability of the
An available application considers the availability of
its underlying infrastructure and dependent services. Available applications
remove single points of failure through redundancy and resilient design. When we
talk about availability in Windows Azure, it is important to understand the
concept of the effective availability of the platform. Effective
availability considers the Service Level Agreements (SLA) of each dependent
service and their cumulative effective on the total system
System availability is the measure of what percentage
of a time window the system will be able to operate. For example, the
availability SLA of at least two instances of a web or worker role in Windows
Azure is 99.95%. This percentage represents the amount of time that the roles
are expected to be available (99.95%) out of the total time they could be
available (100%). It does not measure the performance or functionality of the
services running on those roles. However, the effective availability of your
cloud service is also affected by the various SLA of the other dependent
services. The more moving parts within the system, the more care must be taken
to ensure the application can resiliently meet the availability requirements of
its end users.
Consider the following SLAs for a Windows Azure
service that uses Windows Azure roles (Compute), Windows Azure SQL Database, and
Windows Azure Storage.
Windows Azure Service
SLA Potential Minutes Downtime/Month (30
You must plan for all services to potentially go down
at different times. In this simplified example, the total number of minutes per
month that the application could be down is 108 minutes. A 30-day month has a
total of 43,200 minutes. 108 minutes is .25% of the total number of minutes in a
30-day month (43,200 minutes). This gives you an effective availability of
99.75% for the cloud service.
However, using availability techniques described in
this paper can improve this. For example, if you design your application to
continue running when SQL Database is unavailable, you can remove that line from
the equation. This might mean that the application runs with reduced
capabilities, so there are also business requirements to consider. For a
complete list of Windows Azure SLA’s, see Service
Scalability directly affects availability—an
application that fails under increased load is no longer available. Scalable
applications are able to meet increased demand with consistent results in
acceptable time windows. When a system is scalable, it scales horizontally or
vertically to manage increases in load while maintaining consistent performance.
In basic terms, horizontal scaling adds more machines of the same size while
vertical scaling increases the size of the existing machines. In the case of
Windows Azure, you have vertical scaling options for selecting various machine
sizes for compute. But changing the machine size requires a re-deployment.
Therefore, the most flexible solutions are designed for horizontal scaling. This
is especially true for compute, because you can easily increase the number of
running instances of any web or worker role to handle increased traffic through
the Azure Web portal, PowerShell scripts, or code. This decision should be based
on increases in specific monitored metrics. In this scenario user performance or
metrics do not suffer a noticeable drop under load. Typically, the web and
worker roles store any state externally to allow for flexible load balancing and
to gracefully handle any changes to instance counts. Horizontal scaling also
works well with services, such as Windows Azure Storage, which do not provide
tiered options for vertical scaling.
Cloud deployments should be seen as a collection of
scale-units, which allows the application to be elastic in servicing the
throughput needs of the end users. The scale units are easier to visualize at
the web and application server level as Windows Azure already provides stateless
compute nodes through web and worker roles. Adding more compute scale-units to
the deployment will not cause any application state management side effects as
compute scale-units are stateless. A storage scale-unit is responsible for
managing a partition of data either structured or unstructured. Examples of
storage scale-units include Windows Azure Table partition, Blob container, and
SQL Database. Even the usage of multiple Windows Azure Storage accounts has a
direct impact on the application scalability. A highly scalable cloud service
needs to be designed to incorporate multiple storage scale-units. For instance,
if an application uses relational data, the data needs to be partitioned across
several SQL Databases so that the storage can keep up with the elastic compute
scale-unit model. Similarly Azure Storage allows data partitioning schemes that
require deliberate designs to meet the throughput needs of the compute layer.
For a list of best practices for designing scalable cloud services, see Best
Practices for the Design of Large-Scale Services on Windows Azure Cloud
Applications need to assume that every dependent cloud
capability can and will go down at some point in time. A fault tolerant
application detects and maneuvers around failed elements to continue and return
the correct results within a specific timeframe. For transient error conditions,
a fault tolerant system will employ a retry policy. For more serious faults, the
application is able to detect problems and fail over to alternative hardware or
contingency plans until the failure is corrected. A reliable application is able
to properly manage the failure of one or more parts and continue operating
properly. Fault tolerant applications can use one or more design strategies,
such as redundancy, replication, or degraded
A cloud deployment might cease to function due to a
systemic outage of the dependent services or the underlying infrastructure.
Under such conditions, a business continuity plan triggers the disaster recovery
(DR) process. This process typically involves both operations personnel and
automated procedures in order to reactivate the application at a functioning
datacenter. This requires the transfer of application users, data, and services
to the new datacenter. This involves the use of backup media or ongoing
Consider the previous analogy that compared high
availability to the ability to recover from a flat tire through the use of a
spare. By contrast, disaster recovery involves the steps taken after a car crash
where the car is no longer operational. In that case, the best solution is to
find an efficient way to change cars, perhaps by calling a travel service or a
friend. In this scenario, there is likely going to be a longer delay in getting
back on the road as well as more complexity in repairing and returning to the
original vehicle. In the same way, disaster recovery to another datacenter is a
complex task that typically involves some downtime and potential loss of data.
To better understand and evaluate disaster recovery strategies, it is important
to define two terms: recovery time objective (RTO) and recovery point objective
The recovery time objective (RTO) is the maximum
amount of time allocated for restoring application functionality. This is based
on business requirements and is related to the importance of the application.
Critical business applications require a low RTO.
The recovery point objective (RPO) is the acceptable
time window of lost data due to the recovery process. For example, if the RPO is
one hour, then the data must be completely backed up or replicated at least
every hour. Once the application is brought up in an alternate datacenter, the
backup data could be missing up to an hour of data. Like RTO, critical
applications target a much smaller RPO.