Azure High Availability

When designing application deployments for the cloud one of the most common issues my customers face is an understanding of what high availability truly means.   The separate but related term of disaster recovery is also rarely understood.  In concert HADR (High Availability Disaster Recovery) is driven by two separate and definable metrics.   RPO or recovery point objective and RTO or recovery time objective.

 

Recovery point objective is a tolerance for data loss.   If a disaster were to occur how much data can the business stand to lose?  Recovery time objective is tolerance for downtime.   If a disaster were to occur how much time can the application be offline before significant financial losses occur?  RPO and RTO are typically measured in minutes or even seconds.    The RPO translates directly into our DR strategy.  The RTO translates directly into our typical thinking of the high availability SLA measured in 9s (i.e. 99.95% available).  This will be the focus of this article however it is important that RPO is not ignored.

 

If you ask any CTO most often you'll hear that RPO and RTO should be zero.   There is no tolerance for losing a single byte of data and the application must be available 24/7 with zero downtime.   While those are certainly noble goals in the real world that is rarely achievable.    There is a direct correlation to these metrics and the cost associated with achieving them.   As a result an important exercise in any high availability design is to first define the true RPO and RTO for a given application.  

 

In the on-premises world the strategy for high availability is often to depend on our infrastructure.   We turn to technologies like highly available SANs with multiple storage controllers and redundant sets of disks.  We leverage virtualization hosts with replication like Hyper-V Replica or VMWare VMotion.   We have redundant power, cooling, network, etc…   The assumption is that if any one component of the physical world fails my virtual machine can continue to run on a paired set of infrastructure.    This is the VM down approach.   While this approach can certainly add value in some scenarios it misses an important point of availability and the typical causes of downtime.   Anecdotally speaking when I've been called in to help customers who have experienced a disaster more often than not the cause was not a physical failure.   The cause was in the VM up.   This could be a problem with the guest OS, patching, configuration, or even the application itself.    This infrastructure based approach conveniently ignores all of this and assumes that if the VM is up and running so too must the application.

 

In the cloud the reality is that none of the public providers are spending big bucks on providing highly available infrastructure.  For a given hardware component the metric measured is mean time between failure (MTBF).   With hardware there is a high variety of MTBF parts and associated costs.  The major cloud players (Amazon, Microsoft, Google) aren't buying the most robust enterprise grade disks, storage controllers, server chassis, etc.   This is part of the economies of scale and is well established.   All of these "clouds" run their infrastructure on commodity hardware.   The idea is that any one component failing should have no substantial impact on business.   The cost of replacing a failed component is substantially cheaper than purchasing one that has a higher MTBF.   There is also no concept of VM replication in the public cloud.   You're not going to automagically have your VM moved from physical host to physical host as a result of failure.

 

Before we dive deeper I think it also important to define how we scale applications.   In the on-premises world we typically refer to a "single instance deployment" as a self-contained application in a single VM.   If your application is running out of resources you might add more CPU cores, memory, or disk to that VM.   This concept is often referred to as scaling vertically or scaling up.    The other common scale pattern, and the one most often used in the cloud, is referred to as scaling horizontally or scaling out.   In this model instead of adding more resources we add more servers.   This gives us a deployment that allows us to add/remove resources based on user demand while simultaneously ensuring high availability with no single point of failure.

 

In the cloud we must change our thinking.   One of the core tenants of cloud architecture is "Design for failure".   You must account for these facts in your design and assume that your underlying infrastructure is not highly available.   This means that instead of the VM down approach to HA we need to look at the VM up side of things to see what can be done.   Often this means that the application itself must have some concept of high availability baked in.   That is, it should be able to run on multiple virtual machines and accept load balanced traffic.   It should have a data tier that matches.   Essentially the application must have no single point of failure in the design.  This is where the conversation forks down two separate paths.   Is this an application you've designed and built or one you've purchased/obtained from another developer?

 

If you're trying to achieve high availability for an application that you've purchased or obtained from elsewhere then your hands may be tied.   I've run into plenty of customers who have line of business applications built on the concept of infrastructure based high availability.   They assume that to scale means vertically.   They do not support a deployment other than "single instance".   If this is true of your application then depending on your RTO the cloud might not be the best place to host it. 

 

For applications that do support a scale out instead of a scale up model Azure provides a number of capabilities to help mitigate these limitations.  The most important concept to understand is the availability set.   When we make a deployment into Azure either via IaaS or PaaS our servers should be assigned to an availability set.   In a typical 3-tier application (front end, application, and database) each of these tiers would represent a different availability set.  What this tells Azure is that any virtual machine instance in a given availability set can satisfy the same role as the other members of the set.   As long as at least one of the members of that set is up and taking requests then the application as a whole is also available.   The availability set directly translates into Azure fault domains.    In Azure we have 2 (and soon to be 3) fault domains.   These are unique sets of hardware in the Azure data center that share no single point of failure.   The VMs in an availability set will automatically be provisioned by Azure to spread across these fault domains.   The promise you are given is that any physical failure inside of an Azure data center will at most impact half (and soon to be one third) of the VMs for that set.  In order to get the Azure 99.95% availability SLA from Microsoft your application must be deployed in this pattern.    The SLA is on the availability set and not the individual VM. 

 

Another Azure capability that needs to be understood is update domains.   With Microsoft's goal of providing you a secure and stable environment to host your applications the software inside Azure occasionally needs to be updated.  For certain patches it may require that the virtualization hosts reboot.   If this happens we need to provide a mechanism to ensure that the VMs hosting your application don't all reboot at the same time.   We provide by default 5 update domains and this is configurable up to 20.   When you deploy VMs into an availability set we automatically slot them into update domains as well.   If a given set has 5 VMs in it for example they will look something this by default:

 

VM

Update Domain

Fault Domain

VM1

0

0

VM2

1

1

VM3

2

0

VM4

3

1

VM5

4

0

 

The promise is that if we have a physical failure in fault domain 1 then VM1, VM3, and VM5 will remain online. The Azure fabric controller will automatically relocate all affected instances to different hosts and reconfigures the network to route traffic to it.   If we deploy a patch to Azure that requires a reboot then we will only reboot one VM at a time since each has a different update domain.

 

If we look at a standard 3-tier web application that uses an OLTP database for the backend it might look something like this.

 

 

Web tier -

 

For the servers accepting HTTP requests directly from a client the traffic will be load balanced across multiple instances.   Depending on the application this may require "affinity".   This means that when a user begins interacting with the application their session is established on a single server.   Any subsequent requests to the application for that user should come to the same server.   This commonly happens if the application stores some user specific data into memory and does not leverage a distributed cache approach (more on that later).   This is often referred to as "sticky sessions" and is a capability that must be provided by the load balancer.  We can call this design to be "stateful".     In Azure we have two load balancers.   The first is Layer-4 which does the balancing on the TCP transport layer.   This essentially is agnostic to the type of traffic being balanced.   It is just forwarding TCP packets from a source IP to a destination IP.   It can provide simplistic affinity based on a 3-tuple hash of source IP, destination IP, and port.   This has challenges in certain scenarios and can lead to an asymmetric balance of traffic.   If all of your users live behind a proxy server or NAT then the source IP for each user will appear to be the same and it will incorrectly establish affinity.    To mitigate this problem you need a Layer-7 load balancer.   We provide this capability via the recently released Application Gateway offering.   There are many other 3rd party and partner solutions from companies like Barracuda and F5.    With this approach the load balancer is aware of the traffic that is being sent through it.   It speaks the HTTP protocol.   If SSL is required for the application then the encrypt/decrypt must happen at the load balancer.    To establish affinity the load balancer will typically do something like insert a cookie into the user's session so that on subsequent requests it can identify the user and route them to the correct server.   In my view all of this is a bit of a Band-Aid solution to something that can be solved in the application itself.

 

The correct approach to this is to move to a "stateless" design.   This essentially means that the web front ends do not store any session data for the user in memory.   All of this is offloaded to a separate caching tier which can be consulted from any of the web front ends.   It doesn't matter which server gets a given request for a given user.   They are all capable of responding to it.   In Azure we provide Redis as a managed caching solution.   For many web development frameworks including ASP.NET there are providers that allow you to offload session state to Redis.   With this model load balancing no longer becomes a concern and you can have true symmetric balancing using our Layer-4 option.   This results in significantly lower costs and a better user experience.

 

Scaling of the web tier can be done by adding/removing instances.   If you're using Azure PaaS and cloud services this can be done automatically for you based on resource utilization or queue depth.   If you're deployed into an IaaS model this can still be done however you must pre-create your maximum number of instances in advance.   You can then configure Azure to turn servers on/off based on threshold criteria.

 

Application tier -

 

There are a few types of "application" servers and you typically need to look at what work is actually being done here to design for HA.   One type of application server might be a consolidated services tier which provide access to the database for the front end.   In the ASP.NET world these could be WebAPI or WCF services that are consumed either via the web front ends or directly via the client.   It is important that these are separated from the web front ends so that they can be scaled and secured independently.   For web services the same concerns around affinity and load balancing apply here.

 

Another application tier might be to do long running or non-real time tasks.   These are jobs created by users or on a schedule that satisfy some application requirement.   Handling long running tasks in a fragile infrastructure world can be a real challenge.   If there is a failure in the middle of the job it could force the process to start over and even data loss.   If the work allows for it we try to break a single long running task into many smaller tasks and allocate the work to multiple nodes using a brokered queue.   Azure provides two managed queue offerings either via service bus or storage depending on the needs.   The typical pattern is that each step of the process generates a message to the queue with the required data which is then picked up by a worker and completed.  If further processing is required another queue message is generated.   This process is repeated until the work is done.  The queue message has an associated visibility timeout which will allow another instance to process it should it not be completed within the allocated limit.   This more granular approach ensures that if any given server fails processing a subset of the task another can pick up where it left off.   It also allows you to process many long running tasks in parallel across many servers at the same time which facilitates the auto scale horizontal model.

 

 

Database tier -

 

The easiest approach to high availability on the database tier is to leverage one of the many Azure managed services.  This includes SQL Azure, Azure Tables, and DocumentDB.   There are many customers however who these offerings are not a good fit and already have a database strategy which requires them to deploy their own servers.   The most common pattern to achieve HA is to leverage two servers in an active/passive mirrored cluster.   That is to say that only one of the two servers is "active" and accepting connections.    The transactions on that node are synchronously committed to both the active and passive nodes.   In the event the active node becomes unavailable there is a cluster quorum process which promotes the passive to active which begins accepting connections.   In the Microsoft SQL server world this is achieved with SQL AlwaysOn Availability Groups.   Similar capabilities exist in Oracle via their DataGuard offering (note Oracle RAC is not supported in Azure because of the requirement for unicast networking).   There are also ways in which you can deploy MySQL to achieve the same thing.

 

All of the discussion so far has focused on how to achieve high availability within a given Azure region.   It is possible although unlikely that a region wide failure could occur (think natural disaster).   How can we architect a solution that leverages the geographic breadth of Azure to ensure availability even in the face of such a major event?   Many of the Azure managed services already provide this capability for you such as SQL Azure,  Tables, Blobs, and Queues.  Every Azure region has an identified pair which is the target for replication.   Where possible the promise we make is that the location is in the same geo-political region but separate geological and climate zones with at least 400 miles separating them.    This is to mitigate the likelihood that the same natural disaster could impact more than one Azure region at a time.    The promise made is that for any block of data that you write to a geo-replicated Azure storage account we will synchronously write 3 copies of the data to the local Azure region and asynchronously write 3 copies to the paired region.   You can optionally obtain read-only access to the replicated copy which can help facilitate your cross-regional HA design.

 

Here is the current list of paired regions as of the time of writing:

 

Primary

Secondary

North Central US

South Central US

South Central US

North Central US

East US

West US

West US

East US

US East 2

Central US

Central US

US East 2

North Europe

West Europe

West Europe

North Europe

South East Asia

East Asia

East Asia

South East Asia

East China

North China

North China

East China

Japan East

Japan West

Japan West

Japan East

Brazil South

South Central US

Australia East

Australia Southeast

Australia Southeast

Australia East

https://msdn.microsoft.com/en-us/library/azure/hh873027.aspx

 

Replicating data isn't enough to achieve regional high availability though.   By itself that is only a disaster recovery solution in that it ensures your data is protected.  If you want to design an application that has cross-regional high availability things become much more complex.   Thankfully Azure has some capabilities that can help with this.

 

When we route users to a server we leverage the DNS infrastructure to determine where to go.   With Azure we have a service called traffic manager that can route users cross-regionally to the Azure region that is nearest them.   This is done leveraging real-time network heuristics to determine the quickest route between two points on the internet.  You setup a CNAME for your application and point it to Azure's traffic manager.   When a user looks up that application they are returned the IP address of the instance of your application nearest them.   This same capability can be used to handle availability failovers across regions.   Each region is treated as a separate deployment of your application and registered with traffic manager.   In the event of a failure of one region it can be taken out of the pool and the user can be routed to another deployment in another region.   For an application that has a significant data tier this can be problematic in ensuring that the data for a given user is replicated across regions and available.    A common pattern is for the user or tenant to have affinity to a primary Azure region and their data be replicated into read-only containers in secondary regions.    If the primary region for a given user is offline their experience for the application may become read-only.   Another option would be to leverage Azure service bus and a publish subscribe pattern for data writes which allows consumers from each region to commit the writes locally to their deployment ensuring consistency across regions.   When the primary region comes back online it can catch-up the writes by consuming the pending messages.   This mimics a transaction log in an OLTP database.   Of course -- the service bus itself must reside in an Azure region so nothing is without the possibility of failure.  You can leverage multiple service bus instances and push the message to  both a primary and secondary queue.  Your consumers should be smart enough to fail over to the secondary

 

Now that we have our deployment architected and up in Azure how can we test it?   One of the interesting concepts that has emerged in cloud architecture is the concept of a "chaos monkey".   This is an application or process that simulates random failure of components in your deployment.   We are trying to replicate the experience of a real monkey sitting in a datacenter somewhere yanking random power and networking cables.   If we've properly architected our system the user should never know about the devious behavior of the monkey.   A truly high available system should be tolerant of failure from any single component of the system and be self-healing.