High Availability On The Azure Platform

Currently, both Windows Azure and SQL Azure offer high availability within a single data center. As long as a data center remains operational and accessible from the Internet, services hosted there can achieve high availability.

Windows Azure

Windows Azure uses a combination of resource management, elasticity, load balancing, and partitioning to enable high availability within a single data center. The service developer must do some additional work to benefit from these features.

Resource Management

All services hosted by Windows Azure are collections of web, worker and/or virtual machine roles. One or more instances of a given role can run concurrently. The number of instances is determined by configuration. Windows Azure uses Fabric Controllers (FCs) to monitor and manage role instances. FCs detect and respond to both software and hardware failure automatically.

  • Every role instance runs in its own VM and communicates with its FC through a guest agent (GA). The GA collects resource and node metrics, including VM usage, status, logs, resource usage, exceptions, and failure conditions. The FC queries the GA at configurable intervals, and reboots the VM if the GA fails to respond.
  • In the event of hardware failure, the FC responsible for the failed node moves all affected role instances to a new hardware node and reconfigures the network to route traffic there. FCs use the same mechanisms to ensure the continuous availability of the services they provide.

Elasticity

The FC dynamically adjusts the number of worker role instances, up to the limit defined by the service through configuration, according to system load.

Load Balancing

All inbound traffic to a web role passes through a stateless load balancer, which distributes client requests among the role instances. Individual role instances do not have public IP addresses, and are not directly addressable from the Internet. Web roles are stateless, so that any client request can be routed to any role instance. A StatusCheck event is raised every 15 seconds.

Partitioning

FCs use two types of partitions: update domains and fault domains.

  • An update domain is used to upgrade a service’s role instances in groups. For an in-place upgrade, the FC brings down all the instances in one upgrade domain, upgrades them, and then restarts them before moving to the next upgrade domain. This approach ensures that in the event of an upgrade failure, some instances will still be available to service requests.
  • A fault domain represents potential points of hardware or network failure. For any role with more than one instance, the FC ensures that the instances are distributed across multiple fault domains, in order to prevent isolated hardware failures from disrupting service. All exposure to VM and cluster failure in Windows Azure is governed by fault domains.

According to the Windows Azure SLA[1], Microsoft guarantees that when two or more web role instances are deployed to different fault and upgrade domains, they will have external connectivity at least 99.95% of the time. There is no way to control the number of fault domains, but Windows Azure allocates them and distributes role instances across them automatically. At least the first two instances of every role are placed in different fault and upgrade domains in order to ensure that any role with at least two instances will satisfy the SLA.

Implementation

The service developer must do some additional work to benefit from these features.

  • To benefit from resource management, developers should ensure that all service roles are stateless, so that they can go down at any time without creating inconsistencies in the transient or persistent state of the service.
  • To achieve elasticity, developers should configure each of their worker roles with the maximum number of instances sufficient to handle the largest expected load.
  • To optimize load balancing, developers should use the StatusCheck event when a role instance reaches capacity to indicate that it is busy and that it should be temporarily removed from the load-balancer rotation.
  • To achieve effective partitioning, developers should configure at least two instances of every role, and at least two upgrade domains for every service.

The requirement to keep roles stateless deserves further comment. It implies, for example, that all related rows in a SQL Azure database should be changed in a single transaction, if possible. For example, instead of inserting a parent in one transaction, and then its children in another, the code should insert both the parent and the children in the same transaction, so that if it goes down after writing just one of the row sets, the data will be left in a consistent state.

Of course, it is not always possible to make all changes in a single transaction. Special care must be taken to ensure that role failures do not cause problems when they interrupt long running operations that span two or more updates to the persistent state of the service.

For example, in a service that partitions data across multiple stores, if a worker role goes down while relocating a shard, the relocation of the shard may not complete, or may be repeated from its inception by a different worker role, potentially causing orphaned data or data corruption. To prevent problems, long running operations must be idempotent (i.e., repeatable without side effect) and/or incrementally restartable (i.e., able to continue from the most recent point of failure).

  • To be idempotent, a long running operation should have the same effect no matter how many times it is executed, even when it is interrupted during execution.
  • To be incrementally restartable, a long running operation should consist of a sequence of smaller atomic operations, and it should record its progress in durable storage, so that each subsequent invocation picks up where its predecessor stopped.

Finally, all long running operations should be invoked repeatedly until they succeed. For example, a provisioning operation might be placed in an Azure queue, and removed from the queue by a worker role only when it succeeds. Garbage collection may be needed to clean up data created by interrupted operations.

Common long running operations that create special challenges include provisioning, deprovisioning, rolling upgrade, data replication, restoring backups and garbage collection.

SQL Azure

SQL Azure uses a combination of replication and resource management to provide high availability within a single data center. Services benefit from these features just by using SQL Azure. No additional work is required by the service developer.

Replication

SQL Azure exposes logical rather than physical servers. A logical server is assigned to a single tenant, and may span multiple physical servers. Databases in the same logical server may therefore reside in different SQL Server instances.

Every database has three replicas: one primary and two secondaries. All reads and writes go to the primary, and all writes are replicated asynchronously to the secondaries. Also, every transaction commit requires a quorum, where the primary and at least one of the secondaries must confirm that the log records are written before the transaction can be considered committed. Most production data centers have hundreds of SQL Server instances, so it is unlikely that any two databases with primary replicas on the same machine will have secondary replicas that also share a machine.

Resource Management

Like Windows Azure, SQL Azure uses a fabric to manage resources. However, instead of a fabric controller, it uses a ring topology to detect failures. Every replica in a cluster has two neighbors, and is responsible for detecting when they go down. When a replica goes down, its neighbors trigger a Reconfiguration Agent (RA) to recreate it on another machine. Engine throttling is provided to ensure that a logical server does not use too many resources on a machine, or exceed the machine’s physical limits.


[1] https://www.microsoft.com/windowsAzure/sla/