Architecting Cloud Services for Resiliency

Article
06/14/2012

Yesterday, DavidMak and I presented a session on architecting for resiliency. With the advent of cloud it has become even more important to make considerations for intermittent failures, dropped connections, server failovers, etc.

Here are some of the key considerations

Design for failure

Hardware/Software failure is inevitable
People make operational errors that cause failures
At cloud scale, low frequency failures happen everyday

Netflix had blogged on their approach to testing for failures and this is what saved them from an buckling under when AWS suffered an extended outage in the Virginia DC bringing down the likes of Reddit, Quora etc. Netflix's lessons can be used as a guide to designing for failure.

Resiliency schemes

– Application resiliency (proactive / reactive)

1. Retry policies (handle dropped connections on SQL Azure and Windows Azure Storage, employ back-off schemes)
2. Avoiding single points of failure (what if DB becomes unavailable, could consider providing degraded experience – site goes into read-only mode with all data being cached)
3. Handling throttling (employ multiple caches or use dedicated caches to fight cache throttling, on the storage side employ multiple storage accounts for logically dividing the application data, diagnostics data should live in its own storage account)
4. Queues, idempotency & poison messages (idempotency implies that an operation would produce the same results no matter how many times it is performed)
5. Internal IP address changes (lost connections) on role recycle (WCF communication based on IP, running MongoDB)
6. Extensive diagnostics / logging (this is not for preventing failures but more of after the fact provisions to diagnose and prevent recurring failures in the solution)

– High Availability (proactive)

– Out-Of-The-Box capabilities

Admin-free HA

ACID properties maintained by fabric

Automated failover

Dynamic routing of db connections

Redundancy

Windows Azure SQL DB availability SLA

Windows Azure Compute availability SLA

Live Upgrades (hot swap or upgrade domains)

Windows Azure Traffic Manager

– Data

Queues for writing to multiple destinations

Data sync

CDN edge nodes

– Client configurations to multiple service endpoints (thick client connecting to service endpoints could load these from configuration with multiple endpoints listed and if one is unavailable – try the next one)

– Disaster Recovery (Reactive - Datacenter failures, Human errors, bugs)

– Building for lowest common denominator (provides flexibility if the service is down in all DCs, allows flexibility to move to other providers/on-premise; this is very rare scenario and thus might not be the top driver in design)

– Windows Azure Storage

Geo-replication for tables and blobs

RPO and RTO

Datacenter implications (co-location of compute with storage might be lost)

– Windows Azure SQL Database

DB Copy + Import / Export to Windows Azure Storage

On Roadmap: Point in Time Backup / Restore (Watch TechEd session on Business Continuity Solutions for Windows Azure SQL Database)

On Roadmap: Geo DR (Configurable RPOs, Multiple geo-secondaries)

Data Sync

Easy to configure and use (++)

No transactional consistency (--)

– ACS (Can take PIT backups of namespace, rules, mapping etc. using the PS cmdlets; Windows Azure takes bacup of ACS namespaces once a day from DR perspective; from a more granular backup scheme - use the cmdlets)

– Service Bus (do not use queues for data, use them for commands – this allows data associated with commands to be stored in blobs and thus readily available in a DR scenario.)

Would love to hear what other patterns/solutions you use today.

Architecting Cloud Services for Resiliency

Additional resources