Architecting Cloud Services for Resiliency

Yesterday, DavidMak and I presented a session on architecting for resiliency. With the advent of cloud it has become even more important to make considerations for intermittent failures, dropped connections, server failovers, etc.

Here are some of the key considerations

Design for failure

  1. Hardware/Software failure is inevitable
  2. People make operational errors that cause failures
  3. At cloud scale, low frequency failures happen everyday

Netflix had blogged on their approach to testing for failures and this is what saved them from an buckling under when AWS suffered an extended outage in the Virginia DC bringing down the likes of Reddit, Quora etc. Netflix's lessons can be used as a guide to designing for failure.

Resiliency schemes

Application resiliency (proactive / reactive)

    1. Retry policies (handle dropped connections on SQL Azure and Windows Azure Storage, employ back-off schemes)
    2. Avoiding single points of failure (what if DB becomes unavailable, could consider providing degraded experience – site goes into read-only mode with all data being cached)
    3. Handling throttling (employ multiple caches or use dedicated caches to fight cache throttling, on the storage side employ multiple storage accounts for logically dividing the application data, diagnostics data should live in its own storage account)
    4. Queues, idempotency & poison messages (idempotency implies that an operation would produce the same results no matter how many times  it is performed)
    5. Internal IP address changes (lost connections) on role recycle (WCF communication based on IP, running MongoDB)
    6. Extensive diagnostics / logging (this is not for preventing failures but more of after the fact provisions to diagnose and prevent recurring failures in the solution)

High Availability (proactive)

– Out-Of-The-Box capabilities

  1. Admin-free HA
  2. ACID properties maintained by fabric
  3. Automated failover
  4. Dynamic routing of db connections
  5. Redundancy
  6. Windows Azure SQL DB availability SLA
  7. Windows Azure Compute availability SLA
  8. Live Upgrades (hot swap or upgrade domains)
  9. Windows Azure Traffic Manager

– Data

  1. Queues for writing to multiple destinations
  2. Data sync
  3. CDN edge nodes

– Client configurations to multiple service endpoints (thick client connecting to service endpoints could load these from configuration with multiple endpoints listed and if one is unavailable – try the next one)

Disaster Recovery (Reactive - Datacenter failures, Human errors, bugs)

– Building for lowest common denominator (provides flexibility if the service is down in all DCs, allows flexibility to move to other providers/on-premise; this is very rare scenario and thus might not be the top driver in design)

– Windows Azure Storage

  1. Geo-replication for tables and blobs
  2. RPO and RTO
  3. Datacenter implications (co-location of compute with storage might be lost)

– Windows Azure SQL Database

  1. DB Copy + Import / Export to Windows Azure Storage
  2. On Roadmap:  Point in Time Backup / Restore (Watch TechEd session on Business Continuity Solutions for Windows Azure SQL Database)
  3. On Roadmap: Geo DR (Configurable RPOs, Multiple geo-secondaries)
  4. Data Sync
  • Easy to configure and use (++)
  • No transactional consistency (--)

– ACS (Can take PIT backups of namespace, rules, mapping etc. using the PS cmdlets; Windows Azure takes bacup of ACS namespaces once a day from DR perspective; from a more granular backup scheme - use the cmdlets)

– Service Bus (do not use queues for data, use them for commands – this allows data associated with commands to be stored in blobs and thus readily available in a DR scenario.)

Would love to hear what other patterns/solutions you use today.