A Summary of the Amazon Web Services June 29 Outage

 

Summary of the AWS Service Event in the US East Region

...from Amazon <https://aws.amazon.com/message/67457/>

 

The event was triggered during a large scale electrical storm which swept through the Northern Virginia area

 

imageThough the resources in this datacenter, including Elastic Compute Cloud (EC2) instances, Elastic Block Store (EBS) storage volumes, Relational Database Service (RDS) instances, and Elastic Load Balancer (ELB) instances, represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers. The impact manifested in two forms. The first was the unavailability of instances and volumes running in the affected datacenter. This kind of impact was limited to the affected Availability Zone. Other Availability Zones in the US East-1 Region continued functioning normally. The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region. While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.

 

 

Systems Affected

  • Elastic Compute Cloud (EC2)
  • Elastic Block Store (EBS)
  • Relational Database Service (RDS)
  • Elastic Load Balancer (ELB)
  • Elastic Cache
  • Elastic MapReduce
  • Elastic Beanstalk

 

 

Timeline - June 29-30, 2012

 

Time (PDT)

System

Event

8:04 pm

all

Servers began losing power

8:21 PM

all

Amazon status update: We are investigating connectivity issues for a number of instances in the US-EAST-1 Region

9:10pm

Control plane

control plane functionality was restored for the Region

10pm

RDS

a large number of the affected Single-AZ RDS instances had been brought online

11pm

RDS

The remaining Multi-AZ instances were processed when EBS volume recovery completed for their storage volumes.

between 11:15pm PDT and just after midnight

EC2

Instances came back online

2:45am

EBS

90% of outstanding volumes had been turned over to customers

 

Note, Amazon seems strangely ambiguous on timing around the ELB outage

 

 

Summary of Amazon control plane issues

Why this is important?

  • Application deployed to AWS are expected to design for failure if they plan to be resilient in the face of an outage. Such fault tolerant designs rely on several capabilities enabled by the control plane. Hence when the control plane fails, even plans to mitigate failure will fail.

 

Details:

  • degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region
  • customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region
  • The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane.
  • ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones
    • From GigOm: the AWS outage resulted in a control plane backlog that prohibited customers from failing over into Availability Zones not affected by the generator failure

 

Some AWS Hosted Companies Affected

  • Netflix
  • Instagram
  • Pinterest
  • Heroku

 

More on Netflix

Why did Netflix go out?

  • Amazon control plane issues (see above)
  • Problems with [Netflix's] load-balancing architecture that ended up compounding the problem by “essentially caus[ing] gridlock inside most of our services as they tried to traverse our middle-tier.”
  • Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.

 

What went right for Netflix?

  • Regional isolation contained the problem to users being served out of the US-EAST region. Our European members were unaffected.
  • Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.

 

Sources used

 

Amazon Services Dashboard

clip_image001[4]