A Summary of the Amazon Web Services June 29 Outage

Article
07/06/2012

Summary of the AWS Service Event in the US East Region

...from Amazon <https://aws.amazon.com/message/67457/>

The event was triggered during a large scale electrical storm which swept through the Northern Virginia area

Though the resources in this datacenter, including Elastic Compute Cloud (EC2) instances, Elastic Block Store (EBS) storage volumes, Relational Database Service (RDS) instances, and Elastic Load Balancer (ELB) instances, represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers. The impact manifested in two forms. The first was the unavailability of instances and volumes running in the affected datacenter. This kind of impact was limited to the affected Availability Zone. Other Availability Zones in the US East-1 Region continued functioning normally. The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region. While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.

Systems Affected

Elastic Compute Cloud (EC2)
Elastic Block Store (EBS)

Relational Database Service (RDS)
Elastic Load Balancer (ELB)
Elastic Cache

Elastic MapReduce
Elastic Beanstalk

Timeline - June 29-30, 2012

Time (PDT)	System	Event
8:04 pm	all	Servers began losing power
8:21 PM	all	Amazon status update: We are investigating connectivity issues for a number of instances in the US-EAST-1 Region
9:10pm	Control plane	control plane functionality was restored for the Region
10pm	RDS	a large number of the affected Single-AZ RDS instances had been brought online
11pm	RDS	The remaining Multi-AZ instances were processed when EBS volume recovery completed for their storage volumes.
between 11:15pm PDT and just after midnight	EC2	Instances came back online
2:45am	EBS	90% of outstanding volumes had been turned over to customers

Note, Amazon seems strangely ambiguous on timing around the ELB outage

Summary of Amazon control plane issues

Why this is important?

Application deployed to AWS are expected to design for failure if they plan to be resilient in the face of an outage. Such fault tolerant designs rely on several capabilities enabled by the control plane. Hence when the control plane fails, even plans to mitigate failure will fail.

Details:

degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region
customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region
The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane.
ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones
- From GigOm: the AWS outage resulted in a control plane backlog that prohibited customers from failing over into Availability Zones not affected by the generator failure

Some AWS Hosted Companies Affected

Netflix
Instagram
Pinterest
Heroku

More on Netflix

Why did Netflix go out?

Amazon control plane issues (see above)
Problems with [Netflix's] load-balancing architecture that ended up compounding the problem by “essentially caus[ing] gridlock inside most of our services as they tried to traverse our middle-tier.”
Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.

What went right for Netflix?

Regional isolation contained the problem to users being served out of the US-EAST region. Our European members were unaffected.
Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.

Sources used

https://gigaom.com/cloud/latest-outage-raises-more-questions-about-amazon-cloud/

https://gigaom.com/cloud/netflix-were-bullish-on-the-cloud-despite-outage/

https://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

Amazon Services Dashboard

A Summary of the Amazon Web Services June 29 Outage

Additional resources