So, one morning a few weeks or so ago, my colleague, Ruwen Hess, stuck his head in my door and commented on an Article on Gmail and User Confidence he had just read. He also pointed out a very interesting Article on the S3 failure from July 20th (which I had seen before). Now, before I start commenting on the point of this blog post, I want to say that this is NOT meant as a slam on our competitors. This is a complex and nascent space in which we are all learning. I want to complement my former company, Amazon, on the openness exhibited in their discussion of the event of July 20th. It is a new thing for Amazon to explain when stuff happens and I think it is great for the industry and great for Amazon. Google is just fighting the good fight to provide highly available services. They are both fascinating companies and worthy competitors. Still, there are some interesting aspects to this publicly available news.
Of course, this leads my mind down too many different paths. This was going to be a short post but I always get carried away! Let me cover:
- Some Observations about Reliable Process Pairs,
- Less Is More,
- N-Version Programming,
- Availability Over Consistency,
- Eventual Consistency,
- Front-Ending the Cloud, and
- It's Going To Be a Fun Ride!
Some Observations about Reliable Process Pairs
Let me start by discussing independence of failures (and the need to focus on it to achieve high availability). During the 1980s, when I was developing system software at Tandem Computers, some of our important programs were process pairs. In this scheme, the same software ran in two different processes on different computers within the distributed system connected via messaging. The goal was that if only one of the computers (and, hence, processes) failed, the other would "take-over" and continue offering service. Of course, it was essential to be able to restart a backup and fill it up with sufficient state to continue the computation. Specifically, every failure and take-over created the need to restart the other process (which used to be the primary). Restarting implies the filling of the new backup with enough state to be ready to take-over.
During this time, I was privileged to be able to see a number of different implementations of process pairs within the Tandem System and, indeed, to do major development and implementation on a large process pair (in addition to a number of other pieces of software with different approaches to fault tolerance). As I lived and breathed the triage of product support, crashes, dump analyses, patches, and major product upgrades, it became clear to me that there was a pattern. Some process pairs were implemented to ship entire data structures. Some were implemented to send very narrow and minimalist descriptions of an operational change in state (e.g. in the transaction manager: "We just advanced transaction-X to the stage of flushing the log records to disk"). In a different example, one process pair would send a message containing its entire address space of data to its partner whenever a change occurred! Lots of lively debates raged about the best way to implement reliable process pairs.
I observed that the implementations which sent the minimum amount of data seemed to be the most resilient. There were many crash dump analyses that showed a process running along and it would trash some data structure. Then, a checkpoint would copy the data structure to the backup process, including its corruption. (A checkpoint is the message used to communicate state from the primary process to the backup process in a process pair.) Soon, the corruption caused the primary process (and its computer) to crash. Now, the backup (with the corruption received from the checkpoint) takes over to do the work of the process pair. Before long, the backup gets tangled up because of the corruption in the data structure and IT falls over, too! The entire fault tolerant system has succumbed to a single failure because there is insufficient isolation between the parts.
Less Is More
Communicating less information within a message is usually best. If you send extra stuff, it can cause corruption!
The failure of Amazon's S3 on July 20th is fascinating and, yet, very typical. S3 uses a Gossip Protocol to spread state around the network without any master. The knowledge is simply propagated to anyone that will listen and, very quickly, spreads around the network. Unfortunately, there was a bit of poison introduced into the system. Now, it just so happens that this was introduced because Amazon's MD5 hash protection of their messages did not cover some of the system status bits, but that is not my main point. All (non-protected) data communications will occasionally have transmission errors and, unfortunately, the stuff being gossiped by S3 was not protected. The article simply describes it as system status state.
I know that in these complex distributed systems,it is essential to keep somewhat accurate track of the state of the system. What is not clear from the Amazon S3 article from July 20th was the shape and form of the system status information and why it did not settle out automatically after the corruption. The article mentions that the system status state is given a higher priority and was starving out the real work as it was gossiping about the system state.
The main point is that when you are communicating information, it is essential to keep the information flow as sparse as possible.
While S3 is a brilliant system which is designed to continue functioning during a data center outage, it had a flaw (which I'm sure is now fixed) which allowed for some bad state to propagate to ALL the data centers. To eliminate the problem, all the data centers were taken offline and then the entire system was restarted.
Let me reiterate my deep respect for the engineering at both Amazon and Google in addition to what I see at Microsoft. Still, as the industry matures, we will all learn by the school of hard knocks to keep the information we spread across failure units as crisp and concise as semantically possible.
I've never personally experienced it but have read about uses of N-Version Programming. In this scheme, specifications of a program are drawn up and presented to N different teams, each of which will produce a version of the system. Now, if you develop 3 versions of the system, you may then implement a voting scheme in which when there are two disparate answers, you pick the one with two votes. The concept is to reduce the probability of a software bug causing the entire system to fail. Of course, one of the challenges in this approach is the development of comprehensive and clear specifications.
The reason I think of N-Version Programming is that it involves the "Less Is More" principle. To do N-Version Programming correctly, you need to isolate your development teams and prohibit any out-of-band communication. If the teams communicate (other than through the specifications), it increases the chances of propagating some corruption.
Availability Over Consistency
I am not aware of any information from Google about the cause of their outage but, again, we as an industry are still learning how to keep ever more complex applications and services available. The article cited above, expresses the dismay of many users as a service they've grown to depend upon is simply not there for a while.
In many cases, the user would gladly take a "good enough" answer NOW rather than wait for the "correct" answer LATER. In fact, it is MORE common to see users just want to keep going. Right now, I am typing most of this on the bus with Windows Live Writer inserting notes to myself about the hyperlinks to fix when I am back online... It's great to just "keep going" even with a reduced experience!
Amazon published a wonderful paper on Dynamo at SOSP 2007. This paper provides an excellent overview of the Dynamo storage system (which I had a role in encouraging from the sidelines -- I can't take credit for it). Dynamo is in production with at least two running services and numerous fascinating techniques employed in its implementation. I would encourage you to read the paper.
The reason that I raise this here is that Dynamo provides availability over consistency. In a distributed system, it has been proven that Brewer's CAP Conjecture is true... Hence, it is now called the CAP Theory. The idea is that you can have only two of Consistency, Availability, or Partition Tolerance. You have have a consistent (and by this, the idea is a classic transactional ACID consistency) and partition tolerant system but it may not be available under some partitions. Alternatively, you can have an available system which tolerates partitions but it won't have the classic notion of consistency. Increasingly, I see applications designed with looser notions of consistency. Most of the time, customers really want availability at the expense of classic consistency! New means of expressing looser consistencies are emerging to provide availability even when failures occur!
I learned over 20 years ago working in transactional systems to ASK a customer what their priority was when dealing with an outage. Indeed, some customers wanted you to ensure that every transaction was correct before bringing the system online. At first, it shocked me to learn that many, many customers wanted the system up even if the results of the work might be somewhat "less than perfect". Indeed, I saw this in some banking systems with humongous amounts of money being pushed around. It wasn't that they were doing anything wrong... if the system came up, most of the transfers could be accomplished and overnight interest gathered for them. The funds involved were large enough that hundreds of bank workers would simply stay up all night and verify the accuracy of the work, cleaning up as necessary. The timeliness mattered more than the accuracy!
Again, more often than not, availability matters more than strict (classic) consistency!
Also fascinating (at least to me) is the area of eventual consistency (and, again, this term implies a LOOSER form of consistency than classic ACID transactions).
The basic idea is that it is OK to diverge opinions across replicas. What is needed is a system that coalesces when communication is reestablished between the replicas. I talked about this in my talk The Irresistible Forces Meet the Movable Objects. I am bringing this up because (I am arguing) that people really want availability over consistency. When divergent changes are made while disconnected, they also want the messes to clean themselves up as much as possible.
It is my opinion that we will be designing systems to support eventual consistency. Part of this trend will be to back away from our traditional separation of storage (e.g. database) from applications. In my opinion, we will evolve towards integrated solutions that combine the storage with the execution of the application in a way that will make it easier to redo the operation by the application and create eventual consistency across different executions of the app. This is much easier to do than to cope with reorderability of write operations against a data store.
Front-Ending the Cloud
So, I love and believe in cloud computing. I think the entire space will grow and evolve. The cost structures for many companies will make it very cost effective to use big data centers for hosting lots of their computation. As I look at data center cost structures, it is clear that it is going to be a competitive business with many advantages to large data center managers with large economies of scale. In a handful of years, most companies will look to offsite providers for their reliable servers.
Still, I wonder if there will be a market for local front-end systems running a replica of a portion of the applications needed to keep the business going. Not all the computation a company performs is "mission critical". In other words, the company can function with a subset of its computation (at least for a short while). Will there be an emerging space of loosely-coupled eventually consistent computation run either at the local company or at a different hosting company than the main cloud-purveyor? If there is an outage of the cloud services, this front-end would keep the business going. When reconnected, the results of the work would be shared and reconciled.
This is very much like the relationship between Outlook and Exchange. Outlook does a LOT of its functionality while disconnected and that is of huge value.
This may be an interesting economic point -- combining the cost savings of the cloud with independent failures of the front-end.
It's Going To Be a Fun Ride!
I am SO glad to be working in this space! It seems to me this is a great time for distributed systems geeks!