Key To The Cloud: Design For Failure

Last week, Amazon Web Services suffered a major outage in one of their datacenters.  There has been a lot of coverage in the media as several companies who built their services on AWS were unavailable during this multi-day outage.  As some one who evangelizes Windows Azure & cloud computing in general, I followed the news closely.  What type of fears would this AWS outage put into the minds of companies contemplating a move to the cloud?  Would this episode be a set back for all of us who are enthusiasts of a cloud computing future?

The headlines are titillating for certain, but I think this outage requires some deeper thought (and reading) before folks deem cloud computing to be un-reliable.  Since the outage, re-caps and post-mortems of the event have begun to appear online.  Over the weekend, I caught up on some of these which I think are worth sharing here:

At many of our MSDN events where I talk about Windows Azure, I often encounter folks who just want to know, “will it run in Azure?”, where the “it” is their existing application or website.  Sometimes the follow-up questions lead me to believe that folks just think of Windows Azure or the cloud as “just another place” they can host their application.  They want to know if they can just take their existing applications and move them offsite as-is. 

In some cases, that may be possible.  Especially if you view the cloud as “just another host”. But the cloud is so much more than “just another host”. It truly is a whole new hosting AND programming paradigm that requires thought into how you design your system from the get go.  Yes, you can “migrate” things to the cloud and they may work (very well might I add!). But if you really want to take advantage of what the cloud has to offer, then you need to design for it.

What does that mean? Well, some of the promises of the cloud are scalability, high availability, and elasticity.  But those things don’t come for free.  How to achieve those things are beyond the scope of this blog post.  I will say here that designing your system to achieve those things in the cloud is an emerging skill set which developers & IT pros would be smart to pick up on.  One key skill is the ability to design your system for failure.  This is critical for high availability in light of last week’s outage.

I encourage folks to read the AWS post-mortem articles I linked above.  I think you’ll find that “designing for failure” is a common theme in all of them. That led to success for those sites & services that did not fail when parts of the cloud did.

[Update 4-29-2011:] Amazon has posted their own post-mortem of the outage.