In the light of the recent story about Amazon’s EC2 cloud platform being zapped by a lightning strike, I was reminded of the forecasters’ maxims – “…calculate the development time, then double it” and “…calculate the required budget, then double it”.
I don’t know how many times in every walk of life you’ve found this to be true – but it just seems to work that way for me, every time. Both professionally and personally. My recent family holiday cost twice as much as I’d originally predicted. Home improvements always take twice as long as hoped. And so it is with disaster planning.
Enormous thought and planning has gone in to the Windows Azure data-centres. The way the 3 phases of the power supply are routed to different racks, the way cooling and water are distributed, the way even different physical parts of the buildings link with other parts and their contents (power, cooling, server racks and hardware etc). Data is written 3 times within a data centre and the fabric very carefully considers all the variables of what can fail when and where. What if there’s a leak and part of the data-centre gets flooded? Obvious things like cooling and power failures – how will they affect the availability of a system?
So built right in to the architecture of Windows Azure is the notion of fault domains and update domains: ways of dividing the physical assets of the service in to methods for keeping it running in a disaster. It’s very similar with Amazon’s EC2 and of course with other cloud service providers’ data-centre architectures as well.
You could be forgiven for thinking “it’s all taken care of” – because, that is indeed one of the main thrusts of the cloud phenomenon: that the boring and plain un-sexy stuff is taken care of. But some disasters can have an impact on everything in a data-centre. The most oft-cited disaster is an earthquake.
Ask a solution or an enterprise architect how they have built disaster-tolerance in to their solution with the cloud and they’ll talk about the cost-benefit analysis. The chances of an entire data-centre being affected by a significant earthquake in Western Europe are small. Not non-existent, but small. Small enough to end up as a consideration in a spec-document somewhere, but that’s it.
However – the EC2 story shows us that despite the considerable effort cloud platform operators like Microsoft and Amazon go to in to their datacentres, there is actually a very good chance they could be affected by lightning and this could have a massive impact on the entire datacentre. Anybody who has suffered a lightning strike in an enterprise data-centre knows the havoc caused on a business.
So – does lightning strike twice in the same place? And how many times does lightning strike the ground? Well, yes – lightning does indeed strike in the same place twice. Lightning has even been known to strike the same person twice – multiple times in some cases. You may have heard the story about the WWII bomber aircrew who fell 18,000 feet and survived. He fell in to a huge snowdrift. Later in life he was struck by lightning several times and ended up selling life insurance. True.
According to National Geographic, lightning strikes are a very common occurrence – 50 to 100 times per second. Or put another way, 1800 to 3600 times per hour. A data-centre is much more likely to be hit by lightning that suffer an earthquake. If the hit is significant, it could take out the whole data-centre and take some time to get thing back online again. Amazon are saying it’ll take 48 hours before full service is resumed.
Perhaps a more realistic statistic is to look at the total number of times buildings are hit by lightning. Again – tall buildings in built-up areas are the biggest target. But data-centres tend to be built in low-rise areas, maybe they are often the tallest buildings in the locale.
For this reason, geo-distribution of applications, data, services etc might be more of a consideration than has been the case in the past. Moving applications and services to the cloud has a lot to do with the outsourcing of risk. Moving the entire estate of all business applications to a single data-centre would be a bad move. A simple lightning strike could cripple the entire business. So it seems some critical applications would be architected in a way that allowed for geo-distribution, so they could survive a strike. Other applications might be categorised and distributed to different data-centres. For example it’d be mad to put all collaboration applications in the same data-centre. But to distribute them over several geographically separated data-centres means say, Instant Messaging might be knocked out, but workers can still communicate with each other over email.
In some parts of the world though – like North America and Europe – there is often legislation that says the application or data can’t live off European (or US) soil. In that case, it’s obviously key that the cloud provider has multiple data centres in the geographic region covered by the legislation. As far as “off country soil” legislation is concerned, the US is well covered by cloud operators that have multiple in-country data-centres. But that’s not so much the case outside of the US.
There’s also a case for sharing your cloud architecture with your business partners. In a manufacturing business with a long and complicated supply chain, the entire operation could be compromised if all the companies happened to host their supply-chain systems in the same cloud data-centre. If you think about it, it’s a fairly likely scenario as the cloud becomes more mainstream and say, European-based companies automatically select their local data-centre for their systems.
As I said at the start about calculating a number and doubling it – it would seem also to be the case, for business critical applications to think about risks and do the same thing – to think of a disaster and double it…
Planky – GBR-257
this article is cross-posted to the UK MSDN Team Blog.