In the web services world I’ve heard a lot of really stupid things said by a lot of really smart people. I will admit, without details, that I have been guilty of saying some of those stupid things myself. I like to believe that I am learning and evolving though as I now recognizes those previous statements as being less than wise.
One area ripe with absurdity is web service’s deployment. I could go on and on for pages on the number of foolhardy things I have heard, seen, and done with respect to service’s deployment. For today, however, I choose to pick on just one, the 50/50 upgrade.
I have heard this one time and again, and even last week. It is often said with perfunctory conviction, a puffed up chest and a broad beaming confident smile by the engineers that have finally implemented it. That statement goes like this, “We finally have a great deployment solution, we can do 50/50 upgrades across two datacenters.”
Yuuuck!! Every time I hear this, I wrench handfuls of hair out of the pitiful amount remaining about my head, and run screaming from the conference room. It is perhaps the most pathetic-neanderthalic-misanthropic statement an engineer could make. What really gets me is how it is always said with such bravado, such confidence, such unthinking belief that this is a great idea, that you suddenly realize the engineer saying it and those nodding their head like little Chihuahua bobble heads, really believe it.
So what is 50/50 upgrade you ask? It works essentially like this.
A web service is built out with all the components needed to function and often this includes a set of web front end servers, maybe a middle tier server and a database of some sort on the very back end. Separating out component of a web service is a common architecture both in the enterprise, traditional online services and even within cloud computing. It allows for improved utilization of hardware resources, improved security by cordoning off critically sensitive customer data, and when done correctly allows for more rapid development on each sub-component. Even the Microsoft Azure service works on this model.
A v1 “new” service may make the mistake of speed to market over maintenance features and thus launch with an architecture that is tightly coupled and in massive flux. This will force the development team into many rapid releases that cause the entire stack to change. Applying upgrades to every server in production flows from the service pack mindset of the enterprise where a poor operations engineer has been assigned the task of apply the new service pack or security patch to every machine in production by the end of the weekend. Apply the upgrade everywhere, innovate everywhere in tandem, upgrade to the new version of the service all at once so we know where we are...
By the way, I really can’t stand tight coupling and all or nothing architectures. Perhaps I’ll get to another post on general SOA concepts as opposed to this one about SSOA.
This is a bad approach for web services and leads toward tight coupling and monolithic upgrades. Monolithic upgrades of a service often result in the well know, “Sorry, our website down for maintenance.” sign being put up for the world to see. These kinds of signs range from somewhat cutsie to downright ugly.
The maintenance window is actually an acceptable option. Many major web services have used this concept. Everyone from parts of MSN, eBay, and even SalesForce.com has used the maintenance window concept of a weekly or monthly basis. The problem is, it is a crutch and the web service really needs to focus on how to have zero downtime deployments.
The decision making process on how to move to a 50/50 upgrade model tends to go something like this:
1. We can’t keep taking these maintenance downtimes. Our competition isn’t down, why are we? Get me one of those zero downtime deployment solutions and get it now!
2. Oh, and we should also become geo-redundant. I hear that is a great think in case we get hit by one of those earthquake things.
3. Let’s build out a duplicate in a new data center so we can do a 50/50 upgrade! All we need is an expensive network load balancer at the edges and a bunch of virtual IPs.
4. Great, but we’ll need to build it out with enough capacity to handle all the user load
5. That’s right and our peak loads usually come right around the holidays, back to school, or yearend accounting, or federal tax day, or Valentine’s Day. This last example probably only affects http://hallmark.com (see Hallmark case study slides among these TiP slides here), but you get the point that most services have a peak utilization that is calendar or event driven.
6. This will be great because if the upgrade doesn’t work well we can always fail back to the half that hasn’t been upgraded.
7. We’ll need to build out with some buffer too because we don’t want to have a service performance slow down because we have a really big spike. I recommend 125% of peak.
8. Perfect, we don’t want any risk with this new 50/50 upgrade architecture.
9. Great let’s build out another duplicate cluster.
At this point the meeting ends and everyone goes their separate ways feeling quite pious about the great decisions they just made. The move to a 50/50 upgrade is underway.
Let me briefly explain some of the mechanics of a 50/50 upgrade. The way it is accomplished is that a service will build out a duplicate infrastructure, often in a different data center but it could be in the same data center, but it will involve new hardware. This mistake can be made within cloud computing by creating duplicate VMs as well. This new build out will need to be at scale as the maintenance window might become twenty-four or forty-eight hours. Each of these units is often called a cluster or a scale unit. Fortunately with modern network load balancers, such as those build by F5 Networks, it is easy to fail over from one cluster to the other, thus allowing you to take one cluster out of rotation for maintenance. Upon completion of maintenance the upgraded cluster is put into production while the other half is taken off-line for the upgrade.
With teams I’ve run we typically put the upgraded servers back into production overnight to see how they perform. We want ensure the users are able to connect and use the service as expected. We also watch to see if the servers suddenly start to throw lots of errors or crash.
Once we are sure the new service is running as expected we can begin to upgrade the remaining machines. This gives us our rapid roll-back plan in case the new version has a major flaw and cannot be released. 50/50 upgrades as a tool for rapid rollback is the first stupidity, but not the most significant, I want to call out.
RANT: If I design an upgrade that is so invasive it takes by 25 hours to upgrade a cluster of say 100 servers, I then put it into production and it fails, and so my roll back plan is to fail back over to the non-upgraded machines, then what the heck am I to do with the mess I created on the first half? Does anyone have an automated tool for me to get them back to the previous version, nooo. Does anyone have a plan on how to fix the bug in the new version so I can move safely and easily forward, noooooo. All I have is a net to land in when the service falls over, but unfortunately there is no way out of this net. I’m stuck halfway between flying seamlessly through the air with the greatest of ease, and smashing my server guts all over the concrete floor. If a trapeze artist falls every other time thy try to jump from one bar to the other, it isn’t much of a show. Proper practice to get the transfer right is the way to go, the net is there for only the worst of incidents. Teams that fall into the 50/50 upgrade trap and point to their rollback plan as failing back to the un-touched servers, are cutting corners and hoping the net will never snap.
So, that’s the first stupidity. The second stupidity is the one that really chaps my hide and that’s the one around frivolously adding massive cost to an online service. In fact because of this one factor, I would rather a service stick with maintenance windows than move to the 50/50 upgrade option. Table 1 shows just how expensive a 50/50 upgrade can be and how quickly the cost drops if you simply move to a smaller scale for upgrades such as a ¼ upgrade.
Upgrade % # of units to upgrade Minimum capacity for peak load % capacity for entire service % wasted capacity 50/50 2 100% + 25% or more buffer ~250% 150% 1/3, 1/3, 1/3 3 50% + 12.5% buffer ~182.5% 82.5% 1/4, 1/4, 1/4, 1/4 4 33.33% +8.33% buffer ~166.5% 66.5%
# of units to upgrade
Minimum capacity for peak load
% capacity for entire service
% wasted capacity
100% + 25% or more buffer
1/3, 1/3, 1/3
50% + 12.5% buffer
1/4, 1/4, 1/4, 1/4
33.33% +8.33% buffer
Table 1: % of wasted capacity moving from a 50/50 upgrade to a minimum of four clusters with shared redundancy.
With a typical 25% capacity buffer a 50/50 upgrade scenario will push a team to build out a minimum of 150% extra capacity. Remember that this is 150% wasted capacity above peak utilization that might come just once a year. For Microsoft Office we know our biggest peak is right around back to school in the early fall and so we have extra capacity available at http://www.office.com for this huge spike but don’t do 50/50 upgrades. The real waste in terms of such capital is probably great that 200% and then add in the operational expense of powering and maintaining that much extra capacity, a service can easily go from being cost effective to cost prohibitive.
This is the stupidity I really loath as it is so obvious and yet never really considered up front. For a little extra effort a team can often engineer their deployment scenario to the point they can move from a maintenance window model to a N cluster deployment option where N is great than four. That is what I point out to ever team I work with that falls for the 50/50 trap. Still many ignore me and take the easier route of throwing hardware at the problem. The problem is that this bad decision when made, will sink capital and actually limits the team’s ability to be more innovative through lose coupling and more dynamic and automate deployments.
I’m planning or writing a few more blogs on Stupid SOA and other Stupid things about services. If you have some to share, I’d love to hear yours. I’ve also tweeted some gems on my twitter account at https://twitter.com/rkjohnston. Tweet, comment, send me email, but whatever you do, if you have a good one, please send it to me.
Ralated Blog posts