Bing Outage exposes Testing in Production (TiP) Hazards

I have been a big proponent of shipping services into production more frequently and conducting more testing on the code once it is in production. Those of us that support this methodology tend to call it TiP (Testing in Production). Find links to previous blog posts on this subject below.

Bing During OutageBing back up and working

After the recent Bing outage (evening of 12/3/2009), I find myself thinking about the Hazards of TiP and thought I might make a post about some lesson's I have drawn from this production outage and what has been written about it so far. ZD Net posted a bit of a sarcastic blog with the title "Microsoft is making progress on search: You noticed Bing's glitch." According to the official blog post by the Bing team (here) the outage was, “The cause of the outage was a configuration change during some internal testing that had unfortunate and unintended consequences.”

Internal testing that had unfortunate

and unintended consequences

Despite this black mark, I still believe TiP is the right direction to go for services testing but clearly there are some hazards and lessons we can extrapolate.

These two posts imply that the outage was wide spread, noticed by a lot of individuals, and caused by an errant configuration change in support of a test. My assessment is that while there was clearly an attempt to run a test configuration in production, the test did not cause the outage. The challenge came where the test configuration change somehow went to all of production.

The core concept of TiP is to minimize risk through TiP-ing. In order to accept the risk of less stable code into production in order to run tests, the less stable code must be easily sandboxed. Whatever happened here was likely a configuration management mistake not a testing error.

I have an axiom for my operations team, and that is that all manual processes will eventually fail, unless they don’t.

I like Bing, but half-an-hour downtime is unacceptable these days. Do you guys not have failover systems?

Comment on Bing Blog by b.dylan.walker

The reality is that the Bing system is very automated. The team has shared some information about their infrastructure so I won’t go into details here less I share something not disclosed. In outage like this from a test configuration change impacting production is clearly a case of fast moving automation.

In order to enable TiP and to take more risk into production, the change management system of a service must be rock solid and fully automated. Clearly though from what has been shared they have a state of the art system. In fact it is likely this state of the art system that allowed the errant change to propagate so quickly and require a full roll back.

Therefore the gap must be in the safety mechanisms to prevent such a mistake in combination with how fast the mistake rolled out to all environments. Another factor in successful TiP is metering of change in production. This change just moved too fast and while the bing system is highly automated it still takes a long time to undo a change across so many servers.

My takeaway from this outage is to remember that TiP does work but you need solid change management.

1. A fully automated deployment system

2. Rock solid controls on change management approval

3. Every change must be a metered change so when a mistake does happen it doesn’t affect every server in production.

Those are the lesson’s I’ve drawn so far. What do you think?

Other Blogs on TiP

· TIP-ING SERVICES TESTING BLOG #1: THE EXECUTIVE SUMMARY

· Ship your service into production and then start testing!

Images from a blog post on WhatWillWeUse.com - https://whatwillweuse.com/2009/12/03/hold-on-40-minutes-while-i-bing-that/