Testing in Production (TiP), a common and costly technical malpractice???

imageCool….I am seeing more hits on Bing and other search engines (I comparison shop my queries) for “Testing in Production”. Here’s one:

Why testing in production is a common and costly technical malpractice

Yipes. Not the message I’ve been trying to drive, but even within Microsoft, especially among those working on shrinkwrap products (i.e. not services) Testing in Production can hearken back to those dark days of developers sneaking releases past the QA team and subjecting users’ to their buggy mess. But no, this article is indeed about services, specifically those running on IBM’s WebSphere. And this is the WebSphere SWAT team warning their users with this message. Let’s see what’s up.

imageOne mistake in particular was that the bank had created two application servers on a single installation of WebSphere Application Server base: one of the application servers was the test server and the other was the production server. Because both application servers ran on the same base WebSphere Application Server, their logs are shared, their ports are configured to avoid conflict, and, most importantly, because they ran on the same binary codebase, any upgrade to the Software Development Kit (SDK) would disrupt both application servers. While frequent updating to the test system is necessary, the repeated disruption to the production system was intolerable.

I agree, that sounds pretty bad. Their test processes (the SDK updates of the test system) are destabilizing their production systems and impacting users. This is obviously not acceptable. So what does WebSphere SWAT recommend?

Simply, you should have a separate test system that is identical to your production system. You can conduct load and stress tests on the separate test system before moving the application to production, where it should expect to work without disruption. Having a separate test system has many advantages, including:

  • Prevents unintended production disruption from test activities.
  • Provides a platform for performing functionality and integrity tests before performing major upgrades.
  • Provides an environment for duplicating production problems and testing Fixes

Simple, eh? Maybe if your making “IBM Mo-ney” you can afford to build a complete and identical copy of your production data center for testing. Sounds good to me. Anybody have 100,000 plus servers I can borrow to create my Bing test environment? :-)

OK, I’m just teasing our friends at IBM a bit… they do concede that if complete production duplication is not an option, you can instead try these:

    • Maintaining a scaled-down environment with load generators that duplicate the expected load.
    • Maintaining a model or simulation of the environment.

So what’s my big take-away I want to share with you here? It is that Testing in Production (TiP) is not wrong, but you can do it wrong. TiP is a great way to get exposure to the diversity of the real world, and to uncover bugs you just won’t find in the lab (assuming your lab is not exactly like your production), but there are right and wrong ways to do it. The WebSphere example above is a wrong way.

  • Even if you do duplicate your production environment in the lab, you should load that environment with real data from production such as user workflows, data, and resources. Even when testing on a scaled down lab, bringing in this production data (even if scaled down itself) gets you that much closer to TiP
  • When you do TiP, you should mitigate risk to your production users.

One way to avoid the mistake made in the WebSphere example when Testing in Production is to isolate your test systems deployed in production from the production systems when possible. Netflix does this:

These provisional APIs can also be deployed and operated on different server clusters than the stable APIs, minimizing the chance of less tested, more rapidly developed code being tested with thousands of users from potentially destabilizing services being used by millions of other users. --API Strategy Evolution at Netflix. December 1, 2010

And there’s another risk mitigation strategy also illustrated by this approach, exposure control enables us to expose the new code to only a limited number of users. We still gain from the exposure to real users, but we control the impact of any problems in the test code.

And as always when talking about bad-TiP’ing, I will pick on my friends at Amazon, and say, don’t do this. :-)