For some time now I’ve been promising to lay out all my thinking on Testing in Production (TiP). I first introduced the topic in a blog post in June 2009 titled “Ship your service into production and then start testing!” Over the summer I had some family commitments, late summer I finished up my internal Microsoft ThinkWeek (full article in Seattle Times here) white paper on the subject (5 five star reviews so far), and had to prepare a couple of talks on the subject for STARWest and the TwinSPIN & Benchmark QA SIG in Minneapolis. All those excuses are over and now I’m ready to get back to the blog.
What follows is an updated version of the executive summary of the ThinkWeek paper that I co-authored with Seth Eliot (Seth's new blog can be found here) and Ravi Vedula on TiP-ing Service Testing. The paper did quite well as we recieved numerous 5 star reviews and the chair gave us a four start review with a must read comment. The paper has seen over five hundred views so far, so I'd call it a success. Ithant is is my best overview of the subject and makes the case for why TiP is an important framing concept for evolving our approach to services testing. For more on ThinkWeek check out the youtube video or news article links above.
When it comes to services testing there are two diametrically opposed perspectives. The most common perspective is the one that is focused on making test environments in the lab as close to production as possible. This approach grows out of the oft expletive laden post RTW (Release to Web) admonishment, “how could test have missed this bug,” and the equally common retort, “our test environment isn’t enough like production so we couldn’t catch it.”
Many a tester has been burned by the bug that was missed and when this happens they often become defensive and more risk averse. They tend to drift into the first school of thought that attempts to continually increase the precision with which their test environment matches production. All sorts of techniques are used from purchasing the exact same servers for testing as production, purchasing load balancers and even taking sanitized dumps of production data into the test lab. After all, it is the job of testing to make certain bugs don’t get away and if even a single bug is missed in test because there was a missing network device or we only had one SQL Database instead of mirrored databases like production, then we must close that gap and stop the bugs from getting out into the wild.
The funny thing is, whether it is a game, a desktop application or a web service, if you are a tester that has shipped a product, you have missed a bug. I know I have missed a lot of bugs in my career, and yet I remain employed. While I too work to find all the bugs in a product, when I hear a manager ask that same tired question, “How could test miss this bug,” I tend to take a deep breath and just let it pass.
Personally I am always confident that my team and I have done a good job and so my advice to others is to just let the comment pass. Think of it this way, it’s not your fault. If the bug hadn’t been designed in the first place or coded in the second place it wouldn’t exist. The fact that it exists is not your fault. You found tons of bugs in the product, so a few got past, go yell at the development team for daring to write the bug in the first place. When you become defensive you become obsessed with never missing a bug. I have a long talk and even a tutorial on what we can learn from bugs that get away, for now just let it go.
The second perspective is the one that accepts test can never find every bug in the test lab and with respect to services; test can never fully emulate production. Trying to find all possible production bugs in test is a daunting, expensive and eventually un-achievable aim. There are simply a vast number of bugs that will only ever be found in production. This line of thinking pushes one toward considering risk mitigation once a bug is out. After accepting that bugs will get out you can free your mind to start thinking about how to leverage production for testing and how to make production more test friendly.
TiP is really a paradigm shift requiring the acceptance that test environments have limitations, test environments cannot catch all the bugs, and testing in production is the most viable option.
I should probably pause a moment and define what I mean by Production and Testing in Production.
Definition of PRODUCTION: For this paper production will be the current version (v-current) of the service to include all the data centers and machines v-current runs across. It will also include v-Next instances that live in the data center and have real world traffic hitting them.
Broad Definition of TiP: TiP constitutes all testing activities occurring on hardware in a data center.
To most testers and operations engineers the notion that you would test in production seems verboten. TiP is an anathema. TiP is just not allowed. The reality is that it is allowed and it is much more common place than many think. In fact this paper will argue that we need to further invest in and enhance our abilities to test in production.
Figure 2-Test Account used frequently on Hotmail for production testing
If you were to look at TiP along a continuum at one end you would find the “Lab Centric” perspective and at the other you would find the “TiP Zealot.” The Lab argument would focus on making the lab as much like the production as possible. If we are looking at Windows or Office that would include all kinds of hardware and peripheral devices, all kinds of applications for compatibility testing, and thousands of machines running automated tests. To be fair, this has worked well for Microsoft with respect to desktop applications and server products. Even so we work with beta partners for them to deploy the next version of our products into production so we can get early feedback. In this case though, production is often treated as a backup to testing in the lab; the test focus is still within the test lab.
This approach to testing has been used by Microsoft and many other companies as the primary approach to testing web services.
The TiP zealot would argue that the lab will never be able to replicate the real world so all final sign off must happen in the real world, in production. Looking again at the software side of things we see the example where we run betas that sometimes include millions of customers and identifying selected Technical Adoption Program (TAP) partners to sign off on final Release to Manufacturing (RTM).
Figure 3: Services TiP Continuum show the shift to more production testing
The reality is whether we are talking software or a service we already conduct much of our testing in the real world and in production. What this series of posts attempts to do, is to organize a collection of best practices within the services space that will move us further along the TiP continuum safely toward a TiP centric perspective.
Future posts in this series will focus on three main sub-section from the original white paper. These upcoming sections are essentially the why, the what, and the how of TiP:
· Why – Factors pushing us toward TiP
· What – Best Practices for TiP
· How – Enabling technologies for TiP
Why – Factors pushing us toward TiP discusses how factors such as Scale and Integration are forcing many services to test in the data center and even forcing them to test the data center itself. Technology shifts such as Cloud Computing make it easier to test in production and other shifts such as an ever deepening stack of infrastructure layers make it impossible for us to replicate production in test labs.
What – Best Practices for TiP is likely the most fun for readers as I will simply lay out my top TiP Tips as a collection of best practices. These practices will cover everything from the simple idea of purchasing production hardware in the data center that is used both for testing and production interchangeably to the relatively bold concept of shipping test hooks into production.
How – Enabling technologies for TiP will wrap up this series by outlining some areas where we need new enabling technologies and improved architecture. Many of these solutions should be built into the coming wave of Cloud Computing offering being made by just about every major software and Telco company in the world.
As I’m building this series off of the white paper written by me (@rkjohnston), Seth Eliot and Ravi Vedula the next installment should come out fairly soon. In the meantime if you have questions about TiP or just want to add your thoughts, please post a comment.