Today I hosted four hours of interactive learning on S+S testing with table topics such as "Testing in Production, How far can we go?" and "Release Cadence in an S+S era." Every time I get together with smart engineers new better ideas are generated.
One interesting example that came up in the afternoon session was the impact background or maintenance tasks can have on a data center's infrastructure. In this particular example a rather large Microsoft service was getting ready for a big launch and needed to upgrade thousands of servers in a data center to the latest version of the service. Well the deployment of this service is fully automated (I'll write a post ranting about the importance of deployment in the near future) and so with the push of a single button the deployment was off. The "bug" if you will occurred because all the machines became very busy pulling down bits, conducting reads and writes to disk, and actually hit a higher average CPU utilization than they would during normal production use. This massive load actually caused power failures in the data center. So where is the bug?
Should we fix this in software or rely on a new policy of never upgrade thousands of machines at the exact same time ever again? This is a very interesting edge case and I don't have the answer. I offer it up as an example of how much there is to consider and learn as we move into S+S and Web 2.0 worlds with cloud computing and multiple devices.
A few more ideas that I gathered today include measuring not just time to deploy but time to rollback in case the deployment is flawed, Should we target the 75th percentile or 95th percentile when measuring and signing off on Page Load Times (PLT), and Release Criteria need to include post RTW measurements before you really sign off. All of these are great ideas that are at least a new twist on an important topic if not completely new. The great thing is they all came up during the training session today and I was lucky enough to hear them. Though most of the content is not public I will dive deeper into some of the hot topics next week.
The other experience I had today was delivering a webinar to a SIG in Bogota over Live Meeting. This session was in support of the book I helped write with my colleagues Alan Page and B.J. Rollison titled “How We Test Software at Microsoft.” For more information on the book visit www.hwtsam.com. For this session I was to deliver some content on Chapter 14 which focuses on S+S Testing and then answered some questions.
Doing a Webinar with Q&A can be challenging. Add to it the translation piece and no video of the audience to register their reaction and it becomes very challenging.
In the Webinar I introduced the topic of Testing in Production (TiP) for services. It is a growing field of thought within Microsoft and from what I can tell a process used very heavily by some of our competitors. The notion that one would ship something into production and then test it seems anathema to software testers. Needless to say this became the major topic of Q&A.
The real way to look at TiP is to ask what can safely and effectively be tested in production. The next question is to ask how to make testing in production a fast turnaround process that is cheaper than testing in a lab. When price and speed of production testing are lower than labs, and we are getting there with cloud computing, then you really should move all the testing that you can out of the lab and into production.
I have an article under way just on TiP and hope to publish it in the near future.
Thank you to the Javier Andres Caceres Alvis for the opportunity to discuss S+S and Services testing with your group.