To bring the windows live experience to life, we host a variety of interconnected services. These are large scale services which run on thousands of servers, and due to their scale cannot be upgraded atomically. One of the challenges is making sure that when one of these services is upgraded, the rest of the system works properly. Upgrades to the full site require a coordinated test effort that often involves running one service against multiple versions of another. To ensure that upgrades go smoothly, Windows Live has several strategies to make sure that all the versions to which our users will be exposed during a release are covered.
Keeping the state throughout upgrades
When testing an upgrade in a stateful system like ours, we need to think about how the new version of the software will interact with preexisting state and data. It is a common mistake to only test a service with data created using the new version, when existing customer data has been around for much longer and has gone through multiple upgrades and migrations. The question is how we replicate this state in our labs, particularly using our automation. Some of our approaches are the following:
- Know your data scenarios: One strategy is having a list of the relevant user data scenarios and having multiple instances of each of them provisioned in our lab. We can then check on the state of the data throughout the upgrade. Without knowing the states that the user can be in, it is hard to make sure that they will be working on the new version.
- Build automation for upgrades: Generally test automation has a setup, execution and verification stage. However, when defining tests for upgrade scenarios, we can think about pre-upgrade setup steps and post-upgrade validation steps. We can then structure tests so that, before the upgrade begins, we run the setup step, and then on each step of the upgrade we run any number of validation actions. Building tests designed for upgrades gives us the ability to automate the process of testing an upgrade.
- Run a practice upgrade: It is always good to have an environment in the labs which has gone through the same upgrade steps as production has, with accounts that have gone through the same migration process. This often lets us detect unexpected issues that we did not catch through our automation.
- Test throughout the full upgrade: Beyond having tests that run after each upgrade step, another strategy is to have tests that are continuously running during the upgrade. These tests start with an existing user state, and execute operations with corresponding verification steps as the upgrade runs through. This lets us find issues that may only happen when the upgrade is actually running.
Another interesting issue that comes when upgrading stateful systems is when old versions of client software cannot support state created with new versions of the service. Testing an upgrade that covers both client and server scenarios requires us to consider the version matrix and test each combination. Often this matrix is reduced by having a way to enforce that a new version of the software client cannot run against an old version of the service code, which means that the transition that we need to test is:
- Service: Vcurrent
- Service: Vnext
- Service: Vnext
Testing strategies when upgrading multiple services
Upgrading a single service in the cloud has its own level of complexity, but in practice we are often upgrading multiple services at a time. Common issues during upgrades happen when one service is rolled out sooner or later than expected. It is thus important to be able to build a timeline of the full upgrade to understand the dependencies that one service will have on another. The best would be to be able to replicate this upgrade procedure on a production like environment, then running end to end application tests. However, this is not always easy to coordinate and execute. It is generally easier to list the different versions of the services that we are planning to run against during the upgrade. With this information, we can build a test matrix to ensure that we are clear when running against any of those versions.
One thing to keep in mind is that upgrade test procedures often rely on moving a test lab back and forth from one version to the other. How easy it is to execute upgrade testing depends entirely on how easy it is to set up a lab environment to a particular state. If you work on a system that is constantly running through upgrades, or has a complex upgrade matrix, investing in easy and quick deployments will be crucial in being empowered to test and find upgrade related issues.
In conclusion, upgrade testing requires thinking in terms on versions of services and clients, and thinking about the different states that a user may be when the process begins, executes and completes. Upgrade tests are often more involved and sometimes hard to execute; However, they are crucial to ensure that our customers will have a good experience when we offer them the new version of our service. Even if customers may be exposed to the system in upgrade for a small period of time, a small issue can lead to issues like corruption which can cause long term issues. Upgrade testing is often overlooked, because it is hard and because it is transient. However our customers always deserve the highest quality service, and that includes the periods of time when we are upgrading.
Federico Gomez Suarez, Microsoft