On 29 September 2017, we experienced a major service incident that lasted just over 8 hours and affected many VSTS customers using our build features (incident blog here). We know how important VSTS is to our users and deeply regret the interruption in service. Over the last several days, we’ve worked to understand the incident’s full timeline, what caused it, how we responded to it, and what improvements we need to make to avoid similar outages in the future.
Starting at 15:51 UTC on 29 September, some VSTS users were not able to view their build definitions and the status of completed builds. The incident lasted 8 hours however most customer impact was mitigated within ~5 hours. During this time, some VSTS users who attempted to view their build definitions received a message that none were available:
And when navigating to a specific build definition directly, they received a message that no builds were available:
This issue was caused by a bug in a sprint deployment upgrade script. While an account was in the process of being upgraded, QueryBuildDefinitions and QueryBuilds could possibly return zero results resulting in the user interface not showing build definitions or build results. Based on our telemetry, the total number of accounts potentially impacted by this bug was 8,244. This estimate is based on the accounts that made at least a single call to the build service anytime during the incident. This is likely an overestimate of impact since the scope of accounts affected was reduced as the release progressed across the environment. The graph below shows the incident timeline along with volume of build calls (red) to illustrate customer usage:
What went wrong:
This bug was introduced while trying to fix a long-standing issue related to searching for both builds and build definitions that many customers have encountered and reported. Builds and build definitions saved with certain characters would not be returned within the search results.
Fixing the search issues involved encoding build definition names and build numbers before storing them in our database, and decoding them as they are read back into the application tier. The upgrade first updated the stored procedures to write and read these new formats and then encoded the existing data to match the new format.
The deployment revealed a timing issue we weren’t previously aware of. Once the stored procedures were updated, they wouldn’t be able to correctly read existing build definition and build data until their format had been updated by a secondary upgrade task.
The data upgrade script was originally written in a way that minimized that window. It was written to update those rows as aggressively as possible, all in a single batch (per database). We were able to progress through two of our deployment rings without this approach causing an issue. In the third ring, we hit a database that had a significant amount of build data. Trying to update that data in a single batch caused resource constraints on the database resulting a rollback of the upgrade. Because of this, we made a change to perform updates in smaller batches with a sleep between transactions for any databases that were under load.
We continued the upgrade with this new batching approach in place. Unfortunately, this extended the window where customers would experience the user interface issues mentioned previously. This compounded the issue. As people started to notice their build definitions were missing, some users then recreated them, often with the same name. Since our new stored procedures were in place, they created these new definitions with the new name format. The new definition’s encoded value and the old definition’s unencoded value could live side-by-side, however when the upgrade attempted to encode the old one, it violated the primary key (PK) constraint, causing the upgrade to fail.
We were actively investigating the slowness of the upgrade and once we noticed the PK violations we immediately patched the offending stored procedures so they would return both the un-encoded and encoded values. In parallel, we modified the upgrade script to handle the PK violation by resolving the name conflicts by prepending “restored.” to the definition names. This mitigated the issue.
Obviously, when customer impacting incidents occur we want to detect and resolve the issue as quickly as possible. For this issue it took us 8 hours which is too long. As part of our post mortem review we identified the following areas where we could have reduced our time-to-mitigate:
- Detection – We did not have an automated alert that detected this issue and instead we became aware of the issue after 45 minutes when a customer reported the problem. See the repair item below regarding detection improvements for more details.
- Safe Deployment Practices –We deploy all changes through a series of deployment rings to minimize the risk of widespread customer impact. As stated above, we realized the deployment script needed to be updated after progressing through several deployment rings. In retrospect, we realize we should have been more cautious after restarting the refactored deployment in a later deployment ring. If we would have limited the initial number of accounts that received the new update and actively monitored the deployment it would have enabled us to detect and mitigate impact faster. This would have also involved the author of the change who was best suited to understand the issue. See the safe deployment improvement below for more details.
We currently have automated test coverage for both binary and data-tier upgrades. However, this issue revealed a gap in our test for compatibility mismatch between steps within a database upgrade. The challenge is that we never run the same upgrade more than once, so generic testing can only catch so much. However, there are some improvements we can make in our processes and automated coverage which will help to avoid similar issues in the future.
|Automated test gap||We do not have coverage for issues between steps of database servicing. The challenge is that every upgrade is different. However, there are classes of upgrade patterns that we can automate coverage on||Add automation between steps of database upgrade to catch common patterns of issues:
|Code review policy||The product team didn’t catch the effect of the change to the batching and the change of the encoding while it was rolling out. There was a gap between requirements and delivery.||More complete sign-off procedures for any compatibility change, including mandatory compat testing for any change to data shape in the schema.|
|Alerting||Although we found out quite soon that the upgrades were taking too long to complete, the Build APIs were not altering to the sudden change in return size, which would have gotten us to a mitigation and root cause faster.||Alerting in our APIs when we have sudden shifts in payload size.|
|Safe Deployment Practices||We needed to refactor our deployment logic after progressing through several of our deployment rings which are designed to limit the blast radius for any change related issues. By deploying an updated script that didn’t go through the full set of deployment rings it prevented us from identifying this issue earlier and limiting the scope of impact.||We are updating our Safe Deployment process to account for situations like this. If we need to update our deployment logic and can’t run the change through our full set of deployment rings we’ll implement a process where the change is carefully rolled out to a subset of accounts and actively monitored for any issues.|
Again, we want to offer our apologies for the impact that this incident had on our users. We take the reliability and performance of our service very seriously. Please understand that we are fully committed to learning from this event and delivering the improvements needed to avoid future issues.
Group Engineering Manager, VSTS Build