On 6 November 2017 we experienced a series of incidents which occurred in close proximity to each other, causing significant impact to our customers during that time (service blog here). The combined impact of these incidents resulted in an outage nearly 3 hours long for our customers. The greatest impact was to users in the North Central US region, while users in other regions experienced intermittent connectivity issues and errors.
We know how important VSTS is to our users and deeply regret the interruption in service. Over the last several days we’ve worked to understand the incident’s full timeline, what caused it, how we responded to it, and what improvements we need to make to avoid similar outages in the future.
The timeline for these incidents is as follows:
- Starting at 18:45 UTC until 19:15 UTC on 6 November 2017, a deployment issue caused a VSTS Configuration database to become unavailable. As a result, approximately 2,000 users across West Europe, Central US, South Central US and North Central US experienced intermittent outages with different features.
- For 15 minutes, from 19:15 until 19:30 UTC, the VSTS Shared Platform Services (SPS), our central service for authentication and authorization, experienced a temporary outage that caused failures and significant slowdowns for approximately 5,000 users. Because this was a shared service, users across all regions were impacted. This issue was triggered by a network failure in North Central US.
- The Azure network outage impacted most of the North Central US datacenter and lasted until 21:07 UTC. During this time, VSTS accounts hosted in the North Central data center were offline. We don’t have an accurate count of users impacted because our application telemetry logging was unavailable due to the network outage. During this period, access attempts to any VSTS North Central US hosted account were shown an error like this one below.
4. At 20:35 UTC, during the tail end of the network issue, SPS experienced a complete network outage for 10 minutes. Since SPS is a central service, there was global impact across VSTS.
5. The network health was fully recovered at 21:07 UTC. At this point, customers could access North Central US accounts. As users started connecting to the system, they experienced a significant spike in slow commands as the service processed the backlog of requests. This lasted until 21:25 UTC.
Here's a graphical view of the number of impacted users over the lifetime of the incidents:
What went wrong:
Monday, 6 November was a challenging day from a live site perspective. We experienced multiple distinct issues within a 3-hour window. In addition, there was a widespread internet outage that we were aware of, however that did not directly affect our services. Our alerting was effective in detecting all discrete system issues. However, in our incident bridge, we would have been more effective if we had aggregate, more high-level view of our system and customer health, particularly during overlapping problems like this. In part, this contributed to delays in providing status updates to our customers and details on what exact impact was being seen by customers.
VSTS Deployment Issue:
We deliver major updates to the service every 3 weeks as part of a Sprint deployment. The Sprint 125 payload contained new version of several extensions that enables VSTS features. As the new versions were initializing, a significant amount of contention occurred on a local extension cache resulting in data-tier resource issues which caused broad customer impact.
Network Outage in North Central US datacenter:
The VSTS instance in the North Central US region was completely offline due to the Azure networking issue that severed all connectivity to the deployment in that region. Issues were introduced as part of planned maintenance in the region which impacted network routing and connectivity. You can read more about this in the Azure postmortem here (see: ‘RCA – Network Infrastructure – North Central US’ entry dated 11/6).
During the network issue there were also two periods where SPS experienced slowdown and failures:
- At the time of network issues in the North Central US region, our SPS service started to experience slow and failed calls into the AAD cluster in North Central US. After SPS traffic was redirected to a different AAD datacenter, the performance issues and errors were resolved. This was denoted as number 2 in the image above.
- Towards the tail end of the North Central US region’s networking issue, SPS experienced a complete loss of network connectivity for 10 minutes. During this time, SPS was fully offline. Once the network connectivity was restored, SPS came back online and started operating as expected. This was denoted as number 4 in the image above.
- Extension Updates - The design for deploying new extensions versions in VSTS has been updated to avoid the potential for contention from multiple concurrent requests. The first thread requesting the new version of the extension will perform a cache refresh. All other threads will continue to use the previous version until the time the cache is fully updated. Since VSTS supports backwards compatibility for extensions this flow is transparent to active users. This new design avoids the type of resource contention experienced during the incident.
- Multi-instance SPS - As we’ve mentioned in previous post mortems, SPS is a singleton service that can cause global impact to VSTS users when issues occur. We are working on partitioning SPS into smaller, discrete deployments spread across Azure regions with a subset of hosted accounts in each instance. When this work is complete it will reduce the blast radius of customer impact that occurs when a regional outage, such as this, occurs. We have engineers dedicated to implementing the partitioning of the SPS service however it is a complex effort that requires major architectural changes and will take a year or more to complete.
- Incident Management Improvements – The VSTS service status page has a dependency on Azure Storage which impacted our ability to effectively communicate during the outage. We are working on a design that lets us update the page directly reducing the number of dependencies that can fail. Additionally, we’re working to create global views that layer the alert state of services over a logical deployment diagram which will help us understand complex issues, such as this, much easier in the future.
This was a significant outage and caused extended impact for many users. The series of issues experienced were complex and evolved over time and highlighted areas where improvements are needed in our resiliency design, communication processes and monitoring. Please accept our apologies and understand we are committed to learning from outages like this and improving over time.
Tom Moore, VSTS SRE Group Manager