On 10 July 2017, we experienced a major service incident that lasted just over 4 hours and affected many VSTS customers using the service at that time (incident blog here). We know how important VSTS is to our users and deeply regret the interruption in service. Over the last several days, we’ve worked to understand the incident’s full timeline, what caused it, how we responded to it, and what improvements we need to make to avoid similar outages in the future.
Starting at 16:45 UTC on 10 July 2017, VSTS users experienced intermittent performance issues and internal server errors when interacting with the service. The incident lasted for 4 hours and 3 minutes. During this time, VSTS users experienced issues when trying to sign in to the service. Additionally, other users who could access the site might have received HTTP 500 errors when performing various interactive actions:
The chart below shows the percentage of active VSTS users who were impacted by performance and/or errors while interacting with the site during the incident. Users who had valid cached tokens and who were not interacting with identities or user profiles, were not necessarily impacted:
What went wrong:
VSTS relies on Azure Active Directory (AAD) to authenticate users. At 16:10 UTC, an internal Microsoft business application started driving too much traffic to AAD which triggered AAD to throttle the shared Microsoft AAD tenant, used internally to authenticate users for many different cloud services including VSTS. Initially this throttling just impacted Microsoft internal VSTS users as well as the ability for VSTS to renew cached tokens. VSTS caches user tokens to prevent round trips to AAD for every user request. As these cached tokens expired, and couldn’t be renewed, there was a dramatic increase in the rate of calls from VSTS into AAD, largely driven by high-volume internal engineering systems. This triggered a new AAD throttle after VSTS exceeded rate limits which in turn expanded the scope of impact to VSTS external users.
As stated above there were two distinct phases of service impact:
- Impact was limited to Microsoft internal users due to throttling on internal Microsoft AAD tenant
- Impact expanded to external VSTS users when VSTS exceeded AAD throttling limits
The first chart below shows the success (green) and failures (red) of our “outside in” tests and the amount of time it takes to access the service. This represents the experience of users without valid cached tokens who performs a new login. The second chart shows the number of requests that VSTS sent to AAD for identity related purposes. You can see the typical volume is around 50K and that we quickly spiked to levels 15 times our nominal level during the incident. Finally, the timeline below shows how the initial impact expanded after the AAD IP layer throttling started, and the key incident response milestones:
There were a couple of challenges that delayed our efforts to mitigate the outage. At the start of the incident, the focus was on the Microsoft AAD tenant throttling. This drew attention away from other failures in VSTS. After resolving the initial throttling issue, there were delays in fully understanding why service health was not restored. Using telemetry in both VSTS and AAD we eventually determined that call volumes from VSTS were exceeding AAD IP throttle limits. To mitigate the issue, we decided to increase the AAD throttling quota for VSTS. However, it then took almost 90 minutes to deploy. Much of the delay was due to the time needed to configure the update and to gradually deploy it across the AAD rings used to introduce changes safely.
In retrospect, there are several things that we could have done to mitigate the incident faster. Below we list the key repair items that we’ll be implementing to improve our telemetry and alerting. Once these are in place, they will help us to understand the scope and source of issues like this much faster. Additionally, the proper approach for addressing AAD IP throttling issues is to reduce traffic to levels below the AAD IP throttle limits rather than asking AAD to increase the quota. We are looking at several flow control mechanisms to ensure our call volumes stay under the AAD throttle limits.
The matrix below categorizes the issues and gaps being addressed with details on the solutions we are committed to delivering:
|Resiliency||VSTS needs to respond appropriately to AAD throttling responses.||Tenant-scoped circuit breaker for AAD’s throttling responses, including tenant-scoped backpressure from SPS to other VSTS services.|
|Resiliency||Service-to-service call volume increase wasn't detected and throttled.||Improved detection of call volume increases in service-to-service traffic and responsive backpressure of that traffic.|
|Telemetry||There are gaps in the telemetry needed to understand the specific rates of calls to different AAD APIs. This increased the time it took to isolate the specific failure modes occurring for both issues detailed above (i.e. throttling on the Microsoft AAD Tenant & IP layer throttling for VSTS calls to AAD).||We are developing reports which will enable VSTS to understand when significant throttling events occur preventing VSTS users from successfully logging in at login.microsoftonline.com. In addition, VSTS will add new telemetry for tracking call rates between VSTS and all AAD APIs.|
|Alerting||Visibility into sustained levels of throttling from AAD.||New alerts based on usage and error telemetry.|
Again, we want to offer our apologies for impact this incident had on our users. We take the reliability and performance of our service very seriously. Please understand that we are fully committed to learning from this event and delivering the improvements needed to avoid future issues.
Tom Moore, VSTS SRE Group Manager