Postmortem – Availability issues with Visual Studio Team Services on 10 October 2017

On 10 October 2017 we had a global incident with Visual Studio Team Services (VSTS) that had a serious impact on the availability of our service (incident blog here). We know how important VSTS is to our users and deeply regret the interruption in service and are working hard to avoid similar outages in the future.

Customer Impact
This was a global incident that caused performance issues and errors across all instances of VSTS, impacting many different scenarios. The incident occurred within Shared Platform Services (SPS), which contains identity and account information for VSTS.

The incident started on October 10th at 7:16 UTC and ended at 14:10 UTC.

The graph below shows the number of impacted users during the incident:

What Happened
This incident was caused by a change delivered as part of our sprint 124 deployment of SPS. In that deployment, we pulled in version 5.1.4 of the System.IdentityModel.Tokens.Jwt library. This library had a performance regression in it that caused CPU utilization on the SPS web roles to spike to 100%.

The root cause of the spike was due to usage of a compiled regex in the jwt token parsing code. The regex library maintains a fixed length cache for all compiled regular expressions, and the regex's in use by the JWT token parser are marked as culture-sensitive, so we were getting a regex cache entry for each locale being used by our users. Compiled regexs are fast so long as they are cached (compiled once and reused), but they are computationally expensive to generate and compile, which is why they are normally cached. Because of the wide variety of locales of our users who came online at about 07:00 UTC, we exceeded the capacity of the regex cache, causing us to thrash on this cache and peg the CPUs due to excessive regex compilation. Additionally, the compilation is serialized, leading to lock contention.

Ideally, we would have been able to roll back the SPS deployment, however we had already upgraded our databases and roll back was not an option. Going forward we will add a 24 hour pause before upgrading the databases enabling us to observe the behavior of the service under peak load.

To mitigate the issue, the web roles were scaled out, however in retrospect it is clear we should have attempted this mitigation much earlier. During the incident there was some level of concern that increasing the web role instance count could cause downstream issues. As part of our post mortem we agreed that defining standard mitigation steps for common scenarios like this (e.g. – high CPU on web roles) will help address such issues faster.

There was an additional delay in mitigating the issue due to challenges adding the new web role instances. One of the existing web roles was stuck in a “busy” state which prevented the new instances from coming online. While investigating how to resolve this issue the problematic web role self-healed allowing the new capacity to come online at 13:26 UTC.

As the additional capacity became active the service started to drain the request backlog. This overwhelmed the CPU of two backend databases, and triggered concurrency circuit breakers on the web roles causing a spike of customer visible errors for approximately 1 hour:

Once the backlog of requests was drained the CPU utilization on the two databases returned to normal levels.

After ensuring the customer impact was fully mitigated, we prepared a code change which eliminated the use of the constructor containing the regex for the JWT tokens. This fix has been deployed to SPS.

Next Steps
In retrospect, the biggest mistake in this incident was not that it occurred, but that it lasted so long and had such wide impact. We should have been able to mitigate it in minutes, but instead it took hours.

  1. We will ensure we can rollback binaries in SPS for the first 24 hours after a sprint deployment by delaying the database servicing changes 24 hours.
  2. We are updating our standard operating procedures to define prescriptive mitigations steps for common patterns such as high CPU on web roles.
  3. We are following up with Azure to understand what caused the delay in adding additional web role capacity.
  4. We are going to further constrain compute on our internal dogfood instance of SPS so that this class of issue is more likely to surface before we ship to external customers.
  5. We are working on partitioning SPS. We currently have a dogfood instance in production, though the access pattern to trigger the issue was not present there (insufficient number of locales to trigger the issue). We have engineers dedicated to implementing a partitioned SPS service, which will allow for an incremental, ring-based deploy that limits the impact of issues.
  6. We have updated the AAD library to fix the JWT token regex parsing code.
  7. We are fixing every other place in our code where compiled regex’s are used incorrectly, especially compiled expressions that are not using the Culture Invariant flag.

We again apologize for the disruption this caused you. We are fully committed to improving our service to be able to limit the damage from an issue like this and be able mitigate the issue more quickly if it occurs.

Ed Glas
Group Engineering Manager, VSTS Platform

Skip to main content