Post Mortem: Networking failures for Visual Studio Team Services on 12 July 2017


On 12 July 2017, we experienced a networking incident that lasted 19 minutes that affected VSTS customers using the service (You can find the incident blog details here). We sincerely regret the inconvenience that this has caused our users. We have conducted an internal post mortem along with our partners in Azure networking to look into the incident details and have identified improvements we need to make to avoid similar outages in the future.

Customer Impact: 

Starting at 06:34 UTC on 12 July 2017, for a period of 19 minutes, VSTS users experienced performance issues and internal server errors while trying to sign in and interact with the service.

The chart below shows the percentage of active VSTS users who experienced errors while trying to sign in or access the service during the incident window.

 

What went wrong: 

A firmware update was being rolled out to route reflectors across the WAN fleet as part of ongoing upgrade and maintenance operations by the Azure networking team. During this upgrade, there was a human error that resulted in a set of upgrade operations being performed across multiple route reflectors simultaneously resulting in the loss/removal of critical redundancies. The route reflectors recovered as soon as the upgrade completed after which networking services and access to VSTS was completely restored. More details about the Azure Network incident can be found here

The incident was detected through our outside in monitoring tests as shown in the chart below and also through a series of Circuit Breaker Exceptions that are fired when a key dependency of VSTS fails (in this case it was networking).

Next Steps:

As part of our continuous efforts to improve the overall service, in partnership with the networking team we have identified the following opportunities as areas of improvement:

  • Introduce additional software checks to prevent multiple router updates from occurring simultaneously
  • Review the route maintenance sequence to ensure that critical router dependencies in a region are not updated during the same day
  • While the incident was detected through our automated outside in alerts, we have identified additional alerting that can be introduced in VSTS to be able to narrow down networking specific alerts for some of our core services.

We extend our apologies for the impact that this has caused our users. We strive to learn from every incident that our customers experience and will ensure that the repair items and improvement opportunities that were identified are executed upon in a timely manner.

Sincerely,
Harish Thekethil, VSTS SRE Manager


Skip to main content