Postmortem: Global VSTS CI/CD outage due to service bus failure – 13 April 2018

Customer Impact:  On 13 April 2018, we had an incident which impacted CI/CD workflows in all data centers.  This was caused by a global Service Bus instance, which we use to orchestrate CI/CD workflows, to be unavailable due to authentication errors.  Users reported that their CI/CD pipelines were stuck at various stages including releases which…


Release Management performance degradation in West Europe – 04/18 – Mitigated

Final Update: Wednesday, April 18th 2018 13:52 UTC We’ve confirmed that all systems are back to normal as of 12:30 UTC. While the issue has self-healed, during the incident we were able to collect key diagnostic information from our web front-ends that the team is actively reviewing in order to under the root cause of…


VSTS AAD linked accounts experiencing 403/500 errors when they are deleted and recreated – 04/17 – Advisory

Between 04/12/2018 01:12 UTC and 04/18/2018 22:40 UTC, some customers may have experienced authentication issues, if they had Visual Studio Team Services accounts that were linked to an Azure Active Directory tenant, after a user was deleted from that tenant and created again. Our DevOps team has release a fix, which addressed this problem for affected…


CDN outage in Azure impacting multiple VSTS features across Western Europe- Mitigated

Final Update: Tuesday, April 17th 2018 21:29 UTC We have confirmed that the CDN issue has been mitigated. We have verified with customers that functionality is returned for the scenarios which were failing. Sincerely,Tom Initial Update: Tuesday, April 17th 2018 21:14 UTC A CDN Outage in Western Europe is impacting multiple features across VSTS Next…


Performance Degradation in South Central US – 04/17 – Mitigated

Final Update: Tuesday, April 17th 2018 21:43 UTC We’ve confirmed that all systems are back to normal as of April 17th 2018 20:35 UTC. We apologize for any inconvenience this may have caused. The Engineering team is still investigating definitive root cause in order to prevent a potential re-occurrence. Sincerely,Daniel Update: Tuesday, April 17th 2018…


Mitigated issues with Release Management Feature in West Europe – 04/16 – Mitigated

Final Update: Monday, April 16th 2018 11:46 UTC We’ve confirmed that all systems are back to normal as of 10:55 UTC Monday April 16th. Our logs show the incident started on 7:55 UTC Monday April 16th and that during the 3 hours that it took to resolve the issue customers might have experienced slow Release…


Performance Degradation in South Central US – 04/13 – Mitigated

We had performance issues in South Central US due to SQL database upgrade on two databases. Users might have noticed intermittent connection issues or 500’s during the impact time. We worked with SQL Azure team and fixed the database upgrade issues. We lost the database quorum during the upgrade and that caused all the write…


Performance degradation in multiple regions – 04/13 – Mitigated

We had performance impact across major VSTS features and across multiple regions. The issue is mitigated now and from our initial investigation we saw spike in Identity calls which impacted multiple services. We are actively investigating root cause. Incident Timeline: 10 minutes – 2018/04/13 17:25 UTC through 2018/04/13 17:35 UTC   Sincerely, Bapayya


Performance Degradation in West Europe – 04/11 – Mitigated

Final Update: Wednesday, April 11th 2018 13:32 UTC We’ve confirmed that all systems are back to normal as of 12:50 UTC April 11th 2018. Our logs show the incident started on 12:15 UTC April 11th 2018 and that during the 35 minutes that it took to resolve the issue some customers may have experienced slow…


Performance Degradation in South Central US – 04/09 – Mitigated

Final Update: Monday, April 9th 2018 19:36 UTC The service is back to a healthy state. We ended up blocking a single user instead of scaling out which addressed the high CPU and slow commands. The user was a Microsoft internal service account and we’re working with the account owner to determine how to unblock…