We had a serious outage in early November. It should not have happened, and we are very sorry for anyone that got impacted. We have more and more users that rely on Labs and we are very conscious that we need to make sure that events like these do not happen. It was a first for us, and now that we have a bit of perspective, we wanted to share what happened.
The problem happened over 3 days starting on November 6th, with a few key moments along the way.
11:00 PST - The issue first appeared as a spike in transient 404 errors in our synthetic monitors for one super-region (a set of regions served by a shared set of DevTest Labs stamps). We continually test our own labs with synthetic data to detect anomalies. We were already investigating a slowdown when we detected the spike so our first line of inquiry was to establish or disprove a link between the two issues. At this stage, nothing out of the ordinary when running a service at scale.
11:15 – The first customer reports come in, both from internal and external customers (Labs is used extensively by teams across Microsoft). We quickly mobilize and find that the issue is between our service and the Azure front-door. In Azure, every request goes to the Azure Resource Manager (ARM) front door that dispatches the requests to the appropriate Resource Providers to execute what is requested – in this case the DevTest Labs provider.
12:00 - By now we believe that the front door is reacting to the intermittent 404s we had detected, and we engage with them. Their engineers are brought in the bridge that got created to address the outage. They confirm that they see the problem in their logs and that they had reacted to the errors by deprovisioning the faulty Labs form the front door, a mechanism to protect the whole integrity of Azure to prevent a faulty service to starts consuming too much resources.
13:00 – We now have a few threads in parallel. We answer every new ticket that comes in, we create a job to identify all affected subscriptions, we prepare a resynch job with the Azure front-end, and we investigate the source of the intermittent 404s - which have completely disappeared in the meantime! Since we firmly believe that “bugs that go away for no reason, come back for no reason”, we keep digging.
17:00 – During the afternoon we found the code path that caused the 404s and closed it down with a “hotfix” and we start working on a more permanent correction. We also have identified the affected subscriptions and are now able to start running the synch job with the front end.
20:00 – During the day we realized that the deprovisioning has removed some Labs permissions, so we engage with the RBAC (Roles Based Access Control) team to confirm. We believed (hoped) that the resynch may also correct the permissions and we were reluctant to run another job simultaneously, so we decided to wait for the resynch job to finish.
02:00 - We issue the formal patch to the service for the initial problem.
03:00 – The resynch jobs end, it had taken longer than expected. We confirm that the RBAC problem is not solved and we will need to restore permissions to validate the resynch. The RBAC team created a tool to do so and start applying it.
06:00 – The permissions have been restored and customer can finally access their labs, or so we thought. We now realize that the synch job has actually desynched additional subscriptions rather than correct the ones we had identified. We diagnose the issue with the synch job and realize that because of an unknown interaction between our service and the front-door, we needed to synch Resource Group by Resource Group instead of the whole subscription at once.
07:00 – We start identifying all the affected subscriptions and run the corrected resynch manually on each resource groups validating as we go. This takes longer but by now we do not want to take any chances with the resynch job.
22:00 – The resynchs are all completed and validated, we can now use the RBAC tool to recreate all permissions.
06:00 The service is back to normal for all users!
The event was triggered by transient 404 errors returned from DTL RP and compounded by a previously not understood interaction between our Resource Provider returning 404, the ARM Front Door cache eviction logic, and the persistence behavior of RBAC role assignments.
We are very grateful for the help from all teams involved and the work our engineers put in up until all was resolved. However, we are conscious we need to do better to prevent similar situations and are continuing to add more logging, better error handling and more resiliency mechanisms to the service.
The Azure DevTest Labs team