Outage on Monday Nov-24

Just Friday, the week before, I was speaking to Buck about the incident we had the previous week and expressing a strong desire to avoid another incident anytime soon.  We agreed to postpone the VS Online deployment so that we didn’t have to rush it in a short holiday week and we could ensure that everyone needed was on hand in case something went wrong.

Then I woke up Monday morning to see that we were in the middle of an incident.  It started at 7:00 UTC (11:06 Sunday night, PST).  By the time I got online at about 11:30 UTC, the team had been engaged for hours and were working diligently on it.

Ultimately, the issue was fully resolved by 14:41 UTC (with partial mitigations in the preceding hour or two).

I’m sorry for the pain it caused – particularly in Europe and regions in a similar time zone, as it affected them most of the day on Monday.  Every single incident that we have is carefully scrutinized so that we can learn from it and make the product better and more reliable.

What happened?

The primary issue was caused by an update made to SQL Azure.  The change was described as follows:

Included in the SQL Azure November update was an enhancement to SQL Server telemetry to track SQL exception stats at a per-DB granularity.  This enhancement enables SQL Azure to perform automated repair actions on a per database basis.  This deployment completed in SQL Azure’s first GeoRegion before an issue was detected by the VS Online CRI.  The core issue is lock contention on a per-DB hash table tracking exception stats.  Only customers with very high exception stat rates on a single DB were impacted.  Deployment to the second GeoRegion was halted and rolled back while the issue was investigated.

As best they can tell, we were the only database in all of SQL Azure that was affected; certainly the only one affected significantly.

VS Online has a stored procedure (prc_ReadGroupMembership) that uses IGNORE_DUP_KEY to suppress duplicate record errors.  Based on telemetry this stored procedure was generating 170,000 exceptions per minute.  The change to SQL Server telemetry did not account for a high rate of “errors” that were intended to be ignored and trying to keep track of them all created lock contention that consumed all worker threads causing SQL processing to effectively stop on a key VS Online database.

Unfortunately, it took longer to identify and repair than it should have.  The good news is, according to standard practice, the new behavior was coded behind a feature flag.  So, once we identified the cause, it was easy to mitigate by disabling the feature flag on our database.

Even more unfortunately, once we cleared that issue at 5:44 UTC, we immediately hit a second issue (that took the remaining hour of the incident to resolve).  The second issue was lock contention on a particular table in our database.  We have no reason to believe that any SQL Azure change caused it.  It turned out to be an unwise choice of a MERGE statement.  The scenario was updating a table to include all of the rows from a second table that didn’t already exist in the first table.  The key is that it was inserts only, no row updates.  Given that, it turns out to be WAY more efficient to use an INSERT statement like the one below rather than a MERGE statement.  We believe that the fail-over we attempted to mitigate the first issue caused a worse than usual query plan for  this MERGE statement causing it to cascade into a serious lock contention problem.   Regardless, that problem is now permanently fixed.  We’ve also updated the SQL online documentation for MERGE to encourage people to use INSERT in this scenario rather than MERGE.

INSERT tbl_A (col, col2)

SELECT col, col2

FROM tbl_B

WHERE NOT EXISTS (SELECT col FROM tbl_A A2 WHERE A2.col = tbl_B.col);

You’ll also notice at 6:11, in the table below, the stored procedure causing the second incident (prc_ProvisionAttributesUpsert) was manually updated to be a no-op.  This disabled VS roaming settings but enabled all other function on the service to come back to life.  This was, in essence, a manual circuit breaker.

Timeline

In the interest thanks full disclosure, here’s the raw incident log recording events and actions taken.

Time (PST)

Description

11/23/2014 11:06

First alerts for high queued requests and high active worker counts in SPS DB

11/23/2014 11:48

LSI (Live Site Incident) Bridge spun up for investigation.

11/23/2014
12:15
First outside-in monitoring failure.

11/24/2014 12:18

First customer report of VSO outage via twitter reporting unable to access accounts.

11/24/2014 01:10

Identified unknown wait_stats via sys.dm_exec_requests causing waits on prc_ReadGroupMembership. Unknown wait stats: SE_REPL_EXCEPTION_STATS

11/24/2014 01:15

Attempted to kill all active SQL workers via SSMS. We were unable to obtain a connection to the server and run scripts to kill all active sessions. SQL Azure doesn’t support admin connections either.

11/24/2014 01:23

Attempted to kill all active SQL workers to free up resources on SQL. Performed a ALTER DB change via the Management Portal that killed all workers but we saw an immediate spike of workers blocked on prc_ReadGroupMembership in minutes and exhausted max workers of 1600 in a few minutes.

11/24/2014 01:40

Engaged WASE (Windows Azure Service Engineering) for analysis on SE_REPL_EXCEPTION_STATS

11/24/2014 02:13

Identified only one SPS partition database currently affected with unknown wait stat with SE_REPL_EXCEPTION_STATS. Majority of accounts contained in this partition DB is SU1.

Updated the connection string to this partition database in SPS Config DB to an invalid partition database connection string to short circuit requests to the partition database in effort to bring back other SU’s.

11/24/2014 02:40

SQL Azure performance engineer identified the issue as a locking problem, however the wait stats was unknown and unclear on the root cause or workaround.

11/24/2014 02:43

Requests queued on TFS AT’s still high as change to SPS partition database connection string did not have the desired effect. TFS request retries kicked in still causing slow requests and convoy of pending requests.

11/24/2014 03:30

SQL Azure team identified a potential change to the cluster of the affected partition database that may have introduced the locking issue. The reason of the wait stat was still unknown at the time and current recommendation is to first attempt to fail over to the secondary, and if that doesn’t work attempt to restart the primary.

11/24/2014 03:36

Flip to the secondary completed. And connection string restored to the correct connection string to resume traffic to the partition DB.

11/24/2014 03:50

Started seeing SQL workers blocked on unknown wait stat SE_REPL_EXCEPTION_STATS again. Requests queued on AT still not draining and decision was made to shut down all TFS SU1 AT’s. Azure reaching out to appropriate product groups to understand this wait stat.

11/24/2014 04:05

All TFS SU1 AT’s stopped. Accounts on other SU’s return to normal and accounts on SU’s other than SU1 are accessible.

11/25/2014 05:04

SQL Azure team identified this wait stat was committed on 11/04 and deployed in the last 24 hours. This feature logs errors being generated by executing queries and takes a mutex lock which caused worker threads to wait and block. We identified that this feature can be disabled via a feature flag (non-standard).

11/25/2014 05:44

Still waiting for deploying the feature flag change via SQL Azure Deployment Engineering (SADE). While SADE were trying to figure out how to update and deploy the non-standard feature flag, decision was made to switch the feature flag off directly on the server in 1 second loop interval.

11/25/2014 05:48

All TFS SU1 AT’s brought back online. Immediately saw no. of SQL workers running spike up and exhausting 1600 max allowed workers in a few minutes in the SPS Configuraton DB. We identified prc_ProvisionAttributesUpsert stored proc as being blocked page latch on the temp database. This caused requests to be queued on SPS and impacted all TFS SU instances.

11/25/2014 06:00

All TFS SU1 AT’s were stopped again to allow accounts on other SU’s continue to function.

11/25/2014 06:11

prc_ProvisionAttributesUpsert stored procedure was updated to return immediately effectively short circuiting the stored procedure. This stored procedure is largely being used by IDE Roaming Settings. Exceptions that would be generated by this change would be swallowed the IDE but effectively shutting down IDE roaming settings.

11/25/2014 06:20

Started to bring back TFS SU1’s AT’s back online one by one.

11/25/2014 06:31

We also identified a missing index in prc_ProvisionAttributesUpsert that was adding additional pressure. Additional index created to improve stored proc performance.

11/25/2014 06:41

All TFS SU’s AT’s back online and service performance back to normal. Status set to green.

 

Learnings

There were a number of things that should have gone better here.

VS Online needs to be more resilient to failures – I’ve mentioned this one in previous outages.  This is an area we’ve been working on but the work is not complete.

  • We still have one global service (SPS – Shared Platform Services) that needs to be decomposed into smaller services and redundancy provided.  All other services are already multi-instanced.
  • We’re part way  through implementing a circuit breaker pattern through our codebase but still have a fair amount of work to do to complete it.
  • We need to improve mitigation controls to be able to more effectively, in a live site incident, shut down fractions of traffic while we fix underlying problems.  Being down 30% is better than being down 100%.
  • We need to do a better job tracking metrics (like SQL I/Os) over time and looking for bad query patterns like table scans.  These are things that are often transparent until the data size gets too large and then cause sudden cascading failures.

Handle new code changes better – There are numerous ways we could have rolled this change out better.

  • Clearly the engineering team missed a use case where there are valid scenarios for very high exception rates.  But, I don’t think this is the most important issue – there’s no way to stop all bugs from getting through.
  • We need to roll out new changes in a more controlled fashion.  I’ve been suggesting for a couple of years now that we have an Azure wide canary instance.  Today each service has their own canary and, as a result, there’s no single place that all code changes across Azure can be tested against all services in a controlled fashion.  I’ve reopened the conversation on doing this.
  • We need to provide better documentation for changes.  The operations team did not know what this new SE_REPL_EXCEPTION_STATS meant and that it was the result of a new code change.  Due to this, time was lost exploring “standard” performance issue mitigations – like failing over to alternate nodes.  Those mitigations, of course, had no effect.
  • Once the change was identified as the culprit and we knew there was a feature flag, it took approximately 45 minutes to figure out how to turn it off due to tooling issues and lack of documentation.

Improvements in incident response

  • It took 66 minutes to get ahold of an engineer that understood the malfunctioning code.  Granted it was the middle of the night but our expectation is 15 minutes.
  • We should always check the change logs on affected clusters.  Time was wasted on “standard” mitigations before the team realized that a deployment had happened on that cluster hours before the incident.

 

Summary

It was a 7.5 hour incident.  I’m sorry it was so long.  I’m sorry it followed so closely on last week’s incident.  That’s not normal – usually we go months with no significant incident and I hope to get to more than a year.  All told it was probably 3-4 hours longer than it had to be due to inefficiencies.

  • We’ve got work to do on VS Online resilience – that work is under way.  It will take many months to complete all of it but I hope to have significant improvements in the next few months.
  • We’ve got work to do on better coordination of rollouts across our services.
  • We’ve got some things to improve in how we manage live site incidents.

We’re going to work hard to ensure this is the beginning of another many month, incident free, stretch.  Thank you for your patience.

Brian