We’ve had a number of outages and other serious incidents in recent months. It’s clear we haven’t done enough to invest in reliability of the service, and I want to give you some insight into what we are working on that will be coming in January and beyond.
First I want to give you a very brief description of the the Visual Studio Online (VSO) service topology. VSO consists of a set of scale units that provide services like version control, work item tracking, and load testing. Each scale unit consists of a set of Azure SQL Databases with customer data and virtual machines running the application tiers that serve the web UI and provide web services and job agents running background tasks. We limit how many customers are located in a given scale unit and create more to scale the service as it grows, which is currently at six scale units. We also have a central set of services that we call Shared Platform Services (SPS) that includes identity, account, profile, client notification and more that nearly every service in VSO uses. When SPS is not healthy, the entire VSO ecosystem suffers.
We have been working on making VSO and especially SPS more resilient, as many of the recent outages have stemmed from problems affecting SPS. While the work will take months to complete, every three-week sprint deployment will include improvements. Here’s an overview of the work.
Breaking apart SPS
One of the lessons we learned from the outages has been to ensure that the less critical services cannot take down the critical services. We’ve had cases where a dependency, such as SQL or Service Bus, becomes slow, and our code has consumed precious resources like the .NET thread pool. While we have fixed the particular issues we’ve hit, the best approach is isolation. As a result, work is currently under way to split out profile and client notification services from SPS and have them be separate, independent services. Profile is responsible for your avatar and roaming settings in Visual Studio. Client notification is responsible for notifying Visual Studio of updates, such as notifying other instances of VS when you change your roaming settings. They are both important services, but most users will not notice if either of those is not working for a short period of time. This is part of an overall move to a smaller independent services model in VSO.
Another lesson has been that we have too much depending on a single configuration database for SPS. We are going to be partitioning that database so that if there is an issue with a database it takes down only a subset of VSO accounts and not all. Part of this work will be determining how many partitions we’ll use. This will also provide us more headroom, allowing us to absorb more spikes in load, such as those occurring during upgrades.
Today we have redundancy provided by our dependencies. For example, SQL Azure uses clusters with three nodes per cluster. Azure Storage maintains three copies of each blob. We use multiple PaaS VMs for each role in a scale unit.
However, we don’t have redundancy at the service level. For SPS in particular, we need to provide redundancy to be able to fail over to a secondary instance of SPS if there is a problem in the primary. We will begin this work in the first quarter of this year.
Graceful degradation is a key principle of resilient services. We are actively working on making VSO services more resilient to their underlying dependencies, whether those are Azure services like SQL and Storage, or our own VSO services, such as SPS. Our first initiative is to contain failures by implementing circuit breakers. Circuit breakers work by detecting an overload condition, such as a backup caused by a slow dependency, and then causing subsequent calls to fail, return a default value, or some other appropriate action to reduce load and prevent the exhaustion of precious resources, such as the databases and thread pool. We now have them implemented in several places in the code. To have them work effectively, each needs to be tuned so that they trip only when needed. Based on the telemetry, we’ll configure them and let them operate automatically. We have quite a few places that need circuit breakers, and our primary focus is on SPS and the underlying VSO framework.
We will also build additional throttling into the system both to prevent abuse as well as tools that run operations either too frequently or that are very inefficient.
Also, we are going to be introducing back pressure into the system – indications in API responses that there is a problem and the need to back off. Azure services such as SQL and Service Bus already provide this today by using particular error codes for failed requests that tell the callers whether to retry and if so when. We make use of that information in calling those services, and we need to introduce the same for the services we build. This work will be starting within the next couple of months.
In order to systematically analyze key components like SPS, we are adapting resiliency modeling and analysis (RMA) to our needs. This is a process we are just starting. In addition to the follow up items that it will generate, we want to build a culture of reliability.
In order to verify our improvements as well as to continue to discover new issues, we’ve started investing in chaos monkey testing (a term that Netflix coined). We’re still in the early stages, but it’s something we will do a lot more with over the next few months.
SPS handles a very large volume of calls. We’ve been spending time understanding the sources of those calls and what we can do to reduce the call volume either by eliminating the calls altogether or caching frequently requested data, either in SPS or in the services making the requests. That’s resulted in a significant reduction in the number of calls made. Of course, the service continues to grow, so this work continues.
We are also looking at the query plans and I/O performance data of our top stored procedures by using database management views (DMVs) in SQL. We’ve done exercises to examine these in the past, and we are looking at how we can automate this. In January we will complete the process of moving our SQL Azure databases to the new version of SQL Azure that will provide us with XEvents for more insight into what’s happening at runtime. The result of this analysis will be tuning or in some cases rewriting stored procedures for optimal performance. The new version of SQL Azure also provides us with benefits such as improved tempDB performance.
When there is a problem, such as a service upgrade that puts too much load on SPS, we need to be able to quickly reduce the traffic to SPS in order to recover. We’ve seen several incidents lately where getting SPS back to a healthy state has required effectively draining the queue of requests that have resulted from the system being overwhelmed. Without doing this, the system doesn’t make sufficient progress to recover. Along with circuit breakers, we’ll add switches to be able to fail requests quickly to clear the queue and get back to a healthy state quickly.
Monitoring and diagnosis
When there is an incident, having the right telemetry available saves valuable time. We already have extensive telemetry on VSO that generates gigabytes of data per day that we analyze through use of alerting systems, dashboards, SQL warehouse, and reports. We also use Application Insights Global System Monitoring for an “outside in” view of VSO availability. We are adding more telemetry around the effectiveness of caches, identity calls, and other areas where we’ve found we haven’t had enough insight into the system during live site investigations.
We continue to learn from every live site incident. We are investing heavily in making VSO reliable and more resilient to failures.
Just like our customers, we are entirely dependent upon the stability of VSO for the work we do. Since May, we have had the entire team that builds VSO using VSO for source control, work item tracking, builds, etc. As part of that migration we set up a scale unit that just has our team on it and where we deploy changes first before rolling changes out to the scale units with customers. This has proven very valuable for finding issues that happen under load at scale that are very difficult to find in testing. There is no place like production.
While the focus of this post has been to describe what we are working on, I want to share some highlights of improvements we’ve made recently.
- Decreased total calls to SPS by 25%
- Client notification and profile are throttled to limit resource usage
- Decreased CPU pressure on the web service VMs by 10%
- Reduced peak average SQL sessions on configuration database by 10%
We apologize for the outages we’ve had recently, and I wanted to let you know we are working hard at making the service more reliable.
Follow me on Twitter at twitter.com/tfsbuck