You haven’t heard from me for a while because I’ve been taking a bit of a break for the past month to help out at home with our new daughter. I returned to work on Monday. As I’m getting back into the swing of things, I thought it would be a great time to write a new blog post. I apologize for the length – there’s a lot to say and I didn’t really have the time to break it into chunks and dribble it out.
Expanding our Dogfooding
As you know the VSTS team has been using TFS internally for well over a year. John Lawrence (http://blogs.msdn.com/johnlawr) has been blogging various statistics from our dogfood server for quite some time. I think I’ve also mentioned that after we released V1, our plan was to roll out TFS to the rest of the Developer Division. Well, we’ve been working on that for the past couple of months – getting internal tools to work with TFS, updating scripts, porting data, migrating teams, etc. While we’re not nearly done, we’ve made some great progress and I thought I’d share it with you.
Getting all of DevDiv to use TFS is a significant challenge. A single branch of our source code is over 600,000 files today. We are in the process of adding all of our test source to our branches (historically they have been in a separate repository). This will add more than 1,000,000 more files to each branch. By the time all is said and done, considering all of the branches we use, the TFS server will contain 100,000,000 files or more (no, that’s not a typo J). This will make it one of the (if not the) largest source code databases on the planet. There are other systems in the world with more source (for example the Windows NT source base is bigger, however it is broken up across approximately 9 different version control databases).
This is an ambitious goal to take on within a month of shipping TFS. Today we have about 1,000 users and over 13,000,000 files in our TFS installation. We’ve learned a great deal from the exercise and have plans to continue growing both the number of users and number of branches over the next 6 months or so. Over that time, I’m sure we’ll learn even more. We simulated very large databases in our labs before we shipped V1 but there’s nothing quite like seeing what happens in a production environment where the unbridled masses are unleashedJ.
Here’s a snapshot of the overall server stats that John has been publishing for a while so that you can see how it has grown in all of the dimensions.
Recent users: 806
Users with assigned work items: 1,023
Version control users: 1,229
Work items: 94,695
Areas & Iterations: 5,897
Work item versions: 721,531
Attached files: 24,569
Stored Queries: 9,160
Total compressed file sizes: 171.8GB
Total checkins: 72,223
Pending changes: 106,113
Commands (last 7 days)
Work Item queries: 22,085
Work Item updates: 5,909
Work Item opens: 35,072
As I said, we’ve learned a lot from the exercise so far and have made several product fixes as a result. We are rolling all of the fixes we’ve made into a service pack that we will make available publicly later this year (please don’t ask me to me more concrete than that as we are still in the midst of planning the release J).
The good news is that with relatively minor changes the system is performing well under impressive load and scale. Most of the growth pains we’ve had have been around version control data. As you can see above the number of files under version control has grown by almost a factor of 10 since John last reported it in February. I’d also like to point out that although I’m going to describe some issues we’ve dealt with in the roll out to DevDiv, no where in the process have we experienced a single data corruption of any type.
As we started growing the amount of data in the server we ran into some service issues. These service issues were caused by operations taking way longer than they should and blocking other people from using the server for periods of time. With the patches and operational changes we’ve made over the past month, we’ve restored the server’s level of performance and availability.
This is what dogfooding at Microsoft is all about. It allows us to really push the system under real production conditions with massive amounts of data and significant load. We can examine carefully how the server is behaving. We can experiment (albeit carefully) with alternative approaches to really push the scale of the system. In the spirit of transparency (and hopefully to share some interesting challenges we’ve faced), I’m going to describe many of the issues we’ve hit, what we’ve done and what further we are planning on doing. I hope your take away from this is not simply that the server has had problems – every system has problems at some level of scale or load. What I hope you take away is that we are in this with our customers and are pushing the system as hard as we can and addressing the issues. All of our customers (including those internally at Microsoft) will benefit from the effort we put in as we continually use and improve the system.
What we’ve learned
Most of the issues can ultimately be traced back to the size of data in some form. There are two dimensions of size that affect the server – how much data is in the server (making tables and indexes larger, etc) and how big individual operations are (consuming memory, CPU, locks, etc). As I mentioned above – a single branch in the VS database is over 600,000 files. As we started to manipulate single branches this large – checkin all 600,000 files, merge 100,000’s of changes, etc we ran into some issues. We found that having 600,000 pending changes in a single workspace didn’t work well. As the warehouse data grew to millions and millions of files, it had some issues. Etc. Our initial approach to work around some of these problems was to break up these really large operations and do them in smaller chunks (for example – checkin 50,000 files at a time rather than 600,000). In many cases (but not all) this is a fine solution and only a minor annoyance.
We’ve completed our analysis of the underlying constraints that we were hitting and have fixed, or are working on fixes to address all of them. Here’s a summary of some of the things we have learned.
Sprocs & Query plans
We’ve made a variety of tweaks to our stored procedures to induce better query plans when we find ones that aren’t working very well. As the amount of data grows, any query plan that is not optimal can quickly go from a few milliseconds to minutes. Changes we’ve made include:
Checkin, Undo, Rename - Changed the sprocs because the performance degraded when there were 100,000’s of pending changes in a single workspace
SetPermission, SetPermissionInheritance - Fixed a performance problem that appeared when the number of files in the system got really large and the depth of the tree grew.
Get – We discovered that many of the get operations were for individual files or for very small groups (often done by automated tools). Our V1 implementation took approximately the same amount of time whether you were doing a get of a single file or an entire workspace of 10,000’s of files (ignoring any file download time). We have enhanced the get sproc now to optimize for what is actually asked for. The result is that single file gets have dropped from about 5 seconds to a few hundred milliseconds.
Merge – We found that a big part of the time spent in merge was computing what changes needed to be made to update the client. This was causing blocking of other operations on the server. We moved the client calculation outside the merge transaction and reduced concurrency problems.
We’ve run into an issue with database locking. The biggest effect has been on the pending change table when doing things like really large checkins. In these scenarios we need to take write locks for every pending change being affected. SQL’s row level locking does a great job. However, to limit the amount of memory a transaction can use for locks, at a point (about 5,000 row locks) SQL stops locking each individual row and “escalates” to a table lock. This means that the transaction has every row in the table locked. While, for the vast majority of applications, this is not a consideration (either tables aren’t that big, not that many rows need to be locked or concurrency is low) – it is an issue for our application. It means that for the duration of the checkin transaction of over 5,000 items no one else can pend any changes (checkout) because the table is locked. For checkins of small numbers of 1,000’s of items (5,000, 10,000, 20,000) it’s not too noticable because the checkin is fast enough. However checkins of 100,000’s of files can take many minutes and everyone screams when they can’t checkout for minutes at a time.
The pending change table is not the only place we can hit this. We can also hit it on the LocalVersion table, the Version table, etc but it tends to be the one we’ve hit the most. The problem, however, is general and we are looking for a general solution.
We have worked with the SQL team to understand all of our options. The first option we are trying is to disable row locks on the tables/indexes where this affects us. This will cause SQL to use page locks instead. Based on the way our tables and indexes are clustered, we don’t believe this lower granularity of locking is going to be a problem for us. Another approach we will try, either together with the first or separately, is to disable lock escalation causing SQL to continue to take finer grained locks of how many are needed (never escalating to table locks). Of course, this means that the server can use more memory for locks but we are planning for that. We’re in the process of upgrading to our final production hardware which is a 64 bit server with 32GB of RAM. We won’t know for sure that this fully addresses the problem until we’ve finished testing it.
In the mean time we’ve chunked most of the really large operations up (as I described above) and moved them more to off hours and this is not creating a problem for us at the moment.
Read Committed Snapshot Isolation (RCSI)
RCSI is a really cool new feature in SQL server that allows readers to maintain read committed isolation semantics without taking read locks. It does this by proactively making copies of data that is changed and allowing readers to access these “old versions” of the rows when needed. RCSI is a database level setting so when enabled, it applies to all tables in the database. There are some tables (most notably the LocalVersion table) that benefit substantially from RCSI. RCSI gives us substantially better concurrency and performance.
However, we have discovered a side effect. It’s not huge but it aggravates the locking issues a bit. RCSI is pretty smart about how it works and it actually only copies the changed portions of a row to minimize the amount of data copied. However, some of our really large operations delete large numbers of rows from some tables. For example, in the 600,000 file checkin case, we need to delete 600,000 pending change rows when the checkin is complete. RCSI has to make a copy of the “changed” data but because the operation is a delete, all of the data in the row “changes”. This means that RCSI has to copy all 600,000 rows. We have found that under load this alone can take several minutes and is sometimes the single most expensive part of an operation. When we disable RCSI, the same deletes complete less than half the time.
We are investigating a variety of options. The first is to change the pattern so that rather than deleting the rows in the transaction, we instead change the value of a non-indexed column to indicate that the row is “deleted”. We could then delete those rows in a separate transaction with an isolation level below read committed (read uncommitted) and there by avoid the row copying that happens with RCSI. If we’re not happy with that, our alternative is to change RCSI from “on” for the entire database to “enabled” and then go through all of the sprocs and specifically state which transactions need that semantic and which don’t. This would be a much more impactful change requiring much more testing.
When we started trying the huge operations (like checkins of 600,000 files), we saw a variety of out of memory problems on the application tier server. After investigation, we discovered that some of these operations were requiring up to 1.5GB of memory on the AT to hold the results before sending them back to the client. Because our AT is a 32 bit application we started running out of virtual memory and the web service would recycle. Recall, normal 32-bit process have only 2GB of virtual address space. Although there is a configuration to increase this to 3GB, it is generally not recommended for IIS worker processes
The biggest occurrence of this was operations that return what we call “GetOperations”. This is data that tells the client what needs to be updated. For example, if I call the “Get” web service, it returns an array of “GetOperations” to tell the client what to do to bring itself up to date. This also happens with delete, rename, merge and others. When doing these operations on 100,000’s of files the AT would run out of memory. We also saw the Warehouse hit this problem when it tried to process code churn information for a single checkin of 100,000’s of files for the same reason.
We’ve taken a variety of approaches to help with this problem. First, the single biggest portion of a GetOperation is the “download ticket” that allows you to download the file. Some of the operations getting these large numbers of GetOperations had no intention of downloading the files, so we modified them to request that the tickets not be generated. Some operations were changed to request the download tickets in chunks after the initial list of GetOperations is returned. We also made some memory consumption optimizations to reduce how much memory is used during processing.
Ultimately, there will always be some level at which the AT will run out of memory. We can trim down and optimize so maybe it will take a single operation of 10,000,000 files instead of 500,000 but there will always be a limit. Our future approach will be to make the AT work as a 64-bit process so that virtual memory will no longer be a meaningful limitation.
The warehouse too has had issues as a result of the flood of data. I mentioned the problem with AT memory above. In addition we have had an issue with the warehouse consuming too many resources (CPU and RAM) on the live server. With all of that data, it got to the point that the hourly cube processing could consume high CPU load for as much as 20 minutes. We are investigating the causes for this high load with the Analysis Services team. As a short term fix we made a change to our server topology to move the warehouse off of the live server and only a separate warehouse server. In general, the large scale data warehouse best practice is to put your warehouse be on different hardware than your operational server. Unfortunately, this is not a supported production configuration in TFS V1 so I can not recommend it. Doing this WILL create problems with the serviceability of a TFS installation. We are investigating making this topology supported in the future.
When looking hard at the warehouse performance in the light of all of the data we were pumping we noticed that there was quite a lot of network traffic between AS and the SQL server (I don’t remember the details). As a result we updated all of the server to server network connections to be Gigabit ethernet.
We hit another problem (that required a sproc change) when we rapidly (with a tool) added 2,000 to 3,000 Areas. The work item tracking system was unable to consume this much change so rapidly and stopped working until we could fix it. We now have a patch available that can remedy the issue. If anyone reading this hits the issue, Customer Support should be able to help you.
As you can see we’ve hit a wide range of issues. I expect that as we continue the rollout and increase the data size substantially we’ll uncover new problems. However, the system is already holding up under a stunningly large amount of data. We’ll continue to fix issues as we find them and will make sure to deliver those fixes in the commercially available product. Many of the fixes we’ve done so far actually help performance even when the data sizes are not nearly so big. It’s just that the fixes go from being “nice to have” with smaller data set sizes to “necessary” at REALLY large data set sizes.
We set out to build TFS as a product that could scale from small teams to the largest enterprises. I’m immensely proud of what we have accomplished with the product so far. I really do hope this information doesn’t scare people. As I’ve said, every system has limits and will run into problems at some level of scale or load – we’re just being up front about what ours are. Those of you with hundreds of thousands for even small numbers of millions of files can use TFS V1 just fine with no additional fixes or patches and it will serve you well. As you grow, we’ll be ready for you. We won’t stop until TFS has more headroom than any team development system in existence.
Thanks for listening – if you’ve actually read this far I’m impressed