The information on server sizing in this blog post is no longer up to date. This was very early guidance before we had done many measurements. See http://blogs.msdn.com/bharry/archive/2006/01/04/509314.aspx and later posts on my blog for more up to date information on our recommendations.
We’re deep into our load testing and server sizing efforts and I thought I’d share with everyone what we are doing and how we are thinking about it.
How big of a server do I need to support my team? Should I use a “single server” or separate Application tier and Data tier configuration? If I double the size of my team, will I need to increase the capacity of my server? We’ll provide high level guidelines for a mapping from team size to server configuration but if you want to understand how we do it and replicate it yourself, I’ll describe it for you.
To date, we’ve been very conservative on our server size recommendations. Officially our recommendations are as follows (they may have changed slightly from this but they are close). I think we’ve blogged this before but our dogfood server configuration is 2 server – 2P 2.?Ghz 4GB AT and a 4P 2.?Ghz 16GB DT and serves about 400 people. You’ll note our memory config is larger than what is recommended below.
Configuration Tiers CPU HD Memory
One server, less than 20 users. Application and data tier server single processor, 2.2 GHz 8 GB 1 GB
One server; 20 to 100 users. Application and data tier server dual processors, 2.2 GHz 30 GB 2 GB
Two servers; 100 to 250 users. Application tier server single processor, 2.2 GHz 20 GB 1 GB
Two servers; 100 to 250 users. Data tier server dual processors, 2.2 GHz 80 GB 2 GB
Two servers; 250 to 500 users. Application tier server dual processors, 2.2 GHz 40 GB 2 GB
Two Servers; 250 to 500 users. Data tier server quadruple processors, 2.2 GHz 150 GB 4 GB
Once we have finished our load testing, I expect we will update the configurations above some. Based on what I’ve seen so far, we will be increasing the supported team sizes.
Approach to load testing
The first question to answer is “how much load does a “typical” user put on the system?” We’ve gathered this information by monitoring usage of our dogfood server for months. Of course this can vary by team based on the sizes of the projects they work on, their development methodologies and the number and type of automated tools that they have. For now we are making the assumption that, to a first approximation, the usage we see on our dogfood server is representative of what other people will see. We are currently in the process of comparing our load to what other teams at Microsoft experience to do a little more validation of that assumption. It is possible for you to collect this data for your own team and here’s how…
Team Foundation Server has a feature to log every server “command” (a command is a TFS web service call) to a database on the data tier. The database is called TFSActivityLogging. Each row records what the request was, when it happened, how long it took, who did it, from what IP address, and more. This information is tracked in a rolling 7 day window – there’s a nightly SQL job that deletes data over 7 days old. We have taken snapshots of this data over the past few months and done some processing on it to determine what the typical “peak” load on the server is. To do this we aggregated the commands into 10 minute windows and divided by 600 (# of seconds in a 10 minute window) to give the number of “commands/sec” in 10 minute chunks. We then graphed these 10 minute windows over a 24 hour day to see what the load profile throughout a day looks like. Based on that we can easily see what the peak “commands/sec” is (in 10 minute windows). For our dogfood server this came out to be about 35 commands/sec. You can then take this number and divide by the number of users using the server – 35 peak commands/sec / 350 team users = 0.1 peak commands/sec/user. This tells you how many peak commands/sec each member of your team contributes – and 0.1 is what we have computed for our dogfood server (and are currently validating that it is consistent with other teams). To enable command logging you’ll need to edit the setting for CommandLogging in the top level TFS web.config.
An important note – As I said, the TFSActivityLogging database includes all commands executed against the server. This not only includes requests made from clients but also includes requests made from one service to another on the server itself. For example, if you call the Version Control service, it may call the security service to do a permission check. Both are recorded. However, when computing the peak commands/sec/user above, you need to exclude the service to service requests because you only want to look at direct client generated requests. This will be important in how we craft the load testing below. You can ignore these easily because the rows in the TFSActivityLogging database include a “UserAgent” – what application was used to make the request. All service to service have the user agent set such that “w3wp.exe” will appear in the string. You need to exclude these before you do the 10 minute aggregations.
Now you know how to compute the peak commands/sec/user (0.1 for our server). How do you use this to answer the questions above? Well, let’s say I want to grow my team from 50 people to 100 people on the same server. Will my server handle it? To handle it, the server will need to support 0.1 * 100 == 10 commands/sec peak.
To measure how much load a server can take, we’ve written a set of load tests for the Team System load testing tool. We can then run these load tests with a ramp load (increasing tests/sec) against different hardware configurations and different data set sizes to see how many commands/sec we can get and what the corresponding response times are. We can measure the test/sec from the load test agents to determine how many commands per second we are generating. Unfortunately this also requires a bit of calculating to get it exactly right. There isn’t a 1-1 correspondence between tests/second and commands per second – because some tests execute multiple commands. To account for this, we run the load tests for a while, compute the number of commands (excluding server to server ones) and divide that by the number of tests that the load test ran. This gives a ratio of commands/sec to tests/sec.
Two other important issues remain with the load tests. First, you need to determine what the appropriate “mix” of commands to simulate is. To do this, we’ve gone back to the command log on our dogfood database. We run a query that aggregates the counts of each command and gives a percentage of all commands. It turns out that relatively few commands make up the vast majority of that load. As a result we’ve actually only written load tests for about 30-40 of the 130 or so possible commands because the sum total of all of the others don’t account for more than a 1% of all commands executed against our dogfood server. Using this data, we created a load test mix that approximates the load distribution we see on the dog food server.
Here’s the dogfood distribution we are basing the load tests on. This includes all web methods with more than 0.1% of the total execution time. Note, we are still doing some perf work so I expect this to change some over time.
Service Web method Execution Time (s) # of calls % Execution Time
Version Control Download 83528.15 2494408 24.51%
Version Control UpdateLocalVersion 68361.83 25179 20.06%
Version Control PendChanges 53533.50 2834 15.71%
Version Control Get 42588.98 17913 12.50%
Version Control QueryPendingSets 19670.23 14065 5.77%
Version Control QueryItemsExtended 11612.75 7974 3.41%
WorkItem Tracking QueryWorkitemCount 8007.08 126827 2.35%
WorkItem Tracking QueryWorkitems 6563.08 88849 1.93%
WorkItem Tracking PageWorkitemsByIds 6388.91 41729 1.87%
Version Control UndoPendingChanges 6345.44 1021 1.86%
WorkItem Tracking Update 5991.57 6488 1.76%
Version Control Unshelve 2382.38 373 0.70%
Version Control QueryHistory 2209.30 6228 0.65%
Version Control Resolve 2009.92 1845 0.59%
WorkItem Tracking GetWorkItem 1991.33 46362 0.58%
Version Control QueryItemsById 1717.28 88816 0.50%
Version Control QueryItems 1660.01 18974 0.49%
WorkItem Tracking GetMetadata 1650.80 9244 0.48%
Version Control FilterPaths 1463.33 10597 0.43%
Integration IsMember 1412.33 289420 0.41%
WorkItem Tracking PageWorkitemsByIdRevs 1158.99 1537 0.34%
Version Control Checkin 1128.93 498 0.33%
Version Control Upload 1048.00 8096 0.31%
WorkItem Tracking GetStoredQueries 1038.85 17445 0.30%
WorkItem Tracking PageItemsOnBehalfOf 609.14 4194 0.18%
Integration GetRegistrationEntries 564.51 6833 0.17%
Version Control QueryShelvesets 561.55 1365 0.16%
Version Control QueryWorkspaces 491.65 4748 0.14%
Integration ListProjects 395.92 4286 0.12%
Version Control QueryChangeset 382.45 3916 0.11%
TFS Build StartBuild 357.30 5 0.10%
Second, you need to think about how much data is on the server you are testing. This can have a huge impact on your results. Running against a small database can yield extremely high rps (requests per second) because all of the data is kept in memory. A much larger database might support 2, 3 or 4 times fewer rps. To be able to experiment with different sizes of data, we created a tool we call “dbfiller”. You tell it how many Team Projects, Files, Work items, attachments, … that you want and it creates a database with that data in it (a really large database might take a day or two to create). Another approach is to take a backup of your production database and restore it onto a test server to run load tests against. You wouldn’t want to run the load tests against a production server as the load tests simulate many commands (including creating, modifying and deleting work items, files, etc).
When we run the load test, we run it with a step load, stepping 1 user every 30 minutes and we run it for 8 hours. Note that the number of “users” in the load test has no real correspondence to the number of people we are simulating. Each “user” in the load test ramp represents 1 thread in the load test engine running requests back to back as fast as it possibly can. We are using the calculations above to determine how many people we can support. The load test “users” are just a way to gradually increase the load on the server. The actual count is not important.
While it runs, we collect perf counters, including tests/sec, CPU utilization, disk queue length, available memory, etc. We also collect response times for the tests. In determining when the server has reached the most load it can service we look at a couple of things. First we look at CPU utilization. Remember what we’re measuring against here is “peak” load so we run the CPU fairly hot. That said, I’d say once the server hits 80-85% utilization you’re done. You definitely don’t want to run your server past that (even at peak times) and if you plan on seeing growth on your server, I’d not want to run it past 50-60% peak utilization. Second we look at response times of the tests. The main thing we are looking for here is “the knee in the curve”. The response time is how long a user will be waiting for that command to complete. You want to make sure that the times are “reasonable” and below any “knee”. The “knee” is the point in the curve where small increases in load start yielding disproportionately larger increases in the response time. When this happens your server is too busy.
Putting it all together
A few weeks ago we did our first full load test run using this methodology. We used the numbers I’ve quoted above and in addition to that we used:
Single server install configuration
Dual Processor, 3.4GHz, 4GB ram, SATA drive
Data size (the most interesting subset of information)
36 Team Projects
21,000 change sets
27,750 work items
194,250 work item revisions
2,000 work item queries
Note, the database grows as the load test runs be cause part of what the load tests do is create new work items, add new files, etc.
We achieved 13.6 tests/sec at 50% cpu utilization and 20.4 tests/sec at about 80% cpu utilization. The response times were more linear than we predicted. I don’t think we ran the tests to high enough load to see the knee (we certainly didn’t see it up to 85% cpu utilization). We’re investigating this a bit more because you certainly should see it at some point.
Using the math above, this server configuration can support a team of: 20.4 tests/sec * 5.19 commands/test / 0.1 peak commands/sec/user = 1060 users.
That’s the absolute peak. In sizing a server for my team, I’d use the 50% cpu utilization number resulting in: 13.6 tests/sec * 5.19 commands/test / 0.1 peak commands/sec/user = 706 users.
Now, the truth is that a team of 706 people would actually have more data than we used to populate the database (we based these numbers on a 100-200 person team) so it wouldn’t actually support a team of that size – the additional data size would cause some slow down. However, it tells you that if this is your data set size, you’re covered for a pretty darn long way.
I hope this analysis was interesting and helpful. Sorry for writing so much but it’s a pretty involved topic. I’ll try to find some shorter things to write about. If there is interest, we can look at posting our load tests, dbfiller and other tools that we used for customers to use in their own capacity planning exercises. As I said before, we’ll be running this over a variety of configurations and creating an updated tabular table like the one at the top for people who don’t want to invest a week in crafting their own custom load test configuration.