Back in April, the week before the VS2010 worldwide launch we successfully upgraded the server to TFS2010 RTM. Because this is such a large server and almost 4,000 people in the division depend on it for their day-to-day work, it took a couple of months of planning, testing and dry-runs to get done. Since then, we've also upgraded our proxy servers to Windows 2008 R2 + TFS2010, upgraded our SQL server to Windows 2008 R2 + SQL Server 2008 R2, moved to a new set of hardware and upgraded+consolidated a couple of other servers to this server. A busy year so far!
This server has had an interesting history which makes this upgrade particularly important. The server originally started as the dogfood server for the TFS team in December 2004, you can see this from the very first checkin:
C:\>tf history /collection:http://vstfdevdiv:8080 $/ /version:C1 /format:detailed /noprompt /stopafter:1
Date: Friday, December 10, 2004 10:04:32 AM
Initial creation of the repository
A brief history
Since that first checkin, it has been constantly patched and upgraded ahead of each release (CTPs, Betas, SPs, etc). Then in early 2008 the whole division on-boarded to the server and it became the single source & bug repository for the division. During this on-boarding period there were lots and lots of patches made so they server could scale to the unique demands of the division. These patches were all rolled into the product and shipped as part of TFS2008 SP1.
Then in mid-2008 we started to have some big growing pains as the number of users and the demands of the server increased. There was a lot of pressure from up the chain and across the division to fix things and make it better. This ultimately lead to what we referred to internally as “the schema change” and you can read more about the impacts of it in Matt’s change to slot mode in TFS2010 blog post.
The improvement that this change brought is pretty clear from the following chart which shows Command Time vs. Command Count – up until the patch was deployed, performance for large operations (Gets, Merges, Branches of millions of files) was pretty bad.
However, getting the schema upgrade deployed was not smooth sailing or a silver bullet to our problems. The chart below shows our availability over the last 2 years. As you can see, we were not in a good shape towards the end of 2008. The schema upgrade involved adding a new non-NULL column to a 5 billion row table. Our initial attempt performed this in a single transaction and took many hours. After running for ~48 hours, our SQL cluster failed over to the passive node which caused the transaction to start rolling back. This is when we discovered that rollback is single-threaded and lower priority, so we had to wait almost 4 days for the transaction to rollback before we could bring the server back online. That was not a good week and we learnt many lessons from that upgrade.
Once that upgrade was complete and the TFS problems were fixed, it was like a spotlight came on and exposed some problems in our underlying infrastructure (cluster failovers, poor disk performance, network failures, hardware failures). Over the next six months, we had a team of people dedicated (from both the product group & operations side) to getting to the bottom of all the issues and a focus to get the division stable again.
The end result is that by us dogfooding our own product, we’ve changed how we approach upgrades and made them more robust which makes it a much better experience for everybody.
Here’s the latest DevDiv TFS Statistics:
- Team Projects: 72
- Files: 981,754,813
- Uncompressed File Sizes: 20,317,315
- Checkins: 1,912,072
- Shelvesets: 244,324
- Merge History: 2,342,520,807
- Workspaces: 38,625
- Local Copies: 4,251,932,059
- Users with Assigned Work Items: 5,121
- Total Work Items: 897,787
- Areas & Iterations: 11,835
- Work Item Versions: 8,540,808
- Work Item Attachments: 472,487
- Work Item Queries: 93,572