Since joining Microsoft, I’ve become intimately familiar with running a TFS server for ~3,500 users in Developer Division and the performance characteristics of it.
One thing I’ve learnt is that Performance Counters rule. You might observe the server being “slow” and you might notice that it “takes a while” to do certain operations – but you need evidence to back up your claims before anybody will take you seriously. The evidence that everybody has access to, is reliable and people take seriously are the perf counters built into windows.
If I think about the problems we’ve overcome in the last 12 months, the issues come down to these:
- IO – If the “LogicalDisk\Avg. Disk sec/Transfer” perf counter for any of your disk drives is more than 0.030 (30ms) – then you’re hosed. This counter is a primary indicator of disk latency. Get that fixed before doing anything. (see below for more details)
- Workspace Mappings – If you have unnecessary paths in your workspace mappings, then Get() will be much slower than it needs to be. E.g. DON’T map $/ to C:\Code and think that everything will be good. A root mapping isn’t truly a bad thing but if you aren’t careful it can lead to unexpected and potentially slower results.
- Latency/Download requests – Proxy servers help here by offloading Download() requests from the main server. Doesn’t help Work Item tracking.
- CPU – Processor performance isn’t linear. If you’re running higher than ~70% CPU for periods of time, then you need to increase your processing capacity.
- SQL indexes/fragmentation – Sometimes the TFS SQL Jobs that update statistics & rebuild/reorganize indexes stop running, or don’t run for whatever reason. Check that the SQL jobs are running successfully and check for index fragmentation.
The tools you can use to diagnose performance issues are:
- PerfMon. Setup a perfmon counter log for the important counters. Track them and work out what’s “normal” for your load/environment
- TfsActivityLogging database. Dive into this database and look for trends, heavy users, heavy tools, etc. Understand where your load is coming from.
- Download & install my TFS Performance Report Pack and look at the Execution Time Summary report.
- TfsServerManager.exe (Comes with the Team Foundation Server Power Tools, see Brian’s blog for more details)
- No shortcut gets created. Run it from "C:\Program Files\Microsoft Team Foundation Server 2008 Power Tools\TfsServerManager.exe"
- If users are reporting a problem, try and catch it while it’s currently executing. Look at the “Source Control Request Queue” report. Is their request on top?
- Run the following query a few times in SQL to see if any blocking is occurring. If the same spid hangs around for a while, run DBCC INPUTBUFFER(spid_here) to see what stored procedure it is and try and match that to a TFS command. e.g. prc_Get = Get()
SELECT a.status, a.*
FROM sys.sysprocesses a
WHERE spid > 50
and spid <> @@spid
and blocked = 0
and EXISTS ( SELECT *
FROM sys.sysprocesses b
WHERE b.blocked = a.spid)
To determine if you are having significant issue with disk latency you should use the following performance counters:
- Object: [Physical Disk] or [Logical Disk]
- Counter: [Avg. Disk Sec/Transfer]
- Instance: Ideally you collect this for individual disks however you may also use [_Total] to identify general issues. If [_Total] is high then further collections can be taken to isolate the specific disks affected.
- Collection Interval: Ideally you should collect at least every 1 minutes. The collection should be run for a significant period of time to show it is an ongoing issue and not just a transient spike. 15 minutes is minimum suggested interval.
- Issue Thresholds (seconds):
- < 0.020: Normal time and no I/O latency issues are apparent
- > 0.00 – 0.050: You may somewhat concerned. Continue to collect and analyze data. Try to correlate application performance issues to these spikes
- > 0.050 – 0.100: You are concerned and should escalate to SAN administrators with your data and analysis. Correlate spikes to application performance concerns.
- > 0.100: You are very concerned and should escalate to SAN administrators. Correlate spikes to application performance concerns.
If you want to understand more about Windows server fundamentals, take a look at the Microsoft Windows Server 2003 Performance Guide. It was published in 2005, but it is a valuable resource on PerfMon, Relog, Performance troubleshooting and performance monitoring. Most of the counters and tools are still valid for Windows 2008 and beyond.