After spending some time out in the field looking at customer’s TFS environments and more recently looking at some of Microsoft’s internal on-premises TFS deployments, I realised that some environments are configured and better maintained than others.
Some of the general concepts and the very TFS-specific configurations are talked about in Part 5 of my Professional Team Foundation Server 2012 book, but many of the basics were considered out of scope or assumed knowledge. Also, not everybody has read the book, even though it gets 5 stars and is considered “THE Reference for the TFS Administrator and expert!” on Amazon.
The purpose of this blog post is to give the Service Owners of TFS a check-list of things to hold different roles accountable for in the smooth operation of the server. It’s broken into 5 sections that roughly translate to the different roles in a typical enterprise IT department. In some cases, it might all be the one person. In other cases, it could be a virtual team of 50 spread all throughout the company and the globe.
- The initial setup and provisioning of the hardware, operating system and SQL platform
- Regular OS system administrator tasks
- Regular SQL DBA tasks
- TFS-specific configurations
- Regular TFS administrator tasks
The list is in roughly descending priority order, so even if you do the first item in each section, that’s better than not doing any of them. I’ll add as many reference links as I can, but if you need specific instructions for the steps, leave a comment and I’ll queue up a follow-up blog post.
- Apply all security updates that the MBSA tool identifies. ‘Critical’ security updates should be applied within 48 hours – There’s no excuses for missing Critical security updates. They are very targeted fixes for very specific and real threats. The risk of not patching soon enough is often greater than the risk of introducing a regression.
- Be on the latest TFS release. (TFS 2012.4 RC4 at the time this post was written or TFS2013 RTM after November 13 2013. If you’re stuck on TFS2010, see here for the latest service packs and hotfixes.)
- Be on the latest edition of SQL that is supported by the TFS version. Check your SQL version here. (TFS 2010 = SQL2008R2SP3, TFS 2012.4 = SQL2012 SP1, TFS 2013 = SQL2012 SP1). Be on Enterprise edition for high-scale environments.
- Be on the latest OS release supported by the combination of SQL + TFS. Most likely Windows Server 2008 R2 SP1 or 2012.
- Be on the latest supported drivers for your hardware (NIC & SAN/HBA drivers especially).
Initial OS Configuration and Regular Management Tasks
- Collect a performance counter baseline for a representative period of time to identify any bottlenecks and serve as a useful diagnostics tool in the future. A collection over a 24 hour period on a weekday @ 1-5min intervals to a local file should be sufficient. Don’t know which counters to collect? Download the PAL tool and look at the “threshold files” for “System Overview” on all your servers, “SQL Server” on your data tier servers, and "IIS" and ".NET (ASP.NET)" for your application tier servers.
- Ensure antivirus exclusions are correct for TFS, SQL and SharePoint. (KB2636507)
- Ensure firewall rules are correct. I had an outage once where the network profile changed from ‘domain’ to ‘public’ due to a switch gateway change, and our firewall policy blocked SQL access for the ‘public’ profile which effectively took SQL offline for TFS.
- Ensure page file settings are configured for an appropriately sized disk & memory dump settings are configured for Complete memory dump. If you get a bluescreen, having a dump greatly increases your chances of getting a root cause + fix. (KB254649), test the settings using NotMyFault.exe (during a maintenance window, of course)
- Don’t run SQL or TFS as a local administrator.
Initial SQL Configuration
- SQL Pre-Deployment Best Practices (SQLIO/IOmeter to benchmark storage performance)
- SQL recommended IO configuration. SQLCAT Storage Top 10 best practices
- Check disk partition alignments for a potential 30% IO performance improvement (especially if your disks were ever attached to a server running Windows Server 2003, but sometimes if you used pre-partitioned disks from OEM)
- Ensure that Instant File Initialization is enabled (if the performance vs. security trade-off is appropriate in your environment. The article has more details). This enables SQL to create data files without having to zero-out the contents, which makes it “instant”. This requires the service account that SQL runs as to have the ‘Perform Volume Maintenance Tasks’ (SE_MANAGE_VOLUME) permission.
- Separate LUNs for data/log/tempdb/system.
- Multiple data files for TempDB and TPC databases. (See here for guidance on the “right” number of files. If you have less than 8 cores, use #files = #cores. If you have more than 8 cores, use 8 files and if you’re seeing in-memory contention, add 4 more files at a time.)
- Consider splitting tbl_Content out to a separate filegroup so that it can be managed differently
- Consider changing ‘max degree of parallelism’ (MAXDOP) to a value other than ‘0’ (a single command can peg all CPUs and starve other commands). The trade-off here is slower execution time vs. higher concurrency of multiple commands from multiple users.
- Consider these SQL startup traceflags. Remember, the answer to “should I do this on all my servers?” is not “yes”, the answer is “it depends on the situation”.
- T1211 (prevent table lock escalation) (KB934005 and here)
- T1118 (reduce tempdb contention, Paul says everyone should turn it on, there’s no downside.)
- T1222 (XML deadlock graph, you’re unlikely to get deadlocks because we find most of them while dogfooding, but this information is useful if you do hit them.)
- T1117 (equal file autogrowth for tempdb files).
- Configure daily SQL ErrorLog rollover and 30 day retention.
- Set an appropriate ‘max server memory’ value for SQL server. If it’s a server dedicated to SQL (assuming TFS, SSRS and SSAS are on different machines), then a loose formula you can use is to reserve: 1 GB of RAM for the OS, 1 GB for each 4 GB of RAM installed from 4–16 GB, and then 1 GB for every 8 GB RAM installed above 16 GB RAM. So, for a 32GB dedicated server, that’s 32-1-4-2=25GB. If you are running SSRS/SSAS/TFS on the same hardware, then you will need to reduce the amount further.
Regular SQL DBA Maintenance
(These are not TFS specific and apply to most SQL servers)
- Backup according to the supported backup procedure (marked transactions, transaction logs, SSRS encryption key and use SQL backup compression and WITH CHECKSUM). It’s important to ensure that transaction log backups run frequently – they allow you to do a point-in-time recovery. It also checkpoints and allows the transaction log file to be reused. If you don’t run transaction log backups (and you’re running in FULL recovery mode, which is the default), then your transaction logfiles will continue to grow. If you need to shrink them, follow the advice in this article.
- Run DBCC CHECKDB regularly to detect physical/logical corruption and have the best chance at repairing and then preventing it in the future. Ola Hollengren's SQL Server Integrity Check scripts are an effective way of doing this, if your organisation doesn't have an established process already. Even though the solution is free, if you use it, send Ola an email to say that you appreciate his work. The solution can also be used for backups and index maintenance for non-TFS databases. TFS rebuilds it's own indexes when needed and it requires marked transactions as per the supported backup procedure)
- Ensure PAGE_VERIFY=CHECKSUM is enabled to prevent corruption. If it’s not, you have to rebuild indexes after enabling it to get the checksums set.
- Mange data/log file freespace and growth.
- Monitor for TempDB freespace (<75% available).
- Monitor for long-running transactions (>60 minutes, excluding index rebuilds, backup jobs).
- Monitor table sizes & row counts (there’s a script on my blog here, search the page for sp_spaceused).
- Monitor SQL ERRORLOG for errors and warnings.
TFS Configuration Optimizations
- At least two application tiers in a load balanced configuration. That gives you redundancy, increased capacity for requests/sec, and two job agents for running background jobs. Ensure that your load balancer configuration has a TCP Idle Timeout of 60 minutes, or that all your clients are running a recent version. See here fore more details.
- Ensure that SQL Page Compression is enabled for up to a 3X storage reduction on tables other than tbl_Content (if running on SQL Enterprise or Data Center Edition). To enable, it’s the opposite of KB2712111.
- Ensure that table partitioning is enabled for version control (if a large number of workspaces and running SQL Enterprise). Not recommended unless you have >1B rows in tbl_LocalVersion. Contact Customer Support for the script, since it’s an undocumented feature for only the very largest TFS instances (i.e. DevDiv).
- Check that SOAP gzip compression is enabled (should’ve been done by TFS 2010 SP1 install. I have seen up to an 80% reduction in traffic across the wire and vastly improved user experience response times for work item operations).
- Disable / monitor the IIS Log files so they don’t fill the drive: %windir%\system32\inetsrv\appcmd set config -section:system.webServer/httpLogging /dontLog:"True" /commit:apphost
- Change the TFS App Pool Idle Timeouts from 20 minutes to 0 (no idle timeout), and disable scheduled recycling so that you don’t have an app-pool recycle during business hours.
- Implement a TFS Proxy Server and make sure people use it (especially build server), even if no users are remote it reduces the requests/sec load on the ATs. Configure it as the default proxy for our AD site using: tf proxy /add
- Enable work item tracking metadata filtering if appropriate.
- Enable SMTP settings and validate that they work. The most common issue here is that a SMTP server won’t relay for the service account that TFS is running as.
- Set TFS’s NotificationJobLogLevel = 2, so that you get the full errors for any event notification jobs that fail.
- Consider moving application tier file cache to a separate physical and/or logical drive. See here for how to set a different dataDirectory, but don’t touch any of the other settings. The reason you want it on it’s own drive, is 1) to separate the I/O load and 2) if you ever have to restore the database to an earlier point in time, you have to clear the cache so that you don’t end up sending the wrong content to users. If you make it a separate drive, you can just do a quick-format which takes seconds. Otherwise you have to delete all the folders/files individually which takes much longer.
Regular TFS Administrator Maintenance
- Periodically run the Team Foundation Server Best Practices Analyzer (BPA) tool that is included with the Team Foundation Server Power Tools. It gets continuously updated with rules to detect common configuration problems and issues that lead to TFS support calls.
- Periodically review the activity log and job monitoring sections of the TFS “Operations Interface” at http://yourserver:8080/tfs/_oi/
- Check for heavy users using Execution Time reports from the Performance report packand tbl_Command in the TPC databases.
- Check build retention policies to ensure stale build logs and results and drops are being cleaned up.
- Clean-up tbl_Content by running the Test Attachment Cleaner tool. (Terje has a great article on how to do this)
- Clean-up unused workspaces and shelvesets. The Workspace and Shelveset sidekicks from the Team Foundation Sidekicks are great for this. Remember, its "tf workspace /delete", not "tf workspaces /remove"
- Clean-up unused work item tracking fields (witadmin listfields /unused).
- Check Cube and Warehouse health using Admin report pack.
- Check work item tracking metadata size, and clean up constants / global list sizes (can’t do this without a script in 2010, automatic cleanup in 2012.2). Look at the file/folder sizes in %localappdata%\Microsoft\Team Foundation\4.0\Cache. The files are named things like ‘ruleconstants1.curcache’, and more files larger metadata. There have been a lot of improvements in TFS2012 + TFS2013 around controlling the size of this metadata, but it can still come unwieldy and need manual intervention. See this MSDN article for more background on the structure.
- Evaluate work item tracking fields that are set to reportingtype=’dimension’. Do they really need to be in the cube? If not, set them to ‘detail’ and Query them using the Relational Warehouse (Tfs_Warehouse).
- Evaluate if you have custom work item tracking fields that are used in many work item queries and would benefit from being indexed. (witadmin indexfield /index:on).
- Check tbl_EventSubscriptions for invalid email and SOAP subscriptions. Use TFS 2012 web access as an admin to view ‘All Alerts’ and delete them. (http://yourserver:8080/tfs/YourCollection/YourProject/admin/_alerts)
René's blog post Top 10 of things every TFS Administrator should do also covers some other things.
Regular TFS Build Administrator Maintenance
This is a community contribution from Jesse on regular maintenance around Build Agents, Symbols and Drop shares:
- Monitor disk space usage on the build agents
- Monitor queue time for the builds, spin up additional agents if available and needed
- Clean up the \Builds folder on build agents to remove old workspaces
- Backup the Symbols share regularly
- Backup the Builds Drop folder regularly
- Exclude \Builds, \Symbols, \Drop, Team Explorer Cache from Anti-virus real time scanning
Another community contribution from Jesse – this is a set of things to check for when a user rolls-off a project or otherwise stops using the server:
- Check for locked or checked out files
- Check for queued builds
- Check for remaining workspaces
- Check for work items assigned to this account
- Check for Builds, Source control items that are exclusively owned by the user
- Back up their personal work item queries by exporting them all to WIQL
The ALM Rangers are a group of individuals from the TFS Product Group, members of Microsoft Services, Microsoft Most Valued Professionals (MVPs) and technical specialists from technology communities around the globe, giving you a real-world view from the field, where the technology has been tested and used. If you haven’t seen some of the resources that they produce and maintain, I highly recommend that you check them out:
- ALM Rangers Solutions Catalog
- Team Foundation Server Planning Guide – Includes a capacity planning worksheet, hardware recommendations and quick reference guides.
- Team Foundation Server Upgrade Guide – Includes workflows on how to get from your current OS + TFS + SQL version to the latest with the approaches and considerations of each.
- ALM Assessment Guidance – This is general guidance around ALM practices, rather than TFS administration.
- Team Foundation Build Customization Guide
- Team Foundation Server Branching and Merging Guide
- Enable client-side tracing and Enable client-side performance view – Useful for diagnosing performance issues
Hopefully this blog post has been an effective use of my limited keystrokes and together we can improve the predictability, reliability and availability of Team Foundation Server in your organisation.
[October 9 2013]: Added notes on local admin, SQL Instant File Initialization, max server memory, transaction log shrinking, SMTP settings, cache directory settings, build administrator tasks and exit procedures.
[October 19 2013]: Added link to Ola's solution for integrity checks and database backups.
[November 1 2013]: Added link to René's blog post on Top 10 TFS administrator tasks
[November 16 2013]: Added reference to IIS & ASP.NET threshold files for PAL. Thanks Chetan.