Ever since we started dogfooding TFS internally, I’ve been working on a tool I call TFSServerManager. It’s a bit of a misnomer because it’s not really a management tool as much as it is an operations tool. In the very early days it started as a tool to help me produce the dogfood statistics you see on my blog every month – I’ve been sending them around internally in email for a couple of years now.
Over time I added more and more capabilities to it (some of which are admittedly frivilous) but it has served as an outlet for me to do a little programming here and there. My day job doesn’t really allow me to do feature develoipment any more Anyway, I thought I’d post a little bit about it here and see what people think. Occasionally I get requests in email or other forms for a tool to produce dogfood stats so I think people might be interested in it. Based on the response, I’ll decide how hard I work to get it pushed out as a PowerTool quickly.
The tool actually comes in 2 parts – a client tool called TFSServerManager and a server tool called TFSMonitor that pings the servers every 5 minutes to make sure they are still alive and functioning and records a variety of interesting data.
What follows is a bunch of screenshots and a little commentary on what it means. The tool has way to many functions to show them all here but I’ll highlight some of the cool stuff.
This is the initial screen. I’ve added a few of each different type of server (TFS, Proxy and Monitor).
You add servers to this list with the “Add Server…” File menu option. When you double click one of the Team Foundation Servers, you get a window with many different tabs covering a variety of server aspects. The first, shown here, is the “Requests” tab. It shows the Version Control Requests that the server is currently processing. I really want it to me all requests but only the version control subsystem on the server supports this capability – and they tend to be the most expensive operations. In future versions we’ll have all of the server components supporting this.
It’s pretty normal for a team our size to have half a dozen or so requests on the dogfood server at a time. When we just had a few hundred people using it, it was hard to catch more than 1 or 2 people at a time and most of the time there was no one.
At the bottom right hand corner you’ll notice a “Completed Requests…” button. As opposed to the currently executing requests above, this allows you to see all requests that have completed processing (assuming you have enabled the activity logging in the web.config on your server). Fortunately, this actually is all requests. All server components support activity logging, unlike “current requests” which only version control supports. Here’s what it looks like when you press Completed Requests:
The top part allows you to filter the requests (we get like 30 million per week – that’s over 100 requests per second, so filtering really helps). One cool thing is that by default, the activity log records the parameters to any request that fails or takes more than 30 seconds so if things are going badly, you can drill in to understand better.
After the Requests tab comes the Summary tab but it’s boring so I’m skipping it here. Then comes the Statistics tab. This is what I generate my dogfood reports from. It actually contains a lot more data than my dogfood reports do.
The table at the botton supports pivoting and drill through. You can pivot by “command” as done here, or by user, requesting application, or client IP address. Very handy for findout who/what is putting all of the load on you server and why. Here’s an example of a drill through into the checkin requests and a pivot by user to see who is doing all of the checkins.
Next is the performance tab. It’s mostly a clone of perfmon except that it runs of of data logged by the activity monitor and allows correlation to TFS events. For example, if you see a spike in server activity and want to know what was happening, you can right click the data point, and choose “Show Activity Summary…” to get a pivot like the above. From there, you can drill down all the way to the actual requests that were executing during that hour.
One really interesting thing you can see on this graph is right around Christmas, you see a substantial drop in the spikes of the perf counters. This is because we identified another performance bottleneck and applied a patch to the server on 12/22. The next day (and from there forward) we started seeing much lower average server utilization. I’m expecting to see another drop when we deploy the first Orcas build in about a month.
Next is the Health tab. It displays both the events from the application tier event log and data from the TFS Availability Monitor. The timeline at the top helps you visualize the timeline of events and outages. It supports drill through into both events and the availability monitor sample data. It has two tabs. The first includes the events as a list like this:
and the second includes a graph of server availability over time:
Now, pride mandates that I comment on this some now that you’ve seen it in its raw form. This graph shows some embarassing dips in availability. Our target minium is to stay above 98% with a goal of over 99%. You can see that we’ve had drops to 95% and 97% (on 7 day running average) in recent history. These were not actually application drops – they were infrastructure drops. We’ve had problems with some network infrastructure flakiness in the data center. In a way this availability data is a worst case analysis. When the monitor samples the server, it touches many different services. If any one of them, returns an error, it records the server as “unavailable” for a 5 minute windows. It’s not uncommon to have dropped packets cause availability drops that users of the systems don’t actually notice. We’ve talked about adding some simple retry logic to avoid these “false positives” but I feel a bit uncomfortable with the idea – a failure is a failure, no matter what the cause. We’ll see where we go with that.
The last tab is the maintenance tab. It’s my least favorite of the set. I’m not even going to show it to you because I really want to redo it. The idea is that is is there to help with data maintenance. It’s there to help you find out and clean up all of the workspaces, shelvesets, etc that people aren’t using any more.
That’s it for the TFS Server part. The next interesting thing is the proxy. The proxy exposes a lot less info than the TFS server does, so what we can show is limited. In future versions we’ll beef this up some. For now, you can get a list of the servers that a proxy supports and the current statistics for each server:
Last but not least, you can double click on a monitor on the main screen and inspect it. The primary purpose of this is to enable you to administer what machines a monitor is monitoring:
That’s 60-70% of it and most of the interesting stuff. As I say, I’m slowly working towards getting this ready for external consumption but I figured I’d test the waters and see how valuable you think this would be. Let me know…