TFS 2010: Introducing the TFS Background Job Agent and Service

Historical Background

 

TFS 2008's Application Tier had two Windows services:

  • Visual Studio Team Foundation Server Task Scheduler (TFSServerScheduler)
  • Code Coverage Analysis Service (CoverAn)

 

In TFS 2008, the TFS web application (hosted by w3wp.exe) performed some periodic tasks independent of end user requests. One example is an hourly identity sync with AD. Some user requests caused work to be done asynchronously - checking in started versioned file deltification running on another thread. (As a side note, I find "TFS WebApp" ambiguous given that we also have the Proxy WebApp and the TSWA WebApp. If you can think of a better name let me know.)

 

In TFS 2010, we're dedicating the TFS WebApp soley to user requests and offloading those periodic tasks and asynchronous work to another process. Configuring the Application Tier installs the "Visual Studio Team Foundation Background Job Agent" (TFSJobAgent). The TFS Job Agent replaces 2008's TFSServerScheduler service. A plug-in to the job agent replaces 2008's CoverAn service. Identity sync and versioned file deltification is performed by other plug-ins.

 

We're even using the job agent for some new longer-running end user actions. Team Project Collection creation is performed by the job agent. If you use the client OM to create a collection, it contacts the WebApp to queue the creation and then polls for progress.

 

Overview

 

The TFS Job Agent executes jobs via plug-ins. Plug-in classes implement the Run method in ITeamFoundationJobExtension. Assemblies with one or more  job plug-ins are placed in %ProgramFiles%\Microsoft Team Foundation Server 2010\Application Tier\TFSJobAgent\plugins\.  In TFS 2010 the job agent uses reflection to load each DLL under this directory and find job extensions. In a later release we might move to a manifest of plug-in classes or assemblies. When a plug-in finishes execution it returns a result (Succeeded, Partially Succeeded, Failed, …) and a result message.

 

The job agent runs on the AT using the same service account as the WebApp. The job agent has direct access to the data tier(s), same as the WebApp. For that reason, most of the TFS job extensions use the server object models to access the data tier directly, as opposed to using the client object models and making web requests to the TFS webApp. By the way, if you're familiar with the TFS client OMs you'll find that TFS 2010's server OMs are similar. (Client: myTeamFoundationServer.GetService(VersionControlServer); Server: requestContext.GetService<TeamFoundationVersionControlService>())

 

Before a job can be run it has to be defined. A  JobDefinition has a JobId, a display name, and an ExtensionName. The ExtensionName is the fully-qualified plug-in class name. A JobDefinition optionally has XML JobData and a schedule. The JobId is a GUID. For many of our jobs the JobId is constant and well-known; we can queue Code Churn just by doing jobService.QueueJobNow(CodeChurnConstants.CodeChurnJobGuid).

 

Jobs run because of their schedule or because they are queued. Many TFS jobs like code churn are queued in response to a user action like checkin, but also have a backup schedule (every 15-60 minutes). The backup schedule is used to ensure code churn runs in the event where your checkin gets committed but the WebApp crashes prior to code churn getting queued.

 

We also have the concept of "delayed queuing". Yes, this is an oxymoron.  But when a JobDefinition has a schedule it generally indicates that the job runs periodically: at the top of the hour, daily at 3am UTC, etc.  By delay-queuing a job you're requesting that it run within the next N seconds. Also a job with an EnabledState of SchedulesDisabled will still run when it is delay-queued.

 

A given job has one of five states: In progress (Running, Paused, Stopping, …), queued, scheduled for future execution, host-blocked or not scheduled. A job is host-blocked because the host/team project collection it is associated with is offline or dormant. When that host becomes active that job will return to queued or scheduled depending on its QueueTime. When a job completes the result and result message are recorded and kept for 31 days (by default). This state and the job results are all maintained on the data tier.

 

Job Service API

 

The client OM is Microsoft.TeamFoundation.Framework.Client.ITeamFoundationJobService. The server OM is Microsoft.TeamFoundation.Framework.Server.TeamFoundationJobService. You'll find more convenience overloads in the client OM but you can accomplish the same things with either OM.

 

To get a list of existing job definitions, use QueryJobs. You can also use this to get a particular job definition if you already know the JobId.

 

To check out the results of prior job executions, use QueryJobHistory or QueryLatestJobHistory.

 

To queue a job, use QueueJobsNow or QueueDelayedJobs.

 

To delete, create or update job definitions, use DeleteJobs or UpdateJobs.

 

If you want to create a job definition and queue it for one execution in a single call, check out QueueOneTimeJob. The job definitions created by QueueOneTimeJob get deleted after 7 days by default. This is intended for when each time you execute a job plug-in you supply it with different JobData.

 

Powershell Client OM Example - Fix Email Notifications

 

Let's say you're not getting email notifications, even though you've set up project alerts through Team Explorer. Grant describes how to diagnose notification failures primarily using SQL. I'll show you how you can investigate without even getting access to the AT.

 

Let's first check out the last job result:

 

[Reflection.Assembly]::Load("Microsoft.TeamFoundation.Client, Version=10.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a");

 

# Modify the collection URL as necessary.

$tpc = new-object Microsoft.TeamFoundation.Client.TfsTeamProjectCollection "https://localhost:8080/tfs/DefaultCollection"

 

$jobService = $tpc.GetService([Microsoft.TeamFoundation.Framework.Client.ITeamFoundationJobService])

 

# Browse the API

$jobService | gm

 

# Look for the "Team Foundation Server Event Processing" job - you don't have the JobId handy.

$jobService.QueryJobs() | sort Name | select JobId, Name | ft -a

 

# Get the latest non-successful result.

# (Notifications get deleted after so many delivery attempts, so the job may occasionally succeed when it has no emails to send.)

$jobService.QueryJobHistory([Guid[]] @('a4804dcf-4bb6-4109-b61c-e59c2e8a9ff7')) | ?{ $_.Result -ne "Succeeded" } | select -last 1

 

You see this:

 

HistoryId          : 46

EndTime            : 2/11/2010 4:41:57 PM

ResultMessage      : There were errors or warnings during notification delivery.

 

                     0/1 emails delivered.

                     0/0 soap notifications delivered.

                     1 errors.

                     0 warnings.

 

                     Set /Service/Integration/Settings/NotificationJobLogLevel in the TF registry to 2 to see errors and warnings.

 

 

JobSource          : bc9cc30d-4a57-4ef8-ae37-7507c5efee57

JobId              : a4804dcf-4bb6-4109-b61c-e59c2e8a9ff7

QueueTime          : 2/11/2010 4:41:54 PM

ExecutionStartTime : 2/11/2010 4:41:55 PM

AgentId            : ba2e59f8-dae1-4f75-9416-1225ed06a69b

Result : Failed

 

Okay let's get more detailed errors by using the registry service to increase the log level:

 

$tfsRegService = $tpc.GetService([Microsoft.TeamFoundation.Framework.Client.ITeamFoundationRegistry])

 

# Browse the API

$tfsRegService | gm

 

$tfsRegService.SetValue("/Service/Integration/Settings/NotificationJobLogLevel", 2)

 

Do something like checking in to produce another notification and switch back to the job service:

 

# Force the job to run again immediately

# (Team Foundation Server Event Processing gets delay queued, so you'll have to wait up to 2 minutes otherwise.)

$jobService.QueueJobNow([Guid] 'a4804dcf-4bb6-4109-b61c-e59c2e8a9ff7', $false)

 

# Keep checking for a new job history till you see a new entry.

$jobService.QueryLatestJobHistory([Guid[]] @('a4804dcf-4bb6-4109-b61c-e59c2e8a9ff7'))

 

Now you see this:

 

HistoryId          : 52

EndTime            : 2/11/2010 5:04:35 PM

ResultMessage      : There were errors or warnings during notification delivery.

 

                     0/1 emails delivered.

                     0/0 soap notifications delivered.

                     2 errors.

                     0 warnings.

 

                     -------------------------------

                     TF271000: An error occurred while attempting to send an e-mail notification to the following address: chrisid@microsoft.com. Ver

                     ify that the e-mail notification settings are correct in the Team Foundation Administration Console.

 

                     Notification: CheckinEvent (DeliveryType: EmailHtml; Address: chris@contoso.com)

 

                     Exception: System.Net.Mail.SmtpException: Mailbox unavailable. The server response was: 5.7.1 Client does not have permissions to send as this sender

                        at System.Net.Mail.MailCommand.CheckResponse(SmtpStatusCode statusCode, String response)

                        at System.Net.Mail.MailCommand.Send(SmtpConnection conn, Byte[] command, String from)

                        at System.Net.Mail.SmtpTransport.SendMail(MailAddress sender, MailAddressCollection recipients, String deliveryNotify, SmtpFa

                     iledRecipientException& exception)

                        at System.Net.Mail.SmtpClient.Send(MailMessage message)

                        at Microsoft.TeamFoundation.JobService.Extensions.Core.NotificationJobExtension.SendEmail(TeamFoundationRequestContext reques

                     tContext, TeamFoundationNotification notification)

JobSource          : bc9cc30d-4a57-4ef8-ae37-7507c5efee57

JobId              : a4804dcf-4bb6-4109-b61c-e59c2e8a9ff7

QueueTime          : 2/11/2010 5:04:32 PM

ExecutionStartTime : 2/11/2010 5:04:33 PM

AgentId            : ba2e59f8-dae1-4f75-9416-1225ed06a69b

Result             : Failed

 

In my case I was using Network Service as the service account for the WebApp and job agent, which doesn't have permission to send emails through our internal SMTP server. I changed the service account and the from address in the TFS Administration Console, checked in, forced the job to run and received an email in my inbox.

 

Team Foundation Server Event Processing runs often and the result messages are rather large when there are failures at logging level 2. I'm going to return the logging level to the default until our next investigation:

 

$tfsRegService.DeleteEntries("/Service/Integration/Settings/NotificationJobLogLevel")

 

 

C# Server OM Example - Run cleanup Jobs at a different time of the day

 

TFS has several cleanup jobs that run daily at 3 a.m. UTC. Maybe you're team is active at 3 a.m. UTC, you've noticed an impact from the cleanup jobs and want to run the jobs at 3 p.m. UTC instead.

 

Warning #1: Try this at your own risk, or better yet just consider it an example. We have not moved our jobs on our production machines and confirmed there are no ramifications.

 

Warning #2: There isn't a good way to programmatically identify cleanup jobs. This code is conservative and doesn't touch several daily jobs that could be considered cleanup jobs.

 

            try

            {

                using (TeamFoundationRequestContext requestContext = serviceHost.CreateSystemContext())

                {

                    TeamFoundationJobService jobService = requestContext.GetService<TeamFoundationJobService>();

 

                    IEnumerable<TeamFoundationJobDefinition> allJobs = jobService.QueryJobDefinitions(requestContext, null);

 

                    List<TeamFoundationJobDefinition> modifiedJobs = new List<TeamFoundationJobDefinition>();

                    foreach (TeamFoundationJobDefinition job in allJobs)

                    {

                        // There is no good programmatic way to identify a cleanup job.

                        // This code is conservative.

                        if (Regex.IsMatch(job.Name, @"\bcleanup\b", RegexOptions.IgnoreCase))

                        {

                            Boolean jobModified = false;

                            foreach (TeamFoundationJobSchedule schedule in job.Schedule)

                            {

                                if ((schedule.Interval % DAILY) == 0

                                    && schedule.ScheduledTime.Hour == 3) // 3 a.m. UTC.

                                {

                                    Console.WriteLine("Changing the schedule for {0}.", job.Name);

 

                                    schedule.ScheduledTime = schedule.ScheduledTime.AddHours(12); // 3 p.m. UTC instead.

 

                                    if (!jobModified)

                                    {

                                        jobModified = true;

                                        modifiedJobs.Add(job);

                                    }

                                }

                            }

                        }

                    }

 

                    // We could just supply allJobs instead of modifiedJobs, but this avoids the race condition where

                    // someone changes a job definition and we revert their changes by supplying the original job definition.

                    Console.WriteLine("Committing changes to {0} jobs.", modifiedJobs.Count);

                    jobService.UpdateJobDefinitions(requestContext, null, modifiedJobs);

                }

            }

            catch (HostShutdownException)

            {

                Console.WriteLine("Host is offline. Skipping...");

            }

 

See attached file for full example.

 

More Details

 

Unless you disable it, the job agent runs on every AT. If you have an NLB configuration you'll have job agents competing for queued jobs. By default the job agent will run 3 jobs per processor concurrently. So on a quad-core machine a job agent could be running up to 12 jobs at once.

 

If you queue a job that wasn't already queued or in-progress, the job agent will probably take about three seconds to start running it if the job queue isn't backed up. You shouldn't take a dependency on this, but its probably what you'll observe for TFS 2010. Queuing a job N times in quick succession doesn't mean that job will run N times. Queuing a job that is already queued is generally a no-op. Assuming a job takes long enough to run, quickly queuing a job 100 times will likely result in the job running once or twice. Your first call to QueueJobNow will queue the job. Maybe some of your following calls will be no-ops. Then the job agent will acquire the job and start running it. The next call to QueueJobNow will mark the in-progress job for re-queuing.  Calls after that while the job is in-progress are again no-ops.

 

We added QueueDelayedJob to batch work and incur the overhead of acquiring, running and releasing jobs less. When the Team Foundation Server Event Processing job runs, it can send notification emails about 20 checkins and 40 work item changes and 1 build instead of a single checkin. Calling QueueDelayedJob will never delay job execution. It only makes a job run sooner or is a no-op. QueueJobsNow and QueueDelayedJobs are currently implemented the same way. They're two different methods just to prevent people from supplying a delay and high-priority. Supplying both is somewhat confusing, and I don't think anyone needs to say "This job will be urgent in 5 minutes."

 

Multiple JobDefinitions can reference the same plug-in. In this case the XML JobData will likely be used to supply different arguments to the plug-in. It is not intended that a JobDefinition's JobData will be updated frequently. Say you do this:

 

jobService.QueueJobsNow(fooJob);

fooJob.Data = newJobData;

Thread.Sleep(60 * 1000);

jobService.QueueJobsNow(fooJob);

 

Will there be two separate executions of the foo job? Probably, but not if the job agent isn't running or the queue is backed up.  In that case the second call to QueueJobsNow is a no-op. If there is only one execution of the job, it will definitely use the new job data. But if there are two executions, the first execution might use the old job data or the new job data. It depends on how quickly the job agent acquires the job and reads the job definition.

 

ChangeTimeOfCleanupJobs.cs