TFS Build fails with the Team Foundation Background Job Agent service stop error immediately after you queue a build upon using Revert Snapshot in Lab template.

Recently, we came across a very interesting customer scenario where the build was failing immediately when it’s trying to execute a revert snapshot. They had the following setting.

TFS 2012/2013
SCVMM 2012

TFS machines were physically placed in the Americas.
The build agent, controller, SCVMM host and VMs were in Asia.

Whenever the build reaches the Revert Snapshot point, we get the below error stack, even though in SCVMM, the snapshot revert kicks off fine.

Exception Message: Team Foundation Server stopped the operation on this environment because the environment was unresponsive. The Visual Studio Team Foundation Background Job Agent service might have unexpectedly stopped. Contact your system administrator to restart the service. (type LabDeploymentProcessException)

Exception Stack Trace:

Server stack trace:

at Microsoft.TeamFoundation.Lab.Workflow.Activities.BaseLabOperationAsyncState.WaitForActiveLabOperationToComplete()

at Microsoft.TeamFoundation.Lab.Workflow.Activities.RestoreLabEnvironment.RunCommand(AsyncState state)

at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Object[]& outArgs)

at System.Runtime.Remoting.Messaging.StackBuilderSink.AsyncProcessMessage(IMessage msg, IMessageSink replySink)

Exception rethrown at [0]:

at System.Runtime.Remoting.Proxies.RealProxy.EndInvokeHelper(Message reqMsg, Boolean bProxyCase)

at System.Runtime.Remoting.Proxies.RemotingProxy.Invoke(Object NotUsed, MessageData& msgData)

at System.Func`2.EndInvoke(IAsyncResult result)

at Microsoft.TeamFoundation.Lab.Workflow.Activities.RestoreLabEnvironment.EndExecute(AsyncCodeActivityContext context, IAsyncResult result)

at System.Activities.AsyncCodeActivity.CompleteAsyncCodeActivityData.CompleteAsyncCodeActivityWorkItem.Execute(ActivityExecutor executor, BookmarkManager bookmarkManager)

We couldn’t find any additional errors on the build logs or event viewer, or even on the TFS Job Agent trace.
On further investigation we figured out that the machines in Asia were set a wrong time by 10-15 minutes.
So, the mismatch in local machine time gave an erroneous information that makes the build fail.

Cause:

Build controller throws Timeout error only when DateTime.Now.Subtract(restore job heartbeat time in local time zone) > 10min.

So, while receiving response from SCVMM it thinks that it’s timed out already!!
The code logic takes care of the time zone conversions, but the wrong time set in the machines caused this failure.

So, ensure the time is synced with the time server and we should be good!

Content created by: Manigandan Balachandran
Content reviewed by: Deepak Mittal.