In the first part of this blog, I described our e2e experience for setting up our test automation using RM. This is the second (and final) installment of that blog, in which I will describe some of the design choices / challenges we faced while setting up the system, and how we addressed them.
Single agent pool vs multiple agent pools
Problem Statement: How do we route a Release Definition (RD) to the correct agent machine i.e. the machine which has the pre-requisites for that RD installed on it?
Solution: When we started authoring our RDs, it became clear to us that we needed to route each RD to the correct agent because the pre-requisites for each test suite were different. Our initial approach was to create a different agent pool per RD, but this was causing a management headache with a blow up in the number of agent pools that we had to manage. Finally, based on guidance from Chris Patterson (from the Build team), we settled on the design of having a single agent pool called RMAgentPool where each agent was differentiated from others by having a user capability. Each Release Definition (RD) and Build Definition (BD) now routes its run to the correct agent using the RmCdpCapability on the agent. E.g. the machine which has been prepared for RM.CDP.TfsOnPrem has the capability RmCdpCapability=TfsOnPrem:
And the RM.CDP.TfsOnPrem RD routes its runs to this agent using the demand RmCdpCapability=TfsOnPrem:
Problem statement: How do we debug intermittent failures when the log files provide insufficient data? By the time the dev gets to the issue, the next test run has wiped out the “bad state” on the test machine.
Solution: When a test suite shows “flaky” behavior, we enable the “Pause Agent on test failure” task in the RD:
This task is configured to pause the release when the flaky task fails. In that case, the dev can remote-desktop into the agent machine, and the state of the machine will be exactly what it was when the task failed. Note that this is possible only if the timeout for the environment is set to 0 i.e. no timeout.
The Pause task takes the following arguments (highlighted in the image above):
-AlternateCredentialsUserName $(release.alternateusername) -AlternateCredentialsPassword $(release.alternatepassword) -ReleaseId $(Release.ReleaseId) -TaskName $(TaskToDebugName) -NumberOfSeconds $(TaskToDebugTimeInSeconds)
The key here is $(TaskToDebugName): The value of this is set to “Run RMCDPOnPrem tests” for RM.CDP.TfsOnPrem i.e. the name of the task which is failing intermittently. All the RM.CDP.* RDs have this task with the same arguments, but the value of the $(TaskToDebugName) in each RD corresponds to the task that we want to debug for flakiness.
You can find the source code for this task here. (You can replace “YourAccount” with your account name and start using it.)
How do we run the tests in admin mode on the agent machine?
Problem statement: Even if the agent is run with admin credentials, it runs the task in non-admin mode. This makes it impossible to run a task that requires to run in admin mode e.g. installing a service on the agent box.
Solution: We solved this problem by using the “Powershell on Target Machines” task, and remoting back to the current agent machine as admin. For example, we install the TFS service in RM.CDP.RmEqTfs using this technique:
Tests that can’t be re-authored to use the VsTest task
Problem statement: Some tests have been authored quite some time back (lets call them “legacy tests”), and they are not VsTest task compliant. How do we get all the reporting benefits of a VsTest task compliance test in such scenarios?
Solution: We solved this problem by:
(1) Running the legacy tests through a batch script or powershell task – whatever was convenient – and noting the location of the output log file.
(2) We then added a VsTest task that parsed the output log file of the above step to see if it passed or failed.
Our RM.CDP.DevFabricUpgrade RD is one such RD: The main test is run through a powershell script RunAOConfigTest.ps1.
The next task is a VsTest task that parses the log file and checks if there is an error or not.
Side note: Till pretty recently, the VsTest-compliant test code for RM.CDP.DevFabricUpgrade used to just output the text “Expected True, but found False” in the log file, which made it a pain for devs to debug. Devs then had to open the log file output of RunAOConfigTest.ps1 (which itself took quite some time since it was pretty large), and had to grovel through it to find out what the real error was e.g. the below screenshot:
We recently revved this code to also output the real reason for the error by parsing the RunAOConfigTest.ps1 output log file more thoroughly. The test now outputs more detail in case of failure:
How do we eliminate bottlenecks in the test runs?
Problem Statement: Some RDs take very long to run, and the queue for such RDs builds up quite a bit. This increases the turnaround time (for getting all the test results of a build) significantly. How can we do better?
Solution: Since most of our tests are designed to deploy the service and run the tests on the agent machine itself, we can just throw more hardware (with the correct RmCdpCapability) at these “bottleneck” RDs.
How can we run UI tests?
Problem Statement: How can we run UI tests on the agent machine?
Solution: The quick and dirty way to do this is to run the agent in interactive mode (as opposed to in service mode). (Reason: When the agent runs in service mode, it is unable to interact with the desktop.) However, the downside of running in interactive mode is that the interactive agent doesn’t support auto-upgrade functionality. So whenever the agent is upgraded, you will need to login to the agent machine and re-start the agent.
A more hands-free approach is to run the agent in service mode, and then use the “Visual Studio Test Agent Deployment” and “Visual Studio Test using Test Agent” tasks respectively. We use a technique similar to the “Powershell on Target Machines” technique described above where we remote back to the current agent machine as admin. These tasks run the Test Service on the agent, and this service is configured to interact with the desktop.
To use this, we first need to create a machine group with the appropriate machines in the Test hub:
We then need to use this machine group in the 2 VS Test Agent tasks mentioned above. The VsTest Agent Deployment task must have “Interactive Process” selected:
And finally, the “Vs Test using Test Agent” task uses the same machine group to run the tests. The test filters used are the same as those used with the standard VsTest task:
(This experience will become more in-line in the near future, where we won’t necessarily need to create Machine Groups for the VS Test Agent tasks, and we can just specify $(Agent.MachineName) inline as the machine name.)
What are the variables available at runtime?
Problem Statement: What are all the variables available at runtime?
Solution: The variables available at runtime (and which you can use in your RD) can be found at the top of the logs:
Will RM be able to scale to suit our team’s needs?
Problem Statement: As we near the end of a sprint, the number of checkins increases exponentially in our team, and there were concerns about RMO being able to handle that load. 15 checkins / day is common during release time, and this maps to 15 * 9 = ~135 deployments / day (since each checkin triggers 9 RDs).
Solution: Initially, RMO did have some jitter when there were high volumes, but we have hardened the service through a bunch of stress testing over the latter part of 2015. RMO can now easily withstand our team’s load. The screenshot below shows ~20 releases of RM.CDP.DevFabircUpgrade in the past 24 hours, which corresponds to 20 * 9 = 180 releases for our team during the same time period.
Hopefully this will help you get through most of your issues if you decide to use RM in your test automation. Happy New Year!
Edit (3/23/2016): RMO has now added support for running environments in parallel with deployment conditions. We have taken advantage of this to move out test automation to a single RD with multiple environments. I have blogged about that here.