Tracking branch health and identifying flaky tests in RM driven test automation

In my previous blog post, I had written about how we have a single Release Definition for our team which runs all the test environments in parallel.  Now that Release Management supports branch based filters while listing releases, it is very easy to track the health of a particular branch.  Further, with the work the test team has done around making test case history branch and environment aware, it has become significantly easier to pinpoint the checkin which caused a particular test to start failing, and to identify flaky tests. 

Branch filters

The “Releases” view of our team’s Release Definition (named RM.CDP – which runs a bunch of test automation per checkin) tells a pretty sordid story regarding the test pass rate (see screenshot below).  There are lots of reds, and I have no idea whether we can deploy to prod or not.

However, note the “Branch: All” control in the upper right corner (highlighted) which was added recently.  This view lists releases across all branches, and I have highlighted a few of the branches that are being worked on).

image

I want to check the health of the release branch for the previous sprint i.e. releases/M106 to see whether we can deploy it to prod or not.  I can do this by selecting this value from the dropdown:

image

This branch looks much healthier, with a lot more greens.  I dig deeper into the latest build from this branch:

image

I find that the compat tests are broken (i.e. when the version of RM is greater than that of TFS, or less than that of TFS, then we have issues).  Once we rationalize/fix up these compat tests in releases/M106, we can deploy releases/M106 to prod.

Thus the branch filter feature can be used to easily find the health of a particular branch.

Analyzing test failures and identifying flaky tests

There are usually 3 kinds of test failures that we see in our automation:

(1) A dev checkin caused the test to start failing

(2) There is an infrastructure issue which is causing the test to fail e.g. the machine is out of disk space

(3) The test is flaky

Here, I will walk you through how to identify each of these scenarios using the cool “View History” feature that the Test team added, which is now Branch aware as well as Environment aware.

How to find which checkin caused a test to start failing?

Lets look at the health of the features/rmmaster branch (which is our team’s working branch), and dig into the latest release by double-clicking on it.

image

I focus on the TfsOnPrem environment to start with (which runs the tests on an on-prem TFS installation), and notice that 96.55% of the tests passed.

image

I click on this number (96.55%) to see which tests failed.  I note that there is one failing test.  Check out the “View History” link on the right.

image

I now click on the “View History” button to see when it began failing.  In this view, I first filter by Environment, then by my branch of interest (features/rmmaster).  Finally, I click on the first red bar in the TfsOnPrem environment to see when it began failing.

image

This gives me the build that caused the problem. 

image

If I click on the build, I can see the commit that caused the failure.

image

Is it an infrastructure issue?

If the commit seems unrelated, then you need to check if there is an infra issue.  If the test began failing across multiple branches at the same time, then it is likely an infra issue (assuming that the same checkin isn’t made into multiple branches simultaneously).

In the above example, it turned out that the test actually began failing across other branches (features/rmmaster and releases/M106) at around the same time:

image

It was indeed an infrastructure issue (where the PAT token for the External TFS server had expired, and therefore all tests that were trying to access this TFS server were failing).  This test in fact caused the tests in the releases/M106 branch to also go red after we deployed M106 to prod.

Is it a flaky test?

If the test has reds and green interspersed for the same branch, and same environment, then it is likely a flaky test:

image

Conclusion

Hopefully this blog has helped explain how we use Release Management’s branch-awareness features to keep our working branch healthy (or at least as healthy as we can Smile).