Release Flow: How We Do Branching on the VSTS Team
April 19, 2018
Whenever I talk to somebody about Git and version control, one question always comes up:
How do you do your branching at Microsoft?
And there’s no one answer to this question. Although we’ve been moving everybody in the company into one engineering system, standardizing on Git hosted in Visual Studio Team Services, what we haven’t done is move everybody into the same branching and development model.
Some teams — like Windows — have kept a branching strategy that is similar to the one that they’ve been using for many years. It’s hard to argue with this approach, they’ve got a lot of tooling to support it, and the developers have institutional knowledge about how things move between branches. Moving a team that big to Git was challenging enough — you can only boil so many oceans at the same time.
But this has led to some confusion in the way we talk about using Git: for example, Raymond Chen recently wrote an interesting series of blog posts explaining how you shouldn’t cherry-pick commits in Git. And while this is perfectly reasonable advice for his team’s workflows, it goes against the workflows that we use to build Visual Studio Team Services itself, and how the VSTS team works on a daily basis.
So, then — how do we do branching on the VSTS Team? First, we follow a trunk-based development approach. But unlike some trunk-based models, like GitHub Flow, we do not continuously deploy master to production. Instead, we release our master branch every sprint by creating a branch for each release. When we need to bring hotfixes into production, we cherry-pick those changes from master into the release branch. It’s a strategy that we call “Release Flow“.

Why Trunk-Based Development
We’re big fans of trunk-based development on the VSTS team. We like a simple branching structure where there’s a single master branch that everybody works in. This is much simpler than our old branching structure back in the dark days, many years ago, when our team was in the same TFVC repository as the Visual Studio IDE. We used to have this multi-level branching strategy that was — to be polite — ”complex”.

The more I talk to developers, the more I’ve observed something that tends to happen to teams that don’t do trunk-based development. No matter how organized they think they are, in fact, they tend to structure their branches in the same way. It’s a bit of a corollary to Conway’s Law:
Organizations tend to produce branching structures that copy the organization chart.
We were no exception: you could basically watch your code flowing through the org chart. When you checked in to your branch, your code would eventually be “forward integrated” into the next branch closer to trunk, eventually landing in trunk. Once that would happen, you’d want to “reverse integrate” trunk back up to all the feature branches so that you’d have everybody else’s code.
This is merge hell — yes, that’s actually what we called it — and we had a person employed full-time to deal with merging, conflict resolution and making sure all this integration continued to build. Whatever we paid him, it probably wasn’t enough.
Why We Don’t Continuously Deploy
When you back away from feature branches and start thinking about trunk-based branching strategies, the one that often comes up is GitHub Flow. (Note, that’s GitHub Flow, not Git Flow, which has two “trunks” and is therefore is not really trunk-based at all.)
I’m very familiar with GitHub Flow from my time working at GitHub. Overall, I really like this system; it’s lightweight and with good tooling and automation, you can be very productive. This system works pretty well for GitHub, but unfortunately, it doesn’t scale to the VSTS team’s needs. That’s because there’s a subtlety to GitHub Flow that often goes overlooked. You actually deploy your changes to production before merging the pull request:
Once your pull request has been reviewed and the branch passes your tests, you can deploy your changes to verify them in production… Now that your changes have been verified in production, it is time to merge your code into the master branch.
— Understanding GitHub Flow
This system is extremely clever: when you’re ready to check-in, you get immediate feedback on how a pull request will behave in production, and that feedback happens before you complete the pull request. So if there’s a problem with your code changes, you can simply abandon the deployment, and your bad code never got merged into master. This lets you take a step back and look at the monitoring data to understand why your changes were problematic, then iterate on the pull request and try again.
The problem with this development strategy is that it scales extremely poorly to larger teams, because there’s contention when you’re trying to deploy to production:
During peak work hours, multiple developers are often trying to deploy their changes to production. To avoid confusion and give everyone a fair chance, we can ask Hubot to add us to the deployment queue.
— Deploying Branches to GitHub.com
(If you’re not familiar with Hubot, it’s the core of GitHub’s chatops infrastructure. GitHub uses Hubot to perform their deployments from within Slack.)
When you have a few developers, you’re going to need a deployment queue to ensure that only one pull request can be deployed at once. This is great, but as you start to grow and hire more developers, there are more people in the queue. As your codebase grows, builds start to take longer. And as you get more popular, your infrastructure grows and with it, the time it takes to deploy.
Visual Studio Team Services has hundreds of developers working on it. On average, we build, review and merge over 200 pull requests a day into our master branch. If we wanted to deploy each of those before we merged them it, it would decimate our velocity.
We try to strike a balance where we want to code fast and get changes into master quickly, even if it takes them a little while longer to get into production. So instead of deploying every pull request to production, we deploy master to production at the end of each sprint — every three weeks. This means that a new feature could take that long to get into production. (And, of course, that new feature might only be enabled in testing, and not for all users, since we use feature flags in production.)
Sprintly Deployments
At the end of a sprint, when we’re ready to do a release, we create a new branch from master. This will be the release branch for the remainder of the sprint. While new feature work and development goes on in master, production stays nicely isolated from that work. Again, this keeps our development velocity moving quickly; we don’t have to worry about how long it takes to deploy these changes to our cloud of hundreds of servers spread across multiple Azure regions. We just open a pull request, get a code review and merge it into master.

We name these branches after the sprint that they correspond with. At the end of sprint 129, we create a branch named releases/M129 from master and deploy that. Once we finish development in sprint 130, we’re ready to deploy those changes to production; at that point, we forget about the old releases/M129 branch. Instead, we create a new branch named releases/M130 from master and deploy it. Once the releases/M130 deployment finishes — which would take a while, since we use a ringed deployment strategy — we don’t care about the old releases/M129 branch anymore. Once all the servers are running releases/M130 and there’s nothing with M129 in production, that branch is only of historical interest. We could even delete it.

Cherry-Picking Changes into Production
Of course, we don’t want production to exist in a vacuum. If there’s a high-priority bug or an availability issue, we need to be able to fix the problem quickly and deploy it immediately. That’s where cherry-picking comes in.
When we need to bring a change to production, we first make the change against the master branch. We get it code reviewed as usual — though at a bit higher priority than normal — and merge it into master. Then we cherry-pick that pull request into the current production release branch and start deploying it.

We find this workflow so useful that you can cherry-pick a pull request right from VSTS:

This actually cherry-picks the whole pull request, bringing each commit that made up the PR from one branch to another.
We always make production changes this way, starting in master; that’s because how the code gets into production is as important as the code that ultimately gets there. If we were to hotfix production directly, we might accidentally forget to bring a change back to master for the next release. But by bringing changes into master first, we ensure that we never have regressions in production.
This is so important that we ask if you’ve done it in the pull request template for our release branches:

The only exception, of course, is when the change doesn’t make sense to bring into master. Perhaps there’s been some refactoring that means that this bug doesn’t exist in master anymore. That’s the only time pull requests can go directly into a release branch without going through master first.
Even though this “master first” policy takes a few extra minutes, it’s always worth it. That’s especially true when you feel the time pressure to resolve a production incident, when you might be tempted to cut corners. It ensures that we only fix these bugs once and that we won’t have a repeat availability incident due to the same problem.
I hope that this gives some context behind the branching strategy we use on the VSTS team and why it works for us. Of course, for your branching strategy, you need to pick an approach that works for your team you have and the product that you’re building. And you should be willing to re-evaluate as those things change: as we transitioned from building an on-premises product shipping every few years to a cloud service deploying all the time, we had to change our branching strategy to fit. We needed a structure that would meet the challenges that we face today instead of fighting battles of the past. You do, too.
If you have any questions, please feel free to leave a comment — or if you’re coming to the Build 2018 conference on May 7th, then I’d love to chat in person. You can drop by the version control area on the expo floor, where I’ll be hanging out.
Please told about testing process.
* How you run tests
* How you create UI test (JavaScript)
* How you remove all TRA Tests
* How you find bugs in component border
Hi Nick, I’m afraid that my expertise is not testing. Thankfully, Munil Shah – Director of Engineering – has a great discussion about how we’re doing testing: https://www.visualstudio.com/learn/shift-left-make-testing-fast-reliable/
I read this article.
And there are no answers to my questions 🙁
(The content was deleted per user request)
> If we were to hotfix production directly, we might accidentally forget to bring a change back to master for the next release. But by bringing changes into master first, we ensure that we never have regressions in production.
Do you not have an schedule or automated process to regularly merge/RI to master from the production bits branch? That would make it unnecessary to worry about whether changes to production will go back to master or not, while also making it unnecessary to cherry-pick changes from master to production.
(My personal opinion: Even if you put aside the commit graph problems that git cherry-picking and rebasing can cause, push-button cherry-picking and squash+merging always makes me queasy because you are putting commits onto master, or whatever the target branch is, that _have never been built or tested_. So that’s why I’d like to avoid it whenever possible.)
You’re very much right; I agree that you should never deploy code that you haven’t built or tested. I should have clarified what our cherry-pick button does: it actually creates a new pull request against the target branch. That PR is built and has our test infrastructure run against it. You can do a test deployment for more manual testing. Once you’re happy, then you can merge it into the production branch and deploy it to production.
As for automation: we never bring changes from production to master, changes _always_ flow the opposite direction. Since we put things in master first, there are never changes in production that are not in master so there would be no value in merging back.
How can we handle a long lived feature in Trunk Based Development. Because such feature would take 2-3 sprints to complete. So what approach is needed in such scenario in case of Trunk Based Development?
Hi Mustafa – we’d recommend that instead of branching to achieve feature isolation, that you use a feature flag to disable it from the users. That is to say: continue to work on the feature, and check in the changes to your master branch, but keep it isolated from users by hiding it behind a feature flag. https://docs.microsoft.com/en-us/vsts/articles/phase-features-with-feature-flags?view=vsts
I saw a talk around this at build and I am interested. I’m genuinely interested in this, and am curious about what sort of automation you guys have with regards to automatically creating sprint release branches as well as being able to specify which branch you are creating a release from. (I know how that could be achievable from a build perspective)
Good question – there’s not a lot of work to be done, it’s a matter of creating a new branch from master (we always create the release branch from master), and then changing the build definition for the release branch to point to that new branch. I _suspect_ that there’s a powershell script that we run, just to make sure that we don’t miss any steps, but if I’m honest, I don’t actually know. 😀
Hi I missed your great presentation at build and watched it on youtube later on. I wonder why you don’t branch from the release branch, merge it in to master and cherry pick this back to release branch?
Interesting question – what are the advantages to this sort of a workflow? One of the advantages of keeping them reasonably separate (that is, never merging between master and the release branch) is that the change in each branch may actually need to be different. So if we branched off the release branch, fixed an issue, and then tried to merge _that_ into master, we may have to fix that code to be applicable for master. Imagine that we did some refactoring in master and changed the name of a variable in master. Imagine that we need to do a hotfix where we set the value of a variable to a default value.
That means that if we branch the release branch to fix this, we’ll update the variable – using the old name. Seems reasonable. Now if we merge our hotfix branch into master, we’ll have to update it to point to the new name. That’s fine, that’s just updating it to reflect the state of the new world that we’ve refactored. But now if we cherry-pick that change from master into the release branch, we’ll have to update it _again_ to point back at the old name. So this is a workable plan, of course, but it’s not particularly efficient. We don’t really need to do a cherry-pick in this scenario: we could instead merge the hotfix branch into the release branch and into the master branch.
The reason we fix in master first is about the _process_ though. It’s so that we _always_ have a fix in master so that we don’t regress during the next feature release. Creating a hotfix branch from the feature branch is certainly technically workable, but it might encourage people to neglect the merge back to master.
We’ve had availability outages recur because we forgot to bring a fix to master (and only fixed it in production). That’s bad and avoidable, and that’s why we use this process this way.
I enjoyed the Build presentation and the article. Thank you.
If I understood the process correctly, a production bug fix is first implemented and tested against the master branch. It is possible that 2 weeks or more of development has occurred in that branch since the last production release – so you are effectively building and testing against something other than production. I realize you then cherry-pick the changes and create a second pull request. How often does this new pull request break unit tests and require an additional round of coding and testing in the master branch to “retrofit” the bug fix to the last production release?
I realize the bug fix has to work in both branches, but by starting with the master branch, you effectively delay the production fix. I guess that is what your “override” form is designed to handle – the case where the bug fix really needs to hit production first.
Overall the approach has merit for my environment in that our teams are 1-3 people and the pure GitHub Flow could probably work.
Good question – I suspect that it happens less often than you would think, to be honest. But actually, I’m not sure that we’ve ever tracked it. It’s not something that’s particularly painful, and you tend to have a pretty good idea of whether the cherry-pick will go cleanly, or whether you’ll need to do a little work to massage it for the release branch, before you even get started. We do delay the production fix by starting in master _but_ we consider that a worthy trade-off to ensure that we don’t have a regression later.
At the scale of 1-3 people, if you want a true continuous delivery pipeline, I think that GitHub Flow can be very effective.
What happens in the scenario where a feature or refactor has been merged into master after the last release, then there is a hotfix you have to introduce in production in that code area? Are your hotfix branches branched from the same commit as the release? I’m not clear on how cherry picking into the release branch would work in this scenario? Thanks.
Great question – indeed this does happen that sometimes we do a refactoring or make some changes in master so that we can’t cleanly cherry-pick a commit into the feature branch. Maybe we’ve changed the name of a variable in the master branch, and the hotfix needs to change the way that variable is assigned. In that case, we couldn’t cherry-pick a commit cleanly into the feature branch. We could either cherry-pick the commit, and then fix it before merging it into the release branch, or we could just re-code the commit entirely. Generally these fixes are small and targeted, so it’s not overly burdensome to do that. It really just depends on which method is the least amount of work for us.
What if master has changed in place since the last release, and you do not want them to go out with the hotfix? Let’s say they have complex database changes, not been tested, gone through UAT, awaiting Build announcement, etc. I’m guessing that you will say feature flags, but is that it?
Right – no, we actually do not use feature flags here, we definitely do not want to introduce any new changes into production, even if they are behind feature flags. That’s why we do a cherry-pick into the release branch instead of a merge. A cherry-pick brings just the single hotfix commit from master into the release branch, without bringing along those complex database changes. A cherry-pick can isolate just that single change.
Now, this doesn’t _always_ work, sometimes that commit that hotfixes the bug will have dependencies on prior commits. Maybe there was some refactoring done in master — consider the case where a variable is renamed, and now the fix in master is to change the way that variable is assigned to. We couldn’t cherry-pick that change in cleanly (because the variable still has the old name in the release branch) so sometimes we actually need to cherry-pick the commit and then clean it up, and sometimes we have to rewrite the code entirely, with one fix going into master and a different fix going into the release branch.
But we definitely try to minimize the changes going into the release branch to hotfix only the issue that we’re seeing; we desperately want to avoid making the issue worse by bringing in unrelated changes.
How you manage automated deployment from release branch. If release were done from master branch CI/CD can be easily configured from that branch once check-in is done in that branch. However, release branch is created for every sprint, in that case do you trigger release manually?
Also, how you handle revert specific feature – if feature lag is not used and QA/Business Team didn’t approve or made it defer to next release?
We have a CI/CD pipeline set up from the release branch. (We use the Buiold and Release functionality in VSTS, which probably won’t be a surprise to you. 😂). We change the branch that it targets to point to the latest release branch – right now our “VSO.Release.CI” process is targeting the “release/M135” branch. So we’re able to take advantage of our build / release process, including any approval gates, just by changing that branch in the release definition.
We’ve never been in a position to revert a specific feature, we would always put a feature flag around it in the first place. And – to be honest – if you said to me that we needed to remove a feature, I think that I would retroactively put a feature flag around it. It sounds like there’s a lot of risk around reverting a feature – there might be schema changes for example – and it sounds a lot less risky to feature flag it than to try to truly revert the code.
It is very nice article. We have similar flow in TFS VC but we have one DEV branch only which is merged into trunk/master. When sprint is in the end we build version from trunk/master and deploy it into test environment. I would like to ask you how you solve bugs in test environment. Because we have more than 10 bugs which could have large changes in code and there is problem with merging this changes into trunk/master and then into release branch (we merge into release branch from trunk/master). It is time consuming. So we create release branch as we deploy version into production environment. It could take more one week so we cannot merge from DEV branch into trunk/master because there is test version in trunk/master. There could be large queue of changes then.
We have second problem that we release version into test production and there is bug which avoid to deploy into production environment. Our customer says we want this version without future where is bug. We must rollback all changes from trunk/master branch which are influent by future. I think your flow has same problem. How do you solve when there is bug in some topic which avoid to deploy to production environment?
HI, thanks for the post.
I’m tempted to implement something like this. The only thing I don’t have clear yet is, how do you deal with hotfixes in master branch when master has more commits in it since the release was made?.
I understand the purpose of Cherry Pick, this takes the commit changes and applies them to the release branch, but, what if the fix commit is very large? How does the cherry pick work in this scenario?
I’ve been trying this on a test git, and sometimes the cherry-pick fails in VSTS, so I have to do the merge locally. I think this manually merging, caused by the fact I want to have the master branch always updated, can cause some unexpected errors, I am right?.
How do you do when the cherry-pick fails in VSTS?
Thanks!
Thanks for the interesting article and summary presentation. If release-branching is more efficient than github-flow for larger teams, is there an argument to instead split into smaller teams & components that can be deployed independently, so that continuous delivery can be preserved instead of introducing 3-week batches? And are there ever cases where the 3-week batch needs to be rolled-back instead of hot-fixed, delaying delivery of the work for hundreds of developers? (or do the feature flags protect against this)