Controlling exposure through feature flags in VS Team Services

Buck Hodges

September 30th, 20160 0

One question that I often get from customers is how we manage exposing features in the service. Features may not be complete or need to be revealed at a particular time. We may want to get early feedback. With the team working in master and deploying every three-week sprint, let’s take a look at how we do this for Team Services.

Goals

Our first goal is decoupling deployment and exposure. We want to be able to control when a feature is available to users without having to time when the code is committed. This allows engineering the freedom to implement the feature based on our needs while also allowing control for the business on when a feature is announced. Next we want to be able to change the setting at any scope from globally to particular scale units to accounts to individual users. This granularity gives us a great deal of flexibility. We can deploy a feature and then expose it to select users and accounts. That allows us to get feedback early, which includes not only what users tell us but also how the feature is used based on aggregated telemetry. Additionally, we want to be able to react quickly if a feature causes issues and be able to turn it off quickly.

To make all of this work well, we need to be able to change a feature flag’s state without re-deploying any of our services. We need each service to react automatically to the change to minimize the propagation delay.

As a result, we have the following goals.

Decouple deployment and exposure
Control down to an individual user
Get feedback early
Turn off quickly
Change without redeployment

Feature flags

Feature flags, sometimes called feature switches, allow us to achieve our goals. At the core, a feature flag is nothing more than an input to an if statement in the code: if the flag is enabled, execute a new code path, and else if not, execute the existing code path.

Let’s look at an actual example. In this case I want to control whether a new feature to revert a pull request is available to the user. I’ve highlighted the Revert button in the screen shot.

First we need to define the feature flag. We do that by defining it in an XML file. Each service in VSTS has its own set of flags. Here’s part of the actual file that defines the feature flag for this button with the name of the feature flag highlighted.

<?xml version="1.0" encoding="utf-8"?>
<!-- In this group we should register TFS specific features and sets their states. -->
<ServicingStepGroup name="TfsFeatureAvailability" …="" >
  <Steps>
    <!-- Feature Availability -->
    <ServicingStep name="Register features" stepPerformer="FeatureAvailability" …="" >
      <StepData>
        <!--specifying owner to allow implicit removal of features -->
        <Features owner="TFS">
          <!-- Begin TFVC/Git -->
          <Feature name="SourceControl.Revert" description="Source control revert features" />

When we deploy the service that defined this feature flag, the deployment engine will create the feature flag in the database.

Using the feature flag in code is simple. Here’s the Typescript code that is used to create the button. I’ve combined contents from two files. The export is from a file that defines constants. The rest is from the code to create the button on the page. I’ve highlighted the flag and the button creation. In this case, there was no prior code, so if the flag is off, nothing gets added to the page.

export module FeatureAvailabilityFlags {
    export var SourceControlRevert = "SourceControl.Revert";
}

import FeatureAvailability = require("VSS/FeatureAvailability/Services");
private _addRevertButton(): void {
    if(FeatureAvailability.isFeatureEnabled(FeatureAvailabilityFlags.SourceControlRevert)) {
        this._calloutButtons.unshift(
            <button onClick={ () => Dialogs.revertPullRequest(
                this.props.repositoryContext,
                this.props.pullRequest.pullRequestContract(),
                this.props.pullRequest.branchStatusContract().sourceBranchStatus,
                this.props.pullRequest.branchStatusContract().targetBranchStatus) }
                    >{ VCResources.PullRequest_Revert_Button } < /button>
            );
        }
    }

In addition to the web UI, the code for the MVC controller is also protected with a feature flag. I’m omitting the definition of the constant and some of the code for brevity, but as you can see the only mention of the feature flag is in the attribute on the controller, making it really easy to control the feature with a flag.

namespace Microsoft.TeamFoundation.SourceControl.WebServer
{
    [FeatureEnabled(FeatureAvailabilityFlags.SourceControlRevert)]
    public class GitRevertsController : GitApiController
    {
        [HttpPost]
        public HttpResponseMessage CreateRevert(
            [FromBody] WebApi.GitAsyncRefOperationParameters revertToCreate,
            [ClientParameterType(typeof(Guid), true)] string repositoryId,
            [ClientIgnore] string projectId = null)
        {
        }

        [HttpGet]
        public GitRevert GetRevertForRefName(
            [FromUri] string refName,
            [ClientParameterType(typeof(Guid), true)] string repositoryId,
            [ClientIgnore] string projectId = null)
        {
        }

        [HttpGet]
        public GitRevert GetRevert(
            int revertId,
            [ClientParameterType(typeof(Guid), true)] string repositoryId,
            [ClientIgnore] string projectId = null)
        {
        }
    }
}

The FeatureEnabled attribute checks to see if the specified feature flag is enabled and throws an exception if not.

    public class FeatureEnabledAttribute : AuthorizationFilterAttribute
    {
        public FeatureEnabledAttribute(string featureFlag)
        {
            this.FeatureFlag = featureFlag;
        }

        public string FeatureFlag { get; private set; }

        public override void OnAuthorization(HttpActionContext actionContext)
        {
            base.OnAuthorization(actionContext);
            TfsApiController tfsController = actionContext.ControllerContext.Controller as TfsApiController;
            if (tfsController != null)
            {
                if (!tfsController.TfsRequestContext.IsFeatureEnabled(this.FeatureFlag))
                {
                    throw new FeatureDisabledException(FrameworkResources.FeatureDisabledError());
                }
            }
        }
    }

Other than some similar code for a menu entry for revert, the feature flag has now been added for the new feature.

Controlling feature flags

We have both PowerShell commands and a web UI to turn the feature flags off and on for different scopes.

Our PowerShell commands are what you would expect: Get-FeatureFlag and Set-FeatureFlag. Here are some examples of what they can do.

Getting all flags and their states globally	Get-FeatureFlag
Getting all flags and states for a single account	Get-FeatureFlag -Host account
Getting information for one flag globally	Get-FeatureFlag -Name feature_name
Getting information for one flag in account	Get-FeatureFlag -Name feature_name –Host account
Setting a feature flag globally	Set-FeatureFlag -Name feature_name -State {Off\|On\|Undefined}
Setting a feature flag for a single account	Set-FeatureFlag -Name feature_name -State {Off\|On\|Undefined} -Host account

We also have internal site that provides an interactive way to do the same operations. In this example, you can see that I have the feature flag for the revert feature on for one of my personal accounts and off for the other. This is a great way to be able to control the flags for accounts on demand, such as when we allowed customers to request SSH before it became broadly available.

Turn it off!

It’s important to have the right telemetry to monitor new features. If we find that a feature is causing problems, we can turn the feature flag off. Since there’s no deployment involved – just a script or change in the administrative web UI – we can quickly revert to the prior behavior.

Testing

New features that are hidden behind a feature flag are deployed with the flag turned off. As we start turning on the feature for users or accounts, both the old and new code will be executing. We need to test both with the feature flag on and off to ensure it works. This also critical for ensuring the feature can be turned off quickly if something goes wrong.

Tests can easily control whether a flag is on or off by calling methods like the following.

public static void RegisterFeature(TestCollection testCollection, string featureName)
public static void UnregisterFeature(TestCollection testCollection, string featureName)
public static void SetFeatureStateForApplication(TestCollection testCollection, string featureName, bool state)
public static void SetFeatureStateForDeployment(TestCollection testCollection, string featureName, bool state)
public static bool IsFeatureEnabled(TestCollection testCollection, string featureName)

Tests can be run conditionally based on the state of a flag.

[TestMethod, Owner("buck"), Priority(1)]
[Description("Verity that Revert works correctly.")]
[RequiresFeature(FeatureAvailabilityFlags.SourceControlRevert)]
public void SourceControl_Revert()
{
   ...
}

Since most feature flags are used initially with a default state of off, it’s also easy to set them in the test environment for the test run.

<?xml version="1.0" encoding="utf-8"?>
<TestEnvironment>
  <TestVariables>
    <Value Key="SetTfsFeaturesOn">SourceControl.Revert</Value>
  </TestVariables>
</TestEnvironment>

Stages

I mentioned earlier that we can use feature flags to get feedback. We also use them to allow our own team to begin using the feature to help flush out bugs (it’s a great to build the service that we use). Rather than have every team invent their own process, we established a process to roll out features in a standardized way. This provides an opportunity to gather feedback and bugs early on in the development process.

We have standard stages that we’ve defined that each team can use. How quickly a feature goes through the stages depends on the scope of the feature, feedback, and telemetry. The stages include an increasingly broad group of users with increasingly diverse perspectives.

Stage 0 – Canary
This is the first phase and is the account used by the VSTS team plus select internal accounts. Program managers are responsible for sending out communication to the users once the feature flags are enabled.

Stage 1 – MVPs & Select Customers
This is the second phase and will include MVPs and select top customers who have opted in. Program managers are again responsible for emailing the users.

Stage 2 – Private Preview
Private preview is used for major new features and services and is designed to test new features with a broader set of customers that we don’t have regular contact with. There are many ways to collect a list of customers for a private preview – from forum interaction, blog comments, etc. We’ve also done invitation codes, publicized email aliases for customers to request access, as we did for SSH access, and sometimes create in-product “opt-in” experiences for customers, as we’ve done for the new navigation UI. Individual teams will manage and communicate directly with their private preview customers.

Stage 3 – Public Preview
Public preview is a state reserved for major new features and services where we want to gather feedback but are not yet ready to provide a full SLA, etc. Public Preview features are enabled for all VSTS customers but their main entry points in the UI are annotated with a “preview” designation so that customers understand this is not a completed feature. When a feature enters public preview, it is announced in the VSTS news feed and may also be accompanied by marketing communication, depending on the feature.

Stage 4 – General Availability (GA)
GA denotes when a feature/service is available to all customers and fully supported (SLA, etc).

Cleaning up flags

It’s easy to accumulate a lot of flags that are no longer used, so after a feature has been in production and fully available, teams decide when to delete the feature flag and the old code path. This may happen a few weeks or a few months later, depending on the scope of the feature.

Events

One of our goals is to allow for features to be unveiled at events in some cases. If you want to unveil a feature at an event, when do you enable it?

We learned the hard way that it’s not the morning of the event. Several years ago at the Connect 2013 event we turned on a large set of feature flags just before a keynote and demo. The service was unusable. By turning on feature flags at the same time, we had a large amount of new code interacting with the system under production load, and the system fell apart. You can read about details of the incident in Brian’s post, A Rough Patch. At that time, we didn’t have stages, and we only had one public scale unit.

As a result of that experience, we ensure that feature flags are enabled in production at least 24 hours ahead of an event so that the code is under full production load. That leaves us time to react, whether to fix problems or disable features, before the event starts. Of course, this means that new features could be discovered early. That’s certainly a possibility, but it’s mitigated by controlling the entry points, making the new features hard to find except for the customers who’ve gotten early access or knowing where to look (perhaps setting a special cookie). Some of it comes down to a judgment call.

We followed this policy when we unveiled the Marketplace at the Connect 2015 event. In contrast to the event in 2013, everything worked as it should during the event.

Summary

Feature flags have become a critical part of how we roll out feature, get feedback, and allow engineering and marketing to proceed on their own schedules. It’s hard to imagine DevOps services without them!

While we built our own implementation, you don’t have to. LaunchDarkly offers feature flags as a service, including a LaunchDarkly VSTS extension to integrate with VSTS work items and releases.

Follow me at twitter.com/tfsbuck