Controlling exposure through feature flags in VS Team Services


One question that I often get from customers is how we manage exposing features in the service. Features may not be complete or need to be revealed at a particular time. We may want to get early feedback. With the team working in master and deploying every three-week sprint, let’s take a look at how we do this for Team Services.

Goals

Our first goal is decoupling deployment and exposure. We want to be able to control when a feature is available to users without having to time when the code is committed. This allows engineering the freedom to implement the feature based on our needs while also allowing control for the business on when a feature is announced. Next we want to be able to change the setting at any scope from globally to particular scale units to accounts to individual users. This granularity gives us a great deal of flexibility. We can deploy a feature and then expose it to select users and accounts. That allows us to get feedback early, which includes not only what users tell us but also how the feature is used based on aggregated telemetry. Additionally, we want to be able to react quickly if a feature causes issues and be able to turn it off quickly.

To make all of this work well, we need to be able to change a feature flag’s state without re-deploying any of our services. We need each service to react automatically to the change to minimize the propagation delay.

As a result, we have the following goals.

  • Decouple deployment and exposure
  • Control down to an individual user
  • Get feedback early
  • Turn off quickly
  • Change without redeployment

Feature flags

Feature flags, sometimes called feature switches, allow us to achieve our goals. At the core, a feature flag is nothing more than an input to an if statement in the code: if the flag is enabled, execute a new code path, and else if not, execute the existing code path.

Let’s look at an actual example. In this case I want to control whether a new feature to revert a pull request is available to the user. I’ve highlighted the Revert button in the screen shot.

image

First we need to define the feature flag. We do that by defining it in an XML file. Each service in VSTS has its own set of flags. Here’s part of the actual file that defines the feature flag for this button with the name of the feature flag highlighted.

<?xml version="1.0" encoding="utf-8"?>
<!-- In this group we should register TFS specific features and sets their states. -->
<ServicingStepGroup name="TfsFeatureAvailability" ="" >
  <Steps>
    <!-- Feature Availability -->
    <ServicingStep name="Register features" stepPerformer="FeatureAvailability" ="" >
      <StepData>
        <!--specifying owner to allow implicit removal of features -->
        <Features owner="TFS">
          <!-- Begin TFVC/Git -->
          <Feature name="SourceControl.Revert" description="Source control revert features" />

When we deploy the service that defined this feature flag, the deployment engine will create the feature flag in the database.

Using the feature flag in code is simple. Here’s the Typescript code that is used to create the button. I’ve combined contents from two files. The export is from a file that defines constants. The rest is from the code to create the button on the page. I’ve highlighted the flag and the button creation. In this case, there was no prior code, so if the flag is off, nothing gets added to the page.

export module FeatureAvailabilityFlags {
    export var SourceControlRevert = "SourceControl.Revert";
}

import FeatureAvailability = require("VSS/FeatureAvailability/Services");
private _addRevertButton(): void {
    if(FeatureAvailability.isFeatureEnabled(FeatureAvailabilityFlags.SourceControlRevert)) {
        this._calloutButtons.unshift(
            <button onClick={ () => Dialogs.revertPullRequest(
                this.props.repositoryContext,
                this.props.pullRequest.pullRequestContract(),
                this.props.pullRequest.branchStatusContract().sourceBranchStatus,
                this.props.pullRequest.branchStatusContract().targetBranchStatus) }
                    >{ VCResources.PullRequest_Revert_Button } < /button>
            );
        }
    }

In addition to the web UI, the code for the MVC controller is also protected with a feature flag. I’m omitting the definition of the constant and some of the code for brevity, but as you can see the only mention of the feature flag is in the attribute on the controller, making it really easy to control the feature with a flag.

namespace Microsoft.TeamFoundation.SourceControl.WebServer
{
    [FeatureEnabled(FeatureAvailabilityFlags.SourceControlRevert)]
    public class GitRevertsController : GitApiController
    {
        [HttpPost]
        public HttpResponseMessage CreateRevert(
            [FromBody] WebApi.GitAsyncRefOperationParameters revertToCreate,
            [ClientParameterType(typeof(Guid), true)] string repositoryId,
            [ClientIgnore] string projectId = null)
        {
        }

        [HttpGet]
        public GitRevert GetRevertForRefName(
            [FromUri] string refName,
            [ClientParameterType(typeof(Guid), true)] string repositoryId,
            [ClientIgnore] string projectId = null)
        {
        }

        [HttpGet]
        public GitRevert GetRevert(
            int revertId,
            [ClientParameterType(typeof(Guid), true)] string repositoryId,
            [ClientIgnore] string projectId = null)
        {
        }
    }
}

The FeatureEnabled attribute checks to see if the specified feature flag is enabled and throws an exception if not.

    public class FeatureEnabledAttribute : AuthorizationFilterAttribute
    {
        public FeatureEnabledAttribute(string featureFlag)
        {
            this.FeatureFlag = featureFlag;
        }

        public string FeatureFlag { get; private set; }

        public override void OnAuthorization(HttpActionContext actionContext)
        {
            base.OnAuthorization(actionContext);
            TfsApiController tfsController = actionContext.ControllerContext.Controller as TfsApiController;
            if (tfsController != null)
            {
                if (!tfsController.TfsRequestContext.IsFeatureEnabled(this.FeatureFlag))
                {
                    throw new FeatureDisabledException(FrameworkResources.FeatureDisabledError());
                }
            }
        }
    }

Other than some similar code for a menu entry for revert, the feature flag has now been added for the new feature.

Controlling feature flags

We have both PowerShell commands and a web UI to turn the feature flags off and on for different scopes.

Our PowerShell commands are what you would expect: Get-FeatureFlag and Set-FeatureFlag. Here are some examples of what they can do.

    Getting all flags and their states globally Get-FeatureFlag
    Getting all flags and states for a single account Get-FeatureFlag -Host account
    Getting information for one flag globally Get-FeatureFlag -Name feature_name
    Getting information for one flag in account Get-FeatureFlag -Name feature_name –Host account
    Setting a feature flag globally Set-FeatureFlag -Name feature_name -State {Off|On|Undefined}
    Setting a feature flag for a single account Set-FeatureFlag -Name feature_name -State {Off|On|Undefined} -Host account

 

We also have internal site that provides an interactive way to do the same operations. In this example, you can see that I have the feature flag for the revert feature on for one of my personal accounts and off for the other. This is a great way to be able to control the flags for accounts on demand, such as when we allowed customers to request SSH before it became broadly available.

image

Turn it off!

It’s important to have the right telemetry to monitor new features. If we find that a feature is causing problems, we can turn the feature flag off. Since there’s no deployment involved – just a script or change in the administrative web UI – we can quickly revert to the prior behavior.

Testing

New features that are hidden behind a feature flag are deployed with the flag turned off. As we start turning on the feature for users or accounts, both the old and new code will be executing. We need to test both with the feature flag on and off to ensure it works. This also critical for ensuring the feature can be turned off quickly if something goes wrong.

Tests can easily control whether a flag is on or off by calling methods like the following.

public static void RegisterFeature(TestCollection testCollection, string featureName)
public static void UnregisterFeature(TestCollection testCollection, string featureName)
public static void SetFeatureStateForApplication(TestCollection testCollection, string featureName, bool state)
public static void SetFeatureStateForDeployment(TestCollection testCollection, string featureName, bool state)
public static bool IsFeatureEnabled(TestCollection testCollection, string featureName)

Tests can be run conditionally based on the state of a flag.

[TestMethod, Owner("buck"), Priority(1)]
[Description("Verity that Revert works correctly.")]
[RequiresFeature(FeatureAvailabilityFlags.SourceControlRevert)]
public void SourceControl_Revert()
{
   ...
}

Since most feature flags are used initially with a default state of off, it’s also easy to set them in the test environment for the test run.

<?xml version="1.0" encoding="utf-8"?>
<TestEnvironment>
  <TestVariables>
    <Value Key="SetTfsFeaturesOn">SourceControl.Revert</Value>
  </TestVariables>
</TestEnvironment>

Stages

I mentioned earlier that we can use feature flags to get feedback. We also use them to allow our own team to begin using the feature to help flush out bugs (it’s a great to build the service that we use). Rather than have every team invent their own process, we established a process to roll out features in a standardized way. This provides an opportunity to gather feedback and bugs early on in the development process.

We have standard stages that we’ve defined that each team can use. How quickly a feature goes through the stages depends on the scope of the feature, feedback, and telemetry. The stages include an increasingly broad group of users with increasingly diverse perspectives.

Stage 0 - Canary
This is the first phase and is the account used by the VSTS team plus select internal accounts. Program managers are responsible for sending out communication to the users once the feature flags are enabled.

Stage 1 - MVPs & Select Customers
This is the second phase and will include MVPs and select top customers who have opted in. Program managers are again responsible for emailing the users.

Stage 2 - Private Preview
Private preview is used for major new features and services and is designed to test new features with a broader set of customers that we don't have regular contact with. There are many ways to collect a list of customers for a private preview - from forum interaction, blog comments, etc. We've also done invitation codes, publicized email aliases for customers to request access, as we did for SSH access, and sometimes create in-product "opt-in" experiences for customers, as we’ve done for the new navigation UI. Individual teams will manage and communicate directly with their private preview customers.

Stage 3 - Public Preview
Public preview is a state reserved for major new features and services where we want to gather feedback but are not yet ready to provide a full SLA, etc. Public Preview features are enabled for all VSTS customers but their main entry points in the UI are annotated with a "preview" designation so that customers understand this is not a completed feature. When a feature enters public preview, it is announced in the VSTS news feed and may also be accompanied by marketing communication, depending on the feature.

Stage 4 - General Availability (GA)
GA denotes when a feature/service is available to all customers and fully supported (SLA, etc).

Cleaning up flags

It’s easy to accumulate a lot of flags that are no longer used, so after a feature has been in production and fully available, teams decide when to delete the feature flag and the old code path. This may happen a few weeks or a few months later, depending on the scope of the feature.

Events

One of our goals is to allow for features to be unveiled at events in some cases. If you want to unveil a feature at an event, when do you enable it?

We learned the hard way that it’s not the morning of the event. Several years ago at the Connect 2013 event we turned on a large set of feature flags just before a keynote and demo. The service was unusable. By turning on feature flags at the same time, we had a large amount of new code interacting with the system under production load, and the system fell apart. You can read about details of the incident in Brian’s post, A Rough Patch. At that time, we didn’t have stages, and we only had one public scale unit.

As a result of that experience, we ensure that feature flags are enabled in production at least 24 hours ahead of an event so that the code is under full production load. That leaves us time to react, whether to fix problems or disable features, before the event starts. Of course, this means that new features could be discovered early. That’s certainly a possibility, but it’s mitigated by controlling the entry points, making the new features hard to find except for the customers who’ve gotten early access or knowing where to look (perhaps setting a special cookie). Some of it comes down to a judgment call.

We followed this policy when we unveiled the Marketplace at the Connect 2015 event. In contrast to the event in 2013, everything worked as it should during the event.

Summary

Feature flags have become a critical part of how we roll out feature, get feedback, and allow engineering and marketing to proceed on their own schedules. It’s hard to imagine DevOps services without them!

While we built our own implementation, you don’t have to. LaunchDarkly offers feature flags as a service, including a LaunchDarkly VSTS extension to integrate with VSTS work items and releases.

Follow me at twitter.com/tfsbuck


Comments (9)

  1. Erik HT says:

    One thing about feature flags that has always confused me is this. How exactly do you implement both features if the feature requires database changes? Let’s say the feature requires changing the primary key of table from a string to an int? Or maybe it requires a whole bunch of new lookup table data? How would you be able to just “turn off and on” something like this? I ask because it seems 90+% of features I’ve worked on fall into this category.

    1. Buck Hodges says:

      Erik, it depends on the situation and how much you are willing to invest in it. One of the largest changes we’ve done this way was with Work Item Tracking. WIT originally used a wide schema where every new field in a work item process meant a new column in the tables that store work items. That’s a problem at scale since the number of columns is limited to 1024. We changed that to a long schema where each field becomes a row instead. When we rolled that out, we actually ran both the old and the new in parallel. For accounts where we turned on the new one, we used jobs to backfill the new tables with the existing data and installed triggers to keep it up to date. Then we were able to use both at the same time (for example, write to both schemas and measure performance of the new one). We could switch accounts to the new schema and switch back if we had to. If it sounds like a lot of work, it was. However, we were rewarded by having that roll out go very smoothly as a result. While this was one of our bigger changes, there’s always some amount of extra work that’s required if the DB schema is also changing.

  2. Thanks for the very interesting post. You certainly got me interested in the concept. :-)

  3. Hamid Shahid says:

    Excellent post. Thanks for sharing how VSTS team uses feature toggles.

    One of the challenges we often face in this approach is how to deal with schema changes, where the format of the table e.g. is modified. or the message format is modified. Would be really interesting if you could shed some light on it.

    1. Buck Hodges says:

      Hamid, thank you. I will add going deeper on DB changes behind feature flags to my list for a follow up post.

      1. s says:

        never did go into that? ;)

  4. Aseem says:

    Great post! :)

  5. How did you have the test run conditionally based on this tag ?
    Do you just check for this in the TestInitialize method to do Assert.Inconclusive ?
    Or did you do more internal modification ?

    1. Buck Hodges says:

      Frederic, the attribute will ensure that the feature flag is on when the test is run and then restore the feature flag to its original state when the test is finished.

Skip to main content