Intelligent routing on Service Fabric with Træfik

This post was coauthored by Joni Collinge and Lawrence Gripper

We recently worked with several customers to explore adding Service Fabric support to Træfik, a popular open source reverse proxy, as an edge proxy for Service Fabric applications.

Edge reverse proxies sit between the web and your applications and services. They provide a way to control, route and augment web requests. They typically offer advanced HTTP functions such as TLS termination, conditional routing, circuit breakers, dynamic load balancing and content caching. This allows you to centralize these functions in a common layer, keeping your services clean and simple. In a microservices architecture, reverse proxies are commonly used to aggregate many different service APIs into a single logical API while enforcing routing and security policies.

So why Træfik? Træfik is a reverse proxy that has been developed with microservices in mind. Its built-in support for dynamic configuration, circuit breakers and smart load balancing make it ideal for orchestrators such as Service Fabric. We’d had good experience working with Træfik on Kubernetes and want to see whether we could replicate that experience on Service Fabric. We submitted a speculative pull-request (PR) and got chatting to the owners of the Træfik project, Containous, they helped us improve the integration and make it more idiomatic.


Træfik is designed with pluggability in mind. It already had support for various orchestrators, including Docker Swarm, Kubernetes, Marathon and Consul. With each supported orchestrator having its own provider.

Træfik defers responsibility for getting the details of the services running on the orchestrator to these providers. The provider periodically builds a view of the cluster and passes it back to the Træfik server to apply to its routing. This service resolution happens in the background, meaning that Træfik can forward incoming requests directly without having to re-resolve the endpoint should something change. In order to add Service Fabric support, we wrote a custom provider, which implements this service resolution pattern against the Service Fabric API. At the time, providers had to be compiled into the Træfik binary rather than invoked via a plugin model. This meant we had to contribute our provider back into Træfik upstream code base. Træfik is written in the open source language Go. As Service Fabric doesn’t supply a Go SDK, we wrote our own minimal SDK which leverages the management REST API, which is available here:

The new Service Fabric provider imports the SDK package and uses it to poll Service Fabric for the current state of services. It also checks to make sure services are healthy and are intended to be exposed before adding them to the configuration.


Each provider has an associated default template file. The template file uses Go’s templating language to dynamically generate a valid Træfik configuration given some data (in this case an array of services) and a map of functions to invoke. These functions extend the templating language and are defined in the custom provider. For example:

{{range $service := .Services}} 
  {{if isEnabled $service}} 
    {{range $partition := $service.Partitions}} 
      {{if eq $partition.ServiceKind "Stateless"}} 
          {{if hasLoadBalancerLabel $service}} 
            method = "{{getLoadBalancerMethod $service}}" 

This snippet is looping through each partition in a Service Fabric service using the built-in range command and then checking the associated ServiceKind property to see whether it is “Stateless” or “Stateful” before starting to define the properties for a new Træfik backend. This template file will be rendered by the provider into a Træfik configuration object and sent back to the Træfik server using a Go channel. If the default Service Fabric template file does not meet your needs, you can provide a custom template file by following the instructions here:


Most of the other orchestrators supported by Træfik have the concept of labels, metadata you can attach to services as key-value pairs. Service Fabric does not have a canonical method for adding arbitrary metadata to a service which is then accessible via its REST API. We tried a number of options before settling on using the <Extensions>  tag under <ServiceTypes>  in the ServiceManifest.xml . This allowed us to embed a schema-less section of XML that we could then define labels in.

  <StatelessServiceType ServiceTypeName="WebServicePkg" UseImplicitHost="true"> 
      <Extension Name="Traefik"> 
        <Labels xmlns=""> 
          <Label Key="traefik.frontend.rule.default">PathPrefixStrip: /v1</Label> 
          <Label Key="traefik.expose">true</Label> 

The ServiceManifest.xml is shared across all instances that are created for a given service type. This means that any labels you define are also going to be shared. This is not ideal if you plan to run multiple service instances, so we added the ability to optionally provide run-time overrides that can be used to dynamically update a service’s labels. This is done using the Service Fabric Property Management API. You can use standard tools such as curl to call the API to set new labels and we also added support to the Service Fabric CLI tool, sfctl , to make it easier:

sfctl property put --name "GettingStartedApplication/WebService" --property-description "{\"PropertyName\":\"traefik.frontend.rule.default\",\"Value\":{\"Kind\":\"String\",\"Data\":\"PathPrefixStrip: /a/path/to/strip\"}}"


We wanted our integration to be able to support Blue Green deployments, A/B testing and Canary releases. We were able to achieve this by using a Service Fabric Application Package as the unit of deployment and manipulating Træfik’s concept of priorities. Priorities allow you to order the set of routing rules you apply, with higher priority rules masking those below.

Examples of how to achieve this can be found in our demos:

Insights and Operations

When operating complex microservice architectures it can be very difficult to aggregate your web server telemetry in a standardized way. As all requests will pass through a Træfik instance, it is a great place to report access logs and HTTP metrics. However, having routing centralized in a single layer raises concerns regarding availability. How do you ensure this layer is behaving as expected, how do you get notified if it is not and how can you troubleshoot it?

We looked to address these concerns by separating log aggregation from availability monitoring. For availability monitoring, Træfik exposes a metrics endpoint that can be used to obtain cumulative statistics. We wrote a watchdog service which fetched Træfik’s metrics and sent them to Application Insights. This runs as an isolated process to ensure it can identify failures even when Træfik is unresponsive. The watchdog also tests Træfik’s routing by periodically sending itself a request through Træfik. It uses a nonce, cookie and sticky sessions to ensure the same watchdog receives the request and the payload matches what it sent. These synthetic requests are timed and also shipped off to Application Insights to allow operational alerts to be configured. All information is tagged with additional meta-data to allow targeted response in case of an issue.

For log aggregation, we evaluated several options but ran into a few problems. Træfik has been developed in a Unix centric way. By default, the Træfik process writes logs to stdout and stderr. You can also add configuration to make Træfik write logs to a file. The expectation is that the log file will then be processed by external tools such as logrotate. The problem here is that Windows and Linux treat files very differently. In Linux it is possible to rename the log file whilst the process is writing to it .i.e.  mv traefik.log traefik.0.log, you can then prompt Træfik to reopen the original log file path by sending the process a USR1 signal. Thus, you can easily rotate the log files with no data loss and then process the archived files separately. On Windows, you cannot rename a file that is locked by another process. You can use tools such as logrotatewin with the copytruncate configuration to copy the contents of the log file to a new file and then truncate it, to bypass the file lock. However, this method is susceptible to lose any data written to the file once the copy and reset operation has begun. We did investigate adding support for Windows Event Objects to Træfik that could prompt it to close and open its log file whilst the log file was rotated. Again, this method suffered from potential data loss as it had to perform the log rotation in separate steps .i.e. close, move, open. We contemplated adding support for ETW directly to Træfik but this felt a little too intrusive. Then we discovered that the logging framework Træfik uses, Logrus, had the ability to use custom hooks. We are currently investigating adding support to our provider for ApplicationInsights using a custom Logrus hook we wrote:

Wrapping up

Træfik’s pluggable architecture allowed us to create a lightweight integration with Service Fabric, which lets users benefit from Træfik’s advanced feature-set. The integration has been merged into Træfik’s upstream which means Service Fabric users can simply adopt the upstream Træfik, version 1.5 or greater. This is the initial stage of the integration, we will continue to work with customers to make sure users have a great experience using Træfik on Service Fabric. If you’re interested in trying the integration out, head over to and have a play. We encourage you to evaluate the work, raise issues and contribute changes to help build, test and grow the project. We have a #service-fabric channel on the official Træfik slack, please feel free to come say "Hi":


Skip to main content