Regular readers of this blog will be aware that Azure App Service Environment (ASE) is available in Azure Government and it offers many advantages for Government Web Applications. For mission critical applications, it may be a requirement to be able to failover to an alternative region in the event of an application and/or data center outage. In this blog post, I will discuss how Azure Traffic Manager (ATM) can assist with this even for web applications with private endpoints (in an ASE) in spite of it being designed for applications with public endpoints. As is often the case, I will be demonstrating this for Azure Government, but it will work the same in Azure Commercial. If you are not familiar with ATM, I would encourage you to have a look at the documentation before reading on.
A Traffic Manager Profile in Azure will have a frontend FQDN (e.g. mytm.usgovtrafficmanager.net in Azure Government) and a number of endpoints. These endpoints can point to Azure Resources or any FQDN or IP address. The traffic manager will probe these endpoints at regular intervals and send traffic to the endpoints that are healthy. Traffic doesn't actually flow through the ATM, it is a DNS load balancer, it just resolves client DNS request to different endpoints. How the traffic is distributed can be based on performance, priority, etc. This is very useful for applications with public endpoints, where the traffic manager will make sure that traffic is routed to a healthy endpoint.
If your application doesn't have a public endpoint, the ATM probes will fail. At a first glance this would mean that ATM has no utility for private endpoint applications. However, if we look closer at the ATM documentation on endpoint monitor status, we see that in the case where an endpoint is "Enabled" and in a "Degraded" state (i.e., probe cannot reach the site), the rule is:
"Endpoint monitoring health checks are failing. The endpoint is not included in DNS responses and does not receive traffic.
An exception to this is if all endpoints are degraded, in which case all of them are considered to be returned in the query response."
This means that as long as all the endpoints are unhealthy, ATM will include all enabled endpoints in DNS query response. By making sure that only one endpoint is enabled at a time, we can direct traffic to one specific site. If we have more endpoints enabled, traffic will flow to all of them. We can use this to have an easy failover switch in the form of a traffic manager. I will first describe how to do this manually and in a later blog post add some automation to orchestrate the failover.
To illustrate the scenario, let's consider the following configuration:
In this configuration, we have two ASEs (ILB configuration); one in
usgovvirginia and one in
usgovtexas. They each serve a simple website with an indication of where the response is coming from (see below). If you need some templates for setting up ASE configurations, have a look at my iac repository on GitHub. The client would be located somewhere in a network peered with both these networks. It could be a separate virtual network or even an on-premises network connected to Azure. We also have a traffic manager profile and we have set up the following DNS records (you need some sort of public DNS server that the traffic manager can reach):
The idea here is that under normal circumstances, when a client attempts to access
http://tmsite.cloudynerd.us, they will be sent to 10.0.1.11 and in case we need to failover, we would like to have an easy switch to send clients to 10.1.2.11 instead. The traffic manager has been configured with two endpoints:
Notice that the Virginia endpoint is in a "Degraded" state, but it is the only one that is "Enabled". So if we try to access http://timsite.cloudynerd.us, we will get something like:
If we would like to failover to Texas, we can simply enable the Texas endpoint and disable Virginia:
It will take a couple of minutes for this to propagate, and you may want to do an
ipconfig /flushdns. After that, traffic will now be going to the Texas endpoint:
You can (of course) easily fall back to the primary region by changing the status of the ATM endpoints.
And that's it, we have demonstrated that Azure Traffic Manager can be used to failover to a secondary region even if the endpoints are private. In this simple example, we did the switch between regions manually and that may well suffice for many applications, but the proposed setup sets the stage for automation, which I will dive into in a subsequent blog post. Specifically, we can leverage some of serverless features in Azure to set the status of the endpoints and thus automate the failover.
Let me know if you have questions/comments/suggestions.