One Thing You Must Do When: Service Endpoint is showing Offline in Azure Traffic Manager

Recently I came across multiple scenarios where the Service endpoints were showing Offline/Degraded in Traffic Manager. Those endpoints were valid, accessible and respective service was running fine, but still the end point was showing offline in Traffic Manager.

image

In the above screenshot, the respective endpoint is accessible fine, but the Traffic Manager is showing it as Degraded.

Interestingly, in all those scenarios, the reason was common. The endpoint in questions was returning a non http 200 response code (302/307 etc.).

The way Traffic Manager works is, it expects an Http 200 response from the endpoints you have configured, that too within 10 seconds. If this doesn’t happen, then monitoring system will either retries (in case of no response in 10 seconds) or assumes that the endpoint is not available.

From MSDN :

Traffic Manager only considers an endpoint to be Online if the return message is a 200 OK. If a non-200 response is received, it will assume the endpoint is not available and will count this as a failed check

The monitoring system performs a GET, but does not receive a response in 10 seconds or less. It then performs three more tries at 30 second intervals. This means that at most, it takes approximately 1.5 minutes for the monitoring system to detect when a service becomes unavailable. If one of the tries is successful, then the number of tries is reset. Although not shown in the diagram, if the 200 OK response(s) come back greater than 10 seconds after the GET, the monitoring system will still count this as a failed check.

In the above example, there was a redirection rule configured on the website which was returning http 302. And this the endpoint was showing Degraded. After removing the redirection, the Endpoint became Online again.

image

Though this looks like very obvious concept, but on a bad day, this could turn out to be really tricky (from past experiences).

Common Scenarios where you may run into this:

1. Any sort of Redirection Rule on the Endpoint (Default Page, SSL Redirection etc.)

2. Any custom Authentication mechanism or URL Rewrite rule which doesn’t return http 200 for the first request

3. Home page takes >10 seconds to load

Further references:

Details about the WATM probe and how it determines when the primary service is down and when to failover

Overview of how WATM works

How the different policies work, including details on how failover works

Hope this helps!

Ashish Goyal