Heartbeats, Recovery, and the Load Balancer

Some of the more common questions I get are around heartbeats/probes, how the fabric recovers from failed probes, and how load balancer manages traffic to these instances.

 

Q: How does the fabric know that an instance has failed, and what actions does it take to recover that instance?

A: There are a series of heartbeat probes between the fabric and the instance --- Fabric <-> Host Agent <-> Guest Agent (WaAppAgent.exe) <-> Host Bootstrapper (WaHostBootstrapper.exe) <-> Host Process (typically WaIISHost.exe or WaWorkerHost.exe). 

    1. If the Fabric <-> Host Agent probe fails then the fabric will attempt to restart the host.  There are heuristics in the fabric to determine what to do with that host if a restart fails to resolve the problem, taking more aggressive actions to remedy the problem until ultimately the fabric may determine that the server itself is bad and it will create a new host on a new server and then start all of the affected guest VMs on that new host. 
    2. If the Host Agent <-> Guest Agent probe fails then the Host will attempt to restart the Guest OS, and this also includes a set of heuristics to take additional actions including attempting to start that Guest VM on a new server.  If the Host <-> Guest  probe succeeds then the fabric no longer takes action on that instance and any further recovery is handled by the guest agent within the VM. 
    3. The only recovery action that the guest agent will take is to restart the host stack (WaHostBootstrapper and all of its children) if one of the child processes crashes.  If the probe times out then the guest agent assumes the host process is busy working and lets it continue running indefinitely.  The guest agent will not restart the VM as part of a recovery process. 

See https://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes and probes on the Guest OS.

 

Q: How does the load balancer know when an instance is unhealthy?

A: There are 2 different mechanisms the load balancer can use to determine instance health and whether or not to include that instance in the round robin rotation and send new traffic to it.

    • The default mechanism is that the load balancer sends probes to the Guest Agent to request the instance health.  If the Guest Agent returns anything besides 'Ready' then the load balancer will mark that instance as unhealthy and remove it from the rotation.  Looking back at the heartbeats from the guest agent to the host process, this means that if any of those processes running in the Guest OS has crashed or hung then the guest agent will not return Ready and the instance will be removed from the LB rotation.
    • The other mechanism is for you to define a custom LoadBalancerProbe in your service definition.  A LoadBalancerProbe gives you much more control over how the load balancer determines instance health and allows you to more accurately reflect the status of your service, in particular the health of w3wp.exe and any other external dependencies your service has.  Make sure your probe path is not a simple HTML page, but actually includes logic to determine your service health (eg. Try to connect to your SQL database).

 

Q: What does the load balancer do when an instance is detected as unhealthy?

A: The load balancer will route new incoming TCP connections to instances which are in rotation.  The instances that are in rotation are either:

    1. Returning a 'Ready' state from the guest agent for roles which do not have a LoadBalancerProbe.
    2. Returning 200 or TCP ACK from a LoadBalancerProbe element.

If an instance drops out of rotation, the load balancer will not terminate any existing TCP connections.  So if the client and server maintain the TCP connection then traffic on that connection will still be sent to the instance which has dropped out of rotation, but no new TCP connections will be sent to that instance.  If the TCP connection is broken by the server (ie. the VM restarts or the process holding the TCP connection closes) then the client should retry the connection, at which time the load balancer will see it as a new TCP connection and route it to an instance which is in rotation.  If your website is unavailable the default behavior is to return an HTTP 503 and keep the TCP connection open which will cause clients to not get load balanced to a new server, to change this behavior you can set the loadBalancerCapabilities property (aka "Service Unavailable" Response Type") to TcpLevel which will cause the TCP connection to terminate and the client to automatically retry the connection and get load balanced to a healthy role instance.

Note that for single instance deployments, the load balancer considers that instance to always be in rotation.  So regardless of the status of the instance the load balancer will send traffic to that instance.

 

Q: How can you determine if a role instance was recycled or moved to a new server?

A: There is no direct way to know if an instance was recycled.  Fabric initiated restarts (ie. OS updates) will raise the Stopping/OnStop events will be raised, but for unexpected shutdowns you will not receive these events.  There are some strategies to detect these events:

    1. The most common way to achieve this is to write a log in the RoleEntroyPoint.OnStart method.  If you unexpectedly see an instance of this log then you know a role instance was recycled and you can look at various pieces of evidence to determine why.
    2. If an instance is moved to a new VM/server then the Changing/Changed events will be raised on all other roles and instances with a type of RoleEnvironmentTopologyChange.  Note that this will only happen if you have an InternalEndpoint defined.  Also note that an InternalEndpoint is implicitly defined for you if you have enabled RDP.
    3. See https://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx for information about determining when an instance is restarted due to OS updates.
    4. The guest agent logs (reference the Role Architecture blog post for log file location) will contain evidence of all restarts, both planned and unplanned, but they are internal undocumented logs and interpreting them is not trivial.  But if you are following #1 and you know the timestamp for when your role restarted then you can focus on a specific timeframe in the agent logs.
    5. The host bootstrapper logs (reference the Role Architecture blog post for log file location) will tell you if a startup task or host process failed and caused the guest agent to recycle the instance.
    6. The state of the drives on the guest OS can provide information about what happened.  See https://blogs.msdn.com/b/kwill/archive/2012/10/05/windows-azure-disk-partition-preservation.aspx.
    7. If the above doesn't help, the support team can help investigate through a support incident.