How does Azure identify a faulty role instance?

The Azure platform in the form of the Fabric Controller (FC) provides monitoring for a health of a role instance. The table below summarises common problems, their detection mechanism and the action taken by the fabric controller:

Problem

How Detected

Fabric Action

Role crashes

Guest FC agent monitors role termination

FC will request that the agent restart the role

Guest VM or agent crashes

The root agent will notice an absence of heartbeats from the guest

FC will restart the VM, and the roles hosted therein

Root OS or agent crashes

The FC will notice an absence of heartbeats from the guest

After some retry logic, FC will reallocate the roles assigned to this node to a healthy node

Physical hardware problems

FC agent will report disk, memory, or CPU failure to FC

FC will migrate the roles hosted on the node to other nodes, and mark the node “out for repair”

Conditions which constitute a role crash include the host process (WaWorkerHost or WaWebHost) terminating for any reason because of an unhandled exception or the role exiting (i.e. a worker role exiting the Run() function).

In addition, a role instance can indicate also indicate it is unhealthy to the fabric controller which will cause the agent to restart the instance.

There are however conditions where a failure of a role will not recognised by the Azure fabric controller. These conditions include:

  • Role instance goes into infinite loop
  • Role instance hangs
  • Role instance performance has degraded.

There are no external interfaces available in Azure today (May 2010) which allows the triggering of a restart for a specific role instance. Using the scenario where the performance of a particular role instance has degraded due to some resource leak, for example only processing 1% of its typical throughput. In an on-premise deployment this condition would be typically be detected by system management software (e.g. Operations Manager) using performance counter rules and triggering a restart of an application pool or windows service as its corrective action. With Azure it is not possible to trigger this restart from the outside world and such health conditions need to be incorporated within the role function themselves.

An application can signal the health of a role instance to Azure by implementing the RoleEnvironment.StatusCheck event. If the handler for StatusCheck returns anything other than RoleInstanceStatus.Ready then the load balancer will stop sending incoming requests to that role instance. It will not recycle your role, however the role runtime API (RoleEnvironment) lets an instance request its own recycle. If you want to force a role recycle based on an internal health check a role must call RoleEnvironment.RequestRecycle. Care must be however taken not to open the service to any denial of service attacks which target the triggering of the condition.

If you wanted to provide the ability for external system management software to trigger the restart of an instance then this mechanism would need to be built into the role itself to monitor for an external trigger, perhaps some storage that the role instances polls.

Written by Michael Royster