How does Azure identify a faulty role instance?

Article
05/10/2010

The Azure platform in the form of the Fabric Controller (FC) provides monitoring for a health of a role instance. The table below summarises common problems, their detection mechanism and the action taken by the fabric controller:

Problem	How Detected	Fabric Action
Role crashes	Guest FC agent monitors role termination	FC will request that the agent restart the role
Guest VM or agent crashes	The root agent will notice an absence of heartbeats from the guest	FC will restart the VM, and the roles hosted therein
Root OS or agent crashes	The FC will notice an absence of heartbeats from the guest	After some retry logic, FC will reallocate the roles assigned to this node to a healthy node
Physical hardware problems	FC agent will report disk, memory, or CPU failure to FC	FC will migrate the roles hosted on the node to other nodes, and mark the node “out for repair”

Conditions which constitute a role crash include the host process (WaWorkerHost or WaWebHost) terminating for any reason because of an unhandled exception or the role exiting (i.e. a worker role exiting the Run() function).

In addition, a role instance can indicate also indicate it is unhealthy to the fabric controller which will cause the agent to restart the instance.

There are however conditions where a failure of a role will not recognised by the Azure fabric controller. These conditions include:

Role instance goes into infinite loop
Role instance hangs
Role instance performance has degraded.

There are no external interfaces available in Azure today (May 2010) which allows the triggering of a restart for a specific role instance. Using the scenario where the performance of a particular role instance has degraded due to some resource leak, for example only processing 1% of its typical throughput. In an on-premise deployment this condition would be typically be detected by system management software (e.g. Operations Manager) using performance counter rules and triggering a restart of an application pool or windows service as its corrective action. With Azure it is not possible to trigger this restart from the outside world and such health conditions need to be incorporated within the role function themselves.

An application can signal the health of a role instance to Azure by implementing the RoleEnvironment.StatusCheck event. If the handler for StatusCheck returns anything other than RoleInstanceStatus.Ready then the load balancer will stop sending incoming requests to that role instance. It will not recycle your role, however the role runtime API (RoleEnvironment) lets an instance request its own recycle. If you want to force a role recycle based on an internal health check a role must call RoleEnvironment.RequestRecycle. Care must be however taken not to open the service to any denial of service attacks which target the triggering of the condition.

If you wanted to provide the ability for external system management software to trigger the restart of an instance then this mechanism would need to be built into the role itself to monitor for an external trigger, perhaps some storage that the role instances polls.

Written by Michael Royster

How does Azure identify a faulty role instance?

Additional resources