Windows Azure PaaS – Role instance health and load balancer requests.

There is a very common question I get asked on a regular basis, and that is “How does Windows Azure determine role health, and how can I take machines in/out of the load balancer rotations?” There’s lots of information out on the web, but I wanted to re-post some relevant information to get you up to speed quickly.

Here’s the typical question I get, this one from a customer deploying WCF services:

How does Azure determine if a web role instance is health or not? Does it use a health check page and how to configure it? What is the best practice to notify the load balancer that a particular web role instance needs to be taken out of the pool? In our case, we want to take the web role instance out of the load balance pool if WCF service has a problem. Thanks!

I’ve taken the liberty of consolidating information from MSDN and other blog posts, to make the answers a bit more concise:

To take the instance out of the load balancer pool, here is a proven practice:

You can use the RoleEnvironment.StatusCheck event of the Role Environment to change the status of the role instance. A role instance may indicate that it is in one of two states: Ready or Busy. If the state of a role instance is Ready, it is prepared to receive requests from the load balancer. If the state of the instance is Busy, it will not receive requests from the load balancer. By calling the SetBusy method of RoleInstanceStatusCheckEventArgs, you can temporarily set the status of the role instance to Busy, which removes the role instance from the load balancer.

 

In terms of "health", it is complex logic/interactions with the Fabric Controller, which interacts with monitoring agents on PaaS VM's, running under their own process:

The Azure platform in the form of the Fabric Controller (FC) provides monitoring for a health of a role instance.

The table below summarizes some common (but not all detected) problems, their detection mechanism and the action taken by the fabric controller:

Problem How Detected Fabric Action
Role crashes Guest VM Fabric Controller agent monitors role termination Fabric Controller will request that the agent restart the role.
Guest VM or agent crashes The root Fabric Controller agent will notice an absence of heartbeats from the guest. Fabric Controller will restart the VM, and the roles hosted therein.
Root OS or agent crashes The Fabric Controller will notice an absence of heartbeats from the guest. After some retry logic, Fabric Controller will reallocate the roles assigned to this node to a healthy node.
Physical hardware problems Fabric Controller agent will report disk, memory, or CPU failure to Fabric Controller. Fabric Controller will migrate the roles hosted on the node to other nodes, and mark the node “out for repair”.

 

Timing of course varies for each of these conditions, but generally the Fabric Controller responds within several seconds whenever detected conditions arise.
Conditions which constitute a role crash include the host process (WaWorkerHost or WaWebHost) terminating for any reason because of an unhandled exception or the role exiting (i.e. a worker role exiting the Run() function).

Note that child processes invoked are by default tied to the same Process Job Object, and hence any crashing in those children or grandchildren will invoke the same response by default.
 

In addition, a role instance can indicate also indicate it is unhealthy to the fabric controller which will cause the agent to restart the instance.

There are however conditions where a failure of a role will not recognized by the Azure fabric controller. These conditions include:

  • Role instance goes into infinite loop
  • Role instance hangs
  • Role instance performance has degraded.

When the RoleEnvironment.RequestRecycle method is called, the load balancer takes the role instance out of the rotation and the normal shutdown cycle is initiated. New requests are not routed to the role instance while it is restarting.

Windows Azure raises the Stopping event and calls the OnStop method where you can run the necessary code to prepare the role instance to be recycled. Using the scenario where the performance of a particular role instance has degraded due to some resource leak, for example only processing 1% of its typical throughput. In an on-premise deployment this condition would be typically be detected by system management software (e.g. Operations Manager) using performance counter rules and triggering a restart of an application pool or windows service as its corrective action.

An application can signal the health of a role instance to Azure by implementing the RoleEnvironment.StatusCheck event. If the handler for StatusCheck returns anything other than RoleInstanceStatus.Ready then the load balancer will stop sending incoming requests to that role instance. It will not recycle your role, however the role runtime API (RoleEnvironment) lets an instance request its own recycle. If you want to force a role recycle based on an internal health check a role must call RoleEnvironment.RequestRecycle. Care must be however taken not to open the service to any denial of service attacks which target the triggering of the condition.

If you wanted to provide the ability for external system management software to trigger the restart of an instance then this mechanism would need to be built into the role itself to monitor for an external trigger, perhaps some storage that the role instances polls.

 

Hope this helps make it a bit more concise for folks.

Eric L. Golpe
Senior Consultant II - Microsoft Consulting Services – US Northwest District