Consider the following scenario where there is an AOS Cluster configured for Services' high availability by following the information published at “Configuring network load balancing for services AX2012” (http://technet.microsoft.com/en-us/library/hh397322.aspx)
The NLB cluster is working fine and suddenly one of the AOS services stops. The rest of Windows Server components in that server are still up and running. The NLB service will still consider the server as available. It will continue sending requests to an AOS Server that remains online but the AOS service is not. This could cause communication errors that could affect user experience.
In order to react to changes on the AOS Service status, we can use the Service’s Recovery tab to automate some management tasks. It also permits to use PowerShell scripts that will be run in case of service failure and will stop//remove the AOS Server from the NLB cluster.
Here are presented several ways to "notify" the NLB Cluster and make it aware that the AOS server should be considered as "out of service". We can use Shutdown.exe or Stop-NlbClusterNode commands.
A. How to Stop/Restart the AOS Server when the service stops?
Use of Shutdown.exe
Open the Services console and go to the AOS Service Properties.
Configure the AOS Service properties within the Recovery tab:
In this example the service will try to:
- Restart the service.
- If the service fails to start, the computer will be restarted.
- And if the AOS Service is not able to start afterwards, it will run a program.
In this case, the “shutdown” command will be run in order to stop the Server completely as the parameter "/s" was used.
This fact will let NLB Service know about the Windows Server node unavailability and will stop sending request to the AOS server.
We could use the parameter “/r” to reboot the server instead. More information on shutdown.exe can be found at http://technet.microsoft.com/en-us/library/cc732503.aspx
Note: The above approach could be valid for some scenarios where platform allows it. It's also important to have an alerting framework in place like SCOM or any other mechanism to notify administrators about the AOS server and service statuses. In scenarios where the server cannot be stopped (there could be additional critical services running), it is possible to manage the NLB by using PowerShell command lets instead, see bellow.
B. How to Stop or Exclude the AOS Server when the service stops?
Use of PowerShell.exe
We need to start by preparing the PowerShell script that will be executed.
In order to stop the NLB traffic to the faulting AOS node we will use Stop-NlbClusterNode command.
In the example we will use StopNLBNode.ps1 that contains only the following statement:
Stop-NlbClusterNode –Hostname AX2012R2A
It is also possible to remove the node completely from the NLB cluster by using the “Remove-NlbClusterNode” command. For more information,TechNet: http://technet.microsoft.com/es-ES/library/hh801262.aspx
Note: The Stop-NlbClusterNode is the cmdlet that stops a node in a Network Load Balancing (NLB) cluster. When the nodes are stopped in the cluster, client connections that are already in progress are interrupted. To avoid interrupting active connections, consider using the Drain parameter, which allows the node to continue servicing active connections but disables all new traffic to that node. From http://technet.microsoft.com/es-ES/library/hh801288.aspx
Note 2: In our case the AOS Service is the core computer role and it has failed. It is possible to proceed without the Drain parameter as there are no active sessions for sure. Consider using the Drain parameter if you plan to use a similar script that may run before the AOS service stops all working sessions, like when it needs to go into admin mode.
In our example, the AOS Service is configured with the following Recovery Options:
- To run a program in all cases.
- The program to run is PowerShell.exe
- Provide the PowerShell script as a command line parameter.
Once set and running, we can force the service failure:
The system will automatically run the powershell script and therefore stop the NLB node:
No more traffic request should be routed now to the stopped NLB Node.
It is important to remember and implement additional administrative tasks needed. Review processes and operations needed once the issue with the AOS is fixed. Otherwise the AOS won’t be accessible by the NLB name or IP address. It can be started manually from the NLB console or managed with PowerShell scripts with Add-NlbClusterNode, Get-NlbClusterNode, Resume-NlbClusterNode, Set-NlbClusterNode, Start-NlbClusterNode, Stop-NlbClusterNode, Suspend-NlbClusterNode.