Some tricks with service restart logic

Today I will venture outside the safe confines of Office Communications Server and discuss some quirks I have noticed with Windows services.  (Disclaimer: I am not in the Windows org so these are just my observations after some experimentation) Some of you may be aware that in the services control panel, you can right click on a service and on the "Recovery" tab you can manage what occurs when a service fails.

The following are the options.

First failure: what should occur the first time the service fails.  Valid options are "Take No Action", "Restart the Service", "Run a Program", and "Restart the Computer".

Second failure: same options the second time a service fails

Subsequent failures: same options for any subsequent failure

Reset fail count after: the number of days the service must be running before the failure count is reset

Restart service after: the amount of time in minutes to wait to restart the service

This is very nice, but it is very easy to misunderstand what these values actually do.  I have seen a number of services (and I tried this myself) set these values to 0 days and 0 minutes.  The problem is your service will continually restart if you set the failure count to reset after 0 days, if the service at least started correctly.  The result is only the first option ("first failure") will ever be run. 

To fix this, set the failure count to reset after one day.  The drawback to this approach is your service may stay stopped after failing several times but this likely means something is toast anyways.

One thing also to take into account is not all services will work with the reset logic - or in other words just setting the recovery options on any service does not guarantee that it will restart.  In order for the service to restart, it must exit abnormally.  This generally means the service must exist with a non-zero exit code and the service status must not be stopped (note: this has changed for Vista - it is possible to set the service status to stopped and provide an exit code to trigger the restart logic).