Operations Manager Health Service Restarts due to Exceeding Handle Count Threshold

Two of my coworkers, Chris Maiden and Phil Bracher, recently ran across an issue where many of their agents were restarting. They thought this issue might be affecting others so they summarized their experience below:

The MonitoringHost processes were causing the Healthservice to restart at least once a day on several of our agents and mostly servers running Windows 2008 R2 SP1 and hosting Exchange 2010. The problem manifests itself by observed gaps in performance data. Since the Healthservice is restarting the system will not have data for the periods of time that the service was stopped.

--Update, attached Phil's MP he used to collect and view data on the handle count

clip_image001

When we believe the Healthservice is being restarted on an agent the first items to look at are the memory monitors associated with the Healthservice. There are four monitors related to both the Healthservice and MonitoringHost processes. The monitors change state when either the handle count or Private Bytes thresholds are exceeded. Additionally, the parent aggregate monitor has a recovery script that restarts the Healthservice process. Kevin Holman has a great blog about this process.

Since we knew the two main reasons for Healthservice restarts were related to memory (Private Bytes and Handle Count) we decided to create a few views to observe the trends associated with both processes. We could also begin alerting for the individual restarts per Kevin’s blog but we wanted to observe the effects of overrides to the thresholds over time to see if the trend would eventually stabilize. The views we created included both processes and their respective private bytes and Handle counts counters. 

The views provided the following results for the MonitoringHost process. The agent can create multiple MonitoringHost processes and in almost each case it was the initial process (MonitoringHost) causing the problem. The process was creating a lot of handles each day and not releasing them to the tune of approximately 6,000 handles per day. 

We were up to date on the most recent .Net patches (the agent uses the .Net framework) so we decided to select a few test agents and begin overriding the threshold for the MonitoringHost Handle Count to see if it would stabilize. Our thinking is that if we can find the sweet spot we could decide whether or not to increase the threshold and leave it.

We started with a threshold of 15,000 handles. The Healthservice continued to restart. We then increased to 30,000 handles. Same result.

clip_image002

Even at 100,000 handles the process would consume everything and a restart would occur. At this point we decided not to increase it again but rather look for any fixes which might resolve the problem since this was obviously a leak. Looking through the various hotfixes for System Center Operations Manager we found KB2878378.

The article speaks to a specific symptom where you might observe grey agents. Though we did not observe grey agents or have any of the affected advapi32.dll listed in the article we decided to install the hotfix on one of our Exchange 2010 servers in the lab. The outcome looked promising but we still leaked handles. Below you can see the agent without the patch and then with the patch applied.

clip_image003

We looked further for any hotfixes, Operations Manager or Operating System, that might impact this leak. We found the following two optional patches and decided to install them.

KB2685811 - Kernel-Mode Driver Framework version 1.11 update for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2.

KB2685813 - User-Mode Driver Framework version 1.11 update for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2.

The handles leak has been resolved and the Healthservice restarts stopped.

clip_image004

Sample.Handle.Count.for.MonitoringHost.renametoxml