Maintenance mode expired, monitored server is still down, but there is no alert in the alert view of OpsMgr 2007 console indicating server being offline. This scenario is happening too often with current implementation of availability monitoring. What our customer face and fear is that if server which undergo maintenance mode each night is not rebooted successfully, there is no indication about its unavailability and this could lead to at least bad user experience if not more …
As described in this knowledge base article, one needs to put computer, health service hosted by maintained computer as well as watcher monitoring this health service remotely into maintenance mode to avoid unexpected unavailability alerts. But “unfortunately”, this is also a direct reason why alert is not generated when maintenance mode terminated.
When instance of the managed entity enters MM, all its monitoring is suspended. When it leaves MM, monitoring is “restarted” with clean slate. As there is not a probe detecting the current availability of instance implemented with current version of OpsMgr 2007, there is no indication about its unavailability. Luckily, there is a runtime component which is always aware of current availability of instances it cares about. This component is dependency monitor. It is always aware of availability status of its contributing instances. This gives us hope and chance to implement alert mechanism which notifies about possible instance unavailability.
Back to the subject of this post, workaround for health service availability. Instance of health service indicates the possibility to monitor discovered instances of other entities. Instance of health service watcher monitors availability of health service instance. There is already a relationship between these two entities and that allows for creation of the dependency monitor. If we would inspect monitor topology for health service watcher, we can see that there is such dependency monitor already.
Unfortunately this dependency monitor doesn’t allow for customization of alert, if one was enabled thru “Generate Alert” override, simply because there is no alert configuration with released dependency monitor.
Such “issues” indicate that workaround should implement its own dependency monitor. This new monitor will be “equal” to existing one, only difference is it defines alert configuration. It also means we could disable original dependency monitor (Local Health Service Availability) because new monitor will provide same monitoring for us.
How this new monitor changes state to error? One possible way is that instance of contributing health service experiences problem and contributes with state error. Another possibility is to use health state “error” in the case contributing instance is not available.
Here is a copy of snapshot from the “Knowledge base” I used with monitor implemented in this workaround. It tries to explain how to troubleshoot its error state and how to recognize (to some extent) that health service is not available.
The error health state for this dependency monitor could be caused by unavailability of watched health service. Alert is generated each time.
The state change event tab and context of the state change of this monitor needs to be investigated. ICMP ping diagnostic should be executed and its result carries information about availability of computer which is hosting watched health service.
Computer is online when there is no diagnostic output available. Contributing monitor hierarchy should be expanded and will point to problem with local state of watched health service.
You can try the following to diagnose and remediate issue when computer is not online: (when diagnostic output exists)
· Perform a trace route to check if there is packet loss to a router or switch to the target Health Service. You can use the tracert.exe utility in Windows. If you have Window XP or Windows Server 2003 and higher, you can use pathping.exe
· Physically check the target computer to ensure it is connected or plugged into the network.
· Check with your network administrators if there are any known issues or outages that may be affecting the target Health Service and its parent Management Server.
Please import attached management pack in your test environment to evaluate if this workaround works for you. It is not sealed and can be further customized if you wish to do so. Test entering MM for all entities as describedhere (you can use script from this post), shutdown server and wait for alert. Steps from knowledge base article should be self explanatory then.