This all started as question about how “Computer Not Reachable” works. It has been asked in newsgroup and I thought I will help and provide some inside. First, I would like to repeat, that in my opinion regarding the fact that computer not reachable detection is based on the fact that heart beat is missing while targeting “Health Service Watcher” is at least questionable and not very fortunate (it is almost like we try to find causality without root causality engine). Due to performance limitations (Main reasons lying in the fact we did not want to end up having as many workflows doing ICM Ping (which is from OpsMgr perspective a blocking operation) as many computers are present within topology, or not having efficient targeting story for such computer not reachable recognition) we now have this “nasty beast” where we execute diagnostic on heartbeat monitor state change which is later supposed to (thru some set of recoveries) make a monitor state change for “Computer Not Reachable”. This all design was not very reliable in RTM and I had to introduce some mitigations and reliability improvements in SP1 (which off course was not too positive because now two alerts are raised instead of one which was raised in RTM, but believe me, you want reliability J).
Regardless of my runt, I decided to show how to add custom diagnostics for Heartbeat monitor and eventually answer a question and provide management pack which will set the state of some custom monitor based on recovery output. This all could be advance talk, please comment and/or ask question thru this blog if something requires further explanation (this I will not discuss on newsgroup, all this provided info is about undocumented functionality which may break in any future release!).
So to adding custom diagnostic is really easy. In this case all we need (monitor and target) is public and accessible outside of management pack where such elements are defined. At the end it is just adding following raw XML into custom management pack (assuming you have all MP references set properly).
<Diagnostic ID="Microsoft.SystemCenter.Community.Diagnostic.NetViewDiagnostic" Comment="In response to heartbeat failure, net view machine" Accessibility="Internal" Enabled="true" Target="SCLibrary!Microsoft.SystemCenter.HealthServiceWatcher" Monitor="SC2007!Microsoft.SystemCenter.HealthService.Heartbeat" ExecuteOnState="Error" Remotable="true" Timeout="300">
<ProbeAction ID="Command" TypeID="System!System.CommandExecuterProbe">
Now acting based on the result is where real fun starts! First I need to mention that “Computer Not Reachable” is defined as internal, so is not accessible and new monitor must be defined rather then disable original thru override included with management pack. I will not provide much information about why it is “aggregate” monitor, the only thing I will say is that in OpsMgr, aggregate monitor is only monitor without a workflow and runtime is magically making all the state changes. In our case we are setting the state thru recovery, so we do not want to “WASTE” a workflow (and there would be as many of it as many computers within enterprise are monitored) if we never expect that workflow to set the state anyway. Also, there are some “magic” modules I WILL NOT DISCUSS (maybe ever, but we will see in next release), where those modules will set the state of the monitor. (There might be some of you willing to do some reverse engineering and you may get an idea how set state critical when result of recovery provides info about command “net view” failing, and how we set state when command succeeded though.) So here is the recap thru screenshots:
1. After import, monitor state is NEVER set until net view command is executed
2. When RMS recognizes that heart beat is missing, we execute “net view” command inside of diagnostic. When net view succeeds, we set state of monitor to “Healthy”, we set to “Critical” otherwise
Attached is management pack that provides this functionality. It may be used AS IS and confers no rights and support. Enjoy!