Heartbeat detection in MOM 2005


MOM 2005 agents routinely report their presence to their assigned management server by sending a heartbeat.  Understanding agent heartbeating is helpful as adjustment to the default values may be beneficial in some environments.  Let’s go through how this works.


Heartbeating is divided into two parts – the agent and the management server.  The agent heartbeat settings are adjusted through global settings on the management server(s) as shown in fugure 1 of the attachment.. 


The one configurable setting for agents is the ‘heartbeat interval’.  By default, the agent is configured to send a heartbeat via UDP port 1270 every 10 seconds.  You will note that this screen also shows the management server ‘heartbeat scan interval’ – this value defines how often the management server will look for a heartbeat from a particular agent.  More on that in a moment but for now just note that the heartbeat scan interval needs to be longer, by default three times longer, than the heartbeat interval.


On the management server side we have several more configuration options as shown in figure 2 of the attachment.


The first block of settings is to configure ‘Heartbeat Scan’.  There are two options here.  The first option, ‘Interval to Scan for Agent Heartbeats’, defines how often the management server will look to see if it has received a heartbeat from an agent.  The default setting is 30 seconds.  As you will recall, the agent will, by default, send in a heartbeat every 10 seconds.  With the default settings, then, the agent will have up to 3 opportunities to send up a heartbeat before the management server looks to see if one has been received.  Since heartbeats are send UDP it’s possible one may not arrive.  Using these settings MOM accounts for that fact and avoid flagging a problem simply because of a potential and transient communications failure. 


Also in the first block is the setting ‘Scan agentless computers every specified number of times Management Server performs agent scan’.  The default setting is 3.  This setting is specific to machines that are agentless monitored – not a common scenario – and by default indicates that the management server should scan agentless machines every 90 seconds (3 times 30 seconds as defined for agent managed machines).


The second block of settings is to configure ‘Heartbeat Ping’ behavoir.  During hearbeat checking, as we will see in a minute, each time the management server looks for a heartbeat and fails to find one MOM will initiate a ping to determine if the agent machine is actually online.  Just because a machines fails to send a heartbeat doesn’t mean that the machine is down – MOM heartbeat checking looks for machines that are offline vs. those that simply haven’t sent a heartbeat by doing ping checks. 


The ‘Number of Ping attempts’ setting defines how many pings will be done to determine if the target machine responds.  The ‘Time between pings’ setting defines how long to wait between each ping attempt.  The ‘Ping time out” defines how long to wait without hearing a response before the ping attempt is considered a failure.  The “Number of scans before generating service unavailablility’ defines how many scan attempts will be done prior to flagging the MOM agent service as unavailable.


Lets pull all of this together to discuss how this mechanism works.  Assume all settings are default and a MOM agent is heartbeating every 10 seconds and suddently stops – due to a system problem, server reboot, etc.  The MOM management server is somewhere in it’s 30 second detection period when this happens.  Assuming MOM has received a valid heartbeat within the current 30 second window the management server will wait for another 30 second period and then check again for a heartbeat.  This time no heartbeat will be seen.  In response to that, the management server will initiate a series of pings.  Assuming the ping attempt fails MOM will immediately generate an event/alert indicating the ping failed and the target machine may be down.  In the instance of a machine actually being down the notification happens as close to real time as possible.  Assuming the ping attempt succeeds, MOM will wait another 30 second window to see if a heartbeat arrives – assuming no heartbeat arrives at the end of the second 30 second window MOM will again initiate the ping test to verify the system is online.  Assuming that succeeds MOM will wait a third 30 second window and if there is still no heartbeat will initiate a third series of pings  Assuming that comes back OK, MOM will generate an event/alert indicating it failed to hear from an agent with current heartbeats but did verify the agent machine was online.  Remember, the 3 scan attempts is driven by the setting on the management server and is configurable.


Based on the above description you may see an event/alert combination after approx. 30 seconds when MOM realizes a machines is totally offline or, if the machine is acually OK but the MOM agent is the one having problems, there will be a delay of appox. 2 minutes before receiving the heartbeat failure event/alert.


These default settings can be adjusted to fit the needs of each operating environment – but it is crucial to understand how all of these settings interact to predict the end behavior of MOM.  If, for example, the default number of scans was adjusted from 3 to 10, MOM would delay notification on missing heartbeats for approx. 6-7 minutes.  This time period may be even more drastically affected by adjusting combinations of settings.


One further comment on this.  MOM heartbeat data is stored in the database but this information is NOT what is used to determine the last heartbeat received from an agent.  Instead, each management server maintains an in memory list of each of it’s managed agents and their last heartbeat time.  This is what is used for heartbeat checking.


I may blog more on this in future sumbissions as there is even more ‘behind the scenes’ details as to how this works both in terms of the mechanics and the rules that detect these potential failures.


-Steve

Agent and management server heartbeat settings.zip

Comments (10)

  1. pengchen_syd says:

    Hi

    What happen when the MOM server cannot find the IP address of the monitored server because there is no WINS and DNS server?

    regards

  2. steverac says:

    I’m not sure exactly what you are asking.  Heartbeats are sent from the agent to the management server – I wouldn’t expect that we will try to resolve the name when we receive the heartbeat.

  3. pengchen_syd says:

    There are two types of heartbeats, one from agent to server and one from server to agent, I am referring to the later one.

    The problem I am having is when simualte a power failure by

    1.Shutdown MOM-Server,
    2.shutdown MOM-agent (Simulate a power failure)
    3.Bring up MOM-Server and leave MOM-agent off,

    I get “Critical Error for MOM-Server MOM Heartbeat Failure Summary” only and State view showed MOM-Agent is OK.

    Because there is no WINS or DNS server in the environment there is no way for the MOM-Server to resolve the IP address of the MOM-Agent. Would this be the cause of not receiving a heartbeat failure alert for the agent?

    I have since then add the MOM-Agent entry to host file on the MOM-Server but end with the same result.

    regards
    Peng

  4. steverac says:

    Peng, I am a bit confused by your question.  There are only two types of heartbeats in MOM 2005 – agent heartbeats and server heartbeats.  The agent heartbeats are initiated by the agent and sent to the management server over port 1270 UDP – the server heartbeats are management server to management server.

  5. michael@ullitz.net says:

    I have a problem with the MOM 2005 HeartBeat rule. Because the rule is dedicated to the mom server, the rule will find all servers with heartbeat problem, a make a Alert.

    I my organization we work with production servers and tes servers. When a production server loose a Heartbeat, the rule must send a SMS to a mobile telephone (24 hour). But when is a test server, I only want a alert in my MOM console.

    Because the MOM heartbeat rule is dedicated to the MOM server group, I cant make a rule disable filter, to a specific group.

    Can somebody help?

  6. steverac says:

    If I understand the question correctly one thought would be to modify the rule to fire s cript response.  The script would check the generated alert to see what computer it is for – if it is for one of the test servers the script would exit – if it is for production servers, the script would proceed to fire your sms message.

  7. Wilson says:

    Hi,

    You may also check how to send SMS from MOM 2005 with Ozeki NG SMS Gateway

    BR