State based alerts in MOM 2005


Alerts in MOM are generally used to inform administrators of a potential problem condition.  By the time the administrator reviews the alert, the problem condition that caused the alert may already be resolved. 


State based alerts are special types of alerts that were introduced, in part, to allow an administrator to tell at a glance whether the condition that caused the alert to show up is an ongoing condition. 


The MOM operator console uses system state to describe graphically the general health of systems – generally grouped by server function (Exchange, AD, SQL, etc).  In some cases, the thresholds used to describe healthy vs. unhealthy can be configured by the administrator.



At the individual level, state based alerts have a ‘problem state’ property as part of the alert description.  This property describes the current state of the alert.  If the alert is still an active then the value will be listed as ‘active’.  If the problem has been resolved the value will be listed as ‘inactive’.  If the problem is inactive the alert will remain until manually cleared or cleared by grooming.  Unresolved alerts whose state is set to ‘active’ can only be removed manually, even is grooming is set to do so.



There are two requirements for creating an alert that is state based. The first is to set the ‘enable state alert properties’ tab on the alert itself and set the Instance value – typically $Logging Computer$.


 


The second is to define what criteria will cause the alert to be generated, it’s severity and it’s state set to ‘active’ and what criteria will cause an active alert to change state to be ‘inactive’ – commonly known as the ‘clearing event’.  In order for state based alerts to work it is required to define what conditions will flag a problem state.  As shown in the graphic below, this criteria can be virtually any MOM property – event ID, event source, event parameter, etc.




Questions arise from time to time about alerts that remain active and never receive the ‘clearing event’ that flags it as inactive and allows it to be groomed from the database.  This can occur for several reasons – including:


-The actual ‘clearing event’ not being received – perhaps a script that produces the ‘clearing event’ not being executed.
-The system has been removed from a computer group so that it is no longer able to run the rules that would generate the ‘clearing event’.
-The problem condition is continuing to occur.

Hope this helps – feel free to post comments with further questions.


-Steve

Comments (9)

  1. Alerts in MOM are generally used to inform administrators of a potential problem condition.  By…

  2. pedrofaustino says:

    Hi Steve,

    I’ve been playing around with state based alerts in MOM 2005. I even created a script that was called by an event rule every say 15 min. This script was:

    – creating 1st alert (state-based of course);

    – waiting for 90 seconds;

    – changing problem state of 1st alert from active to inactive; raising a 2nd alert;

    – waiting for 90 seconds;

    – changing problem state of 2nd alert from active to inactive;

    Everything was working fine.

    But if I change the 90 seconds time interval to around 1 second OR to more than 1 hour, the problem state is not correctly changed from active to inactive.

    Did this happen to you? Any ideas?

    Thank you,

    pedrofaustino

  3. steverac says:

    Sounds odd – I haven’t run into this problem and I have tested in similar scenarios to what you describe.  The only thing that comes to mind right away is a potential timing issue.  Also, it seems from your description that the script you are using is manually manipulating the values which is going around the normal processes MOM typically uses to change state values and such.  If that is the case you should test whether this happens when you let MOM do the work – a simple test would be:

    1.  Create an event rule to run a script – the script would fire an event 9999, wait one second, fire an event 9998, and keep doing this for several interations (perhaps a for loop).

    2.  Create another event rule that looks for both event 9999 and 9998 – configure the rule so that event 9999 causes an alert to be fired and an event 9998 sets an all clear status.

    I would say though that setting anything to run every second is a bit extreme.

    Steve

    Steve

  4. pedrofaustino says:

    Thanks for the reply!

    My specific problem is: I need to monitor 3DM2 RAID cards on all computers of the network. I’m looking for events on the eventlog of all machines that satisfy my criteria. E.g.:  if a disk brakes down, there’s a DEGRADED_UNIT event and the RAID controller takes care of rebuilding the unit, thus immediatly firing a REBUILD_STARTED event. This occurs within seconds. But the rebuild process takes time, so only after approx. 1h30m the 3DM2 RAID controller fires a REBUILD_DONE event.

    I need to fire 3 alerts to MOM, one for each event. Whenever there’s the next event (on a logic sequence), I change the problem state of the previous alert to inactive. I could do this with an event rule that fires alerts whenever it finds specific events (and just configure the "alert" tab). But the problem is that I need to access specific fields of the events which MOM doesnt have access to. This is then the reason for a script response.

    This is a timeline version of what I need:

    DEGRADED_UNIT -> Critical Error alert

    (some seconds of wait)

    REBUILD_STARTED -> change problem state of previous critical error alert to inactive and fire a Warning alert

    (more than 1 hour of wait)

    REBUILD_DONE -> change problem state of previous warning alert to inactive and fire Success alert

    My script can’t change the problem state of the (previous) warning alert on the third and last step. In other words, 3DM2 RAID problem is solved but in the state view there’s a yellow warning sign attached to the computer.

    If you have any ideas…

    Thank you again.

    Best,

    Pedro

  5. pedrofaustino says:

    About the timing issue you refer to. Maybe that can be the problem for events within 1 second or less of interval. Maybe the MOM client queues the alert firing job, maybe some database issue. I don’t know, just wildly guessing.

    But what about more than 1 hour?! It’s odd. I’ve successfully changed problem state for time intervals less than 1 hour (in my case it was 59 min and 5 secs). My working day just finished but tomorrow I’ll try to test what happens when time interval is say 1h01.

    Thank you again.
    Best,
    Pedro

  6. steverac says:

    Pedro,

    You do have an interesting problem to solve and MOM should be able to solve it for you – even if we do have to do some scripting.  Addressing something like this is somewhat problematic in this forum so I would suggest the best and most timely avenue for you would be to open a case with Product Support Services for MOM with the details of wnat you are trying to do.

    Steve

  7. Will says:

    Hi,

    You may also check how to send SMS alerts from MOM 2005 with Ozeki NG SMS Gateway:

    sms-integration.com/p_14-mom-2005-sms.html

    BR