What is new: OpsMgr 2007 R2 - Alert Storm Recognition (possible rule misconfiguration)

Article
03/26/2009

What is new?

OpsMgr 2007 R2 Release candidate finally released and can be downloaded from Connect. What is new in this release? PLENTY! Some of you get a glimpse at those features while evaluating Beta, some will see most improvement for the first time … very exiting!

That is the reason why I would like to start small series where I comment on some of the changes or additions. With this post, I would like to mention a design change to suspend alert creation in order to prevent alert storm – yes, we did bring MOM2005 feature (at least for the rule) back!

Alert storm mitigation at glance:

I need to clarify we are not trying to solve generic data storm problem – that is vNext scenario. We were only addressing possible “rogue” alert generating rule to flood our operational DB and/or raise too many notifications.

Settings to recognize such problem are per agent (across all targeting instances) per individual management group (there are multiple groups settings in registry in multi-homed scenario). Default throttle settings are 50/60/10. This means that if one rule generates more than 50 alerts within 60s, such rule is suspended for 10 minutes (alert generation is disabled)

Option to customize threshold values still exist … Customization will not work in very special deployment scenario – having OpsMgr2007 R2 agent multi-homed to at least one management group monitored by OpsMgr2007 SP1 server (reason is that such agent is forced to use SP1 management packs – and those obviously miss new configuration required when threshold customization was requested). In order for runtime to recognize customized values, health service must be restarted!

When runtime recognizes that possible storm is happening, event 5399 is raised. Following is English snap of such event:

;// Suspend alert generating rule
;// %1 = management group name
;// %2 = workflow name
;// %3 = name of targeted instance
;// %4 = instance id
;// %5 = alert origin (name or message id)
;// %6 = count
;// %7 = time
;// %8 = disabled time

MessageId=5399
SymbolicName=MSG_HS_HM_ALERT_SUSPENDED
Severity=Warning
Language=English

A rule has generated %6 alerts in the last %7 seconds. Usually, when a rule generates this many alerts, it is because the rule definition is misconfigured. Please examine the rule for errors. In order to avoid excessive load, this rule will be temporarily suspended until %8.
%nRule: %2
%nInstance: %3
%nInstance ID: %4
%nManagement Group: %1.

OpsMgr 2007 R2 health monitoring will recognize this event and will raise an alert to notify operator about this problem. Alert needs to be manually closed when corrective action is taken or when conditions causing possible storm are mitigated

Following is an example of customized threshold values. It shows customization 15/30/5 (15 alerts within 30 seconds will cause suspension for 5 minutes (300 seconds). It also shows where in registry such customization should be done. One must create “Alert Count”, “Alert Count Interval” and ”Alert Suspend Interval” under “HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Management Groups\<name of MG> ”.

threshold customizations

I hope you enjoy this product as much as we hope you would. I always feel happy, this time I also feel rather confident about its quality and value! Questions, comments, feedback (anything) please let me know, I will try to continue this series often (so any things in particular, scream and I move it higher in my TODO list!)

What is new: OpsMgr 2007 R2 - Alert Storm Recognition (possible rule misconfiguration)

What is new?

Alert storm mitigation at glance:

Additional resources