This has been a common request. Suppose you have an opsmgr environment and you have a variety of monitors working in the environment. At the core, monitors work to tell you whether your systems are healthy or not. Green is healthy, Yellow is warning and Red is critical. And there are lots of different types of monitors. As these monitors operate the state will change as different conditions are encountered. You will see green to to red when a problem condition is detected and if that problem condition clears automatically the monitor will return to green – and even close the alert if configured to do so.
Lets discuss an example where things might get a bit confusing to the operator.
Lets assume that one of your monitors was set to green but during operation came across a problem and flagged the health of your system as critical and fired a corresponding alert. And, lets say your operator saw the alert and cleared it after putting the action to fix the problem on their ‘to do’ list. Then, the operator goes out of the office for several weeks. While the operator is out of the office the system remains in a critical state but you are never again alerted on the problem. Why? When the system state changed from green to red (or yellow for that matter) opsmgr DID alert but no action was taken to act on the alert so the system remained critical. If the system were to have come out of the critical state and gone back into the critical state THEN you would have gotten another alert but not if it remains in the critical state. This is just one example.
Remember, monitors are dynamic – meaning that if a system enters a critical state and then comes out of it, the monitor will detect that change and can restore the system to a healthy state in your views – but if the state doesn’t change the system doesn’t ‘realert’. There are some provisions built in to help with this, like monitors that will reset health state on a timer, but not all monitors are built that way.
So, how can you be sure that the health state of your systems are all periodically reset so that you DO get realerted o an ongoing problem? While there are several possible solutions I just saw a blog link from a colleague that looks really good. Introducing – the GREEN MACHINE. This tool will go through OpsMgr and reset the health state of your monitors to green so you can basically start fresh. The tool is very flexible and even comes with source code. Having this tool in the OpsMgr admins toolbelt will help ensure that problems don’t get hidden due to accidental mishandling of an alert.