Cluster resource groups are not monitored! Is there anything I can do?

Recently I saw some questions about why cluster infrastructure is only partially discovered and it is not monitored when all cluster services are healthy. I replied to some of those questions either internally or at the newsgroup, where I was trying to explain why such health model was possible (so this answer is not subject of this post, but I will explain again if needed; just ask thru comment and I will get back to you). Back to the original problem, I was never really able to provide information what was indeed wrong with cluster infrastructure discovery. In fact, this issue turned rather tricky to repro locally in order to investigate. Luckily, thanks to Brian, one of the possible root causes was containing trailing spaces in cluster resource group name and/or containing such trailing spaces in cluster resource name.

PROBLEM

There is a bug in current discovery script, where value for key property is trimmed when creating instance, but trailing spaces are not removed when creating instance of relationship for that same instance. That causes our discovery insertion module failing discovery data validation when such is to be inserted and following event s provided (all GUIDs may/will differ in your topology):

Event Type:Error
Event Source: Health Service Modules
Event Category: None
Event ID: 10801
Date: 4/30/2008
Time: 11:16:41 AM
User: N/A
Computer: PACER01DDescription:

Discovery data couldn't be inserted to the database. This could have happened because of one of the following reasons:

  - Discovery data is stale. The discovery data is generated by an MP recently deleted.

  - Database connectivity problems or database running out of space.

  - Discovery data received is not valid.

The following details should help to further diagnose:

 DiscoveryId: 7d5fdd7b-2e94-406b-8178-8434625a7885

 HealthServiceId: 6127d93a-76c8-c381-a593-4ebd7849ca7e

 Invalid relationship target specified in the discovery data item.

RelationshipTargetBaseManagedEntityId: 14a220e0-68b4-59e8-a789-16502957eef2

RuleId: 7d5fdd7b-2e94-406b-8178-8434625a7885

Instance:

<?xml version="1.0" encoding="utf-16"?>

<RelationshipInstance TypeId="{121337e9-cd89-393f-2571-bcfaaf54d1eb}" SourceTypeId="{72700ece-9199-408a-9d74-241df8143e2c}" TargetTypeId="{5bb0294e-6846-b398-740c-06dd351f772a}">

  <Settings />

  <SourceRole>

    <Settings>

      <Setting>

        <Name>{A0F90386-D16E-43AD-C419-77525FD9F62B}</Name>

        <Value>smcluster90</Value>

      </Setting>

    </Settings>

  </SourceRole>

  <TargetRole>

    <Settings>

      <Setting>

        <Name>{6A12CC8B-89C2-29A3-3D5A-B3A6266481CD}</Name>

        <Value>smcluster90</Value>

      </Setting>

      <Setting>

        <Name>{AB1EB97B-7A96-2544-F713-BA27D1D9BAF3}</Name>

         <Value>marius trail space </Value>

</Setting>

     </Settings>

  </TargetRole>

</RelationshipInstance>.

You could quickly verify if discovery experiencing problems is the one responsible for cluster infrastructure discovery by querying DB directly and looking for discovery ID from event 10801which in my case would be:

use OperationsManager

select discoveryname from discovery where discoveryid = '7d5fdd7b-2e94-406b-8178-8434625a7885'

You could also research using powershell:

get-discovery -criteria "Id like '7d5fdd7b-2e94-406b-8178-8434625a7885'"

SOLUTION

This issue will be addressed in next web release of management packs responsible for monitoring of cluster infrastructure. Closest future release is Windows 2008 Failover cluster management pack, Right now I do not have enough information to provide and disclose detail release date, MP will be verified and deployed in MSIT and in TAP customers to get as much as possible verification of its functionality prior releasing to web. The date I’d say if had to guess is midsummer or early fall timeframe.

This also means that something should or may need to be done prior official fix is available to download. Most obvious solution seems to remove trailing spaces from group or resource names. In the case there are multiple cases with names that contain trailing spaces, or when changing name is simply unacceptable, attached management pack provides workaround discovery script.

This workaround MP file contains new class definition which is extending Microsoft.Windows.Cluster.Group with new property called OriginalGroupName. This property is needed to execute tasks related to group, because in order to succeed, cluster WMI provider requires name which contains trailing spaces. Such approach with defining new managed entity class had to be taken as original class is defined in sealed MP and there was not extending of properties possible, something which can be done when fixing issue while creating next version of original management pack. That is also the reason why new tasks appear in cluster group state view. Another important thing to mention is that cluster discovery interval is set to 30 minutes, which means that discovery should succeed and fix the original problem within this time after management pack reached cluster node. This value is override-able, so can be further customized. Please remember that local discovery cache is not sending data thru the distributed workflow to the database unless properties changed, so with fairly stable cluster system (where by "stable" I mean one where cluster groups and resources no longer change too often or at all) value could be increased (after successful discovered data appeared in state views) just to lower CPU usage on agent itself. Because officially released management packs are required, it also means that event describing discovery data issue will remain to be raised in OperationsManager log. At the same time, unlike with official management pack, only pooling discovery is present which also means that deleting cluster resource group from console will succeed (very likely), but group will stay present in the view until next discovery removes its instace from database, which in fact should happen within interval specified for discovery itself (by default within 30 minutes).

DISCLAIMER

Please import attached management pack in your test environment to evaluate if this cluster infrastructure discovery works for you. It is not sealed and can be further customized if you wish to do so. As expected, this workaround is provided AS IS, with no warranties, and confer no rights. Use of this management pack is subject to the terms specified at Microsoft.

Microsoft.Windows.2003.Cluster.Quick.Unofficial.Discovery.Fix.xml