Validating a Cluster with Zero Downtime

Hi Cluster Fans,

As you may already know, the requirements for having a supported cluster have been significantly simplified since the release of Windows Server 2008.  There is no longer a need to have every component in the configuration listed in the Windows Server catalog as any commodity hardware should now support clustering.  However it is required that every component receives a Windows Server 2008 (R2) logo and the entire solution passes the built-in “Validate a Configuration” (“Validate”) suite of tests.

Validate (Step by Step Guide) can be run at any time once the Failover Clustering feature has been installed, including before the cluster has been deployed, during cluster creation and while the cluster is running.  In fact, additional tests are executed once the cluster is in use which check that best practices are being followed for the highly-available workloads.  However, according to the Microsoft Support Policy for Failover Clusters, the “Validate test should also be run whenever a major component of the cluster is changed or updated.  The following are examples…Adding a node to the cluster; Upgrading or replacing the storage hardware; Upgrading the firmware or the driver for host bus adapters (HBAs); Updating the multipathing software or the DSM; Changing or updating any network adapter…”  So how can you minimize the impact of Validating a cluster while it is in production?

When the “Validate a Configuration” wizard is launched it offers the choice to run all tests or a subset of tests.  With this granularity it is possible to select all the tests which do not impact the cluster and skip those which can impact high-availability for a group.  In fact almost all of the tests can be run while the cluster is online, without impacting anything running on the cluster, other than a slight performance hit due to processing the tests themselves.  There are five categories of tests:

  • Cluster Configuration – These tests are only executed on clusters that have been deployed to ensure that best practices are being followed.  They provide a simple way to review cluster settings and determine whether they are properly configured.
  • Inventory – These tests will inventory the hardware, software, and settings (such as network settings) on the servers, and information about the storage.
  • Network – These tests will ensure that your networks are set up correctly for clustering.
  • Storage – These tests will analyze the shared cluster storage to check it is behaving correctly and supports the required functions of the cluster.
  • System Configuration – These tests will check the system software and configuration settings across servers for compatibility.

Across these dozens of tests, only a few of them will impact running cluster workloads and these are all within the storage category, so skipping this entire category is an easy way to avoid disruptive tests.  Listing All Disks and Potential Cluster Disks will not impact anything.  Validating Disk Access Latency, File System, Microsoft MPIO-based disks, and SCSI device Vital Product Data (VPD) can impact the disk’s performance as tests are performed against the disk which is in use, however they will cause no downtime unless the disk latency becomes so slow that it triggers an alert.  Several tests will actually trigger failovers and move the disks and groups to different cluster nodes which will cause downtime, and these include Validating Disk Arbitration, Disk Failover, Multiple Arbitration, SCSI-3 Persistent Reservation, and Simultaneous Failover.  So if you want to test a majority of the functionality of your cluster without impacting availability, exclude these tests.

Failover Clustering does come with a built-in safeguard to prevent accidental downtime when running the storage tests in Validate.  If the cluster has any online groups when Validate is initiated, and the storage tests remain selected, it will prompt the user for confirmation whether they want to run all the tests (and cause downtime), or to skip testing the disks of any online groups to avoid downtime.  If the entire storage category was excluded from being tested, then this prompt is not displayed.  This will enable cluster validation with no downtime, but of course it is not complete as some of the tests have been skipped, yet according to the Microsoft Support Policy “the proposed solution must pass the full Validate test.”  So what happens if you need to test your storage, yet all of your disks are being used by running workloads?  Or how do you test that your Windows Server 2003 storage will work after a migration without actually impacting your production 2003 cluster.

This can be done by simply creating a new cluster disk from the same storage array, exposing it to all nodes and running all tests against just that disk.  This gives you the benefits of running Validate against that type of disk to ensure that it will work while not risking any downtime to production workloads.  This can be done by running Validate, selecting all the tests, but keeping any running services or applications online.

Now you can feel confident that your disks should work on your new cluster!  It is still important to note that the support policy technically requires the entire configuration to be tested (including all disks), so if you have a storage-related issue and have not run Validate against all disks you may be asked to do so.  Nevertheless this is still a great approach to Validating your cluster after a configuration change or migration with minimal impact.

Symon Perriman
Technical Evangelist
Private Cloud Technologies