Event 1006 FailoverClustering: Cluster Service was halted due to incomplete connectivity with the other nodes

Ran into this event on an Exchange DAG where the customer was reporting that Exchange seemed to be making arbitrary decisions to randomly failover from Node1 to Node2 of the mailbox cluster. Some investigation showed that the 1006 event always corresponded to a 4082 MSExchangeRepl event in the application log and the events seemed to happen consistently betwee 7:00 PM and 8:00 PM, but not every night and oddly at 15 minute intervals (7:16 one night, 7:31 a different night).

Most of the reseach indicated this was related to a network failure. We checked the hardware, software patches, drivers and everything seemed okay. Other events would also show up around the same time of day:

7024 Service Control Manager: The cluster service terminated with the service-specific error, "The membership engine requested shutdown of the cluster service on this node."

7031 Service Control Manager: "The cluster service terminated unexpectedly. It has done this 1 time(s). Correction action: restart service.

7036 Service Control Manager: The cluster service entered the running state.

 

Ultimately, this was traced to a backup utility running on the DAG member.