Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142

In the last post http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx we discussed event 5120, which indicates that Cluster Shared Volumes (CSV) observed and error and attempted to recover. In this post we will discuss cases when recovery does not succeed. When CSV recovery does not succeed, an Event 5142 is logged to the System event log.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      5142
Task Category: Cluster Shared Volume
Level:         Error
Description:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

In this post we will go over several possible root causes which may result in a 5142, and how to identify which issue are you hitting.

Cluster Service Failed

When the Cluster Service fails on a node, the Cluster Shared Volumes file system (CSVFS) will invalidate all file objects on all the volumes on that node. You may not see an event 5142 in this case because cluster may not have an opportunity to log it due to the service failed. You can find these cases by scanning Microsoft-Windows-FailoverClustering-CsvFs/Operational for the following sequence of events

 

The first 8960 event indicates that CSVFS is moving the volume to Init state and that DcmSequenceIs is empty, which means that this command did not come from the cluster service. CSVFS initiated this activity on its own

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      8960
Task Category: Volume State Change Started
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} transitioning from Init to Init.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”>
  <EventData>
    <Data Name=”Volume”>0xffffe000badfb1b0</Data>
    <Data Name=”VolumeId”>{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name=”CurrentState”>0</Data>
    <Data Name=”NewState”>0</Data>
    <Data Name=”DcmSequenceId”>
    </Data>
  </EventData>
</Event>

Next 9216 event tells us that CSVFS successfully finished transition to Init state.

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      9216
Task Category: Volume State Change Completed
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} moved to state Init. Reason Transition to Init; Status 0x0.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”>
  <EventData>
    <Data Name=”Volume”>0xffffe000badfb1b0</Data>
    <Data Name=”VolumeId”>{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name=”State”>0</Data>
    <Data Name=”Source”>8</Data>
    <Data Name=”Status”>0x0</Data>
    <Data Name=”DcmSequenceId”>
    </Data>
  </EventData>
</Event>

And finally an event 49152 is logged which provides details why this transitions was done, in this case because CSVFS observed that the Cluster Service is terminating.

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      49152
Task Category: ClusterDisconnected
Level:         Information
Keywords:      ClusterServiceState
Description:
Cluster service disconnected.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”>
  <EventData>
    <Data Name=”FileObject”>0xffffe000bab597c0</Data>
    <Data Name=”ProcessId”>0x1070</Data>
  </EventData>
</Event>

In the cluster logs every time cluster starts you will see a log statement which contains “————–+” so you can look for the last statement from the previous cluster instance to see what was the last thing cluster service have been doing right before terminating.

Here is what was logged in the cluster log when I terminated Cluster Service:
00001070.000015e0::2014/10/23-00:12:29.885 DBG   [API] s_ApiOpenKey: “ServerForNFS\ReadConfig” failed with error 2

And then the ClusSvc was started again:
00000f10.00000fa4::2014/10/23-00:13:03.287 INFO  —————————–+ LOG BEGIN +—————————–

In the event that that it is unknown why the cluster service terminated, you can start reading cluster logs from this point back trying to understand why cluster service went down.

When cluster fails on one of the nodes it will cause CSV volumes on this node to go down. CSV volumes will stay up on the other nodes. If node with failed clussvc was the coordinator then CSV on the other nodes will be paused until cluster fails over and onlines disk on a surviving node.

Disk Failure or Offline

When cluster exhausts all restart attempts to online a disk after too many failed attempts or when user manually offlines the disk, cluster will move CSV volumes corresponding to this disk to the Init state.

For instance if you offline a disk using Failover Cluster Manager then in the Microsoft-Windows-FailoverClustering-CsvFs/Operational channel we would see following events

  

 Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      8960
Task Category: Volume State Change Started
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} transitioning from Active to Init.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”>
  <EventData>
    <Data Name=”Volume”>0xffffe000885581b0</Data>
    <Data Name=”VolumeId”>{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name=”CurrentState”>4</Data>
    <Data Name=”NewState”>0</Data>
    <Data Name=”DcmSequenceId”>&lt;1:60129542151&gt;&lt;60129542147&gt;</Data>
  </EventData>
</Event>

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      9216
Task Category: Volume State Change Completed
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} moved to state Init. Reson Transition to Init; Status 0x0.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”>
  <EventData>
    <Data Name=”Volume”>0xffffe000885581b0</Data>
    <Data Name=”VolumeId”>{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name=”State”>0</Data>
    <Data Name=”Source”>8</Data>
    <Data Name=”Status”>0x0</Data>
    <Data Name=”DcmSequenceId”>&lt;1:60129542151&gt;&lt;60129542147&gt;</Data>
  </EventData>
</Event>

DcmSequenceId is not empty, so this means that the command came to CSVFS from the cluster. Using DcmSequenceId <1:60129542151><60129542147> you can correlate this to the place in the cluster log where cluster service initiated that state transition
[Verbose] 000004dc.00001668::2014/10/23-00:57:00.587 INFO  [DCM] FilterAgent: ChangeCsvFsState: uniqueId ca4ce06f-6b06-4405-b058-fd9d1cf869b3, state CsvFsVolumeStateInit, sequence <1:60129542151><60129542147>

And working from this point backwards I see that there was a manual offline of the disk
[Verbose] 000004dc.000012dc::2014/10/23-00:57:00.527 INFO  [RCM] rcm::RcmApi::OfflineResource: (Cluster Disk 3, 0)

If disk is failing then you will find in the cluster logs records about that and the reason for the failure.

Volume Is Failing Too Often

If cluster observes that a CSV volume on one of the nodes is failing too often, and is unable to stay without interruption in a good state for 5 minutes without running into auto pause then the volume will be taken down on that node. On the other nodes the volume will remain active. After several minutes cluster will attempt to revive the volume back to active state. In this case in the cluster logs you would see statements similar to this:
00000ab8.0000109c::2014/10/21-02:36:23.388 INFO  [DCM] UnmapPolicy::enter_CountingToBad(aec3c2e8-a7eb-45e9-9509-f63190659ba4): goodTimer P0…75, badTimer R0…150, badCounter 1 state CountingToBad

00000ab8.0000109c::2014/10/21-02:36:23.544 INFO  [DCM] CsvFs Listener: state [volume aec3c2e8-a7eb-45e9-9509-f63190659ba4, sequence <><145>, state CsvFsVolumeStateChangeFromInit->CsvFsVolumeStateInit, status 0x0]

And in the Microsoft-Windows-FailoverClustering-CsvFs/Operational you will see correlating events 8960 and 9216 with matching DcmSequenceId – <><145>. Note that first part of the sequence Id is empty. This is because the action is not global for all nodes, but is only for the current node. In general sequence format is:

<Id of the cluster node that initiated this action:Sequence number of action on the node that initiated this action><Sequence number of the action on the node where the action was executed >

Recovery Is Taking Too Long

Disk Is Timing Out During Online

Once CSVFS completes a state transition it waits for the cluster to start the next one, but CSVFS will not wait indefinitely. Depending on the volume type and state CSVFS will wait from 1 to 10 minutes. For example with a snapshot volume CSVFS would wait for only 1 minute. For volume which has some dirty pages in the file system cache CSVFS would wait for 3 minutes. For a volume which has no dirty pages CSVFS would wait up to 10 minutes. If cluster does not start the next state transition in that time then CSVFS will move volume to Init state.

In the Microsoft-Windows-FailoverClustering-CsvFs/Operational you will see events 8960 and 9216 similar to the case when cluster service was terminated. DcmSequenceId will be empty because state change was initiated by CSVFS. You will NOT see event 49152 saying that cluster service has disconnected. In the cluster logs you will see a log record that volume went to Init state, and sequence number will be empty.
00000af0.00000f88::2014/10/17-06:37:58.325 INFO  [DCM] CsvFs Listener: state [volume aec3c2e8-a7eb-45e9-9509-f63190659ba4, sequence, state CsvFsVolumeStateChangeFromInit->CsvFsVolumeStateInit, status 0x0]

The next step is to find the node which owns the cluster Physical Disk resource at the moment of the failure and use cluster logs to try to identify why the disk online is taking so much time.

CSV State Transition Is Taking Too Long

This case is similar to the case above – CSV transitioned the volume to Init state because it timed out waiting for cluster to start next state transition. The reason why this happens may vary. In this case the disk may be staying online and healthy the whole time, but CSVFS on another node might take too long to finish its state transitions. The result will be similar, and events that you find in CSVFS logs and cluster logs will be similar

Summary

In this blog post we went over common reasons for the event 5142 when the Cluster Service fails to recover CSV, and you see your workloads failing. This blog post explained the sequences of events you will see in this case in the CSVFS operational channel, and how to correlate them with cluster logs.

Thanks!
Vladimir Petter
Principal Software Engineer
High-Availability & Storage
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx

Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx