Troubleshooting Hangs Using Live Dump

In this blog post https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/ we discussed what a Cluster Shared Volumes (CSV) event ID 5120 means, and how to troubleshoot it. In particular, we discussed the reason for auto-pause due to STATUS_IO_TIMEOUT (c00000b5), and some options on how to troubleshoot it. In this post we will discuss how to troubleshoot it using LiveDumps, which enables debugging the system with no downtime for your system.

First let’s discuss what is the LiveDump. Some of you are probably familiar with kernel crash dumps https://support.microsoft.com/en-us/kb/927069. You might have at least two challenges with kernel dump.

  1. Bugcheck halts the system resulting in downtime
  2. Entire contents of memory are dumped to a file.  On a system with a lot of memory, you might not have enough space on your system drive for OS to save the dump

The good news is that LiveDump solves both of these issues. Live Dump was a new feature added in Windows Server 2012 R2. For the purpose of this discussion you can think of LiveDump as an OS feature that allows you to create a consistent snapshot of kernel memory and save it to a dump file for the future analysis. Taking this snapshot will NOT cause bugcheck so no downtime. LiveDump does not include all kernel memory, it excludes information which is not valuable in debugging. It will not include pages from stand by list and file caches. The kind of livedump that cluster collects for you also would not have pages consumed by Hypervisor. In Windows Server 2016 Cluster also makes sure to exclude from the livedump CSV Cache. As a result LiveDump has much smaller dump file size compared to what you would get when you bugcheck the server, and would not require as much space on your system drive.  In Windows Server 2016 there is a new bugcheck option called an “Active Dump”, which similarly excludes unnecessary information to create a smaller dump file during bugchecks.

You can create LiveDump manually using LiveKD from Windows Sysinternals (https://technet.microsoft.com/en-us/sysinternals/bb897415.aspx ). To generate LiveDump run command “livekd –ml –o <path to a dump file>” from an elevated command prompt. Path to the dump file does not have to be on the system drive, you can save it to any location. Here is an example of creating live dump on a Windows 10 Desktop with 12 GB RAM, which resulted in a dump file of only 3.7 GB.

D:\>livekd -ml -o d1.dmp
LiveKd v5.40 - Execute kd/windbg on a live system
Sysinternals - www.sysinternals.com

Copyright (C) 2000-2015 Mark Russinovich and Ken Johnson

Saving live dump to D:\d1.dmp... done.

D:\>dir *.dmp

Directory of D:\

02/25/2016 12:05 PM     2,773,164,032 d1.dmp
1 File(s) 2,773,164,032 bytes
0 Dir(s) 3,706,838,417,408 bytes free

If you are wondering how much disk space you would need to livedump you can generate one using LiveKD, and check its size.

You might wonder what so great about LiveDump for troubleshooting. Logs and traces work well when something fails because hopefully in a log there will be a record where someone admits that he is failing operations and blames someone who causes that. LiveDump is great when we need to troubleshoot a problem where something is taking long time, and nothing is technically failing. If we start a watchdog when operation started, and if watchdog expires before operation completes then we can try to take a dump of the system hoping that we can walk a wait chain for that operation and see who owns it and why it is not completing. Looking at the livedump is just like looking at kernel dumps. It requires some skills, and understanding of Windows Internals. It has a steep learning curve for customers, but it is a great tool for Microsoft support and product teams who already have that expertise. If you reach out to Microsoft support with an issue where something is stuck in kernel, and a live dump taken while it was stuck then chances of prompt root causing of the issue are much higher.

Windows Server Failover Clustering has many watchdogs which control how long it should wait for cluster resources to execute calls like resource online or offline. Or how long we should wait for CSVFS to complete a state transition. From our experience we know that in most cases some of these scenarios will be stuck in the kernel so we automatically ask Windows Error Reporting to generate LiveDump. It is important to notice that LiveKd uses different API that produces LiveDump without checking any other conditions. Cluster uses Windows Error Reporting. Windows Error Reporting will throttle LiveDump creation. We are using WER because it manages disk space consumption for us and it also will send telemetry information about the incident to Microsoft where we can see what issues are affecting customers. This helps us to priorities and strategize fixes. Starting from Windows Server 2016 you can control WER telemetry through common telemetry settings, and before that there was a separate control panel applet to control what WER is allowed to share with Microsoft.

By default, Windows Error Reporting will allow only one LiveDump per report type per 7 days and only 1 LiveDump per machine per 5 days. You can change that by setting following registry keys

reg add "HKLM\Software\Microsoft\Windows\Windows Error Reporting\FullLiveKernelReports" /v SystemThrottleThreshold /t REG_DWORD /d 0 /f
reg add "HKLM\Software\Microsoft\Windows\Windows Error Reporting\FullLiveKernelReports" /v ComponentThrottleThreshold /t REG_DWORD /d 0 /f

Once LiveDump is created WER would launch a user mode process that creates a minidump from LiveDump, and immediately after that would delete the LiveDump. Minidump is only couple hundred kilobytes, but unfortunately it is not helpful because it would have call stack only of the thread that invoked LiveDUmp creation, and we need all other threads in the kernel to track down where we are stuck. You can tell WER to keep original Live dumps using these two registry keys.

reg add "HKLM\Software\Microsoft\Windows\Windows Error Reporting\FullLiveKernelReports" /v FullLiveReportsMax /t REG_DWORD /d 10 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\CrashControl" /v AlwaysKeepMemoryDump /t REG_DWORD /d 1 /f

Set FullLiveReportsMax to the number of dumps you want to keep, the decision on how many to keep depends on how much free space you have and the size of LiveDump.
You need to reboot the machine for Windows Error Reporting registry keys to take an effect.
LiveDumps created by Windows Error Reporting are located in the %SystemDrive%\Windows\LiveKernelReports.

Windows Server version 1709

In Windows Server version 1709 release, Windows Error Reporting registry keys that control LiveDump behavior changed to the following:

reg add "HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\FullLiveKernelReports" /v FullLiveReportsMax /t REG_DWORD /d 10 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\FullLiveKernelReports" /v SystemThrottleThreshold /t REG_DWORD /d 0 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\CrashControl\FullLiveKernelReports" /v ComponentThrottleThreshold /t REG_DWORD /d 0 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\CrashControl" /v AlwaysKeepMemoryDump /t REG_DWORD /d 1 /f

To keep your scripts simple you can set values in both old and new location.

Windows Server 2016

In Windows Server 2016 Failover Cluster Live Dump Creation is on by default. You can turn it on/off by manipulating lowest bit of the cluster DumpPolicy public property. By default, this bit is set, which means cluster is allowed to generate LiveDump.

PS C:\Windows\system32> (get-cluster).DumpPolicy
1118489

If you set this bit to 0 then cluster will stop generating LiveDumps.

PS C:\Windows\system32> (get-cluster).DumpPolicy=1118488

You can set it back to 1 to enable it again

PS C:\Windows\system32> (get-cluster).DumpPolicy=1118489

Change take effect immediately on all cluster nodes. You do NOT need to reboot cluster nodes.

Here is the list of LiveDump report types generated by cluster. Dump files will have report type string as a prefix.

Report Type Description
CsvIoT A CSV volume AutoPaused due to STATUS_IO_TIMEOUT and cluster on the coordinating node created LiveDump
CsvStateIT CSV state transition to Init state is taking too long.
CsvStatePT CSV state transition to Paused state is taking too long
CsvStateDT CSV state transition to Draining state is taking too long
CsvStateST CSV state transition to SetDownLevel state is taking too long
CsvStateAT CSV state transition to Active state is taking too long

You can learn more about CSV state transition in this blog post:

Following is the list of LiveDump report types that cluster generates when cluster resource call is taking too long

Report Type Description
ClusResCO Cluster resource Open call is taking too long
ClusResCC Cluster resource Close call is taking too long
ClusResCU Cluster resource Online call is taking too long
ClusResCD Cluster resource Offline call is taking too long
ClusResCK Cluster resource Terminate call is taking too long
ClusResCA Cluster resource Arbitrate call is taking too long
ClusResCR Cluster resource Control call is taking too long
ClusResCT Cluster resource Type Control call is taking too long
ClusResCI Cluster resource IsAlive call is taking too long
ClusResCL Cluster resource LooksAlive call is taking too long
ClusResCF Cluster resource Fail call is taking too long

You can learn more about cluster resource state machine in these two blog posts:

You can control what resource types will generate LiveDumps by changing value of the first bit of the resource type DumpPolicy public property. Here are the default values:

C:\> Get-ClusterResourceType | ft Name,DumpPolicy

Name                                DumpPolicy
----                                ----------
Cloud Witness                       5225058576
DFS Replicated Folder               5225058576
DHCP Service                        5225058576
Disjoint IPv4 Address               5225058576
Disjoint IPv6 Address               5225058576
Distributed File System             5225058576
Distributed Network Name            5225058576
Distributed Transaction Coordinator 5225058576
File Server                         5225058576
File Share Witness                  5225058576
Generic Application                 5225058576
Generic Script                      5225058576
Generic Service                     5225058576
Health Service                      5225058576
IP Address                          5225058576
IPv6 Address                        5225058576
IPv6 Tunnel Address                 5225058576
iSCSI Target Server                 5225058576
Microsoft iSNS                      5225058576
MSMQ                                5225058576
MSMQTriggers                        5225058576
Nat                                 5225058576
Network File System                 5225058577
Network Name                        5225058576
Physical Disk                       5225058577
Provider Address                    5225058576
Scale Out File Server               5225058577
Storage Pool                        5225058577
Storage QoS Policy Manager          5225058577
Storage Replica                     5225058577
Task Scheduler                      5225058576
Virtual Machine                     5225058576
Virtual Machine Cluster WMI         5225058576
Virtual Machine Configuration       5225058576
Virtual Machine Replication Broker  5225058576
Virtual Machine Replication Coor... 5225058576
WINS Service                        5225058576

By default, Physical Disk resources would produce LiveDump. You can disable that by setting lowest bit to 0. Here is an example how to do that for the physical disk resource

(Get-ClusterResourceType -Name "Physical Disk").DumpPolicy=5225058576

Later on you can enable it back

(Get-ClusterResourceType -Name "Physical Disk").DumpPolicy=5225058577

Changes take effect immediately on all new calls, no need to offline/online resource or restart the cluster.

The last group is the report types that cluster service would generate when it observes that some operations are taking too long.

Report Type Description
ClusWatchDog Cluster service watchdog

Windows Server 2012 R2

We had such a positive experience troubleshooting issues using LiveDump on Windows Server 2016 that we’ve backported a subset of that back to Windows Server R2. You need to make sure that you have all the recommended patches outlined here. On Windows Server 2012 R2 LiveDump will not be generated by default, it can be enabled using following PowerShell command:

Get-Cluster | Set-ClusterParameter -create LiveDumpEnabled -value 1

LiveDump can be disabled using the following command:

Get-Cluster | Set-ClusterParameter -create LiveDumpEnabled -value 0

Only CSV report types were backported, as a result you will not see LiveDumps from cluster resource calls or cluster service watchdog.  Windows Error Reporting throttling will also need to be adjusted as discussed above.

CSV AutoPause due to STATUS_IO_TIMEOUT (c00000b5)

Let’s see how LiveDump help troubleshooting this issue. In the blog post https://blogs.msdn.microsoft.com/clustering/2014/12/08/troubleshooting-cluster-shared-volume-auto-pauses-event-5120/ we’ve discussed that it is usually caused by an IO on the coordinating node taking long time. As a result of that CSVFS on a non-coordinating node would get an error STATUS_IO_TIMEOUT. CSVFS will notify cluster service about that event. Cluster service will create LiveDump with report type CsvIoT on the coordinating node where IO is taking time. If we are lucky, and the IO has not completed before the LiveDump has been generated then we can load the dump using WinDbg to try to find the IO that is taking a long time and see who owns that IO.

Thanks!
Vladimir Petter
Principal Software Engineer
High-Availability & Storage
Microsoft

 

Additional Resources:

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx

Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx

Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142
http://blogs.msdn.com/b/clustering/archive/2015/03/26/10603160.aspx

Cluster Shared Volume – A Systematic Approach to Finding Bottlenecks
https://blogs.msdn.microsoft.com/clustering/2015/07/29/cluster-shared-volume-a-systematic-approach-to-finding-bottlenecks/