Cluster Log Enhancements
This is the first in a series of Blogs that will provide details about the improvements we have made in the tools and methods for troubleshooting Failover Clusters with Windows Server 2016.
Failover Cluster has diagnostic logs running on each server that allow in-depth troubleshooting of problems without having to reproduce the issue. This log is valuable for Microsoft’s support as well as those out there who have expertise at troubleshooting failover clusters.
Tip: Always go to the System event log first, when troubleshooting an issue. Failover cluster posts events in the System event log that are often enough to understand the nature and scope of the problem. It also gives you the specific date/time of the problem, which is useful if you do look at other event logs or dig into the cluster.log if needed.
Generating the Cluster.log
This is not new, but will be helpful information for those that aren’t familiar with generating the cluster log.
Get-ClusterLog is the Windows PowerShell cmdlet that will generate the cluster.log on each server that is a member of the cluster and is currently running. The output looks like this on a 3 node cluster:
The Cluster.log files can be found in the <systemroot>\cluster\reports directory (usually c:\windows\cluster\Reports) on each node.
You can use the –Destination parameter to cause the files to be copied to a specified directory with the Server’s name appended to the log name, which makes it much easier to get and analyze logs from multiple servers:
Other useful parameters are discussed in the rest of this blog.
I’m going to highlight the enhancements to the information in the Windows Server 2016 cluster.log that will be the most interesting and useful to the general audience interested in troubleshooting failover clusters, and leave detailing every item in the log to a future blog(s). I’m including references and links to resources related to troubleshooting clusters and using the cluster log at the end of this blog.
The cluster.log is a dump of the information from the system and captured in a text file. The time stamps default to UTC (which some people call GMT). Therefore if you are in a time zone that is UTC+8 you need to look at the time stamp in the cluster log and add 8 hours. For instance, if you are in that time zone and a problem occurred at 1:38pm (13:38), UTC time stamp in the cluster log would be (21:38).
We offer 2 enhancements in the cluster.log that makes this time zone and UTC offset easier to discover and work with:
- UTC offset of the server: The Top of the cluster.log notes the UTC offset of the originating server. In the example below, it notes that the server is set to UTC + a 7 hour offset (420 minutes). Specifically noting this offset in the log removes the guesswork related to the system’s time zone setting.
- Cluster log uses UTC or local time. The top of the cluster.log notes whether the log was created using UTC or local time for the timestamps. The –UseLocalTime parameter for Get-ClusterLog causes the cluster.log to write timestamps that are already adjusted for the server’s time zone instead of using UTC. This is not new, but it became obvious that it’s helpful to know if that parameter was used or not, so it’s noted in the log.
UTC= localtime + time zone offset; with daylight savings, the time zone offset of this machine is 420 minutes, or 7 hours
The logs were generated using Coordinated Universal Time (UTC). ‘Get-ClusterLog -UseLocalTime’ will generate in local time.
Tip:The sections of the cluster.log are encased in [=== ===], which makes it easy to navigate down the log to each section by doing a find on “[===”. As a bit of trivia, this format was chosen because it kind of looks like a Tie Fighter and we thought it looked cool.
The cluster has objects that are part of its configuration. Getting the details of these objects can be useful in diagnosing problems. These objects include resources, groups, resource types, nodes, networks, network interfaces, and volumes. The cluster.log now dumps these objects in a Comma Separated Values list with headers.
Here is an example:
[===Networks ===] Name,Id,description,role,transport,ignore,AssociatedInterfaces,PrefixList,address,addressMask,ipV6Address,state,linkSpeed,rdmaCapable,rssCapable,autoMetric,metric, Cluster Network 1,27f2d19b-7e23-4ee3-a226-287d4ebe9113,,1,TCP/IP,false,82e5107c-5375-473a-ab9f-5b6450bf5c7f30ff5ff6-00a3-494b-84b6-62a27ef99bb3 187c582d-f23c-48f4-8c37-6a452b2a238b,10.10.1.0/24 ,10.10.1.0,255.255.255.0,,3,1000000000,false,false,true,39984, Cluster Network 2,e6efd1f6-474b-410a-bd7b-5ece99476cd8,,1,TCP/IP,false,57d9b74d-8d9e-4afe-8667-e91e0bd23412617bb075-3803-4e5e-a039-db513d60603d 51c4fd42-9cb4-4f2e-a65c-01fea9bfa582,10.10.3.0/24 ,10.10.3.0,255.255.255.0,,3,1000000000,false,false,true,39985, Cluster Network 3,1a5029c7-7961-40bb-b6b9-dcbbe4187034,,3,TCP/IP,false,d3cdef35-82bc-4a60-8ed4-5c2b278f7c0e83c7c4b8-b588-425c-bfae-0c69d7a45bcd c1fb12d2-071b-4cb2-8ca7-fa04e972cd1c,126.96.36.199/22 2001:4898:28:4::/64,188.8.131.52,255.255.252.0,2001:4898:28:4::,3,100000000,false,false,true,80000,
These sections can be consumed by any application that can parse CSV text. Or, you can copy/paste into an Excel spreadsheet, which makes it easier to read as well as provides filter/sort/search. For the example below, I pasted the above section into a spreadsheet and then used the “Text to Columns” action in the “DATA” tab of Microsoft’s Excel.
New Verbose Log
New for Windows Server 2016 is the DiagnosticVerbose event channel. This is a new channel that is in addition to the Diagnostic channel for FailoverClustering.
In most cases the diagnostic channel, with the default log level set to the default of 3, gets enough information that an expert troubleshooter or Microsoft’s support engineers can understand a problem. However, there are occasions where we need more verbose logging and it’s necessary to set the cluster log level to 5, causing the diagnostic channel to start adding the verbose level of events to the log. After changing the log level you have to reproduce the problem and analyze the logs again.
The question arises, why don’t we suggest keeping the log level at 5? The answer is that it causes the logs to have more events and therefore wrap faster. Being able to go back for hours or days in the logs is also desirable so the quicker wrapping poses its own troubleshooting problem.
To accommodate wanting verbose logging for the most recent time frame, and having logging that provides adequate history, we implemented a parallel diagnostic channel we call DiagnoticVerbose. The DiagnosticVerbose log is always set for the equivalent of the cluster log level 5 (verbose) and runs in parallel to the Diagnostic channel for FailoverClustering.
You can find the DiagnosticVerbose section in the cluster.log by doing a find on “DiagnosticeVerbose”. It will go the section header:
[=== Microsoft-Windows-FailoverClustering/DiagnosticVerbose ===] [Verbose] 00000244.00001644::2015/04/22-01:04:29.623 DBG [RCM] rcm::PreemptionTracker::GetPreemptedGroups() [Verbose] 00000244.00001644::2015/04/22-01:04:29.623 DBG [RCM] got asked for preempted groups, returning 0 records
The Diagnostic channel (default log level of 3) can be found by doing a find on “Cluster Logs”:
[=== Cluster Logs ===] 00000e68.00000cfc::2015/03/23-22:12:24.682 DBG [NETFTAPI] received NsiInitialNotification 00000e68.00000cfc::2015/03/23-22:12:24.684 DBG [NETFTAPI] received NsiInitialNotification
Events From Other Channels
There is a “Tip” above that notes the recommendation to start in the system event log first. However, it’s not uncommon for someone to generate the cluster logs and send them to their internal 3rd tier support or to other experts. Going back and getting the system or other event logs that may be useful in diagnosing the problem can take time, and sometimes the logs have already wrapped or have been cleared.
New in Windows Server 2016 cluster log, the following event channels will also be dumped into the cluster.log for each node. Since they are all in one file, you no longer need to go to the nodes and pull each log individually.
[=== System ===] [=== Microsoft-Windows-FailoverClustering/Operational logs ===] [=== Microsoft-Windows-ClusterAwareUpdating-Management/Admin logs ===] [=== Microsoft-Windows-ClusterAwareUpdating/Admin logs ===]
Here is an example:
[=== System ===] [System] 00000244.00001b3c::2015/03/24-19:46:34.671 ERR Cluster resource 'Virtual Machine <name>' of type 'Virtual Machine' in clustered role '<name>' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
[System] 00000244.000016dc::2015/04/14-23:43:09.458 INFO The Cluster service has changed the password of account 'CLIUSR' on node '<node name>'.
Tip: If the size of the cluster.log file is bigger than you desire, the –TimeSpan switch for Get-ClusterLog will limit how far back (in minutes) it will go back in time for the events. For instance, Get-Clusterlog –TimeSpan 10 will cause the cluster.log on each node to be created and only include events from the last 10 minutes. That includes the Diagnostic, DiagnosticVerbose, and other channels that are included in the report.
Troubleshooting Windows Server 2012 Failover Clusters, How to get to the root of the problem: http://windowsitpro.com/windows-server-2012/troubleshooting-windows-server-2012-failover-clusters