Windows Server Failover Clustering logs information about cluster activities including normal operations like updates between nodes as well as errors and warnings related to problems that occurred on the cluster in a text file called cluster.log. The information in the cluster.log is very valuable when trying to troubleshoot just about any problem encountered with a cluster. Having been in Microsoft’s support team that helps customers resolve Failover Cluster problems, I can tell you that having this detailed logging makes the difference of being able to identify the cause of a problem at the first occurrence, instead of having to configure logging after the fact and wait for the problem to happen again.
As valuable as it is, there is room for improvement for the Windows Server 2003 cluster.log. The text-based file format means using a text editor/viewer to parse the log, which is not ideal when dealing with very large logs.
For those of you that don’t want all the gritty details, I’m going to first provide a quick outline of how to get the equivalent of the cluster.log from a Windows Server 2008 Failover Cluster and some background, and then get into some of the details.
CREATING THE CLUSTER.LOG:
From one of the nodes of the cluster, open a Command Prompt with Administrator rights. The simplest command to create the log is to type “cluster log /g”. A clusterl.log file will be generated and stored in the %windir%\Cluster\Reports directory on each node of the cluster. Note that with all commands you can use either “cluster … ” or “cluster.exe …” as they have the same functionality.
Here are some cool new commands that can make this even easier:
· /Copy:<directory> (example: /Copy:logs) This command will take the cluster.log that is generated on each node, and copy it to a single directory. This makes it incredibly easy to get all the logs for analysis. One thing to note, the directory that you specify should be a subdirectory under the path which the command prompt is showing. If you want to save the logs at c:\archive\logs, then you need to set the command prompt to c:\archive and then execute the “cluster log /g /copy:logs” command.
· /Span:<minutes> (example /Span:15). This specifies the number of minutes to go back in time for the log collection. For instance, you reproduce a problem and you then generate the cluster.log. If you don’t use this switch, you will get up to several days of history. Using this switch, you can limit the contents of the cluster.log to only include the last few minutes which you have specified. So, what if you specified 15 minutes but it was really 20 minutes before? No problem, generating the cluster.log does not remove any data from the servers. You just run the command again specifying additional minutes for this /span option.
· /Node:<node name> (example /Node:”node A”). This command allows the specification of a specific node and the other nodes will not have a log generated. If this option is not specified, all nodes in the cluster will have a cluster.log generated. This is particularly useful if not all the cluster nodes are up, or some don’t have the cluster service started, which can cause a long delay with cluster log command execution because it will try to issue the command to those missing servers and will wait for a response when none will be forthcoming.
· /Level:<0-5> (example /Level:4) A note about another “cluster log” option that you may find information on in the help output for the command. The /Level switch can be used to change the logging level being captured. For Windows Server 2008, this has a default level of 3, which is the equivalent of what is captured by cluster.log in previous versions of Windows Server. If you change this level to a higher number, more detailed information will be logged, but that means that the .etl file that is capturing the tracing will fill faster and there can be a small impact on system performance. Setting this level lower than 3 will mean there is less tracing information and it may not be useful if analysis of a problem is needed. For Windows Server 2008, 5 is the maximum effective level, although the command help notes that the level can be set between 0 and 10. Any setting over 5 has the equivalent functionally as 5. The level range was set to 10 to allow for further options if needed in the future.
THE GORY DETAILS:
Windows Server 2008 introduced new eventing and diagnostic channels and Failover Cluster moved to using ETW (Event Tracing for Windows). You can see this new tracing exposed in the “Reliability and Performance Monitor” under “Data Collector Sets\Event Tracing Session\Failover Clustering” (shown below).
The logging is saved in files at %windir%\System32\winevt\logs\Clusterlog.etl
Each time the server is rebooted, a new log file will be used and a number used as an extension of the log name like ClusterLog.etl.001. Up to 5 log files are kept, so after 5 reboots the older log files will start to be removed. The default log file is 100 MB (for each .etl file), which can be changed using the command “Cluster log /size:<size in MB>” (example: cluster log /size:120). Although 100 MB may seem like a large log file, there is a significant amount of detail being saved for each entry due to this format change and 40 MB provides a reasonable amount of history. To view the setting for the log file size setting, at a command prompt opened with Administrator privileges execute “cluster /prop”. That command will list the properties for the cluster, including the “ClusterLogSize” and “ClusterLogLevel” property.
The .etl files themselves are not consumable by any viewer directly, but you can dump the contents into several different formats using tracerpt.exe (this TechNet article has the information on using tracerpt.exe: http://technet.microsoft.com/en-us/library/bb490959.aspx). You can dump the contents to EVTX and view in Event Viewer, or .XML and manipulate the information in many ways. For instance, you can apply a script that parses the file and provides formatting to a subset of the events. Here is an example of how the display can look:
As you can see, the new logging mechanism provides more options and flexibility. I personally love the /copy and /span options, they make targeting a specific time frame and getting the logs from all nodes much easier.
Senior Program Manager
Clustering & High Availability