Hi cluster fans,
Many of you have asked about starting up an offline cluster in Windows Server 2008. What is the correct process? Can I start all nodes at once? Need I need to stager boot times by 30-60 seconds, as was recommended in Windows Server 2003? The short answer is that you can start all your nodes simultaneously without any issues, but let’s look at this in some more detail…
In Windows Server 2003 a node starting the cluster service would first try to join the cluster by looking for a sponsor node. If, after 30 seconds, no sponsor node was found, then the node would try to form the cluster by arbitrating for the quorum resource. Once a node determined that it was going to form the cluster, it would no longer try to join with other nodes – either it would succeed in quorum arbitration and form, or it would fail and restart the cluster service (ClusSvc) 60 seconds later. Thus, if you booted all nodes at the same time, there was a reasonable probability that they would all start looking for a sponsor at the same time. All nodes would wait for 30 seconds, and then they would all arbitrate for the quorum resource at the same time. Only one would win, and the others would need to back off and start ClusSvc again later. 60 seconds later, all of the nodes that lost quorum arbitration would all start looking for a sponsor again at the same time. They would all find the one node that formed the cluster.
However Windows Server 2003 also had the problem that join was serialized – only one node could join at a time. Especially in large clusters, the nodes that had to wait to join might timeout, so then they would terminate ClusSvc, which would lead to ClusSvc starting 120 seconds later for another join attempt. You can see that giving one node a head start allows it to search for sponsors and then arbitrate for the quorum resource with no interference. After that, if the remaining nodes are staggered by a few seconds, then there is no contention to join, and each node will find a sponsor and join as it starts ClusSvc.
In Windows Server 2008, there are three major changes the affect this behavior. First, a node does not search for sponsors for only 30 seconds. In fact, a node will continually try to establish connections to missing nodes. Because a node continually tries to contact missing nodes, there is no need to wait a full 30 seconds prior to quorum arbitration. A node starting ClusSvc that finds itself one vote below quorum will arbitrate for the quorum resource after only a few seconds. Second, a majority of nodes can form the cluster without arbitrating for the quorum resource. This means the cluster will actually form faster if all nodes start at the same time – there will be no need to arbitrate for the quorum resource, which takes an extra few seconds. Finally, the join operation is much less serialized, so multiple nodes can join the cluster practically simultaneously.
As an example, the Windows Server 2008 ‘Create a Cluster’ operation (either through the Failover Cluster Manager, ClusAPI, or cluster.exe) starts ClusSvc on all nodes at the same time, and this works just fine.
Once again, this is some great innovation from the Clustering and High Availability team to make clustering even easier.
Principal Development Lead
Clustering & High Availability