Taking a Fresh Look at Hyper-V Clusters (Part 1)

If you would like to read the other parts in this article series please go to:

Introduction

Quite some time ago, I wrote an article series describing the process of setting up Hyper-V to use Windows failover clustering. As you can imagine, a lot has changed since I wrote that article series. In fact, Microsoft has made a number of improvements to both Hyper-V and to failover clustering in Windows Server 2012 R2. That being the case, I decided to take the opportunity to take a fresh look at Hyper-V clusters.

Quorum Improvements

As previously mentioned, Microsoft has made a tremendous number of improvements to failover clustering in Windows Server 2012 R2. There are so many improvements in fact, that I could easily write an entire article series based solely around the new features. However, I want to start off by talking about some of the improvements that Microsoft has made with regard to the quorum model.

Windows Server clusters have long used a majority node set model. This model is based around the idea that each cluster node gets a vote and that a majority of the cluster nodes must cast a vote in order for the cluster to retain quorum. Microsoft defines a node majority as half plus one.

The problem with the majority node set model is that it can lead to a cluster losing quorum even if a number of nodes are still online. Imagine for instance a seven node cluster. Using the majority node set model, four of the seven nodes would have to cast votes in order for the cluster to retain quorum. The problem is that the cluster could lose quorum even if three nodes were still functional.

In Windows Server 2012, Microsoft introduced a new feature called dynamic quorum. The basic idea behind dynamic quorum is that the node’s state is taken into account with regard to the quorum vote. If a node shuts down (or crashes) the node loses its vote. If the node comes back online, it regains its quorum vote. The advantage to this model is that the cluster is able to dynamically adjust the number of votes that are required in order for the cluster to retain quorum.

Although the dynamic quorum feature was a very welcome improvement, Microsoft made some additional improvements in Windows Server 2012 R2. One such improvement was the introduction of the dynamic witness. The basic idea here is that if the Windows Server 2012 R2 failover cluster is configured to use dynamic quorum (which it is by default) then the witness vote also becomes dynamic. There are two implications to having a dynamic witness.

First of all, the witness only gets a vote in the event that there are an even number of cluster nodes. If an even number of nodes exists then the witness server can act as a tie breaker. Otherwise, the witness server does not cast a vote.

The other implication is that the witness server only gets a vote if it is healthy. If the witness server fails or is taken offline then the cluster will automatically set the witness vote value to zero.

Given the changes that Microsoft has made to the quorum model, it is easy to wonder if a witness server is really even necessary. However, Microsoft recommends that every Windows Server 2012 R2 cluster be provided with a quorum witness.

As you probably already know, a witness server’s job is to act as a tie breaker in situations in which there is a 50 / 50 node split within the failover cluster. However, Microsoft has implemented an automatic tie breaker function that can dynamically adjust the vote count so that there is always an odd number of votes (in an effort to avoid the dreaded 50 / 50 node split).

The automatic tie breaker functionality is designed to work in situations in which the witness server is unavailable either due to a failure or due to being taken offline. If in this type of situation an even number of nodes are running then Windows will remove the quorum vote from one of the cluster nodes in order to force an odd number of votes.

To help to illustrate the quorum model’s functionality, imagine for a moment that you have an eight node cluster. Now let’s suppose that the cluster spans two physical datacenters and that you have a witness server that is located at a third location. Let’s also assume that each of the two datacenters contains four cluster nodes.

The reason why the witness server is kept separately from the other sites is that its job is to verify connectivity. Imagine that the WAN link were to fail between the two sites. Each site would interpret the link failure as a failure of the four nodes in the remote site. The witness server would allow one of the two sites to retain quorum, depending on which site could communicate with the witness.

So with this example in mind, let’s examine how the situation would be different using a dynamic witness and the automatic tie breaker functionality. For the purposes of this example, we will stick with the model in which four cluster nodes exist in each of two datacenters and a witness server exists at a separate location.

With this in mind, let’s pretend that the witness server was taken down for maintenance. In this type of situation, the cluster would still maintain quorum because there are plenty of functional nodes. However, since Windows Server 2012 R2 is using a dynamic witness in this situation, the server operating system removes the witness server’s vote. This means that the entire cluster now has eight quorum votes (one for each cluster node).

The problem with having an even number of votes of course is that it becomes possible for the cluster to experience a 50 / 50 split if the WAN link between the two datacenters were to fail. Microsoft sometimes refers to this condition as split brain syndrome.

In an effort to eliminate the possibility of a 50 / 50 split, Windows removes the quorum vote from a random cluster node. This means that even though eight cluster nodes are functioning, one of the sites has four votes while the other only has three. This means that if the WAN link were to fail then the site with the four votes would retain quorum and continue to function.

Of course this raises two questions. First, what happens if the witness server comes back online? Second, can you control which node loses a vote?

If the witness server comes back online it regains its vote. Windows sense that the witness server has returned and restores the vote to the cluster node from which the vote was removed. Incidentally, there is a way to pick which node loses the vote in this type of situation. The advantage to doing so is that you can force your primary datacenter to retain quorum during a link failure.

Conclusion

As you can see, a lot has changed with regard to the failover clustering quorum model. In Part 2 of this series, I will discuss some more changes that Microsoft has made.

If you would like to read the other parts in this article series please go to:

Leave a Comment

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top