rakhesh sasidharan's mostly techie somewhat purpley blog

I spent yesterday refreshing my knowledge of clusters and quorum. Hadn’t worked on these since Windows Server 2003! So here’s a brief intro to this topic:

There are two types of clusters – Network Load Balancing (NLB) and Server Clusters.

Network Load Balancing is “share all” in that every server has a full copy of the app (and the data too if it can be done). Each server is active, the requests are sent to each of them randomly (or using some load distributing algorithm). You can easily add/ remove servers. Examples where you’d use NLB are SMTP servers, Web servers, etc. Each server in this case is independent of the other as long as they are configured identically.

Server Clusters is “share nothing” in that only one server has a full copy of the app and is active. The other servers are in a standby mode, waiting to take over if the active one fails. A shared storage is used, which is how standby servers can take over if the active server fails.

The way clusters work is that clients see one “virtual” server (not to be confused with virtual servers of virtualization). Behind this virtual server are the physical servers (called “cluster nodes” actually) that make up the cluster. As far as clients are concerned there’s one end point – an IP address or MAC address – and what happens behind that is unknown to them. This virtual server is “created” when the cluster forms, it doesn’t exist before that. (It is important to remember this because even if the servers are in a cluster the virtual server may not be created – as we shall see later).

In the case of server clusters something called “quorum” comes into play.

Imagine you have 5 servers in a cluster. Say server 4 is the active server and it goes offline. Immediately, the other servers detect this and one of them becomes the new active server. But what if server 4 isn’t offline, it’s just disconnected from the rest of the group. Now we’ll have server 4 continuing to be active, but the other servers can’t see this, and so one of these too becomes the active server – resulting in two active servers! To prevent such scenarios you have the concept of quorum – a term you might have heard of in other contexts, such as decision making groups. A quorum is the minimum number of people required to make a decision. Say a group of 10 people are deciding something, one could stipulate that at least 6 members must be present during the decision making process else the group is not allowed to decide. A similar concept applies in the case of clusters.

In its simplest form you designate one resource (a server or a disk) as the quorum and whichever cluster contains that resource sets itself as the active cluster while the other clusters deactivate themselves. This resource also holds the cluster database, which is a database containing the state of the cluster and its nodes, and is accessed by all the nodes. In the example above, initially all 5 servers are connected and can see the quorum, so the cluster is active and one of the servers in it is active. When the split happens and server 4 (the currently active server) is separated, it can no longer see the quorum and so disables the cluster (which is just itself really) and stops being the active server. The other 4 servers can still see the quorum, so they continue being active and set a new server as the active one.

In this simple form the quorum is really like a token you hold. If your cluster holds the token it’s active; if it does not you disband (deactivate the cluster).

This simple form of quorum is usually fine, but does not scale to when you have clusters across sites. Moreover, the quorum resource is a single point of failure. Say that resource is the one that’s disconnected or offline – now no one has the quorum, and worse, the cluster database is lost. For this reason the simple form of quorum is not used nowadays. (This mode of quorum is called “Disk Only” by the way).

There are three alternatives to the Disk Only mode of quorum.

One mode is called “Node Majority” and as the name suggests it is based on majority. Here each node has a copy of the cluster database (so there’s no single point of failure) and whichever cluster has more than half the nodes of the cluster wins. So in the previous example, say the cluster of 5 servers splits into one of 3 and 2 servers each – since the first cluster has more than half of the nodes, that wins. (In practice a voting takes place to decide this. Each node has a vote. So cluster one has 1+1+1 = 3 votes; cluster two has 1+1 = 2 votes. Cluster one wins).

Quorums based on majority have a disadvantage in that if the number of nodes are even then you have a tie. That is, if the above example were a 6 node cluster and it split into two clusters of 3 nodes each, both clusters will deactivate as neither have more than half the nodes (i.e. neither have the quorum). You need a tie breaker!

It is worth noting that the effect of quorum extends to the number of servers that can fail in a cluster. Say we have a 3 node cluster. For the cluster to be valid, it must have at least 2 servers. If one server fails, the cluster still has 2 servers so it will function as usual. But if one more server fails – or these two servers are disconnected from each other – the remaining cluster does not have quorum (there’s only 1 node, which is less than more than half the nodes) and so the cluster stops and the servers deactivate. This means even though we have one server still running, and intuitively one would expect that server to continue servicing requests (as it would have in the case of an NLB cluster), it does not do so in the case of a server cluster due to quorum! This is important to remember.

Another mode is called “Node & Disk Majority” and as the name suggests it is a mix of the “Node Majority” and “Disk Only” modes. This mode is for clusters with an even number of modes (where, as we know, “Node Majority” fails) and the way it works is that the cluster with more than half the nodes and which also contains the resource (a disk, usually called a “disk witness”) designated as quorum is the active one. Essentially the disk witness essentially acts as the tie breaker. (In practice the disk witness has an extra vote. So a cluster with 6 nodes will still require more than 3 nodes to consider it active and so if it splits into 3 nodes each, when it comes to voting one of the clusters will have (3+1=) 4 votes and hence win quorum).

In “Node & Disk Majority” mode, unlike the “Disk Only” mode the cluster database is present with all the nodes and so it is not a single point of failure either.

The last mode is called “Node & File Share Majority” and this is a variant of the “Node Majority” mode. This mode too is for clusters with an even number of nodes, and it works similar to “Node & Disk Majority” except for the fact that instead of a resource a file share is used. A file share (called a “file witness” in this case) is selected on any server – not necessarily a part of the cluster – and one node in the cluster locks a file on this share, effectively telling others that it “owns” the file share. So instead of using a resource as a tie breaker, ownership of the file share is used as the tie breaker. (As in the “Node & Disk Majority” mode the node that owns the file share has an additional vote as it owns the file share). Using the previous examples, if a cluster with 6 nodes splits into 3 nodes each, whichever cluster has the node owning the file share will win quorum while the other cluster will deactivate. If the cluster with 6 nodes splits into clusters of 4 and 2 nodes each, and say the 2 node cluster owns the file share, it will still lose as more than half the nodes are in the 4 node cluster (in terms of votes the winning cluster has 4 votes, the losing cluster has 3 votes). When the cluster deactivates, the current owner will release the lock and a node in the new cluster will take ownership.

An advantage of the “Node & File Share Majority” mode is that the file share can be anywhere – even on another cluster (preferably). The file share can also be on a node in the cluster, but that’s not preferred as if the node fails you lose two votes (that of being the file share owner as well as the node itself).

Here are some good links that contain more information (Windows specific):

At work we have two HP LeftHand boxes in a SAN cluster. The only quorum mode used by this is is the “Node Majority” one, which as we know fails for an even number of nodes, and so HP supplies a virtual appliance called “Failover Manager” that is installed on the ESX hosts and is used as the tie breaker. If the Failover Manager is powered off and both LeftHands are powered on together, suppose both devices happen to come on at the same time a cluster is not formed as neither has the quorum. To avoid such situations the Failover Manager has to be present when they power on, or we have to time the powering on such that one of the LeftHands is online before the other.