Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

[Aside] Under the Hood with DAGs

Watching this Ignite 2015 video: Under the Hood with DAGs, by Tim McMichael.

Adding some links here to supplement the video:

  • Tuning Failover Cluster Network thresholds – useful when you have stretched DAGs
  • The mystery of the 9223372036854775766 copy queue… – never had this one but good info.
    • Basically, the cluster registry keeps track of the last log number per database and also the timestamp. When a node wants to see its copy queue length (i.e. how behind it is in terms of processing the logs) it can compare this log number with the log number it has actually processed. Sometimes, however, some node might be having issue updating the cluster registry or reading the cluster registry and so they fall behind in terms of receiving updates. In such cases the last log number will match what they have processed, but it is actually outdated info and so if the Exchange Replication service on the server hosting the passive copy notices that the timestamp is 12 minutes ago it puts its database copy into self-protection mode. This is done by putting the copy queue length (a.k.a. CQL) manually as 9 quintillion (the maximum a 64-bit integer can be). No one can actually have such large a copy queue length so it’s as good number to choose.
    • The video suggests rebooting each node until you find one which might be holding updates. But the link above suggests a different method.

DAC

Came across some Datacenter Activation Coordination (DAC) from Tim’s blog: part 1, followed by a series of posts you can see at the end of part 1.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

The bit I italicized is important. If you read his blog post you’ll see why. If DAC is activated and you are starting up a previously shutdown DAG, even though the DAG might have quorum it will not start up if some members are still offline. (I had missed that when reading about DAC earlier). To summarize it succinctly from part 2 of his series:

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

Here’s highlights from some of the interesting posts in Tim’s series:

  • Part 4 has info on the steps to do a datacenter switchover and the cmdlets available when DAC is enabled. Essentially: you 1) Stop-DatabaseAvailabilityGroup with –configurationOnly:$TRUE switch for the site that is down – this marks the servers in the site that is down as down, 2) Stop-Service CLUSSVC on the nodes in the site that is up, and finally 3) Restore-DatabaseAvailabilityGroup specifying the site that is up. This Microsoft doc on datacenter switchovers is worth reading side-by-side. It contains info on both DAC and non-DAC scenarios so watch out for that.
  • Part 5 has info on how to use the Start-DatabaseAvailabilityGroup cmdlet to set the DACP bit as 1 on a specified server thus bringing up the DAG by forcing a consensus.
  • Part 6 is an interesting story. A nice edge case of DAC being enabled and graceful shutdown.
  • Part 8 is another interesting story on what happens due to a typo in a cmdlet.

Very briefly, the DAC cmdlets:

  • Stop-DatabaseAvailabilityGroup – mark a specified server, or all server in a specified AD site, as down. Use the -ConfigurationOnly switch to mark the server as down in AD only but not actually do anything on the server(s). Need to use this switch if the servers are already offline but AD is up and accessible in that site. This cmdlet also forces a sync of AD across sites so the information is propagated.
  • Start-DatabaseAvailabilityGroup – same as above, but mark as up. Can use the -ConfigurationOnly switch to not really do anything but only mark in AD.
  • Restore-DatabaseAvailabilityGroup – it evicts any stopped servers, it can configure the DAG to use an alternate witness server, and it brings up the DAG after doing this. This cmdlet can only be used against a DAG with DAC enabled.

Dynamic Quorum

Came across dynamic quorum from the videos (wasn’t previously aware of it). Am being lazy and will put in some screenshots from the video:

The highlighted part is the key thing.

Remember that quorum is defined as “(the number of votes)/2 + 1“. Each node (or witness) typically has a single vote, and (number of votes)/2 is rounded down (i.e. 7/2 = 3.5, rounded down to 3).

With dynamic quorum once a node (or set of nodes) fail, and if the remaining set of nodes form a quorum (note – they have to form quorum), then the required quorum of the cluster is adjusted to reflect the remaining number of nodes.

Take a look at the scenario below:

We have two data centers. 6 nodes + a witness, so initially the quorum was 7/2 + 1 = 4.

The link between the two data centers goes down. Data center B has 3 nodes, which is below the quorum of 4 so all 3 nodes shutdown. Data center A has 3 nodes + witness, thus meeting the quorum and it stays up.

At this point if any further node in data center A goes down, they will fall below the quorum and the cluster will shutdown. To avoid such a situation is where dynamic quorum comes in. With dynamic quorum (introduced since Server 2012) when the nodes in data center A form quorum, the new quorum requirements is 4/2 + 1 = 3.

If a node goes down in data center A, leaving 2 nodes + a witness, since they meet the new quorum of 3 the cluster stays up. The quorum then gets revised to be 2/2 + 1 = 2. If yet another node goes down, the remaining node + witness still meets the new quorum of 2 and so the cluster continues to stay up.

Another slide:

Two data centers, 2 nodes + 1 node, no witness; the quorum is therefore 3/2 + 1 = 2.

One of the nodes in data center A goes down. Since the number of remaining nodes meets quorum, the cluster can stay up. But since there is no fail share witness each node cannot be given an equal vote (I wasn’t aware of this). Thus the cluster service picks up one of the nodes (the one with the lowest node ID) and gives it a vote of 0. The node in data center A has a vote of 0. The new quorum is thus 1/2 + 1 = 1.

If the link between the two data centers goes down, the node in data center B stays up even though the node in data center A too could have formed quorum! Nothing wrong with it, just an edge case to keep in mind as chances are you probably wanted data center A to remain up as that’s why you provisioned two nodes there in the first place.

Now for a variant in which there is a witness:

So two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before, one of the nodes in data center A goes down. Since there are 3 nodes + witness remaining, they meet quorum and the cluster continues. The new quorum is 4/2 + 1 = 3. Again, the data center link goes down. Everything goes down! :) Why? Coz no one has a clear majority. Each side has 2 votes, not the 3 required.

Interestingly I have this setup at work. So a critical thing to keep in mind is that if I were to update & reboot the witness or one of the nodes in data center A (my preferred data center), and the WAN link were to go down – I could lose the cluster! No such problems if I update & reboot a node in data center B and the link goes down, as data center A has the majority. Funny, it’s like you must keep the witness in the less preferred data center.

Windows Server 2012R2 improves upon dynamic quorum by adding dynamic witness.

So if the number of votes is odd, the witness vote is removed. And if the witness is offline or failed, then too it is removed (that includes reboots too, right?). 

Now things get tricky.

Going back to the previous example: so two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before a node data center A goes down (the picture is a bit incorrect as I skipped some intermediate slides), the remaining nodes have quorum so the cluster stays put. The new quorum is 4/2 + 1 = 3. But since the number of nodes is now ODD, cluster service removes the witness from the vote calculations. So the new quorum turns out to be 3/2 + 1 = 2. At this point if the link goes down, the nodes in data center B have quorum and so they form a cluster while the remaining node in data center A is shut down. So unlike the Server 2012 case, which had no dynamic witness, the whole cluster does not go down!

Now, going back to the case where one of the nodes (not witness) had its vote removed as there were only one node in each data center, I mentioned that the node with the lowest ID gets removed. The next two slides talk about that, including how to select a node in a cluster that we’d preferentially like to remove the vote of in such situations.

At this point I’d also like to link to this Microsoft doc on cluster quorum. Am going to quote some parts from there as they explain well and I’d like to keep it here as as reference to myself.

How cluster quorum works

When nodes fail, or when some subset of nodes loses contact with another subset, surviving nodes need to verify that they constitute the majority of the cluster to remain online. If they can’t verify that, they’ll go offline.

But the concept of majority only works cleanly when the total number of nodes in the cluster is odd (for example, three nodes in a five node cluster). So, what about clusters with an even number of nodes (say, a four node cluster)?

There are two ways the cluster can make the total number of votes odd:

  1. First, it can go up one by adding a witness with an extra vote. This requires user set-up.
  2. Or, it can go down one by zeroing one unlucky node’s vote (happens automatically as needed).

I didn’t know about point 2 until watching this video.

Worth bearing in mind that this also applies in the case of the witness being lost. So any time your witness is offline the cluster service automatically zeroes the vote of one of the nodes. If you have 2 nodes in each data center + a witness in one data center, and you reboot the witness – that is fine. One of the nodes will have its vote zeroed out, but there’s no impact and when the witness returns the zeroed out node gets its vote back. But if during the time your witness is rebooting you also have a network outage between the two data centers, then the data center with majority nodes (i.e. not the data center containing the node whose vote was zeroed) wins and the cluster fails over there.

Some more:

Dynamic witness

Dynamic witness toggles the vote of the witness to make sure that the total number of votes is odd. If there are an odd number of votes, the witness doesn’t have a vote. If there is an even number of votes, the witness has a vote. Dynamic witness significantly reduces the risk that the cluster will go down because of witness failure. The cluster decides whether to use the witness vote based on the number of voting nodes that are available in the cluster.

Dynamic quorum works with Dynamic witness in the way described below.

Dynamic quorum behavior

  • If you have an even number of nodes and no witness, one node gets its vote zeroed. For example, only three of the four nodes get votes, so the total number of votes is three, and two survivors with votes are considered a majority.
  • If you have an odd number of nodes and no witness, they all get votes.
  • If you have an even number of nodes plus witness, the witness votes, so the total is odd.
  • If you have an odd number of nodes plus witness, the witness doesn’t vote.

Am pretty sure am going to forget all this a few days from today so I’ll re-link to the docs again as it goes into more detail and has examples etc.

Clusters & Quorum

I spent yesterday refreshing my knowledge of clusters and quorum. Hadn’t worked on these since Windows Server 2003! So here’s a brief intro to this topic:

There are two types of clusters – Network Load Balancing (NLB) and Server Clusters.

Network Load Balancing is “share all” in that every server has a full copy of the app (and the data too if it can be done). Each server is active, the requests are sent to each of them randomly (or using some load distributing algorithm). You can easily add/ remove servers. Examples where you’d use NLB are SMTP servers, Web servers, etc. Each server in this case is independent of the other as long as they are configured identically.

Server Clusters is “share nothing” in that only one server has a full copy of the app and is active. The other servers are in a standby mode, waiting to take over if the active one fails. A shared storage is used, which is how standby servers can take over if the active server fails.

The way clusters work is that clients see one “virtual” server (not to be confused with virtual servers of virtualization). Behind this virtual server are the physical servers (called “cluster nodes” actually) that make up the cluster. As far as clients are concerned there’s one end point – an IP address or MAC address – and what happens behind that is unknown to them. This virtual server is “created” when the cluster forms, it doesn’t exist before that. (It is important to remember this because even if the servers are in a cluster the virtual server may not be created – as we shall see later).

In the case of server clusters something called “quorum” comes into play.

Imagine you have 5 servers in a cluster. Say server 4 is the active server and it goes offline. Immediately, the other servers detect this and one of them becomes the new active server. But what if server 4 isn’t offline, it’s just disconnected from the rest of the group. Now we’ll have server 4 continuing to be active, but the other servers can’t see this, and so one of these too becomes the active server – resulting in two active servers! To prevent such scenarios you have the concept of quorum – a term you might have heard of in other contexts, such as decision making groups. A quorum is the minimum number of people required to make a decision. Say a group of 10 people are deciding something, one could stipulate that at least 6 members must be present during the decision making process else the group is not allowed to decide. A similar concept applies in the case of clusters.

In its simplest form you designate one resource (a server or a disk) as the quorum and whichever cluster contains that resource sets itself as the active cluster while the other clusters deactivate themselves. This resource also holds the cluster database, which is a database containing the state of the cluster and its nodes, and is accessed by all the nodes. In the example above, initially all 5 servers are connected and can see the quorum, so the cluster is active and one of the servers in it is active. When the split happens and server 4 (the currently active server) is separated, it can no longer see the quorum and so disables the cluster (which is just itself really) and stops being the active server. The other 4 servers can still see the quorum, so they continue being active and set a new server as the active one.

In this simple form the quorum is really like a token you hold. If your cluster holds the token it’s active; if it does not you disband (deactivate the cluster).

This simple form of quorum is usually fine, but does not scale to when you have clusters across sites. Moreover, the quorum resource is a single point of failure. Say that resource is the one that’s disconnected or offline – now no one has the quorum, and worse, the cluster database is lost. For this reason the simple form of quorum is not used nowadays. (This mode of quorum is called “Disk Only” by the way).

There are three alternatives to the Disk Only mode of quorum.

One mode is called “Node Majority” and as the name suggests it is based on majority. Here each node has a copy of the cluster database (so there’s no single point of failure) and whichever cluster has more than half the nodes of the cluster wins. So in the previous example, say the cluster of 5 servers splits into one of 3 and 2 servers each – since the first cluster has more than half of the nodes, that wins. (In practice a voting takes place to decide this. Each node has a vote. So cluster one has 1+1+1 = 3 votes; cluster two has 1+1 = 2 votes. Cluster one wins).

Quorums based on majority have a disadvantage in that if the number of nodes are even then you have a tie. That is, if the above example were a 6 node cluster and it split into two clusters of 3 nodes each, both clusters will deactivate as neither have more than half the nodes (i.e. neither have the quorum). You need a tie breaker!

It is worth noting that the effect of quorum extends to the number of servers that can fail in a cluster. Say we have a 3 node cluster. For the cluster to be valid, it must have at least 2 servers. If one server fails, the cluster still has 2 servers so it will function as usual. But if one more server fails – or these two servers are disconnected from each other – the remaining cluster does not have quorum (there’s only 1 node, which is less than more than half the nodes) and so the cluster stops and the servers deactivate. This means even though we have one server still running, and intuitively one would expect that server to continue servicing requests (as it would have in the case of an NLB cluster), it does not do so in the case of a server cluster due to quorum! This is important to remember.

Another mode is called “Node & Disk Majority” and as the name suggests it is a mix of the “Node Majority” and “Disk Only” modes. This mode is for clusters with an even number of modes (where, as we know, “Node Majority” fails) and the way it works is that the cluster with more than half the nodes and which also contains the resource (a disk, usually called a “disk witness”) designated as quorum is the active one. Essentially the disk witness essentially acts as the tie breaker. (In practice the disk witness has an extra vote. So a cluster with 6 nodes will still require more than 3 nodes to consider it active and so if it splits into 3 nodes each, when it comes to voting one of the clusters will have (3+1=) 4 votes and hence win quorum).

In “Node & Disk Majority” mode, unlike the “Disk Only” mode the cluster database is present with all the nodes and so it is not a single point of failure either.

The last mode is called “Node & File Share Majority” and this is a variant of the “Node Majority” mode. This mode too is for clusters with an even number of nodes, and it works similar to “Node & Disk Majority” except for the fact that instead of a resource a file share is used. A file share (called a “file witness” in this case) is selected on any server – not necessarily a part of the cluster – and one node in the cluster locks a file on this share, effectively telling others that it “owns” the file share. So instead of using a resource as a tie breaker, ownership of the file share is used as the tie breaker. (As in the “Node & Disk Majority” mode the node that owns the file share has an additional vote as it owns the file share). Using the previous examples, if a cluster with 6 nodes splits into 3 nodes each, whichever cluster has the node owning the file share will win quorum while the other cluster will deactivate. If the cluster with 6 nodes splits into clusters of 4 and 2 nodes each, and say the 2 node cluster owns the file share, it will still lose as more than half the nodes are in the 4 node cluster (in terms of votes the winning cluster has 4 votes, the losing cluster has 3 votes). When the cluster deactivates, the current owner will release the lock and a node in the new cluster will take ownership.

An advantage of the “Node & File Share Majority” mode is that the file share can be anywhere – even on another cluster (preferably). The file share can also be on a node in the cluster, but that’s not preferred as if the node fails you lose two votes (that of being the file share owner as well as the node itself).

Here are some good links that contain more information (Windows specific):

At work we have two HP LeftHand boxes in a SAN cluster. The only quorum mode used by this is is the “Node Majority” one, which as we know fails for an even number of nodes, and so HP supplies a virtual appliance called “Failover Manager” that is installed on the ESX hosts and is used as the tie breaker. If the Failover Manager is powered off and both LeftHands are powered on together, suppose both devices happen to come on at the same time a cluster is not formed as neither has the quorum. To avoid such situations the Failover Manager has to be present when they power on, or we have to time the powering on such that one of the LeftHands is online before the other.