Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

[Aside] Under the Hood with DAGs

Watching this Ignite 2015 video: Under the Hood with DAGs, by Tim McMichael.

Adding some links here to supplement the video:

  • Tuning Failover Cluster Network thresholds – useful when you have stretched DAGs
  • The mystery of the 9223372036854775766 copy queue… – never had this one but good info.
    • Basically, the cluster registry keeps track of the last log number per database and also the timestamp. When a node wants to see its copy queue length (i.e. how behind it is in terms of processing the logs) it can compare this log number with the log number it has actually processed. Sometimes, however, some node might be having issue updating the cluster registry or reading the cluster registry and so they fall behind in terms of receiving updates. In such cases the last log number will match what they have processed, but it is actually outdated info and so if the Exchange Replication service on the server hosting the passive copy notices that the timestamp is 12 minutes ago it puts its database copy into self-protection mode. This is done by putting the copy queue length (a.k.a. CQL) manually as 9 quintillion (the maximum a 64-bit integer can be). No one can actually have such large a copy queue length so it’s as good number to choose.
    • The video suggests rebooting each node until you find one which might be holding updates. But the link above suggests a different method.

DAC

Came across some Datacenter Activation Coordination (DAC) from Tim’s blog: part 1, followed by a series of posts you can see at the end of part 1.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

The bit I italicized is important. If you read his blog post you’ll see why. If DAC is activated and you are starting up a previously shutdown DAG, even though the DAG might have quorum it will not start up if some members are still offline. (I had missed that when reading about DAC earlier). To summarize it succinctly from part 2 of his series:

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

Here’s highlights from some of the interesting posts in Tim’s series:

  • Part 4 has info on the steps to do a datacenter switchover and the cmdlets available when DAC is enabled. Essentially: you 1) Stop-DatabaseAvailabilityGroup with –configurationOnly:$TRUE switch for the site that is down – this marks the servers in the site that is down as down, 2) Stop-Service CLUSSVC on the nodes in the site that is up, and finally 3) Restore-DatabaseAvailabilityGroup specifying the site that is up. This Microsoft doc on datacenter switchovers is worth reading side-by-side. It contains info on both DAC and non-DAC scenarios so watch out for that.
  • Part 5 has info on how to use the Start-DatabaseAvailabilityGroup cmdlet to set the DACP bit as 1 on a specified server thus bringing up the DAG by forcing a consensus.
  • Part 6 is an interesting story. A nice edge case of DAC being enabled and graceful shutdown.
  • Part 8 is another interesting story on what happens due to a typo in a cmdlet.

Very briefly, the DAC cmdlets:

  • Stop-DatabaseAvailabilityGroup – mark a specified server, or all server in a specified AD site, as down. Use the -ConfigurationOnly switch to mark the server as down in AD only but not actually do anything on the server(s). Need to use this switch if the servers are already offline but AD is up and accessible in that site. This cmdlet also forces a sync of AD across sites so the information is propagated.
  • Start-DatabaseAvailabilityGroup – same as above, but mark as up. Can use the -ConfigurationOnly switch to not really do anything but only mark in AD.
  • Restore-DatabaseAvailabilityGroup – it evicts any stopped servers, it can configure the DAG to use an alternate witness server, and it brings up the DAG after doing this. This cmdlet can only be used against a DAG with DAC enabled.

Dynamic Quorum

Came across dynamic quorum from the videos (wasn’t previously aware of it). Am being lazy and will put in some screenshots from the video:

The highlighted part is the key thing.

Remember that quorum is defined as “(the number of votes)/2 + 1“. Each node (or witness) typically has a single vote, and (number of votes)/2 is rounded down (i.e. 7/2 = 3.5, rounded down to 3).

With dynamic quorum once a node (or set of nodes) fail, and if the remaining set of nodes form a quorum (note – they have to form quorum), then the required quorum of the cluster is adjusted to reflect the remaining number of nodes.

Take a look at the scenario below:

We have two data centers. 6 nodes + a witness, so initially the quorum was 7/2 + 1 = 4.

The link between the two data centers goes down. Data center B has 3 nodes, which is below the quorum of 4 so all 3 nodes shutdown. Data center A has 3 nodes + witness, thus meeting the quorum and it stays up.

At this point if any further node in data center A goes down, they will fall below the quorum and the cluster will shutdown. To avoid such a situation is where dynamic quorum comes in. With dynamic quorum (introduced since Server 2012) when the nodes in data center A form quorum, the new quorum requirements is 4/2 + 1 = 3.

If a node goes down in data center A, leaving 2 nodes + a witness, since they meet the new quorum of 3 the cluster stays up. The quorum then gets revised to be 2/2 + 1 = 2. If yet another node goes down, the remaining node + witness still meets the new quorum of 2 and so the cluster continues to stay up.

Another slide:

Two data centers, 2 nodes + 1 node, no witness; the quorum is therefore 3/2 + 1 = 2.

One of the nodes in data center A goes down. Since the number of remaining nodes meets quorum, the cluster can stay up. But since there is no fail share witness each node cannot be given an equal vote (I wasn’t aware of this). Thus the cluster service picks up one of the nodes (the one with the lowest node ID) and gives it a vote of 0. The node in data center A has a vote of 0. The new quorum is thus 1/2 + 1 = 1.

If the link between the two data centers goes down, the node in data center B stays up even though the node in data center A too could have formed quorum! Nothing wrong with it, just an edge case to keep in mind as chances are you probably wanted data center A to remain up as that’s why you provisioned two nodes there in the first place.

Now for a variant in which there is a witness:

So two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before, one of the nodes in data center A goes down. Since there are 3 nodes + witness remaining, they meet quorum and the cluster continues. The new quorum is 4/2 + 1 = 3. Again, the data center link goes down. Everything goes down! :) Why? Coz no one has a clear majority. Each side has 2 votes, not the 3 required.

Interestingly I have this setup at work. So a critical thing to keep in mind is that if I were to update & reboot the witness or one of the nodes in data center A (my preferred data center), and the WAN link were to go down – I could lose the cluster! No such problems if I update & reboot a node in data center B and the link goes down, as data center A has the majority. Funny, it’s like you must keep the witness in the less preferred data center.

Windows Server 2012R2 improves upon dynamic quorum by adding dynamic witness.

So if the number of votes is odd, the witness vote is removed. And if the witness is offline or failed, then too it is removed (that includes reboots too, right?). 

Now things get tricky.

Going back to the previous example: so two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before a node data center A goes down (the picture is a bit incorrect as I skipped some intermediate slides), the remaining nodes have quorum so the cluster stays put. The new quorum is 4/2 + 1 = 3. But since the number of nodes is now ODD, cluster service removes the witness from the vote calculations. So the new quorum turns out to be 3/2 + 1 = 2. At this point if the link goes down, the nodes in data center B have quorum and so they form a cluster while the remaining node in data center A is shut down. So unlike the Server 2012 case, which had no dynamic witness, the whole cluster does not go down!

Now, going back to the case where one of the nodes (not witness) had its vote removed as there were only one node in each data center, I mentioned that the node with the lowest ID gets removed. The next two slides talk about that, including how to select a node in a cluster that we’d preferentially like to remove the vote of in such situations.

At this point I’d also like to link to this Microsoft doc on cluster quorum. Am going to quote some parts from there as they explain well and I’d like to keep it here as as reference to myself.

How cluster quorum works

When nodes fail, or when some subset of nodes loses contact with another subset, surviving nodes need to verify that they constitute the majority of the cluster to remain online. If they can’t verify that, they’ll go offline.

But the concept of majority only works cleanly when the total number of nodes in the cluster is odd (for example, three nodes in a five node cluster). So, what about clusters with an even number of nodes (say, a four node cluster)?

There are two ways the cluster can make the total number of votes odd:

  1. First, it can go up one by adding a witness with an extra vote. This requires user set-up.
  2. Or, it can go down one by zeroing one unlucky node’s vote (happens automatically as needed).

I didn’t know about point 2 until watching this video.

Worth bearing in mind that this also applies in the case of the witness being lost. So any time your witness is offline the cluster service automatically zeroes the vote of one of the nodes. If you have 2 nodes in each data center + a witness in one data center, and you reboot the witness – that is fine. One of the nodes will have its vote zeroed out, but there’s no impact and when the witness returns the zeroed out node gets its vote back. But if during the time your witness is rebooting you also have a network outage between the two data centers, then the data center with majority nodes (i.e. not the data center containing the node whose vote was zeroed) wins and the cluster fails over there.

Some more:

Dynamic witness

Dynamic witness toggles the vote of the witness to make sure that the total number of votes is odd. If there are an odd number of votes, the witness doesn’t have a vote. If there is an even number of votes, the witness has a vote. Dynamic witness significantly reduces the risk that the cluster will go down because of witness failure. The cluster decides whether to use the witness vote based on the number of voting nodes that are available in the cluster.

Dynamic quorum works with Dynamic witness in the way described below.

Dynamic quorum behavior

  • If you have an even number of nodes and no witness, one node gets its vote zeroed. For example, only three of the four nodes get votes, so the total number of votes is three, and two survivors with votes are considered a majority.
  • If you have an odd number of nodes and no witness, they all get votes.
  • If you have an even number of nodes plus witness, the witness votes, so the total is odd.
  • If you have an odd number of nodes plus witness, the witness doesn’t vote.

Am pretty sure am going to forget all this a few days from today so I’ll re-link to the docs again as it goes into more detail and has examples etc.

P2V a SQL cluster by breaking the cluster

Need to P2V a SQL cluster at work. Here’s screenshots of what I did in a test environment to see if an idea of mine would work.

We have a 2 physical-nodes SQL cluster. The requirement was to convert this into a single virtual machine.

P2V-ing a single server is easy. Use VMware Converter. But P2V-ing a cluster like this is tricky. You could P2V each node and end up with a cluster of 2 virtual-nodes but that wasn’t what we wanted. We didn’t want to deal with RDMs and such for the cluster, so we wanted to get rid of the cluster itself. VMware can provide HA if anything happens to the single node.

My idea was to break the cluster and get one of the nodes of the cluster to assume the identity of the cluster. Have SQL running off that. Virtualize this single node. And since there’s no change as far as the outside world is concerned no one’s the wiser.

Found a blog post that pretty much does what I had in mind. Found one more which was useful but didn’t really pertain to my situation. Have a look at the latter post if your DTC is on the Quorum drive (wasn’t so in my case).

So here we go.

1) Make the node that I want to retain as the active node of the cluster (so it was all the disks and databases). Then shutdown SQL server.

sqlshutdown

2) Shutdown the cluster.

clustershutdown

3) Remove the node we want to retain, from the cluster.

We can’t remove/ evict the node via GUI as the cluster is offline. Nor can we remove the Failover Cluster feature from the node as it is still part of a cluster (even though the cluster is shutdown). So we need to do a bit or “surgery”. :)

Open PowerShell and do the following:

This simply clears any cluster related configuration from the node. It is meant to be used on evicted nodes.

Once that’s done remove the Failover Cluster feature and reboot the node. If you want to do this via PowerShell:

4) Bring online the previously shared disks.

Once the node is up and running, open Disk Management and mark as online the shared disks that were previously part of the cluster.

disksonline

5) Change the IP and name of this node to that of the cluster.

Straight-forward. Add CNAME entries in DNS if required. Also, you will have to remove the cluster computer object from AD first before renaming this node to that name.

6) Make some registry changes.

The SQL Server is still not running as it expects to be on a cluster. So make some registry changes.

First go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Setup and open the entry called SQLCluster and change its value from 1 to 0.

Then take a backup (just in case; we don’t really need it) of the key called HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Cluster and delete it.

Note that MSSQL10_50.MSSQLSERVER may vary depending on whether you have a different version of SQL than in my case.

7) Start the SQL services and change their startup type to Automatic.

I had 3 services.

Now your SQL server should be working.

8) Restart the server – not needed, but I did so anyways.

Test?

If you are doing this in a test environment (like I was) and don’t have any SQL applications to test with, do the following.

Right click the desktop on any computer (or the SQL server computer itself) and create a new text file. Then rename that to blah.udl. The name doesn’t matter as long as the extension is .udl. Double click on that to get a window like this:

udl

Now you can fill in the SQL server name and test it.

One thing to keep in mind (if you are not a SQL person – I am not). The Windows NT Integrated security is what you need to use if you want to authenticate against the server with an AD account. It is tempting to select the “Use a specific user name …” option and put in an AD username/ password there, but that won’t work. That option is for using SQL authentication.

If you want to use a different AD account you will have to do a run as of the tool.

Also, on a fresh install of SQL server SQL authentication is disabled by default. You can create SQL accounts but authentication will fail. To enable SQL authentication right click on the server in SQL Server Management Studio and go to Properties, then go to Security and enable SQL authentication.

sqlauth

That’s all!

Now one can P2V this node.

Notes on vSphere High Availability (HA)

Just some notes on vSphere HA as I reading along on that. Nothing new here …

Starting with vSphere 5.0 HA has a Master/ Slave model. One ESXi host is elected as a Master, the rest are Slaves. The Master is the one with the most number of datastores connected to it; if all ESXi hosts have the same number of datastores connected to it, the Master is the one with the largest Managed Object ID (MOID). Note that the MOID is interpreted lexically – so an MOID 99 is larger than 100. PowerCLI can be used to view the MOIDs:

Also, the MOID is a vCenter specific construct. Whenever a host, VM, datastore, etc is added to vCenter it is assigned an MOID. For instance here are the MOIDs of my datastores:

Although I haven’t used this it’s also possible to find MOIDs vSphere Managed Object Browser. See this KB article for more info.

Back to the topic – the above is how a Master is elected. There’s only one Master per cluster. When it comes to HA, the Fault Domain Manager (FDM) on this Master is responsible for most of the tasks (which is why even if vCenter is down for a while HA can continue working). vCenter checks with the Master and the Master communicates with vCenter to keep each other abreast of the cluster situation.

  • FDM is installed at /opt/vmware/fdm/fdm/
  • FDM config files are at /etc/opt/vmware/fdm/

The Master monitors the Slave hosts and if a Slave goes down/ is unreachable the Master is responsible for starting these Protected VMs elsewhere. The Master is also responsible for keeping the Slaves abreast of the cluster configuration.

Slaves are limited to monitoring VMs running with them. Slaves monitor the VM health and if a Protected VM powers down they inform the Master so it can be restarted. (Note on Protected VMs: once you enable VM monitoring on a cluster or set a VM as Protected, the VM must be powered off and powered on to be protected). Slaves also keep in touch with each other and if they find the Master is down they conduct an election to select a new Master.

The only time vCenter communicates with Slaves is when a new Master needs to be elected or when the Master reports a Slave as missing and so vCenter tries to contact it.

Slaves send network heartbeats to the Master every second. When a Master stops receiving heartbeats from a Slave it knows it is offline or partitioned/ isolated. Similarly when a Slave stops receiving heartbeats from a Master it knows the Master is offline or partitioned/ isolated.

  • If a Slave is cut off from all other hosts (Master and Slaves) it is considered isolated (caveat: you can also specify up to 10 isolation IP addresses to ping – if these are reachable but the Master and Slaves are not, the Slave does not consider itself isolated, only partitioned).
  • If a Slave is cut off from the Master and some/ none Slaves (i.e. it still has contact with some Slaves) then it is considered partitioned.

In the past if a Slave were isolated/ partitioned the Master would consider it as offline and restart its Protected VMs elsewhere. Starting with vSphere 5.0 the Master also sends a ping (ICMP packet) to the Slave to see if responds and uses datastore heartbeats to verify the Slave is really down. It could be that the Management network is down but the VM and storage networks are up, so the VMs are still functioning as expected.

Datastore heartbeats work thus (and remember they are only used in case of isolation/ partition scenarios):

  • When enabling HA for a cluster, a datastore is automatically selected (or can be selected manually by the user) to be used for datastore heartbeats.
  • On this datastore a folder called .vSphere-HA is created within which a sub-folder of name FDM-<Fault Domain ID>-<vCenter Server Name> is created. (Such a name allows the same datastore to be used by multiple clusters).
  • Each host creates a file with its MOID name in this sub-folder. Like thus:heartbeats
  • Notice the host-X-hb file above? That is created by each host (you can check the /var/log/fdm.log file on each host to see it creating this file). When a Slave does not get heartbeats from a Master it updates its file above (and also checks the timestamp of the file for the Master – if that has updates it means the Master is alive). Similarly, when a Master does not hear from a Slave it checks the Slave’s file above to see if there’s updates. This is how datastore heartbeats work.
  • If a Slave is network partitioned – i.e. it cannot contact the Master – but can see some of the other Slaves, the Master and Slave can conclude that each other is still alive from the datastore heartbeats as above.
    • If the Master is down – i.e. the Slaves think they are partitioned because actually the Master is down – they can now elect a new Master since there are no datastore heartbeats from the Master.
    • If the Slave is down – i.e. the Master is not getting any datastore heartbeats from the Slave – then it restarts the Protected VMs on other hosts. (If the Slave were actually up but had lost network access to the datastore and so cannot update heartbeats, it is as good as down because the VMs have probably crashed by now).
  • If a Slave is network isolated – i.e. it cannot contact the Master or any other Slave (nor can it ping the isolation addresses) – then the Slave adds a special bit in the host-X-poweron file above. This tells the Master that the Slave is network isolated.
    • The Master then locks the file called protectedlist. This is a list of all Protected VMs. Once the Master has locked this file, the Slave knows the Master has taken responsibility for the Protected VMs and the Slave can leave these powered on, shut down, or power off (depending on which of these is selected as the host isolation response when setting up HA).
    • The protectedlist file thus ensures that unless another host has taken over these VMs the current host will not shut down/ power off these.

Two advanced options to keep in mind:

  • I mentioned this earlier: das.isolationAddress[0-9] allow one to specify up to 10 isolation IP addresses to check before a host considers itself isolated.
  • And das.allowNetwork[0-9] allow one to specify up to 10 port groups to use for HA. See this KB article for examples.

Lastly, I haven’t read it fully but this HA Deepdive is a great resource.