Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Notes on vSphere High Availability (HA)

Just some notes on vSphere HA as I reading along on that. Nothing new here …

Starting with vSphere 5.0 HA has a Master/ Slave model. One ESXi host is elected as a Master, the rest are Slaves. The Master is the one with the most number of datastores connected to it; if all ESXi hosts have the same number of datastores connected to it, the Master is the one with the largest Managed Object ID (MOID). Note that the MOID is interpreted lexically – so an MOID 99 is larger than 100. PowerCLI can be used to view the MOIDs:

Also, the MOID is a vCenter specific construct. Whenever a host, VM, datastore, etc is added to vCenter it is assigned an MOID. For instance here are the MOIDs of my datastores:

Although I haven’t used this it’s also possible to find MOIDs vSphere Managed Object Browser. See this KB article for more info.

Back to the topic – the above is how a Master is elected. There’s only one Master per cluster. When it comes to HA, the Fault Domain Manager (FDM) on this Master is responsible for most of the tasks (which is why even if vCenter is down for a while HA can continue working). vCenter checks with the Master and the Master communicates with vCenter to keep each other abreast of the cluster situation.

  • FDM is installed at /opt/vmware/fdm/fdm/
  • FDM config files are at /etc/opt/vmware/fdm/

The Master monitors the Slave hosts and if a Slave goes down/ is unreachable the Master is responsible for starting these Protected VMs elsewhere. The Master is also responsible for keeping the Slaves abreast of the cluster configuration.

Slaves are limited to monitoring VMs running with them. Slaves monitor the VM health and if a Protected VM powers down they inform the Master so it can be restarted. (Note on Protected VMs: once you enable VM monitoring on a cluster or set a VM as Protected, the VM must be powered off and powered on to be protected). Slaves also keep in touch with each other and if they find the Master is down they conduct an election to select a new Master.

The only time vCenter communicates with Slaves is when a new Master needs to be elected or when the Master reports a Slave as missing and so vCenter tries to contact it.

Slaves send network heartbeats to the Master every second. When a Master stops receiving heartbeats from a Slave it knows it is offline or partitioned/ isolated. Similarly when a Slave stops receiving heartbeats from a Master it knows the Master is offline or partitioned/ isolated.

  • If a Slave is cut off from all other hosts (Master and Slaves) it is considered isolated (caveat: you can also specify up to 10 isolation IP addresses to ping – if these are reachable but the Master and Slaves are not, the Slave does not consider itself isolated, only partitioned).
  • If a Slave is cut off from the Master and some/ none Slaves (i.e. it still has contact with some Slaves) then it is considered partitioned.

In the past if a Slave were isolated/ partitioned the Master would consider it as offline and restart its Protected VMs elsewhere. Starting with vSphere 5.0 the Master also sends a ping (ICMP packet) to the Slave to see if responds and uses datastore heartbeats to verify the Slave is really down. It could be that the Management network is down but the VM and storage networks are up, so the VMs are still functioning as expected.

Datastore heartbeats work thus (and remember they are only used in case of isolation/ partition scenarios):

  • When enabling HA for a cluster, a datastore is automatically selected (or can be selected manually by the user) to be used for datastore heartbeats.
  • On this datastore a folder called .vSphere-HA is created within which a sub-folder of name FDM-<Fault Domain ID>-<vCenter Server Name> is created. (Such a name allows the same datastore to be used by multiple clusters).
  • Each host creates a file with its MOID name in this sub-folder. Like thus:heartbeats
  • Notice the host-X-hb file above? That is created by each host (you can check the /var/log/fdm.log file on each host to see it creating this file). When a Slave does not get heartbeats from a Master it updates its file above (and also checks the timestamp of the file for the Master – if that has updates it means the Master is alive). Similarly, when a Master does not hear from a Slave it checks the Slave’s file above to see if there’s updates. This is how datastore heartbeats work.
  • If a Slave is network partitioned – i.e. it cannot contact the Master – but can see some of the other Slaves, the Master and Slave can conclude that each other is still alive from the datastore heartbeats as above.
    • If the Master is down – i.e. the Slaves think they are partitioned because actually the Master is down – they can now elect a new Master since there are no datastore heartbeats from the Master.
    • If the Slave is down – i.e. the Master is not getting any datastore heartbeats from the Slave – then it restarts the Protected VMs on other hosts. (If the Slave were actually up but had lost network access to the datastore and so cannot update heartbeats, it is as good as down because the VMs have probably crashed by now).
  • If a Slave is network isolated – i.e. it cannot contact the Master or any other Slave (nor can it ping the isolation addresses) – then the Slave adds a special bit in the host-X-poweron file above. This tells the Master that the Slave is network isolated.
    • The Master then locks the file called protectedlist. This is a list of all Protected VMs. Once the Master has locked this file, the Slave knows the Master has taken responsibility for the Protected VMs and the Slave can leave these powered on, shut down, or power off (depending on which of these is selected as the host isolation response when setting up HA).
    • The protectedlist file thus ensures that unless another host has taken over these VMs the current host will not shut down/ power off these.

Two advanced options to keep in mind:

  • I mentioned this earlier: das.isolationAddress[0-9] allow one to specify up to 10 isolation IP addresses to check before a host considers itself isolated.
  • And das.allowNetwork[0-9] allow one to specify up to 10 port groups to use for HA. See this KB article for examples.

Lastly, I haven’t read it fully but this HA Deepdive is a great resource.

Creating a Server 2012 Failover Cluster for iSCSI target

This post is about setting up a Server 2012 R2 failover cluster that acts as an iSCSI target server.

I have four servers: WIN-DATA01, WIN-DATA02, WIN-DATA03, WIN-DATA04. I will be putting WIN-DATA03 & WIN-DATA04 in the cluster. As you know clusters required shared storage so that’s what WIN-DATA01 and WIN-DATA02 is for. In a real world setup WIN-DATA01 & WIN-DATA02 are your SAN boxes whose storage you want to make available to clients. Yes you could have clients access the two SAN boxes directly, but by having a cluster in between you can provide failover. Plus, Windows now has a cool thing called Storage Pools which let you do software RAID sort of stuff.

Prepare the iSCSI target server

First step, prepare the iSCSI target servers that will provide storage for the cluster. The steps for this are in my previous post so very briefly here’s what I did:

Repeat the above for the other server (you can login to that server and issue commands, or do remotely like I did below). Everything’s same as above except for the addition of the -ComputerName switch.

Excellent!

Now let’s move on to the two servers that will form the cluster.

Add shared storage to the servers

Here we add the two iSCSI targets created above to the two servers WIN-DATA03 & WIN-DATA04.

Login to one of the servers, open Server Manager > Tools > iSCSI Initiator.

Targets

Easiest option is to enter WIN-DATA01 in the Target field and click Quick Connect. That should list the targets on this server in the Discovered targets box. Select the ones you want and click Connect.

Another option is to go to the Discovery tab.

Discovery tab

Click Discover Portal, enter the two server names (WIN-DATA01, WIN-DATA02), refresh (if needed), and then the first tab will automatically show all targets on these two servers. Select the ones you want and click Connect as before.

The GUI sets these connections as persistent (it calls them “Favorite Target”) so they are always reconnected when the server reboots.

You can also use PowerShell to add these connections though that isn’t as easy as this point and click. Instructions are in my earlier post so here they are briefly:

Unlike the GUI, PowerShell does not mark these targets as persistent so I have to specify that explicitly when connecting.

Prepare the shared storage

The shared storage we added is offline and needs to be initialized. You can do this via the Disk Management UI or using PowerShell as below. The initializing bits need to be done on any one server only (WIN-DATA03 or WIN-DATA04) but you have to make the disks online on both servers.

Create the cluster

Login to one of the servers that will form the cluster, open Server Manager, go to Tools > Failover Cluster Manager and click Create Cluster on the right side (the Actions pane). This launches the Create Cluster Wizard.

Click Next on the first screen, enter the server names on the second …

Adding Servers to Cluster

… do or don’t do the validation tests (I skipped as this is a lab setup for me), give a name for the cluster and an IP address, confirm everything (I chose to not add all eligible storage just so I can do that separately), and that’s it.

Couple of things to note here:

  • If your cluster servers don’t have any interfaces with DHCP configured, you will also be prompted for an IP address. Otherwise a DHCP address is automatically assigned. (In my case I had an interface with DHCP).
  • A computer object with the cluster server name you specify is created in the same OU as the servers that make up the cluster. You can specify a different OU by giving the full name as the cluster name – so in the example above I would use “CN=WIN-CLUSTER01,OU=Clusters,OU=Server,DC=rakhesh,DC=local” to create the object in the Clusters OU within the Servers OU. Would have been good if the wizard mentioned that.
  • Later, when you add roles to this cluster server, it creates more virtual servers automatically. These are placed in the same OU where the cluster server object is so you must give this object rights to add/ remove computers in that OU. So it’s best you have a separate OU which you can delegate rights to. I used the Delegation Control Wizard to give the WIN-CLUSTER01 object full control over Computer Objects in the rakhesh.local/Servers/Clusters OU.

    5

Instead of the Create Cluster Wizard one can use PowerShell as below:

Configuring the cluster

A cluster has many resources assigned to it. Resources such as disks, networks, and its name. These can be seen in the summary page of the cluster or via PowerShell.

Cluster Resources

Network

I am interested in changing the IP address of my cluster. Currently it’s taken an IP from the DHCP pool and I don’t want that. Being a lab setup both my servers had a NAT interface to connect to the outside world and so the cluster is currently picking up an IP from that. I want it to use the internal network instead.

If I right click on IP address resource I can change it.

IP Address

In my case I don’t want to use this network itself so I have to go to the Networks section in the UI …

Networks

… where I can see there are two networks specified, with one of them having no cluster use while the other is configured for cluster & client use, so I right click on the first network and …

Network Settings

… enable it for cluster access, then I right click on the second network and …

Network Settings

… rename it (for my reference) as well as disable it from the cluster.

Now if I go to the resources section and right click the IP address, I can select the second network and assign a static IP address.

Here’s how to do the above via PowerShell.

To change the network name use the Get-ClusterNetwork to select the network you want (the result is an object) and directly assign the new name as a value:

One would expect to be able to set IP addresses too via this, but unfortunately these are read-only properties (notice it only has the get method; properties which you can modify will also have the set method):

To set the IP address via PowerShell we have to remember that we went to the Cluster Resource view to change it. So we do the same here. Use the Get-ClusterResource cmdlet.

That doesn’t totally help. What we need are the parameters of the resource, so we pipe that to the Get-ClusterParameter cmdlet.

Perfect! And to modify these parameters use the Set-ClusterParameter cmdlet.

To take the Cluster IP Address resource offline & online use the Stop-ClusterResource and Start-ClusterResource cmdlets:

Although it doesn’t say so, you have to also stop and start the Cluster Name resource for the name to pick up the new IP address.

Lastly, to change whether a particular network is used for cluster communication or not, one can do the same technique that was used to change the host name. Just use the Role property. I am not sure what the values for it are, but from my two networks I can see that a value of 0 means it is not used for cluster communication, while a value of 3 means it is used for cluster communication and client traffic.

The GUI is way easier than PowerShell for configuring this network stuff!

Storage (Disks)

Next I want to add disks to my cluster. Usually all available disks get added by default, but in this case I want to do it manually.

Right click Failover Cluster Manager > Storage > Disk and select Add Disk. This brings up a window with all the available disks. Since this is a cluster not every disk present on the system can be used. The disk must be something visible to all members of the cluster as it is shared by them.

Add Disk

Via PowerShell:

To add these disks to the cluster pipe the output to the Add-ClusterDisk cmdlet. There’s no way to select specific disks so you must either pipe the output to a Where-Object cmdlet first to filter the disks you want, or use the Get-Disk cmdlet (as it lets you specify a disk number) and pipe that to the Add-ClusterDisk cmdlet.

(I won’t be adding these disks to my cluster yet as I want to put them in a storage pool. I’ll be doing that in a bit).

Quorum

Quorum is a very important concept for clusters (see an earlier post of mine for more about quorum).

The cluster which I created is currently in the Node Majority mode. (Because I haven’t added and Disk or File Share witnesses to it). That’s not a good mode to be in so let’s change that.

Go to the Configure Cluster Quorum Settings as in the screenshot below, click Next …

Quorum

… choose the second option (Select the quorum witness) and click Next …

Witness

… in my case I want to use a File Share witness so I choose that option and click Next …

File Witness

… create a file share someplace, point to that, and click Next …

File Share

… click Next and then Finish.

Now it’s configured.

Storage (Pool)

(If you plan to create a storage pool add an extra shared disk to your cluster. I created a new target on the WIN-DATA02 & WIN-DATA01 server (no need for doing on both but I just did so it’s consistent), mapped another virtual disk to it, and used the iSCSI initiator on WIN-DATA03 & WIN-DATA04 to map it. Storage Pools on Failover Clusters require a minimum of three disks).

I want to use Storage Spaces (they are called Storage Pools in Failover Cluster Manager). Storage Spaces is a new feature in Server 2012 (and Windows 8) and it lets you combine disks and create virtual disks on them that are striped, mirrored, or have parity (think of it as software RAID).

We can create a new Storage Pool by right clicking on Failover Cluster Manager > Storage > Pools and select New Storage Pool.

Pool

Click Next, give the Pool a name …

Pool Name

… select the disks that will make up the pool (notice that the disks shown below are the iSCSI disk that are visible to both nodes; neither disk is actually on the local computer, they are both from a SAN box someplace (in this case the WIN-DATA01 & WIN-DATA02 servers from where we created these before) …

Pool Disks

… and click Create.

Right click on the pool that was created and make virtual disks on that. These virtual disks are software RAID equivalents. (The pool is a placeholder for all your disks. The virtual disks are the logical entities you create out of this pool). Here’s the confirmation page of the pool I created:

Pool Virtual Disk

Once the disk is created, be sure to not uncheck the “Create a volume when this wizard closes” check box. If you did uncheck, you’ll have to go to Server Manager > File and Storage Services > Volumes, click on TASKS and select New Volume. The virtual disk is just a disk, what we have to do now is create volumes on it.

Below is a screenshot of the volume I created. I chose ReFS for no particular reason, and assigned the full space to the volume. Also, the screenshot shows that I didn’t assign a drive letter. That’s incorrect. I did just that I forgot to do that when taking this screenshot. (Note that I assigned the drive letter R. This will be used later).

Volume

Via PowerShell:

As with the GUI, once the disk is created a cmdlet has to be run to create a volume on the disk. It’s probably the New-Volume cmdlet, but it’s throwing errors in my case and I am too tired to investigate further so I’ll skip it for now.

Add Roles

Finally lets add the iSCSI target server role.

Right click Failover Cluster Manager > Roles and select Configure Roles.

Click Next, select iSCSI Target Server, and click Next. This steps assumes the iSCSI Target Server role is installed on both nodes of the cluster. If not, install it via Server Manager or PowerShell.

Give the role (the virtual server that hosts the role) a name and IP address …

ISCSI Target Server

… select storage (the previously created volume; if you missed out on creating the volume go back and do it now), click Next, Next, and that’s it.

At this point it’s worth taking a step back to understand what we have done. What we did just now is create a virtual server (a role in the cluster actually) and assigned it some storage space. One might expect this storage to be the one that’s presented by the iSCSI server as a target, but no, that’s not the case. Think of this server and its storage as similar to the WIN-DATA01 and WIN-DATA02 servers that we dealt with initially. What did we do to set these as an iSCSI target? We created a target, created virtual iSCSI disks, and assigned mappings to them. That’s exactly what we have to do here too!

Ideally one should be able to use the GUI and do this, but Server Manager seems to have trouble communicating with the newly created WIN-DATA server. So I’ll use PowerShell instead.

And that’s it! Now I can add this target back to the WIN-DATA01 & WIN-DATA02 servers if I want (as those are the Initiator IDs I specified) and whatever I write will be written back to their disks via this clustered iSCSI target. I am not sure if the mirroring will happen across both servers though, but this is all just for fun anyways and not a real life scenario.

Lastly …

Before I conclude here’s something worth checking out.

The WIN-DATA virtual server is currently on the WIN-DATA04 node. This means WIN-DATA04 is the one who is currently providing this role, WIN-DATA03 is on standby.

If I login to WIN-DATA04 and check its disks I will see the mirrored volume I created:

If I do the same on WIN-DATA03 I won’t see the volume.

If I right click on the role in Failover Cluster Manager, select Move > Select Node, and select the new node as WIN-DATA03, then this becomes the new active node. Now if I check the volumes and disks of both servers the information will be the other way around. WIN-DATA03 will have the volume and disk, WIN-DATA04 won’t have anything!

This is how clustering ensures both servers don’t write to the shared storage simultaneously. Remember, the shared storage is just a block device. It doesn’t have any file locking nor is it aware of who is writing to it. So it’s up to the cluster to take care of all this.

Also …

Be sure to add WIN-CLUSTER01 and WIN-DATA to DNS with the correct IPs. If you don’t do that Server Manager and other tools won’t be able to resolve the name.

There’s more …

This post is just a tip of the iceberg. There’s so many more cool things you can do with iSCSI, Clustering, and Storage Spaces in Server 2012 so be sure to check these out elsewhere!

Clusters & Quorum

I spent yesterday refreshing my knowledge of clusters and quorum. Hadn’t worked on these since Windows Server 2003! So here’s a brief intro to this topic:

There are two types of clusters – Network Load Balancing (NLB) and Server Clusters.

Network Load Balancing is “share all” in that every server has a full copy of the app (and the data too if it can be done). Each server is active, the requests are sent to each of them randomly (or using some load distributing algorithm). You can easily add/ remove servers. Examples where you’d use NLB are SMTP servers, Web servers, etc. Each server in this case is independent of the other as long as they are configured identically.

Server Clusters is “share nothing” in that only one server has a full copy of the app and is active. The other servers are in a standby mode, waiting to take over if the active one fails. A shared storage is used, which is how standby servers can take over if the active server fails.

The way clusters work is that clients see one “virtual” server (not to be confused with virtual servers of virtualization). Behind this virtual server are the physical servers (called “cluster nodes” actually) that make up the cluster. As far as clients are concerned there’s one end point – an IP address or MAC address – and what happens behind that is unknown to them. This virtual server is “created” when the cluster forms, it doesn’t exist before that. (It is important to remember this because even if the servers are in a cluster the virtual server may not be created – as we shall see later).

In the case of server clusters something called “quorum” comes into play.

Imagine you have 5 servers in a cluster. Say server 4 is the active server and it goes offline. Immediately, the other servers detect this and one of them becomes the new active server. But what if server 4 isn’t offline, it’s just disconnected from the rest of the group. Now we’ll have server 4 continuing to be active, but the other servers can’t see this, and so one of these too becomes the active server – resulting in two active servers! To prevent such scenarios you have the concept of quorum – a term you might have heard of in other contexts, such as decision making groups. A quorum is the minimum number of people required to make a decision. Say a group of 10 people are deciding something, one could stipulate that at least 6 members must be present during the decision making process else the group is not allowed to decide. A similar concept applies in the case of clusters.

In its simplest form you designate one resource (a server or a disk) as the quorum and whichever cluster contains that resource sets itself as the active cluster while the other clusters deactivate themselves. This resource also holds the cluster database, which is a database containing the state of the cluster and its nodes, and is accessed by all the nodes. In the example above, initially all 5 servers are connected and can see the quorum, so the cluster is active and one of the servers in it is active. When the split happens and server 4 (the currently active server) is separated, it can no longer see the quorum and so disables the cluster (which is just itself really) and stops being the active server. The other 4 servers can still see the quorum, so they continue being active and set a new server as the active one.

In this simple form the quorum is really like a token you hold. If your cluster holds the token it’s active; if it does not you disband (deactivate the cluster).

This simple form of quorum is usually fine, but does not scale to when you have clusters across sites. Moreover, the quorum resource is a single point of failure. Say that resource is the one that’s disconnected or offline – now no one has the quorum, and worse, the cluster database is lost. For this reason the simple form of quorum is not used nowadays. (This mode of quorum is called “Disk Only” by the way).

There are three alternatives to the Disk Only mode of quorum.

One mode is called “Node Majority” and as the name suggests it is based on majority. Here each node has a copy of the cluster database (so there’s no single point of failure) and whichever cluster has more than half the nodes of the cluster wins. So in the previous example, say the cluster of 5 servers splits into one of 3 and 2 servers each – since the first cluster has more than half of the nodes, that wins. (In practice a voting takes place to decide this. Each node has a vote. So cluster one has 1+1+1 = 3 votes; cluster two has 1+1 = 2 votes. Cluster one wins).

Quorums based on majority have a disadvantage in that if the number of nodes are even then you have a tie. That is, if the above example were a 6 node cluster and it split into two clusters of 3 nodes each, both clusters will deactivate as neither have more than half the nodes (i.e. neither have the quorum). You need a tie breaker!

It is worth noting that the effect of quorum extends to the number of servers that can fail in a cluster. Say we have a 3 node cluster. For the cluster to be valid, it must have at least 2 servers. If one server fails, the cluster still has 2 servers so it will function as usual. But if one more server fails – or these two servers are disconnected from each other – the remaining cluster does not have quorum (there’s only 1 node, which is less than more than half the nodes) and so the cluster stops and the servers deactivate. This means even though we have one server still running, and intuitively one would expect that server to continue servicing requests (as it would have in the case of an NLB cluster), it does not do so in the case of a server cluster due to quorum! This is important to remember.

Another mode is called “Node & Disk Majority” and as the name suggests it is a mix of the “Node Majority” and “Disk Only” modes. This mode is for clusters with an even number of modes (where, as we know, “Node Majority” fails) and the way it works is that the cluster with more than half the nodes and which also contains the resource (a disk, usually called a “disk witness”) designated as quorum is the active one. Essentially the disk witness essentially acts as the tie breaker. (In practice the disk witness has an extra vote. So a cluster with 6 nodes will still require more than 3 nodes to consider it active and so if it splits into 3 nodes each, when it comes to voting one of the clusters will have (3+1=) 4 votes and hence win quorum).

In “Node & Disk Majority” mode, unlike the “Disk Only” mode the cluster database is present with all the nodes and so it is not a single point of failure either.

The last mode is called “Node & File Share Majority” and this is a variant of the “Node Majority” mode. This mode too is for clusters with an even number of nodes, and it works similar to “Node & Disk Majority” except for the fact that instead of a resource a file share is used. A file share (called a “file witness” in this case) is selected on any server – not necessarily a part of the cluster – and one node in the cluster locks a file on this share, effectively telling others that it “owns” the file share. So instead of using a resource as a tie breaker, ownership of the file share is used as the tie breaker. (As in the “Node & Disk Majority” mode the node that owns the file share has an additional vote as it owns the file share). Using the previous examples, if a cluster with 6 nodes splits into 3 nodes each, whichever cluster has the node owning the file share will win quorum while the other cluster will deactivate. If the cluster with 6 nodes splits into clusters of 4 and 2 nodes each, and say the 2 node cluster owns the file share, it will still lose as more than half the nodes are in the 4 node cluster (in terms of votes the winning cluster has 4 votes, the losing cluster has 3 votes). When the cluster deactivates, the current owner will release the lock and a node in the new cluster will take ownership.

An advantage of the “Node & File Share Majority” mode is that the file share can be anywhere – even on another cluster (preferably). The file share can also be on a node in the cluster, but that’s not preferred as if the node fails you lose two votes (that of being the file share owner as well as the node itself).

Here are some good links that contain more information (Windows specific):

At work we have two HP LeftHand boxes in a SAN cluster. The only quorum mode used by this is is the “Node Majority” one, which as we know fails for an even number of nodes, and so HP supplies a virtual appliance called “Failover Manager” that is installed on the ESX hosts and is used as the tie breaker. If the Failover Manager is powered off and both LeftHands are powered on together, suppose both devices happen to come on at the same time a cluster is not formed as neither has the quorum. To avoid such situations the Failover Manager has to be present when they power on, or we have to time the powering on such that one of the LeftHands is online before the other.