Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

vSphere client does not depend on the Inventory Service

On numerous occasions I have noticed my vSphere client always has the correct inventory of objects in vCenter whereas the vSphere web client tends to lag behind. While reading Mastering VMware vSphere 5.5 (a great book if you want to really understand how all this works!) I learnt that that’s because the web client depends on the vCenter Inventory Service as a cache between the web client and vCenter whereas the regular client talks to it directly.

The vCenter Inventory Service does two things – one, it caches inventory objects from vCenter so that each time the web client needs something it doesn’t have to ask vCenter (thus reducing load on vCenter); two, it allows for tags. Having the Inventory Service allows for more web client sessions with lesser load on vCenter server.

This also means its a good idea to place the Inventory Service with the web client, not with the vCenter server.

vCenter and vSphere editions (5.5)

vCenter editions. Just three.

  • Essentials
  • Foundation
  • Standard

Standard is what you usually want. No limits or restrictions.

Essentials is only available when purchased as part of vSphere Essentials or vSphere Essentials Plus kits. Not sold separately. These kits are targeted for SMBs. Limited to 3 hosts of 2 CPUs each. Self-contained – cannot be used with other editions.

Foundation is also for 3 hosts only.

All editions of vCenter include the Management service, SSO, Inventory service, Orchestrator, Web client – everything. There’s no difference in the components included in each edition.

vSphere is the suite. There are three plus two edition of vSphere suite.

Two editions are the kits:

  • Essentials
  • Essentials Plus

Three editions are bundled with vCenter Operations Manager:

  • Standard
  • Enterprise
  • Enterprise Plus

The Essentials & Essentials Plus editions only work with vCenter Essentials. The Standard, Enterprise, and Enterprise Plus work with vCenter Foundation or Standard.

Essentials is pretty basic. Remember it is for 3 hosts of 2 CPUs each. Standalone. In addition you don’t get features like vMotion either. All you get is (1) Thin Provisioning, (2) Update Manager, and (3) vStorage APIs for Data Protection (VADP). Note the latter is only APIs. It is not VMware solution vSphere Data Protection (VDP). Also, no VSAN.

Essentials Plus is a bit more than basic. Once again, only for 3 hosts of 2 CPUs each. Standalone. However, in addition to the three features above you also get (4) vSphere Data Protection, (5) High Availability (HA), (6) vMotion, and (7) vSphere Replication. So you get some useful features. In fact, if I had just 3 hosts and I am unlikely to expand further this is the option I would go for – for me vMotion is very useful and so is HA. Sadly, no Distributed Resource Scheduling (DRS). But you do get VSAN.

Moving on to the big boys …

Standard gives you all the above plus useful features like (8) Storage vMotion, (9) Fault Tolerance, and some more (Hot Add & vShield Endpoint). Still no DRS.

Enterprise gives you all the above plus (10) Storage APIs for Array Integration (nice! but useful only in an Enterprise context where you are likely to have a SAN array and need something like this), (11) DRS, (12) DPM, and (13) Storage APIs for Multi-pathing. As expected, features that are more useful when you have a lot of hosts and are in an Enterprise-y setup. Except DRS :) which would have been nice to have in Standard/ Essentials Plus too.

Finally, Enterprise Plus. All the above plus (13) Distributed Switches, (14) Host Profiles, (15) Auto Deploy, (16) Storage DRS – four of my favorite features – and a bunch of others like App HA, Storage IO Control, Network IO Control, etc.

vCenter – Cannot load the users for the selected domain

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

Found an informative VMware KB article. The ntpq command (short for “NTP query”) can be used to see the status of NTP daemon on the client side. Like thus:

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

  • If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.
  • If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.
  • If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.
    • The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Cannot load the users

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

Makes no sense to me but the problem looks to be in Java/ SSO.

I tried removing AD from the list of identity sources in SSO (in the web client) and re-added it. No luck.

Tried re-adding AD but this time I used an SPN account instead of the machine account. No luck!

Finally I tried adding AD as an LDAP Server just to see if I can get it working somehow – and that clicked! :)

AD as LDAP

So while I didn’t really solve the problem I managed to work around it …

Update: Added the rest of my DCs as time sources to the ESXi hosts and restarted the ntpd service. Maybe that helped, now NTP is working on the hosts.

 

Upgrading iLO firmware manually (working around a stuck HP logo screen when updating)

Past two weeks I have been upgrading the iLO and ROM of all our servers (a bunch of HP DL 360s basically – Gen6 to Gen8) following which I upgrade them from ESXi 4.1 to 5.5. Side by side I have also been upgrading the iLO and ROM of our LeftHand/ StoreVirtual boxes following which I upgrade them from LeftHand OS 8.5 to 12.0. Yes, I’ve been busy!

Interesting thing about the firmware upgrades is that even between servers of the same model, when upgrading with the same Service Pack for Proliant (SPP) CD version, I get different errors. Some odd ones really. For instance some servers simply power off once the SPP CD boots, others give me a Pink Screen of Death, and yet others simply hang with the pulsating HP logo.

Pink Screen

Pulsating Logo

I couldn’t find any solutions for the servers that power off (I used SPP version 2015.04, 2014.09, 2014.06 and 2013.02 – same results for all). I was able to work around the pink screen by using an older version (for instance, I was using 2015.04 and that failed but 2014.09 worked). And I sorted of worked around the pulsating logo problem.

For the pulsating logo issue apparently the fix is to upgrade iLO first and then run the SPP. In my case the servers had really ancient versions of iLO – “1.87 06/03/2009” – so I upgraded them via the iLO webpage. The blog post I link to before (and also this one) show a way of updating iLO via SSH but that didn’t work for me sadly. (Could just be the web server I was running. I used TinyWeb to run a small web server off my desktop machine).

Before upgrading iLO via SSH or the webpage, you need to get iLO first. That should be easy but I had trouble getting it. For anyone else looking for the latest and greatest version of iLO 2 this HP page is what you want (and the “Revision History” tab on that page gives you older versions too). That page lets you download versions of the firmware for flashing via Linux or Windows. I downloaded the Windows versions, right clicked on it (it’s an EXE file) via 7-Zip (any other zip tool should do), and extracted the contents. The result is a file with a name like “ilo2_225.bin”. This is the binary image of the iLO 2 firmware that you can flash via SSH or the webpage.

Flashing via the webpage is easy. Go to the “Administration” tab, click Browse to select this file, and click “Send firmware image”.

GUI firmwareUse a modern browser if you can. :-) I used the ancient version of IE on my server and that didn’t do anything, but when I used Firefox I was able to see a progress bar and the firmware actually got updated.

GUI firmware flashingAfter doing this I was able to run the SPP without any issue.

Another thing I learnt is that for the LeftHand/ StoreVirtual servers, simply upgrading the OS or patching it is enough to upgrade the ROM too. So I could have saved some time for myself with the LeftHand/ StoreVirtual servers by updating the iLO (as above) and upgrading the OS. No need to run the SPP.

On a related note, I had some servers with an “Internal Health LED failed” error even though everything seemed to be alright with them. Upgrading the iLO sorted that out!

And while on the topic of iLO I had some servers whose iLO was not responsive. I couldn’t ping the iLO IP address nor could I connect to it. I was able to fix some of those servers by completely powering off the server, removing the power cables, removing the iLO cable, waiting a few minutes, putting back the power cable and powering on the server, and once it has loaded the OS put in the iLO cable. (I have also read reports on the Internet where there was no need to remove/ re-insert the iLO cable so YMMV).

One server though had no luck – its iLO chip was faulty I guess. I tried to upgrade its iLO firmware and ROM by physically being in front of the server but it would hang at the pulsating logo as above. I think the faulty iLO was causing SPP to fail. Because of the faulty iLO though, ESXi would hang at “loading module ipmi_si_drv” for about 30 minutes each time it would boot (or when I’d run the installer to upgrade to 5.5). The solution is as detailed in this blog post. (Note: the argument is noipmiEnabled – I was mistakenly typing noipmiEnable the first few times and nothing happened). Post-install I configured the VMkernel.Boot.impiEnabled advanced configuration option to 0 (I unchecked it). This way I don’t have to enter the boot options each time.

That’s all!

VMware/ PowerCLI – find disks used by a template

It’s easy finding the disks used by a VM. Just check its settings via the client or use PowerCLI.

Can’t do the same for a template though as the Get-Template output has no similar property.

But then I came across Get-HardDisk:

Sweet! Same command works for templates (as above) and VMs:

That’s all. Hope this helps someone.

Enable & Disable SSH on ESXi host via PowerCLI

I alluded to this in another post but couldn’t find it when I was searching my posts for the cmdlet. So here’s a separate post.

The Get-VMHostService is your friend when dealing with services on ESXi hosts. You can use it to view the services thus:

To start and stop services we use the Start-VMHostService and Stop-VMHostService but these take (an array of) HostService objects.  HostService objects are what we get from the Get-VMHostService cmdlet above. Here’s how you stop the SSH & ESXi Shell services for instance:

Since the cmdlet takes an array, you can give it HostService objects of multiple hosts. Here’s how I start SSH & ESXi Shell for all hosts:

As an aside here’s a nice post on six different ways to enable SSH on a host. Good one!

Misc ESXI/ vSphere stuff

Just some notes to myself so I can refer to this later.

  • You can only have a maximum of 256 VMFS datastores per ESXI host. (This is one reason why you wouldn’t want to create a LUN/ datastore per VM. Wouldn’t work if you have a lot of VMs!)
    • Other maximums (for vSphere 5.5) can be found at this link.
  • When you create distributed switch port group there are 3 port binding options:
    • Static Binding (the default): VM NICs are connected to the port group at VM creation and remain so until the VM is removed from the port group. Power off a VM or disconnecting the NIC from the port group does not remove it from the port group – the port is still kept aside for the VM. What this means is that once you connect a VM to a port it stays with that forever.
      • Since the ports are assigned at VM creation, even if vCenter is down when the VM later powers on/ connects to the port group, it will continue to have network connectivity. (Note the emphasis on “later”. If the VM were already running and vCenter were to go down network traffic isn’t affected in either of the binding options).
    • Dynamic Binding (deprecated): VM NICs are connected to the port group only when the VM is powered on and its NIC connected to the port group. Power off the VM or disconnect the NIC and it is not longer connected to the same port when it comes back on or is reconnected.
      • Since the port binding happens only when the VM is powered on or connected, and the port group resides with vCenter, what this means is that you can only power on / off such VMs via vCenter. If vCenter is off / unreachable when the VM powers on / connects, it will not have network connectivity as it won’t have a port in the port group. (As above, note that this doesn’t affect VMs that are already running).
      • Dynamic Binding is deprecated but is useful when the number of VMs is larger than the number of ports in the port group and not all VMs will be on / connected at the same time.
    • Ephemeral Binding: Similar to Distributed Binding, VM NICs are connected to the port group only when the VM is powered on and its NIC connected to the port group. Powering off the VM or disconnecting it results in the port being removed from the port group. 
      • Although Dynamic and Ephemeral Bindings seem similar, they don’t have similar limitations. Thus while VMs with Dynamic Binding port groups won’t have network connectivity if they are powered on / connected when vCenter is off / unreachable, VMs with Ephemeral Binding have no such limitation. They don’t get a proper port number from the port group, but get a temporary one like h-1 which changes to a proper port number whenever connectivity with vCenter is restored.
      • Below screenshot shows the port numbers of three VMs, each connected to a port group of different binding (Ephemeral, Dynamic, Standard from top to bottom) and powered on when the vCenter was unreachable. Bindings
      • Although the NIC is unable to get a port – like Dynamic Binding – with an Ephemeral Binding port group the host creates a fake port and connects the VM anyway. 
      • I don’t understand why Dynamic Binding even exists as an option – unless it’s for backward compatibility? Ephemeral Binding seems to have the advantage of Dynamic Binding – ports are created at VM connection / powering on and so you can oversubscribe to a port group – but doesn’t have the disadvantage of lost connectivity when vCenter is off / unreachable. (I assume Ephemeral port groups too can be used for over subscribing, though the official KB articles don’t say anything like this so I could be wrong).
      • Dynamically creating / removing ports from the port group is an expensive operation so Dynamic and Ephemeral Binding port groups have a performance overhead. Static Binding is the preferred one.
      • Also, Ephemeral Binding port groups lose their history and security controls across host reboots. Apparently Dynamic Binding port groups don’t do this as I don’t see any mention of this as a Dynamic Binding limitation anywhere.

That’s all for now!

 

Get ESXi host network info using PowerShell/ PowerCLI

Not an exhaustive post, I am still exploring this stuff.

To get a list of network adapters on a host:

To get a list of virtual switches on a host, with the NICs assigned to these:

To get a list of port groups on a host:

To get a list of port groups , the virtual switches they are mapped to, and the NICs that make up these switches:

This essentially combines the first and third cmdlets above.

More later!

ESXi 4.0 “unsupported” mode

At work we still use some ESXi 4.0 hosts so this one’s a reminder to myself as ESXi Shell access works slightly different with that one.

On ESXi 4.0 once we are on the DCUI screen, pressing Alt+F1 gives access to a different (hidden) console. Whatever you type here seems to have no effect, but if you type the word unsupported and press Enter, you will be prompted to enter the root password and enter the Tech Support Mode (TSM). For screenshots and such of this check out this blog post.

On ESXi 4.1 and above you can enable this via the DCUI. See this KB article for the deets.

On that note here’s a good blog post detailing various ways of enabling SSH access on an ESXi host. Informative.

Number of IPv4 routes did not match

Was creating / migrating some ESXi hosts during the week and came across the above error “Number of IPv4 routes did not match” when checking for host profile compliance of one of the hosts. All network settings of this host appeared to be same as the rest so I was stumped as to what could be wrong. Via a VMware KB article I came across the esxcfg-route command that helped identify the problem. To run this command SSH into the host:

By default the command only outputs the default gateway but you can pass it the -l switch to list all routes:

In my case the above output was from one of the hosts, while the following was from the non-compliant host:

Notice the vmk2 interface has the wrong network. Not sure how that happened. Oddly the GUI didn’t show this incorrect network but obviously something was corrupt somewhere.

To fix that I thought I’ll remove the vmk2 interface and re-add it. Big mistake! Possibly because its network was same as that of the management network (10.50.0.0/24) removing this interface caused the host to lose connectivity from vCenter. I could ping it but couldn’t connect to it via SSH, vSphere Client, or vCenter. Finally I had to reset the network via the DCUI – it’s under “Network Restore Options”. I tried “Restore vDS” first, didn’t help, so did a “Restore Standard Switch”. This is a very useful – it creates a new standard switch and moves the Management Network onto that so you get connectivity to the host. This way I was able to reconnect to the host, but now I stumbled upon a new problem.

The host didn’t have the vmk2 interface any more but when I tried to recreate it I got an error that the interface already exists. But no, it does not – the GUI has no trace of it! Some forum posts suggested restarting the vCenter service as that clears its cache and puts it in sync with the hosts but that didn’t help either. Then I came across this post which showed me that it is possible for the host to still have the VMkernel port but vCenter to not know of it. For this the esxcli command is your friend. To list all VMkernel ports on a host do the following:

After that, removing the VMkernel interface can be done by a variant of same command:

Now I could add the re-add the interface via vSphere and get the hosts into compliance.

Before I conclude this post though, a few notes on the commands above.

If you have PowerCLI installed you can run all the esxcli commands via the Get-EsxCli cmdlet. For example:

If I wanted to remove the interface via PowerCLI the command would be slightly different:

I would have written more on the esxcli command itself but this excellent blog post covers it all. It’s an all powerful command that can be used to manage many aspects of the ESXi host, even set it in maintenance mode!

Heck you can even use esxcli to upgrade from one ESXi version to another. It is also possible to run the esxcli command from a remote computer (Windows or Linux) by installing the vSphere CLI tools on that computer. Additionally, there’s also the vSphere Management Assistant (VMA) which is a virtual appliance that offers command line tools.

The esxcli is also useful if you want to kill a VM. For instance the following lists all running VMs on a host:

If that VM were stuck for some reason and cannot be stopped or restarted via vSphere it’s very useful to know the esxcli command can be used to kill the VM (has happened a couple of times to me in the past):

Regarding the type of killing you can do:

There are three types of VM kills that can be attempted: [soft, hard, force]. Users should always attempt ‘soft’ kills first, which will give the VMX process a chance to shutdown cleanly (like kill or kill -SIGTERM). If that does not work move to ‘hard’ kills which will shutdown the process immediately (like kill -9 or kill -SIGKILL). ‘force’ should be used as a last resort attempt to kill the VM. If all three fail then a reboot is required. (required)

Another command line option is vim-cmd which I stumbled upon from one of the links above. I haven’t used it much so as a reference to myself here’s a blog post explaining it in detail.

Lastly there’s also a bunch of esxcfg-* commands, one of whom we came across above.

I haven’t used these much. They seem to be present for compatibility reasons with ESXi 3.x and prior. Back then you had commands with a vicfg- prefix, now you have the same but with a esxcfg- prefix. For instance, esxcfg-vmknic is now replaced with esxcli network interface as we saw above.

That’s all for now!

Update: Thought I’d use this post to keep track of other useful commands.

To get IPv4 addresses details:

Replace with ipv6 if that’s what you want.

To set an IPv4 address:

To ping an address from the host:

Change keyboard layout:

Get current keyboard layout:

List available layouts:

Set a new layout:

Remotely enable SSH

The esxcli commands are cool but you need to enable SSH each time you want to connect to the host and run these (unless you install the CLI tools on your machine). If you have PowerCLI though you can enable SSH remotely.

To list the services:

To enable SSH and the ESXi shell:

 

Load balancing in vCenter and ESXI

One of the things you can do with a portgroup is define teaming for the underlying physical NICs.

teaming

If you don’t do anything here, the default setting of “Route based on originating virtual port” applies. What this does is quite obvious. Each virtual port on the virtual switch is mapped to a physical NIC behind the scenes; so all traffic to & from that virtual port goes & comes via that physical NIC. Since your virtual NIC connects to a virtual port this is equivalent to saying all traffic for that virtual NIC happens via a particular physical NIC.

In the screenshot above, for instance, I have two physical NICs dvUplink1 and dvUplink2. If I left teaming at the default setting and say I had 4 VMs connecting to 4 virtual ports, chances are two of these VMs will use dvUplink1 and two will use dvUplink2. They will continue using these mappings until one of the dvUplinks dies, in which case the other will take over – so that’s how you get failover.

This is pretty straightforward and easy to set up. And the only disadvantage, if any, is that you are limited to the bandwidth of a single physical NIC. If each of dvUplink1 & dvUplink2 were 1Gb NICs it isn’t as though the underlying VMs had 2Gb (2 NICs x 1Gb each) available to them. Since each VM is mapped to one uplink, 1Gb is all they get.

Moreover, if say two VMs were mapped to an uplink, and one of them was hogging up all the bandwidth of this uplink while the remaining uplink was relatively free, the other VM on this uplink won’t automatically be mapped to the free uplink to make better use of resources. So that’s a bummer too.

A neat thing about “Route based on originating virtual port” is that the virtual port is fixed for the lifetime of the virtual machine so the host doesn’t have to calculate which physical NIC to use each time it receives traffic to & from the virtual machine. Only if the virtual machine is powered off, deleted, or moved to a different host does it get a new virtual port.

The other options are:

  • Route based on MAC hash
  • Route based on IP hash
  • Route based on physical NIC load
  • Explicit failover

We’ll ignore the last one for now – that just tells the host to use the first physical NIC in the list and use that for all VMs.

“Route based on MAC hash” is similar to “Route based on originating virtual port” in that it uses the MAC address of the virtual NIC instead of virtual port. I am not very clear on how this is better than the latter. Since the MAC address of a virtual machine is usually constant (unless it is changed or a different virtual NIC used) all traffic from that MAC address will use the same physical NIC always. Moreover, there is the additional overhead in that the host has to check each packet for the MAC address and decide which physical NIC to use. VMware documentation says it provides a more even distribution of traffic but I am not clear how.

“Route based on physical NIC load” a good one. It starts off with “Route based on originating virtual port” but if a physical NIC is loaded, then the virtual ports mapped to it are moved to a physical NIC with less load! This load balancing option is only available for distributed switches. Every 30s the distributed switch checks the physical NIC load and if it exceeds 75% then the virtual port of the VM with highest utilization is moved to a different physical NIC. So you have the advantages of “Route based on originating virtual port” with one of its major disadvantages removed.

In fact, except for “Route based on IP hash” none of the other load balancing mechanisms have an option to utilize more than a single physical NIC bandwidth. And “Route based on IP hash” does not do this entirely as you would expect.

“Route based on IP hash”, as the name suggests, does load balancing based on the IP hash of the virtual machine and the remote end it is communicating with. Based on a hash of these two IP addresses all traffic for the communication between these two IPs is sent through one NIC. So if a virtual machine is communicating with two remote servers, it is quite likely that traffic to one server goes through one physical NIC while traffic to the other goes via another physical NIC – thus allowing the virtual machine to use more bandwidth than that of one physical NIC. However – and this is an often overlooked point – all traffic between the virtual server and one remote server is still constrained by the bandwidth of the physical NIC it happens via. Once traffic is mapped to a particular physical NIC, if more bandwidth is required or the physical NIC is loaded, it is not as though an additional physical NIC is used. This is a catch with “Route based on IP hash” that’s worth remembering.

If you select “Route based on IP hash” as a load balancing option you get two warnings:

  • With IP hash load balancing policy, all physical switch ports connected to the active uplinks must be in link aggregation mode.
  • IP hash load balancing should be set for all port groups using the same set of uplinks.

What this means is that unlike the other load balancing schemes where there was no additional configuration required on the physical NICs or the switch(es) they connect to, with “Route based on IP hash” we must combine/ bond/ aggregate the physical NICs as one. There’s a reason for this.

In all the other load balancing options the virtual NIC MAC is associated with one physical NIC (and hence one physical port on the physical switch). So incoming traffic for a VM knows which physical port/ physical NIC to go via. But with “Route based on IP hash” there is no such one to one mapping. This causes havoc with the physical switch. Here’s what happens:

  • Different outgoing traffic flows choose different physical NICs. With each of these packets the physical switch will keep updating its MAC address table with the port the packet was got from. So for instance, say the two physical NICs are connected to physical switch Port1 and Port2 and the virtual NIC MAC address is VMAC1. When an outgoing traffic packet goes via the first physical NIC, the switch will update its tables to reflect that VMAC1 is connected to Port1. Subsequent traffic flows might continue using the first physical NIC so all is well. Then say a traffic flow uses the second physical NIC. Now the switch will map VMAC1 to Port2; then a traffic flow could use Port1 so the mapping gets changed to Port1, and then Port2, and so on …
  • When incoming traffic hits the physical switch for MAC address VMAC1, the switch will look up its tables and decide which port to send traffic on. If the current mapping is Port1 traffic will go out via that; if the current mapping is Port2 traffic will go out via that. The important thing to note is that the incoming traffic flow port chosen is not based on the IP hash mapping – it is purely based on whatever physical port the switch currently has mapped for VMAC1.
  • So what’s required is a way of telling the physical switch that the two physical NICs are to be considered as bonded/ aggregated such that traffic from either of those NICs/ ports is to be treated accordingly. And that’s what EtherChannel does. It tells the physical switch that the two ports/ physical NICs are bonded and that it must route incoming traffic to these ports based on an IP hash (which we must tell EtherChannel to use while configuring it).
  • EtherChannel also helps with the MAC address table in that now there can be multiple ports mapped to the same MAC address. Thus in the above example there would now be two mappings VMAC1-Port1 and VMAC1-Port2 instead of them over-writing each other!

“Route based on IP hash” is a complicated load balancing option to implement because of EtherChannel. And as I mentioned above, while it does allow a virtual machine to use more bandwidth than a single physical NIC, an individual traffic flow is still limited to the bandwidth of a single physical NIC. Moreover there is more overhead on the host because it has to calculate the physical NIC used for each traffic flow (essentially each packet).

Prior to vCenter 5.1 only static EtherChannel was supported (unless you use a third party virtual switch such as the Cisco Nexus 1000V). Static EtherChannel means you explicitly bond the physical NICs. But from vCenter 5.1 onwards the inbuilt distributed switch supports LACP (Link Aggregation Control Protocol) which is a way of automatically bonding physical NICs. Enable LACP on both the physical switch and distributed switch and the physical NICs will automatically be bonded.

(To enable LACP on the physical NICs go to the uplink portgroup that these physical NICs are connected to and enable LACP).

lacpThat’s it for now!

Update

Came across this blog post which covers pretty much everything I covered above but in much greater detail. A must read!

VCSA: Unable to connect to server. Please try again.

Most likely you set the VCSA to regenerate its certificates upon reboot and forgot to uncheck it after the reboot. (It’s under Admin > Certificate Regeneration Enabled). So each time you reboot VCSA gets a new certificate and your browser throws the above error.

Fix is to refresh (Ctrl+F5 in Firefox) the page so the new certificate is fetched and you get a prompt about it.

A very brief intro to Port Groups, Standard and Distributed switches

A year ago I went for VMware training but never got around to using it at work. Now I am finally into it, but I’ve forgotten most of the concepts. And that sucks!

So I am slowly re-learning things as I go along. I am in this weird state where I sort of remember bits and pieces from last year but at the same time I don’t really remember them.

What I have been reading about these past few days (or rather, trying to read these past few days) is networking. The end goal is distributed switches but for now I am starting with the basics. And since I like to blog these things as I go along, here we go.

You have a host. The server that runs ESXi (hypervisor).

This host has physical NICs. Hopefully oodles of them, all connected to your network.

This server runs virtual machines (a.k.a guests). These guests see virtual NICs that don’t really exist except in software, exposed by ESXi.

What you need is for all these virtual NICs to be able to talk to each other (if needed) as well as talk to the outside world (via the physical NICs and they switches they connect to).

You could create one big virtual switch and connect all the physical and virtual NICs to it. (This virtual switch is again something which does not physically exist). All the guests can thus talk to each other (as they are on the same switch) and also talk to the outside world (because the virtual switch is connected to the outside world via whatever it is connected to).

But maybe you don’t want all the virtual NICs to be able to talk to each other. You want little networks in there – a la VLANs – to isolate certain traffic from other. There’s two options here:

  1. Create separate virtual switches for each network, and assign some virtual NICs to some switches. The physical NICs that connect to these virtual switches will connect to separate physical switches so you are really limited in the number of virtual switches you have by the number of physical NICs you have. Got 2 physical NICs, you can create 2 virtual switches; got 5 physical NICs, you can create 5 virtual switches.
  2. Create one big virtual switch as before, but use port groups. Port groups are the VMware equivalent of VLANs (well, sort of; they do more than just VLANs). They are a way of grouping the virtual ports on the virtual switch such that only the NICs connected to a particular port group can talk to each other. You can create as many port groups as you want (within limits) and assign all your physical NICs to this virtual switch and use VLANs so the traffic flowing out of this virtual switch to the physical switch is on separate networks. Pretty nice stuff!

(In practice, even if you create separate virtual switches you’d still create a port group on that – essentially grouping all the ports on that switch into one. That’s because port groups are used to also apply policies to the ports in the group. Policies such as security, traffic shaping, and load balancing/ NIC teaming of the underlying physical NICs. Below is a screenshot of the options you have with portgroups).

Example of a Portgroup

Now onto standard and distributed switches. In a way both are similar – in that they are both virtual switches – but the difference is that a standard switch exists on & is managed by a host whereas a distributed switch exists on & is managed by vCenter. You create a distributed switch using vCenter and then you go to each host and add its physical NICs to the distributed switch. As with standard switches you create can portgroups in distributed switches and assign VM virtual NICs to these portgroups.

An interesting thing when it comes to migration (obvious but I wasn’t sure about this initially) is that if you have a host with two NICs – one of which is a member of a standard switch and the other of a distributed switch – but both NICs connect to the same physical network (or VLAN), and you have VMs in this host some of which are on the standard switch and others are on the distributed switch, all these VMs can talk to each other through the underlying physical network. Useful when you want to migrate stuff.

I got side tracked at this point with other topics so I’ll conclude this post here for now.

[Aside] Hyper-V R2 replicas

Saw this post in the Technet newsletter. Good stuff. I won’t be using it now (wish I were managing a Hyper-V environment in the first place! setting up replicas is a far away dream L:)) but thought I should point to it anyways. Maybe some day I’ll need to use Hyper-V replicas and I’ll search my blog and come across this … maybe maybe! :)

From my ESXi training I know VMware has something similar (vSphere Replication). The latter is a paid additional feature while Hyper-V replicas are free. Also, here’s a comparison by Aidan Finn.