vMotion NIC load balancing fails even though there is an active link

The other day I blogged about how I had a host whose vMotion VMkernel interface seemed to be broken. Any vMotion attempts to it would hang at 14%.

At that time I logged on to the destination host, then used vmkping with the -I switch (to explicitly specify the vMotion VMkernel interface of the destination host), and found that I couldn’t ping the VMkernel interface of the other hosts. These hosts could ping each other but couldn’t ping the destination host.

The VMKernel interface is backed by two physical NICs. I found that if I remove one of the physical NICs from the VMkernel it works. Interestingly this link wasn’t showing any CDP info either, so it looked like something was wrong with it (the physical NIC shows as unclaimed coz the screenshot was taken after I moved it to unclaimed).

So the first question is why did the VMkernel fail when only one of the physical NICs failed? Since the other physical NIC backing the VMkernel NIC is still active shouldn’t it have continued working?

The reason why it failed is that by default network failover detection is via “Link status only”. This only detects failures to the link – like say the cable is broken, the switch is down, or the NIC has failed – while failures such as the link being connected but blocked by switch are not detected. In my case as you can see from the screenshot above the link status is connected – so the host doesn’t consider the link failed even though it isn’t actually working, thus continues to use it.

Next I discovered that other hosts too similarly had their second vMotion physical NIC in a failed state as above yet they weren’t failing like this host. The simple explanation for this is that the host above somehow selected the faulty physical NIC as the one to use, didn’t detect it as failed and so continued to use it; whereas other hosts were more lucky and chose the physical NIC that works alright, so didn’t have any issues.

I am not sure that’s the entire answer though. For once the host that failed was ESXi 5.5 and using a distributed switch, while the other two hosts were ESXi 4.0 and using standard switches. Did that make a difference?

The default load balancing method for both standard and distributed switches is the same. (For a standard switch you check this under the vSwitch properties on the host. For a distributed switch you check this under the portgroup in the Networking section of vSphere (web) client).

Load balancing is what I am concerned about here because that’s what the hosts should be using to balance between both the NICs. That’s what the host will be using to select the physical NIC to use for that particular traffic flow. The load balancing method is same between standard and distributed switches yet why were the distributed switch/ ESXi 5.5 hosts behaving differently?

I am still not sure of an answer but I have my theory. My theory is that since a distributed switch is across multiple hosts the load balancing method (above) of choosing a route based on virtual port ID comes into play. Here’s screenshots from two of my hosts connected to the same distributed switch port group for instance:

As you can see the virtual port number is different for the VMkernel NIC of each host. So each host could potentially use a different underlying physical NIC depending on how the load balancing algorithm maps it.

But what about a standard switch? Since the standard switch is only on the host, and the only VMkernel NIC connected to it (in the case of vMotion) is the single VMKernel NIC I have assigned for vMotion, there is no load balancing algorithm coming into play! If, instead of a VMkernel I had a Virtual Machine network, then the virtual port number matters because there are multiple VMs connecting to the various port numbers; but that doesn’t matter for VMkernel NICs as there is only one of them. And so my theory is that for a VMkernel NIC (such as vMotion) backed by multiple physical NICs and using the default load balancing algorithm of virtual port ID – all traffic by default goes into one of the physical NICs and the other physical NIC is never used unless the chosen one fails. And that is why my hosts using the standard switches were always using the same physical NIC (am guessing the lower numbered one as that’s what both hosts chose) while hosts using distributed switches would have chosen different physical NICs per host.

That’s all! Just thought I’d put this out there in case anyone else has the same question.