Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Migrating VMkernel port from Standard to Distributed Switch fails

I am putting a link to the official VMware documentation on this as I Googled it just to confirm to myself I am not doing anything wrong! What I need to do is migrate the physical NICs and Management/ VM Network VMkernel NIC from a standard switch to a distributed switch. Process is simple and straight-forward, and one that I have done numerous times; yet it fails for me now!

Here’s a copy paste from the documentation:

  1. Navigate to Home > Inventory > Networking.
  2. Right-click the dVswitch.
  3. If the host is already added to the dVswitch, click Manage Hosts, else Click Add Host.
  4. Select the host(s), click Next.
  5. Select the physical adapters ( vmnic) to use for the vmkernel, click Next.
  6. Select the Virtual adapter ( vmk) to migrate and click Destination port group field. For each adapter, select the correct port group from dropdown, Click Next.
  7. Click Next to omit virtual machine networking migration.
  8. Click Finish after reviewing the new vmkernel and Uplink assignment.
  9. The wizard and the job completes moving both the vmk interface and the vmnic to the dVswitch.

Basically add physical NICs to the distributed switch & migrate vmk NICs as part of the process. For good measure I usually migrate only one physical NIC from the standard switch to the distributed switch, and then separately migrate the vmk NICs. 

Here’s what happens when I am doing the above now. (Note: now. I never had an issue with this earlier. Am guessing it must be some bug in a newer 5.5 update, or something’s wrong in the underlying network at my firm. I don’t think it’s the networking coz I got my network admins to take a look, and I tested that all NICs on the host have connectivity to the outside world (did this by making each NIC the active one and disabling the others)). 

First it’s stuck in progress:

And then vCenter cannot see the host any more:

Oddly I can still ping the host on the vmk NIC IP address. However I can’t SSH into it, so the Management bits are what seem to be down. The host has connectivity to the outside world because it passes the Management network tests from DCUI (which I can connect to via iLO). I restarted the Management agents too, but nope – cannot SSH or get vCenter to see the host. Something in the migration step breaks things. Only solution is to reboot and then vCenter can see the host.

Here’s what I did to workaround anyways. 

First I moved one physical NIC to the distributed switch.

Then I created a new management portgroup and VMkernel NIC on that for management traffic. Assigned it a temporary IP.

Next I opened a console to the host. Here’s the current config on the host:

The interface vmk0 (or its IPv4 address rather) is what I wanted to migrate. The interface vmk4 is what I created temporarily. 

I now removed the IPv4 address of the existing vmk NIC and assigned that to the new one. Also, confirmed the changes just to be sure. As soon as I did so vCenter picked up the changes. I then tried to move the remaining physical NIC over to the distributed switch, but that failed. Gave an error that the existing connection was forcibly closed by the host. So I rebooted the host. Post-reboot I found that the host now thought it had no IP, even though it was responding to the old IP via the new vmk. So this approach was a no-go (but still leaving it here as a reminder to myself that this does not work)

I now migrated vmk0 from the standard switch to the distributed switch. As before, this will fail – vCenter will lose connectivity to the ESX host. But that’s why I have a console open. As expected the output of esxcli network ip interface list shows me that vmk0 hasn’t moved to the distributed switch:

So now I go ahead and remove the IPv4 address of vmk0 and assign that to vmk4 (the new one). Also confirmed the changes. 

Next I rebooted (reboot) the host, and via the CLI I removed vmk0 (for some reason the GUI showed both vmk0 and vmk4 with the same IP I assigned above). 

Reboot again!

Post-reboot I can go back to the GUI and move the remaining physical NIC over to the distributed switch. :) Yay!

vMotion NIC load balancing fails even though there is an active link

The other day I blogged about how I had a host whose vMotion VMkernel interface seemed to be broken. Any vMotion attempts to it would hang at 14%.

At that time I logged on to the destination host, then used vmkping with the -I switch (to explicitly specify the vMotion VMkernel interface of the destination host), and found that I couldn’t ping the VMkernel interface of the other hosts. These hosts could ping each other but couldn’t ping the destination host.

The VMKernel interface is backed by two physical NICs. I found that if I remove one of the physical NICs from the VMkernel it works. Interestingly this link wasn’t showing any CDP info either, so it looked like something was wrong with it (the physical NIC shows as unclaimed coz the screenshot was taken after I moved it to unclaimed).

Missing CDP infoSo the first question is why did the VMkernel fail when only one of the physical NICs failed? Since the other physical NIC backing the VMkernel NIC is still active shouldn’t it have continued working?

The reason why it failed is that by default network failover detection is via “Link status only”. This only detects failures to the link – like say the cable is broken, the switch is down, or the NIC has failed – while failures such as the link being connected but blocked by switch are not detected. In my case as you can see from the screenshot above the link status is connected – so the host doesn’t consider the link failed even though it isn’t actually working, thus continues to use it.

Next I discovered that other hosts too similarly had their second vMotion physical NIC in a failed state as above yet they weren’t failing like this host. The simple explanation for this is that the host above somehow selected the faulty physical NIC as the one to use, didn’t detect it as failed and so continued to use it; whereas other hosts were more lucky and chose the physical NIC that works alright, so didn’t have any issues.

I am not sure that’s the entire answer though. For once the host that failed was ESXi 5.5 and using a distributed switch, while the other two hosts were ESXi 4.0 and using standard switches. Did that make a difference?

The default load balancing method for both standard and distributed switches is the same. (For a standard switch you check this under the vSwitch properties on the host. For a distributed switch you check this under the portgroup in the Networking section of vSphere (web) client).

default load balancingLoad balancing is what I am concerned about here because that’s what the hosts should be using to balance between both the NICs. That’s what the host will be using to select the physical NIC to use for that particular traffic flow. The load balancing method is same between standard and distributed switches yet why were the distributed switch/ ESXi 5.5 hosts behaving differently?

I am still not sure of an answer but I have my theory. My theory is that since a distributed switch is across multiple hosts the load balancing method (above) of choosing a route based on virtual port ID comes into play. Here’s screenshots from two of my hosts connected to the same distributed switch port group for instance:

port numberAs you can see the virtual port number is different for the VMkernel NIC of each host. So each host could potentially use a different underlying physical NIC depending on how the load balancing algorithm maps it.

But what about a standard switch? Since the standard switch is only on the host, and the only VMkernel NIC connected to it (in the case of vMotion) is the single VMKernel NIC I have assigned for vMotion, there is no load balancing algorithm coming into play! If, instead of a VMkernel I had a Virtual Machine network, then the virtual port number matters because there are multiple VMs connecting to the various port numbers; but that doesn’t matter for VMkernel NICs as there is only one of them. And so my theory is that for a VMkernel NIC (such as vMotion) backed by multiple physical NICs and using the default load balancing algorithm of virtual port ID – all traffic by default goes into one of the physical NICs and the other physical NIC is never used unless the chosen one fails. And that is why my hosts using the standard switches were always using the same physical NIC (am guessing the lower numbered one as that’s what both hosts chose) while hosts using distributed switches would have chosen different physical NICs per host.

That’s all! Just thought I’d put this out there in case anyone else has the same question.

A very brief intro to Port Groups, Standard and Distributed switches

A year ago I went for VMware training but never got around to using it at work. Now I am finally into it, but I’ve forgotten most of the concepts. And that sucks!

So I am slowly re-learning things as I go along. I am in this weird state where I sort of remember bits and pieces from last year but at the same time I don’t really remember them.

What I have been reading about these past few days (or rather, trying to read these past few days) is networking. The end goal is distributed switches but for now I am starting with the basics. And since I like to blog these things as I go along, here we go.

You have a host. The server that runs ESXi (hypervisor).

This host has physical NICs. Hopefully oodles of them, all connected to your network.

This server runs virtual machines (a.k.a guests). These guests see virtual NICs that don’t really exist except in software, exposed by ESXi.

What you need is for all these virtual NICs to be able to talk to each other (if needed) as well as talk to the outside world (via the physical NICs and they switches they connect to).

You could create one big virtual switch and connect all the physical and virtual NICs to it. (This virtual switch is again something which does not physically exist). All the guests can thus talk to each other (as they are on the same switch) and also talk to the outside world (because the virtual switch is connected to the outside world via whatever it is connected to).

But maybe you don’t want all the virtual NICs to be able to talk to each other. You want little networks in there – a la VLANs – to isolate certain traffic from other. There’s two options here:

  1. Create separate virtual switches for each network, and assign some virtual NICs to some switches. The physical NICs that connect to these virtual switches will connect to separate physical switches so you are really limited in the number of virtual switches you have by the number of physical NICs you have. Got 2 physical NICs, you can create 2 virtual switches; got 5 physical NICs, you can create 5 virtual switches.
  2. Create one big virtual switch as before, but use port groups. Port groups are the VMware equivalent of VLANs (well, sort of; they do more than just VLANs). They are a way of grouping the virtual ports on the virtual switch such that only the NICs connected to a particular port group can talk to each other. You can create as many port groups as you want (within limits) and assign all your physical NICs to this virtual switch and use VLANs so the traffic flowing out of this virtual switch to the physical switch is on separate networks. Pretty nice stuff!

(In practice, even if you create separate virtual switches you’d still create a port group on that – essentially grouping all the ports on that switch into one. That’s because port groups are used to also apply policies to the ports in the group. Policies such as security, traffic shaping, and load balancing/ NIC teaming of the underlying physical NICs. Below is a screenshot of the options you have with portgroups).

Example of a Portgroup

Now onto standard and distributed switches. In a way both are similar – in that they are both virtual switches – but the difference is that a standard switch exists on & is managed by a host whereas a distributed switch exists on & is managed by vCenter. You create a distributed switch using vCenter and then you go to each host and add its physical NICs to the distributed switch. As with standard switches you create can portgroups in distributed switches and assign VM virtual NICs to these portgroups.

An interesting thing when it comes to migration (obvious but I wasn’t sure about this initially) is that if you have a host with two NICs – one of which is a member of a standard switch and the other of a distributed switch – but both NICs connect to the same physical network (or VLAN), and you have VMs in this host some of which are on the standard switch and others are on the distributed switch, all these VMs can talk to each other through the underlying physical network. Useful when you want to migrate stuff.

I got side tracked at this point with other topics so I’ll conclude this post here for now.