Was migrating one of our offices to a new IP scheme the other day and vMotion started failing. I had a good idea what the problem could be (coz I encountered something similar a few days ago in another context) so here’s a blog post detailing what I did.
For simplicity let’s say the hosts have two VMkernel NICs – vmk0 and vmk1. vmk0 is connected to the Management Network. vmk1 is for vMotion. Both are on separate VLANs.
When our Network admins gave out the new IPs they gave IPs from the same range for both functions. That is, for example, vmk0 had an IP 10.20.1.2/24 (and 10.20.1.3/24 and 10.20.4/24 on the other hosts) and vmk1 had an IP of 10.20.12/24 (and 10.20.1.13/24 and 10.20.1.14/24 on the other hosts).
Since both interfaces are on separate VLANs (basically separate LANs) the above setup won’t work. That’s because as far as the hosts are concerned both interfaces are on the same network yet physically they are on separate networks. Here’s the routing table on the hosts:
1 2 3 4 5 |
~ # esxcli network ip route ipv4 list Network Netmask Gateway Interface Source ----------- ------------- ----------- --------- ------ default 0.0.0.0 10.20.1.254 vmk0 MANUAL 10.20.1.0 255.255.255.0 0.0.0.0 vmk0 MANUAL |
Notice that any traffic to the 10.20.1.0/24 network goes via vmk0. And that includes the vMotion traffic because that too is in the same network! And since the network that vmk0 is on is physically a separate network (because it is a VLAN) this traffic will never reach the vMotion interfaces of the other hosts because they don’t know of it.
So even though you have specific vmk1 as your vMotion traffic NIC, it never gets used because of the default routes.
If you could force the outgoing traffic to specifically use vmk1 it will work. Below are the results of vmkping
using the default route vs explicitly using vmk1:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
~ # vmkping 10.20.1.13 PING 10.20.1.13 (10.20.1.13): 56 data bytes --- 10.20.1.13 ping statistics --- 2 packets transmitted, 0 packets received, 100% packet loss ~ # vmkping -I vmk1 10.20.1.13 PING 10.20.1.13 (10.20.1.13): 56 data bytes 64 bytes from 10.20.1.13: icmp_seq=0 ttl=64 time=0.433 ms 64 bytes from 10.20.1.13: icmp_seq=1 ttl=64 time=0.198 ms --- 10.20.1.13 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.198/0.316/0.433 ms |
The solution here is to either remove the VLANs and continue with the existing IP scheme, or to keep using VLANs but assign a different IP network for the vMotion interfaces.
Update: Came across the following from this blog post while searching for something else:
If the management network (actually the first VMkernel NIC) and the vMotion network share the same subnet (same IP-range) vMotion sends traffic across the network attached to first VMkernel NIC. It does not matter if you create a vMotion network on a different standard switch or distributed switch or assign different NICs to it, vMotion will default to the first VMkernel NIC if same IP-range/subnet is detected.
Please be aware that this behavior is only applicable to traffic that is sent by the source host. The destination host receives incoming vMotion traffic on the vMotion network!
That answered another question I had but didn’t blog about in my post above. You see, my network admins had also set the iSCSI networks to be in the same subnet as the management network – but separate VLANs – yet the iSCSI traffic was correctly flowing over that VLAN instead of defaulting to the management VMkernel NIC. Now I understand why! It’s only vMotion that defaults to the first VMkernel NIC in the same IP range/ subnet as vMotion.