Subscribe via Email

Subscribe via RSS


Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

vSphere Distributed Switches are Layer 2 devices (doh!)

This is a very basic post. Was trying to read up on NSX and before I could appreciate it I wanted to go down and explore how things are without NSX so I can better understand what NSX is trying to do. I wanted to put it down in writing as I spent some time on it, but there’s nothing new or grand here.

So. vSphere Distributed Switches (VDS). These are Layer 2 switches that exist on each ESX host and which contain port groups that you can connect VMs running on a host onto. In case it wasn’t obvious from the name “switch”, these are Layer 2. Which means that all the hosts connecting to a particular Distributed Switch must be on the same Layer 2. Once you create a Distributed Switch and add ESXi hosts and their physical NICs to it, you can create VMKernel ports for Management, vMotion, Fault Tolerance, etc but these VMKernel ports aren’t used by the port groups you create on the Distributed Switch. The port groups are just like Layer 2 switches – they communicate via broadcasting using the underlying physical NICs that are assigned to the Distributed Switch; but since there’s no IP address as such assigned to a port group there’s no routing involved. (This is an obvious point but I keep forgetting it).

For example say you have two ESX hosts – HostA and HostB – and these are on two separate physical networks (i.e. separated by a router). You create a new Distributed Switch comprising of a physical NIC each from each host. Then you make a port group on this switch and put VM-A on HostA and VM-B on HostB. When creating the Distributed Switch and adding physical NICs to it, VMware doesn’t care if the physical NICs aren’t in the same Layer 2 domain. It will happily add the NICs but when you try to send traffic from VM-A to VM-B it will fail. That’s because when VM-A tries to communicate with VM-B (let’s assume these two VMs know each others MAC address so there’s no need for ARP communication first), VM-A will send Ethernet frames to the Distributed Switch on HostA who will then broadcast it to the Layer 2 network its physical NIC assigned to the Distributed Switch is connected to. Since these broadcasted frames won’t reach the physical NIC of HostB the VM-B there never sees it, and so the two VMs cannot communicate with each other. 

So – keep in mind that all physical NICs connecting to the same Distributed Switch must be on the same Layer 2. If the underlying physical NICs are on separate Layer 3 networks, and these Layer 3 networks have connectivity to each other, it doesn’t matter – the VMs in the port groups will not be able to communicate. 

And this is where NSX comes in. Using the concept of VXLANs, NSX stretches a Layer 2 network across Layer 3. Basically it encapsulates the Layer 2 traffic within Layer 3 packets and gives the illusion of all VMs being on the same Layer 2 network – but this illusion is what Network Virtualization if all about, right? :) VXLAN is an overlay

VXLAN encapsulates Layer 2 frames in UDP packets. The VXLAN is like a tunnel to which all the hosts connecting to this VXLAN hook into. On each host there’s something called a Virtual Tunnel End Point (VTEP) which is the “thing” that actually hooks into the VXLAN. If a VXLAN is a Distributed Switch made up of physical NICs from the host, the VTEP is the VMKernel ports of this Distributed Switch that do the actual communication (like how vMotion traffic between two hosts happens via the VMKernel ports you assign for vMotion). In fact, during an NSX install you install three VIBs on the ESXi hosts – one of these enhances the existing Distributed Switch with VXLAN capabilities (the encapsulation stuff  I mentioned above). 

Once you have NSX you can create multiple Logical Switches. These are basically VXLAN switches that operate like Layer 2 switches but can actually stretch multiple Layer 3 networks. Logical Switches are overlay switches. ;o) Each Logical Switch corresponds to one VXLAN. 

ps. VXLAN is one of the cool features of NSX. The other cool features are the Distributed Logical Router (DLR) and the Distributed Firewall (DFW). I mentioned that a ESXi host has 3 VIBs installed as part of NSX, and that one of them is VXLAN functionality? Well the other two are DLR and DFW (god, so many acronyms!). Prior to DLR if an ESXi host had two VMs connected to different Distributed Switches, and if these two hosts wanted to talk to each other, the traffic would go down from one of the VMs, to the host, to the underlying physical network router, and back to the host and up to the VM on the other Distributed Switch. But with DLR, the ESXi hypervisor kernel can do Layer 3 routing too, so it will simply send traffic directly to the VM in the other Distributed Switch. 

Similarly, DFW just means each ESXi hypervisor can also apply firewall rules to the packets, so you don’t need one centralized firewall place any more. You simply create rules and push it out to the ESXi hosts and they can do firewalling between VMs. Everything is virtual! :)

pps. Some other jargon. East-West traffic means network traffic that’s usually within or between servers (ESXi hosts in our case). North-South traffic means any other network traffic – basically, traffic that goes out of this layer of ESXi hosts. With NSX you try and have more traffic East-West rather than North-South. 

vMotion NIC load balancing fails even though there is an active link

The other day I blogged about how I had a host whose vMotion VMkernel interface seemed to be broken. Any vMotion attempts to it would hang at 14%.

At that time I logged on to the destination host, then used vmkping with the -I switch (to explicitly specify the vMotion VMkernel interface of the destination host), and found that I couldn’t ping the VMkernel interface of the other hosts. These hosts could ping each other but couldn’t ping the destination host.

The VMKernel interface is backed by two physical NICs. I found that if I remove one of the physical NICs from the VMkernel it works. Interestingly this link wasn’t showing any CDP info either, so it looked like something was wrong with it (the physical NIC shows as unclaimed coz the screenshot was taken after I moved it to unclaimed).

Missing CDP infoSo the first question is why did the VMkernel fail when only one of the physical NICs failed? Since the other physical NIC backing the VMkernel NIC is still active shouldn’t it have continued working?

The reason why it failed is that by default network failover detection is via “Link status only”. This only detects failures to the link – like say the cable is broken, the switch is down, or the NIC has failed – while failures such as the link being connected but blocked by switch are not detected. In my case as you can see from the screenshot above the link status is connected – so the host doesn’t consider the link failed even though it isn’t actually working, thus continues to use it.

Next I discovered that other hosts too similarly had their second vMotion physical NIC in a failed state as above yet they weren’t failing like this host. The simple explanation for this is that the host above somehow selected the faulty physical NIC as the one to use, didn’t detect it as failed and so continued to use it; whereas other hosts were more lucky and chose the physical NIC that works alright, so didn’t have any issues.

I am not sure that’s the entire answer though. For once the host that failed was ESXi 5.5 and using a distributed switch, while the other two hosts were ESXi 4.0 and using standard switches. Did that make a difference?

The default load balancing method for both standard and distributed switches is the same. (For a standard switch you check this under the vSwitch properties on the host. For a distributed switch you check this under the portgroup in the Networking section of vSphere (web) client).

default load balancingLoad balancing is what I am concerned about here because that’s what the hosts should be using to balance between both the NICs. That’s what the host will be using to select the physical NIC to use for that particular traffic flow. The load balancing method is same between standard and distributed switches yet why were the distributed switch/ ESXi 5.5 hosts behaving differently?

I am still not sure of an answer but I have my theory. My theory is that since a distributed switch is across multiple hosts the load balancing method (above) of choosing a route based on virtual port ID comes into play. Here’s screenshots from two of my hosts connected to the same distributed switch port group for instance:

port numberAs you can see the virtual port number is different for the VMkernel NIC of each host. So each host could potentially use a different underlying physical NIC depending on how the load balancing algorithm maps it.

But what about a standard switch? Since the standard switch is only on the host, and the only VMkernel NIC connected to it (in the case of vMotion) is the single VMKernel NIC I have assigned for vMotion, there is no load balancing algorithm coming into play! If, instead of a VMkernel I had a Virtual Machine network, then the virtual port number matters because there are multiple VMs connecting to the various port numbers; but that doesn’t matter for VMkernel NICs as there is only one of them. And so my theory is that for a VMkernel NIC (such as vMotion) backed by multiple physical NICs and using the default load balancing algorithm of virtual port ID – all traffic by default goes into one of the physical NICs and the other physical NIC is never used unless the chosen one fails. And that is why my hosts using the standard switches were always using the same physical NIC (am guessing the lower numbered one as that’s what both hosts chose) while hosts using distributed switches would have chosen different physical NICs per host.

That’s all! Just thought I’d put this out there in case anyone else has the same question.

Misc ESXI/ vSphere stuff

Just some notes to myself so I can refer to this later.

  • You can only have a maximum of 256 VMFS datastores per ESXI host. (This is one reason why you wouldn’t want to create a LUN/ datastore per VM. Wouldn’t work if you have a lot of VMs!)
    • Other maximums (for vSphere 5.5) can be found at this link.
  • When you create distributed switch port group there are 3 port binding options:
    • Static Binding (the default): VM NICs are connected to the port group at VM creation and remain so until the VM is removed from the port group. Power off a VM or disconnecting the NIC from the port group does not remove it from the port group – the port is still kept aside for the VM. What this means is that once you connect a VM to a port it stays with that forever.
      • Since the ports are assigned at VM creation, even if vCenter is down when the VM later powers on/ connects to the port group, it will continue to have network connectivity. (Note the emphasis on “later”. If the VM were already running and vCenter were to go down network traffic isn’t affected in either of the binding options).
    • Dynamic Binding (deprecated): VM NICs are connected to the port group only when the VM is powered on and its NIC connected to the port group. Power off the VM or disconnect the NIC and it is not longer connected to the same port when it comes back on or is reconnected.
      • Since the port binding happens only when the VM is powered on or connected, and the port group resides with vCenter, what this means is that you can only power on / off such VMs via vCenter. If vCenter is off / unreachable when the VM powers on / connects, it will not have network connectivity as it won’t have a port in the port group. (As above, note that this doesn’t affect VMs that are already running).
      • Dynamic Binding is deprecated but is useful when the number of VMs is larger than the number of ports in the port group and not all VMs will be on / connected at the same time.
    • Ephemeral Binding: Similar to Distributed Binding, VM NICs are connected to the port group only when the VM is powered on and its NIC connected to the port group. Powering off the VM or disconnecting it results in the port being removed from the port group. 
      • Although Dynamic and Ephemeral Bindings seem similar, they don’t have similar limitations. Thus while VMs with Dynamic Binding port groups won’t have network connectivity if they are powered on / connected when vCenter is off / unreachable, VMs with Ephemeral Binding have no such limitation. They don’t get a proper port number from the port group, but get a temporary one like h-1 which changes to a proper port number whenever connectivity with vCenter is restored.
      • Below screenshot shows the port numbers of three VMs, each connected to a port group of different binding (Ephemeral, Dynamic, Standard from top to bottom) and powered on when the vCenter was unreachable. Bindings
      • Although the NIC is unable to get a port – like Dynamic Binding – with an Ephemeral Binding port group the host creates a fake port and connects the VM anyway. 
      • I don’t understand why Dynamic Binding even exists as an option – unless it’s for backward compatibility? Ephemeral Binding seems to have the advantage of Dynamic Binding – ports are created at VM connection / powering on and so you can oversubscribe to a port group – but doesn’t have the disadvantage of lost connectivity when vCenter is off / unreachable. (I assume Ephemeral port groups too can be used for over subscribing, though the official KB articles don’t say anything like this so I could be wrong).
      • Dynamically creating / removing ports from the port group is an expensive operation so Dynamic and Ephemeral Binding port groups have a performance overhead. Static Binding is the preferred one.
      • Also, Ephemeral Binding port groups lose their history and security controls across host reboots. Apparently Dynamic Binding port groups don’t do this as I don’t see any mention of this as a Dynamic Binding limitation anywhere.

That’s all for now!


A very brief intro to Port Groups, Standard and Distributed switches

A year ago I went for VMware training but never got around to using it at work. Now I am finally into it, but I’ve forgotten most of the concepts. And that sucks!

So I am slowly re-learning things as I go along. I am in this weird state where I sort of remember bits and pieces from last year but at the same time I don’t really remember them.

What I have been reading about these past few days (or rather, trying to read these past few days) is networking. The end goal is distributed switches but for now I am starting with the basics. And since I like to blog these things as I go along, here we go.

You have a host. The server that runs ESXi (hypervisor).

This host has physical NICs. Hopefully oodles of them, all connected to your network.

This server runs virtual machines (a.k.a guests). These guests see virtual NICs that don’t really exist except in software, exposed by ESXi.

What you need is for all these virtual NICs to be able to talk to each other (if needed) as well as talk to the outside world (via the physical NICs and they switches they connect to).

You could create one big virtual switch and connect all the physical and virtual NICs to it. (This virtual switch is again something which does not physically exist). All the guests can thus talk to each other (as they are on the same switch) and also talk to the outside world (because the virtual switch is connected to the outside world via whatever it is connected to).

But maybe you don’t want all the virtual NICs to be able to talk to each other. You want little networks in there – a la VLANs – to isolate certain traffic from other. There’s two options here:

  1. Create separate virtual switches for each network, and assign some virtual NICs to some switches. The physical NICs that connect to these virtual switches will connect to separate physical switches so you are really limited in the number of virtual switches you have by the number of physical NICs you have. Got 2 physical NICs, you can create 2 virtual switches; got 5 physical NICs, you can create 5 virtual switches.
  2. Create one big virtual switch as before, but use port groups. Port groups are the VMware equivalent of VLANs (well, sort of; they do more than just VLANs). They are a way of grouping the virtual ports on the virtual switch such that only the NICs connected to a particular port group can talk to each other. You can create as many port groups as you want (within limits) and assign all your physical NICs to this virtual switch and use VLANs so the traffic flowing out of this virtual switch to the physical switch is on separate networks. Pretty nice stuff!

(In practice, even if you create separate virtual switches you’d still create a port group on that – essentially grouping all the ports on that switch into one. That’s because port groups are used to also apply policies to the ports in the group. Policies such as security, traffic shaping, and load balancing/ NIC teaming of the underlying physical NICs. Below is a screenshot of the options you have with portgroups).

Example of a Portgroup

Now onto standard and distributed switches. In a way both are similar – in that they are both virtual switches – but the difference is that a standard switch exists on & is managed by a host whereas a distributed switch exists on & is managed by vCenter. You create a distributed switch using vCenter and then you go to each host and add its physical NICs to the distributed switch. As with standard switches you create can portgroups in distributed switches and assign VM virtual NICs to these portgroups.

An interesting thing when it comes to migration (obvious but I wasn’t sure about this initially) is that if you have a host with two NICs – one of which is a member of a standard switch and the other of a distributed switch – but both NICs connect to the same physical network (or VLAN), and you have VMs in this host some of which are on the standard switch and others are on the distributed switch, all these VMs can talk to each other through the underlying physical network. Useful when you want to migrate stuff.

I got side tracked at this point with other topics so I’ll conclude this post here for now.