Yay! (VXLAN) contd. + Notes to self while installing NSX 6.3 (part 2)

In my previous post I said the following (in gray). Here I’d like to add on:

  • A VDS uses VMKernel ports (vmk ports) to carry out the actual traffic. These are virtual ports bound to the physical NICs on an ESXi host, and there can be multiple vmk ports per VDS for various tasks (vMotion, FT, etc). Similar to this we need to create a new vmk port for the host to connect into the VTEP used by the VXLAN. 
    • Unlike regular vmk ports though we don’t create and assign IP addresses manually. Instead we either use DHCP or create an IP pool when configuring the VXLAN for a cluster. (It is possible to specify a static IP either via DHCP reservation or as mentioned in the install guide).
      • The number of vmk ports (and hence IP addresses) corresponds to the number of uplinks. So a host with 2 uplinks will have two VTEP vmk ports, hence two IP addresses taken from the pool. Bear that in mind when creating the pool.
    • Each cluster uses one VDS for its VXLAN traffic. This can be a pre-existing VDS – there’s nothing special about it just that you point to it when enabling VXLAN on a cluster; and the vmk port is created on this VDS. NSX automatically creates another portgroup, which is where the vmk port is assigned to.
    • VXLANs are created on this VDS – they are basically portgroups in the VDS. Each VXLAN has an ID – the VXLAN Network Identifier (VNI) – which NSX refers to as segment IDs. 
      • Before creating VXLANS we have to allocate a pool of segment IDs (the VNIs) taking into account any VNIs that may already be in use in the environment.
      • The number of segment IDs is also limited by the fact that a single vCenter only supports a maximum of 10,000 portgroups
      • The web UI only allows us to configure a single segment ID range, but multiple ranges can be configured via the NSX API
  • Logical Switch == VXLAN -> which has an ID (called segment ID or VNI) == Portgroup. All of this is in a VDS. 

While installing NSX I came across “Transport Zones”.

Remember ESXi hosts are part of a VDS. VXLANs are created on a VDS. Each VXLAN is a portgroup on this VDS. However, not all hosts need be part of the same VXLANs, but since all hosts are part of the same VDS and hence have visibility to all the VXLANs we need same way of marking which hosts are part of a VXLAN. We also need some place to identify if a VXLAN is in unicast, multicast, or hybrid mode. This is where Transport Zones come in.

If all your VXLANs are going to behave the same way (multicast etc) and have the same hosts, then you just need one transport zone. Else you would create separate zones based on your requirement. (That said, when you create a Logical Switch/ VXLAN you have an option to specify the control plane mode (multicast mode etc). Am guessing that overrides the zone setting, so you don’t need to create separate zones just to specify different modes). 

Note: I keep saying hosts above (last two paragraphs) but that’s not correct. It’s actually clusters. I keep forgetting, so thought I should note it separately here rather the correct my mistake above. 1) VXLANs are configured on clusters, not hosts. 2) All hosts within a cluster must be connected to a common VDS (at least one common VDS, for VXLAN purposes). 3) NSX Controllers are optional and can be skipped if you are using multicast replication? 4) Transport Zones are made up of clusters (i.e. all hosts in a cluster; you cannot pick & choose just some hosts – this makes sense when you think that a cluster is for HA and DRS so naturally you wouldn’t want to exclude some hosts from where a VM can vMotion to as this would make things difficult). 

Worth keeping in mind: 1) A cluster can belong to multiple transport zones. 2) A logical switch can belong to only one transport zone. 3) A VM cannot be connected to logical switches in different transport zones. 4) A DLR (Distributed Logical Router) cannot connect to logical switches in multiple transport zones. Ditto for an ESG (Edge Services Gateway). 

After creating a transport zone, we can create a Logical Switch. This assigns a segment ID from the pool automatically and this (finally!!) is your VXLAN. Each logical switch creates yet another portgroup. Once you create a logical switch you can assign VMs to it – that basically changes their port group to the one created by the logical switch. Now your VMs will have connectivity to each other even if they are on hosts in separate L3 networks. 

Something I hadn’t realized: 1) Logical Switches are created on Transport Zones. 2) Transport Zones are made up of / can span clusters. 3) Within a cluster the logical switches (VXLANs) are created on the VDS that’s common to the cluster. 4) What I hadn’t realized was this: no where in the previous statements is it implied that transport zones are limited to a single VDS. So if a transport zone is made up of multiple clusters, each / some of which have their own common VDS, any logical switch I create will be created on all these VDSes.  

Sadly, I don’t feel like saying yay at the this point unlike before. I am too tired. :(

Which also brings me to the question of how I got this working with VMware Workstation. 

By default VMware Workstation emulates an e1000 NIC in the VMs and this doesn’t support an MTU larger than 1500 bytes. We can edit the .VMX file of a VM and replace “e1000” with “vmxnet3” to replace the emulated Intel 82545EM Gigabit Etherne NIC with a paravirtual VMXNET3 NIC to the VMs. This NIC supports an MTU larger than 1500 bytes and VXLAN will begin working. One thing though: a quick way of testing if the VTEP VMkernel NICs are able to talk to each other with a larger MTU is via a command such as ping ++netstack=vxlan -I vmk3 -d -s 1600 xxx.xxx.xxx.xxx. If you do this once you add a VMXNET3 NIC though, it crashes the ESXi host. I don’t know why. It only crashes when using the VXLAN network stack; the same command with any other VMkernel NIC works fine (so I know the MTU part is ok). Also, when testing the Logical Switch connectivity via the Web UI (see example here) there’s no crash with a VXLAN standard test packet – maybe that doesn’t use the VXLAN network stack? I spent a fair bit of time chasing after the ping ++netstack command until I realized that even though it was crashing my host the VXLAN was actually working!

Before I conclude a hat-tip to this post for the Web UI test method and also for generally posting how the author set up his NSX test lab. That’s an example of how to post something like this properly, instead of the stream of thoughts my few posts have been. :)

Update: Short lived happiness. Next step was to create an Edge Services Gateway (ESG) and there I bumped into the MTU issues. And this time when I ran hte test via the Web UI it failed and crashed the hosts. Disappointed, I decided it was time to move on from VMware Workstation. :-/

Update 2: Continued here …