Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

NSX Firewall no working on Layer3; OpenBSD VMware Tools; IP Discovery, etc.

I have two security groups. Network 1 VMs (a group that contains my VMs in the 192.168.1.0/24) and Network 2 VMs (similar, for 192.168.2.0/24 network). 

Both are dynamic groups. I select members based on whether the VM name contains -n1 or -n2. (The whole exercise is just for fun/ getting to know this stuff). 

I have two firewall rules making use of these rules. Layer 2 and Layer 3. 

The Layer 2 rule works but the Layer 3 one does not! Weird. 

I decided to troubleshoot this via the command line. Figured it would be a good opportunity.

To troubleshoot I have to check the rules on the hosts (because remember, that’s where the firewall is; it’s a kernel module in each host). For that I need to get the host-id. For which I need to get the cluster-id. Sadly there’s no command to list all hosts (or at least I don’t know of any). 

So now I have my host-ids.

Let’s also take a look the my VMs (thankfully it’s a short list! I wonder how admins do this in real life):

We can see the filters applying to each VM.  To summarize:

And are these filters applying on the hosts themselves?

Hmm, that too looks fine. 

Next I picked up one of the rule sets and explored it further:

The Layer 3 & Layer 2 rules are in separate rule sets. I have marked the ones which I am interested in. One works, the other doesn’t. So I checked the address sets used by both:

Tada! And there we have the problem. The address set for the Layer 3 rule is empty. 

I checked this for the other rules too – same situation. I modified my Layer 3 rule to specifically target the subnets:

And the address set for that rule is not empty:

And because of this the firewall rules do work as expected. Hmm.

I modified this rule to be a group with my OpenBSD VMs from each network explicitly added to it (i.e. not dynamic membership in case that was causing an issue). But nope, same result – empty address set!

But the address set is now empty. :o)

So now I have an idea of the problem. I am not too surprised by this because I vaguely remember reading something about VMware Tools and IP detection inside a VM (i.e. NSX makes use of VMware Tools to know the IP address of a VM) and also because I am aware OpenBSD does not use the official VMware Tools package (it has its own and that only provides a subset of functions).

Googling a bit on this topic I came across the IP address Discovery section in the NSX Admin guide – prior to NSX 6.2 if VMware Tools wasn’t installed (or was stopped) NSX won’t be able to detect the IP address of the VM. Post NSX 6.2 it can do DHCP & ARP snooping to work around a missing/ stopped VMware Tools. We configure the latter in the host installation page:

I am going to go ahead and enable both on all my clusters. 

That helped. But it needs time. Initially the address set was empty. I started pings from one VM to another and the source VM IP was discovered and put in the address set; but since the destination VM wasn’t in the list traffic was still being allowed. I stopped pings, started pings, waited a while … tried again … and by then the second VM IP to was discovered and put in the address set – effectively blocking communication between them. 

Side by side I installed a Windows 8.1 VM with VMware Tools etc and tested to see if it was being automatically picked up (I did this before enabling the snooping above). It was. In fact its IPv6 address too was discovered via VMware Tools and added to the list:

Nice! Picked up something interesting today. 

Notes to self while installing NSX 6.3 (part 4)

Reading through the VMware NSX 6.3 Install Guide after having installed the DLR and ESG in my home lab. Continuing from the DLR section.

As I had mentioned earlier NSX provides routing via DLR or ESG.  

  • DLR == Distributed Logical Router.
  • ESG == Edge Services Gateway

DLR consists of an appliance that provides the control plane functionality. This appliance does not do any routing itself. The actual routing is done by the VIBs on the ESXi hosts. The appliance uses the NSX Controller to push out updates to the ESXi host. (Note: Only DLR. ESG does not depend on the Controller to push out route). Couple of points to keep in mind:

  • A DLR instance cannot connect to logical switches in different transport zones. 
  • A DLR cannot connect to a dvPortgroup with VLAN ID 0.
  • A DLR cannot connect to a dvPortgroup with VLAN ID if that DLR also connects to logical switches spanning more than one VDS. 
    • This confused me. Why would a logical switch span more than one VDS? I dunno. There are reasons probably, same way you could have multiple clusters in same data center having different VDSes instead of using the same one. 
  • If you have portgroups on different VDSes with the same VLAN ID, and these VDSes share some hosts, then DLR cannot connect these. 

I am not entirely clear with the above points. It’s more to enforce the transport zones and logical switches align correctly, but I haven’t entirely understood it so I am simply going to make note as above and move on …

In a DLR the firewall rules only apply to the uplink interface and are limited to traffic destined for the edge virtual appliance. In other words they don’t apply to traffic between the logical switches a DLR instance connects. (Note that this refers to the firwall settings found under the DLR section, not in the Firewall section of NSX). 

A DLR has many interfaces. The one exposed to VMs for routing is the Logical InterFace (LIF). Here’s a screenshot from the interfaces on my DLR. 

The ones of type ‘Internal’ are the LIFs. These are the interfaces that the DLR will route between. Each LIF connects to a separate network – in my case a logical switch each. The IP address assigned to this LIF will be the address you set as gateway for the devices in that network. So for example: one of the LIFs has an IP address 192.168.1.253 and connects to my 192.168.1.0/24 segment. All the VMs there will have 192.168.1.253 as their default gateway. Suppose we ignore the ‘Uplink’ interface for now (it’s optional, I created it for the external routing to work), and all our DLR had were the two ‘Internal’ LIFs, and VMs on each side had the respective IP address set as their default gateway, then our DLR will enable routing between these two networks. 

Unlike a physical router though, which exists outside the virtual network and which you can point to as “here’s my router”, there’s no such concept with DLRs. The DLR isn’t a VM which you can point to as your router. Nor is it a VM to which packets between these networks (logical switches) are sent to for routing. The DLR, as mentioned above, is simply your ESXi hosts. Each ESXi host that has logical switches which a DLR connects into has this LIF created in them with that LIF IP address assigned to it and a virtual MAC so VMs can send packets to it. The DLR is your ESXi host. (That is pretty cool, isn’t it! I shouldn’t be amazed because I had mentioned it earlier when reading about all this, but it is still cool to actually “see” it once I have implemented).

Above screenshot is from my two VMs on the same VXLAN but on different hosts. Note that the default gateway (192.168.1.253) MAC is the same for both. Each of their hosts will respond to this MAC entry. 

(Note to self: Need to explore the net-vdr command sometime. Came across it as I was Googling on how to find the MAC address table seen by the LIF on a host. Didn’t want to get side-tracked so didn’t explore too much. There’s something called a VDR (not encountered it yet in my readings).

  • net-vdr -I -l will list all the VDRs on a host.
  • net-vdr -L -l <vdrname> will list the LIFs.
  • net-vdr -N -l <vdrname> will list the MAC addresses (ARP info)

)

When creating a DLR it is possible to create it with or without the appliance. Remember that the appliance provides the control plane functionality. It is the appliance that learns of new routes etc and pushes to the DLR modules in the ESXi hosts. Without an appliance the DLR modules will do static routing (which might be more than enough, especially in a test environment like my nested lab for instance) so it is ok to skip it if your requirements are such. Adding an appliance means you get to (a) select if it is deployed in HA config (i.e. two appliance), (b) their locations etc, (c) IP address and such for the appliance, as well as enabling SSH. The appliance is connected to a different interface for HA and SSH – this is independent of the LIFs or Uplink interfaces. That interface isn’t used for any routing. 

Apart from the control plane, the appliance also controls the firewall on the DLR. If there’s no appliance you can’t make any firewall changes to the DLR – makes sense coz there’s nothing to change. You won’t be connecting to the DLR for SSH or anything coz you do that to the appliance on the HA interface. 

According to the docs you can’t add an appliance once a DLR instance is deployed. Not sure about that as I do see an option to deploy an appliance on my non-appliance DLR instance. Maybe it will fail when I actually try and create the appliance – I didn’t bother trying. 

Discovered this blog post while Googling for something. I’ve encountered & linked to his posts previously too. He has a lot of screenshots and step by step instructions. So worth a check out if you want to see some screenshots and much better explanation than me. :) Came across some commands from his blog which can be run on the NSX Controller to see the DLRs it is aware of and their interfaces. Pasting the output from my lab here for now, I will have to explore this later …

I have two DLRs. One has an appliance, other doesn’t. I made these two, and a bunch of logical switches to hook these to, to see if there’s any difference in functionality or options.

One thing I realized as part of this exercise is that a particular logical switch can only connect to one DLR. Initially I had one DLR which connected to 192.168.1.0/24 and 192.168.2.0/24. Its uplink was on logical switch 192.168.0.0/24 which is where the ESG too hooked into. Later when I made one more DLR with its own internal links and tried to connect its uplink to the 192.168.0.0/24 network used by the previous DLR, I saw that it didn’t even appear in the list of options. That’s when I realized its better to use a smaller range logical switch for the uplinks – like say a /30 network. This way each DLR instance connects to an ESG on its own /30 network logical switch (as in the output above). 

A DLR can have up to 8 uplink interfaces and 1000 internal interfaces.


Moving on to ESG. This is a virtual appliance. While a DLR provides East-West routing (i.e. within the virtual environment), an ESG provides North-South routing (i.e. out of the virtual environment). The ESG also provides services such as DHCP, NAT, VPN, and Load Balancing. (Note to self: DLR does not provide DHCP or Load Balancing as one might expect (at least I did! :p). DLR provides DHCP Relay though). 

The uplink of an ESG will be a VDS (Distributed Switch) as that’s what eventually connects an ESXi environment to the physical network. 

An ESG needs an appliance to be deployed. You can enable/ disable SSH into this appliance. If enabled you can SSH into the ESG appliance from the uplink address or from any of the internal link IP addresses. In contrast, you can only SSH into a DLR instance if it has an associated appliance. Even then, you cannot SSH into the appliance from the internal LIFs (coz these don’t really exist, remember … they are on each ESXi host). With a DLR we have to SSH into the interface used for HA (this can be used even if there’s only one appliance and hence no HA). 

When deploying an ESG appliance HA can be enabled. This deploys two appliances in an active/passive mode (and the two appliances will be on separate hosts). These two appliances will talk to each other to keep in sync via one of the internal interfaces (we can specify one, or NSX will just choose any). On this internal interface the appliances will have a link local IP address (a /30 subnet from 169.254.0.0/16) and communicate over that (doesn’t matter that there’s some other IP range actually used in that segment, as these are link local addresses and unlikely anyone’s going to actually use them). In contrast, if a DLR appliance is deployed with HA we need to specify a separate network from the networks that it be routing between. This can be a logical switch or a DVS, and as with ESG the two appliances will have link local IP addresses (a /30 subnet from 169.254.0.0/16) for communication. Optionally, we can specify an IP address in this network via which we can SSH into the DLR appliance (this IP address will not be used for HA, however).

After setting up all this, I also created two NAT rules just for kicks. 

And with that my basic setup of NSX is complete! (I skipped OSPF as I don’t think I will be using it any time soon in my immediate line of work; and if I ever need to I can come back to it later). Next I need to explore firewalls (micro-segmentation) and possibly load balancing etc … and generally fiddle around with this stuff. I’ve also got to start figuring out the troubleshooting and command-line stuff. But the base is done – I hope!

Yay! (VXLAN) contd. + Notes to self while installing NSX 6.3 (part 3)

Finally continuing with my NSX adventures … some two weeks have past since my last post. During this time I moved everything from VMware Workstation to ESXi. 

Initially I tried doing a lift and shift from Workstation to ESXi. Actually, initially I went with ESXi 6.5 and that kept crashing. Then I learnt it’s because I was using the HPE customized version of ESXi 6.5 and since the server model I was using isn’t supported by ESXi 6.5 it has a tendency to PSOD. But strangely the non-HPE customized version has no issues. But after trying the HPE version and failing a couple of times, I gave up and went to ESXi 5.5. Set it up, tried exporting from VMware Workstation to ESXi 5.5, and that failed as the VM hardware level on Workstation was newer than ESXi. 

Not an issue – I fired up VMware Converter and converted each VM from Workstation to ESXi. 

Then I thought hmm, maybe the MAC addresses will change and that will cause an issue, so I SSH’ed into the ESXi host and manually changed the MAC addresses of all my VMs to whatever it was in Workstation. Also changed the adapters to VMXNet3 wherever it wasn’t. Reloaded the VMs in ESXi, created all the networks (portgroups) etc, hooked up the VMs to these, and fired them up. That failed coz the MAC address ranges were of VMware Workstation and ESXi refuses to work with those! *grr* Not a problem – change the config files again to add a parameter asking ESXi to ignore this MAC address problem – and finally it all loaded. 

But all my Windows VMs had their adapters reset to a default state. Not sure why – maybe the drivers are different? I don’t know. I had to reconfigure all of them again. Then I turned to OpnSense – that too had reset all its network settings, so I had to configure those too – and finally to nested ESXi hosts. For whatever reason none of them were reachable; and worse, my vCenter VM was just a pain in the a$$. The web client kept throwing some errors and simply refused to open. 

That was the final straw. So in frustration I deleted it all and decided to give up.

But then …

I decided to start afresh. 

Installed ESXi 6.5 (the VMware version, non-HPE) on the host. Created a bunch of nested ESXi VMs in that from scratch. Added a Windows Server 2012R2 as the shared iSCSI storage and router. Created all the switches and port groups etc, hooked them up. Ran into some funny business with the Windows Firewall (I wanted to assign some interface as Private, others as Public, and enable firewall only only the Public ones – but after each reboot Windows kept resetting this). So I added OpnSense into the mix as my DMZ firewall.

So essentially you have my ESXi host -> which hooks into an internal vSwitch portgroup that has the OpnSense VM -> which hooks into another vSwitch portgroup where my Server 2012R2 is connected to, and that in turn connects to another vSwitch portgroup (a couple of them actually) where my ESXi hosts are connected to (need a couple of portgroup as my ESXi hosts have to be in separate L3 networks so I can actually see a benefit of VXLANs). OpnSense provides NAT and firewalling so none of my VMs are exposed from the outside network, yet they can connect to the outside network if needed. (I really love OpnSense by the way! An amazing product). 

Then I got to the task of setting these all up. Create the clusters, shared storage, DVS networks, install my OpenBSD VMs inside these nested EXSi hosts. Then install NSX Manager, deploy controllers, configure the ESXi hosts for NSX, setup VXLANs, segment IDs, transport zones, and finally create the Logical Switches! :) I was pissed off initially at having to do all this again, but on the whole it was good as I am now comfortable setting these up. Practice makes perfect, and doing this all again was like revision. Ran into problems at each step – small niggles, but it was frustrating. Along the way I found that my (virtual) network still does not seem to support large MTU sizes – but then I realized it’s coz my Server 2012R2 VM (which is the router) wasn’t setup with the large MTU size. Changed that, and that took care of the MTU issue. Now both Web UI and CLI tests for VXLAN succeed. Finally!

Third time lucky hopefully. Above are my two OpenBSD VMs on the same VXLAN, able to ping each other. They are actually on separate L3 ESXi hosts so without NSX they won’t be able to see each other. 

Not sure why there are duplicate packets being received. 

Next I went ahead and set up a DLR so there’s communicate between VXLANs. 

Yeah baby! :o)

Finally I spent some time setting up an ESG and got these OpenBSD VMs talking to my external network (and vice versa). 

The two command prompt windows are my Server 2012R2 on the LAN. It is able to ping the OpenBSD VMs and vice versa. This took a bit more time – not on the NSX side – as I forgot to add the routing info on the ESG for my two internal networks (192.168.1.0/24 and 192.168.2.0/24) as well on the Server 2012R2 (192.168.0.0/16). Once I did that routing worked as above. 

I am aware this is more of a screenshots plus talking post rather than any techie details, but I wanted to post this here as a record for myself. I finally got this working! Yay! Now to read the docs and see what I missed out and what I can customize. Time to break some stuff finally (intentionally). 

:o)

Yay! (VXLAN) contd. + Notes to self while installing NSX 6.3 (part 2)

In my previous post I said the following (in gray). Here I’d like to add on:

  • A VDS uses VMKernel ports (vmk ports) to carry out the actual traffic. These are virtual ports bound to the physical NICs on an ESXi host, and there can be multiple vmk ports per VDS for various tasks (vMotion, FT, etc). Similar to this we need to create a new vmk port for the host to connect into the VTEP used by the VXLAN. 
    • Unlike regular vmk ports though we don’t create and assign IP addresses manually. Instead we either use DHCP or create an IP pool when configuring the VXLAN for a cluster. (It is possible to specify a static IP either via DHCP reservation or as mentioned in the install guide).
      • The number of vmk ports (and hence IP addresses) corresponds to the number of uplinks. So a host with 2 uplinks will have two VTEP vmk ports, hence two IP addresses taken from the pool. Bear that in mind when creating the pool.
    • Each cluster uses one VDS for its VXLAN traffic. This can be a pre-existing VDS – there’s nothing special about it just that you point to it when enabling VXLAN on a cluster; and the vmk port is created on this VDS. NSX automatically creates another portgroup, which is where the vmk port is assigned to.
    • VXLANs are created on this VDS – they are basically portgroups in the VDS. Each VXLAN has an ID – the VXLAN Network Identifier (VNI) – which NSX refers to as segment IDs. 
      • Before creating VXLANS we have to allocate a pool of segment IDs (the VNIs) taking into account any VNIs that may already be in use in the environment.
      • The number of segment IDs is also limited by the fact that a single vCenter only supports a maximum of 10,000 portgroups
      • The web UI only allows us to configure a single segment ID range, but multiple ranges can be configured via the NSX API
  • Logical Switch == VXLAN -> which has an ID (called segment ID or VNI) == Portgroup. All of this is in a VDS. 

While installing NSX I came across “Transport Zones”.

Remember ESXi hosts are part of a VDS. VXLANs are created on a VDS. Each VXLAN is a portgroup on this VDS. However, not all hosts need be part of the same VXLANs, but since all hosts are part of the same VDS and hence have visibility to all the VXLANs we need same way of marking which hosts are part of a VXLAN. We also need some place to identify if a VXLAN is in unicast, multicast, or hybrid mode. This is where Transport Zones come in.

If all your VXLANs are going to behave the same way (multicast etc) and have the same hosts, then you just need one transport zone. Else you would create separate zones based on your requirement. (That said, when you create a Logical Switch/ VXLAN you have an option to specify the control plane mode (multicast mode etc). Am guessing that overrides the zone setting, so you don’t need to create separate zones just to specify different modes). 

Note: I keep saying hosts above (last two paragraphs) but that’s not correct. It’s actually clusters. I keep forgetting, so thought I should note it separately here rather the correct my mistake above. 1) VXLANs are configured on clusters, not hosts. 2) All hosts within a cluster must be connected to a common VDS (at least one common VDS, for VXLAN purposes). 3) NSX Controllers are optional and can be skipped if you are using multicast replication? 4) Transport Zones are made up of clusters (i.e. all hosts in a cluster; you cannot pick & choose just some hosts – this makes sense when you think that a cluster is for HA and DRS so naturally you wouldn’t want to exclude some hosts from where a VM can vMotion to as this would make things difficult). 

Worth keeping in mind: 1) A cluster can belong to multiple transport zones. 2) A logical switch can belong to only one transport zone. 3) A VM cannot be connected to logical switches in different transport zones. 4) A DLR (Distributed Logical Router) cannot connect to logical switches in multiple transport zones. Ditto for an ESG (Edge Services Gateway). 

After creating a transport zone, we can create a Logical Switch. This assigns a segment ID from the pool automatically and this (finally!!) is your VXLAN. Each logical switch creates yet another portgroup. Once you create a logical switch you can assign VMs to it – that basically changes their port group to the one created by the logical switch. Now your VMs will have connectivity to each other even if they are on hosts in separate L3 networks. 

Something I hadn’t realized: 1) Logical Switches are created on Transport Zones. 2) Transport Zones are made up of / can span clusters. 3) Within a cluster the logical switches (VXLANs) are created on the VDS that’s common to the cluster. 4) What I hadn’t realized was this: no where in the previous statements is it implied that transport zones are limited to a single VDS. So if a transport zone is made up of multiple clusters, each / some of which have their own common VDS, any logical switch I create will be created on all these VDSes.  

Sadly, I don’t feel like saying yay at the this point unlike before. I am too tired. :(

Which also brings me to the question of how I got this working with VMware Workstation. 

By default VMware Workstation emulates an e1000 NIC in the VMs and this doesn’t support an MTU larger than 1500 bytes. We can edit the .VMX file of a VM and replace “e1000” with “vmxnet3” to replace the emulated Intel 82545EM Gigabit Etherne NIC with a paravirtual VMXNET3 NIC to the VMs. This NIC supports an MTU larger than 1500 bytes and VXLAN will begin working. One thing though: a quick way of testing if the VTEP VMkernel NICs are able to talk to each other with a larger MTU is via a command such as ping ++netstack=vxlan -I vmk3 -d -s 1600 xxx.xxx.xxx.xxx. If you do this once you add a VMXNET3 NIC though, it crashes the ESXi host. I don’t know why. It only crashes when using the VXLAN network stack; the same command with any other VMkernel NIC works fine (so I know the MTU part is ok). Also, when testing the Logical Switch connectivity via the Web UI (see example here) there’s no crash with a VXLAN standard test packet – maybe that doesn’t use the VXLAN network stack? I spent a fair bit of time chasing after the ping ++netstack command until I realized that even though it was crashing my host the VXLAN was actually working!

Before I conclude a hat-tip to this post for the Web UI test method and also for generally posting how the author set up his NSX test lab. That’s an example of how to post something like this properly, instead of the stream of thoughts my few posts have been. :)

Update: Short lived happiness. Next step was to create an Edge Services Gateway (ESG) and there I bumped into the MTU issues. And this time when I ran hte test via the Web UI it failed and crashed the hosts. Disappointed, I decided it was time to move on from VMware Workstation. :-/

Update 2: Continued here … 

Yay! (VXLAN)

I decided to take a break from my NSX reading and just go ahead and set up a VXLAN in my test lab. Just go with a hunch of what I think the options should be based on what the menus ask me and what I have read so far. Take a leap! :)

*Ahem* The above is actually incorrect, and I am an idiot. A super huge idiot! Each VM is actually just pinging itself and not the other. Unbelievable! And to think that I got all excited thinking I managed to do something without reading the docs etc. The steps below are incomplete. I should just delete this post, but I wrote this much and had a moment of excitement that day … so am just leaving it as it is with this note. 

Above we have two OpenBSD VMs running in my nested EXIi hypervisors. 

  • obsd-01 is running on host 1, which is on network 10.10.3.0/24.
  • obsd-02 is running on host 2, which is on network 10.10.4.0/24. 
  • Note that each host is on a separate L3 network.
  • Each host is in a cluster of its own (doesn’t matter but just mentioning) and they connect to the same VDS.
  • In that VDS there’s a port group for VMs and that’s where obsd-01 and obsd-02 connect to. 
  • Without NSX, since the hosts are on separate networks, the two VMs wouldn’t be able to see each other. 
  • With NSX, I am able to create a VXLAN network on the VDS such that both VMs are now on the same network.
    • I put the VMs on a 192.168.0.0/24 network so that’s my overlay network. 
    • VXLANs are basically port groups within your NSX enhanced VDS. The same way you don’t specify IP/ network information on the VMware side when creating a regular portgroup, you don’t do anything when creating the VXLAN portgroup either. All that is within the VMs on the portgroup.
  • A VDS uses VMKernel ports (vmk ports) to carry out the actual traffic. These are virtual ports bound to the physical NICs on an ESXi host, and there can be multiple vmk ports per VDS for various tasks (vMotion, FT, etc). Similar to this we need to create a new vmk port for the host to connect into the VTEP used by the VXLAN. 
    • Unlike regular vmk ports though we don’t create and assign IP addresses manually. Instead we either use DHCP or create an IP pool when configuring the VXLAN for a cluster. (It is possible to specify a static IP either via DHCP reservation or as mentioned in the install guide). 
    • Each cluster uses one VDS for its VXLAN traffic. This can be a pre-existing VDS – there’s nothing special about it just that you point to it when enabling VXLAN on a cluster; and the vmk port is created on this VDS. NSX automatically creates another portgroup, which is where the vmk port is assigned to. 

And that’s where I am so far. After doing this I went through the chapter for configuring VXLAN in the install guide and I was pretty much on the right track. Take a look at that chapter for more screenshots and info. 

Yay, my first VXLAN! :o)

p.s. I went ahead with OpenBSD in my nested environment coz (a) I like OpenBSD (though I have never got to play around much with it); (b) it has a simple & fast install process and I am familiar with it; (c) the ISO file is small, so doesn’t take much space in my ISO library; (d) OpenBSD comes with VMware tools as part of the kernel, so nothing additional to install; (e) I so love that it still has a simple rc based system and none of that systemd stuff that newer Linux distributions have (not that there’s anything wrong with systemd just that I am unfamiliar with it and rc is way simpler for my needs); (f) the base install has manpages for all the commands unlike minimal Linux ISOs that usually seem to skip these; (g) take a look at this memory usage! :o)

p.p.s. Remember to disable the PF firewall via pfctl -d.

Yay again! :o)

Update: Short-lived excitement sadly. A while later the VMs stopped communicating. Turns out VMware Workstation doesn’t support MTU larger than 1500 bytes, and VXLAN requires 1600 byte. So the VTEP interfaces of both ESXi hosts are unable to talk to each other. Bummer!

Update 2: I finally got this working. Turns out I had missed some stuff; and also I had to make some changes to allows VMware Workstation to with larger MTU sizes. I’ll blog this in a later post

Notes to self while installing NSX 6.3 (part 1)

(No sense or order here. These are just notes I took when installing NSX 6.3 in my home lab, while reading this excellent NSX for Newbies series and the NSX 6.3 install guide from VMware (which I find to be quite informative). Splitting these into parts as I have been typing this for a few days).

You can install NSX Manager in VMware Workstation (rather than in the nested ESXi installation if you are doing it in a home lab). You won’t get a chance to configure the IP address, but you can figure it from your DHCP server. Browse to that IP in a browser and login as username “admin” password “default” (no double quotes). 

If you want to add a certificate from your AD CA to NSX Manager create the certificate as usual in Certificate Manager. Then export the generated certificate and your root CA and any intermediate CA certificates as a “Base-64 encoded X.509 (.CER)” file. Then concatenate all these certificates into a single file (basically, open up Notepad and make a new file that has all these certificates in it). Then you can import it into NSX Manager. (More details here).

During the Host Preparation step on an ESXi 5.5 host it failed with the following error: 

“Could not install image profile: ([], “Error in running [‘/etc/init.d/vShield-Stateful-Firewall’, ‘start’, ‘install’]:\nReturn code: 1\nOutput: vShield-Stateful-Firewall is not running\nwatchdog-dfwpktlogs: PID file /var/run/vmware/watchdog-dfwpktlogs.PID does not exist\nwatchdog-dfwpktlogs: Unable to terminate watchdog: No running watchdog process for dfwpktlogs\nFailed to release memory reservation for vsfwd\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nSet memory minlimit for vsfwd to 256MB\nFailed to set memory reservation for vsfwd to 256MB, trying for 256MB\nFailed to set memory reservation for vsfwd to failsafe value of 256MB\nMemory reservation released for vsfwd\nResource pool ‘host/vim/vmvisor/vsfwd’ released.\nResource pool creation failed. Not starting vShield-Stateful-Firewall\n\nIt is not safe to continue. Please reboot the host immediately to discard the unfinished update.”)” Error 3/16/2017 5:17:49 AM esx55-01.fqdn

Initially I thought maybe NSX 6.3 wasn’t compatible with ESXi 5.5 or that I was on an older version of ESXi 5.5 – so I Googled around on pre-requisites (ESXi 5.5 seems to be fine) and also updated ESXi 5.5 to the latest version. Then I took a closer look at the error message above and saw the bit about the 256MB memory reservation. My ESXi 5.5 host only had 3GB RAM (I had installed with 4GB and reduced it to 3GB) so I bumped it up to 4GB RAM and tried again. And voila! the install worked. So NSX 6.3 requires an ESXi 5.5 host with minimum 4GB RAM (well maybe 3.5 GB RAM works too – I was too lazy to try!) :o)

If you want, you can browse to “https://<NSX_MANAGER_IP>/bin/vdn/nwfabric.properties” to manually download the VIBs that get installed as part of the Host Preparation. This is in case you want to do a manual install (thought had crossed my mind as part of troubleshooting above).

NSX Manager is your management layer. You install it first and it communicates with vCenter server. A single NSX Manager install is sufficient. There’s one NSX Manager per vCenter. 

The next step after installing NSX Manager is to install NSX Controllers. These are installed in odd numbers to maintain quorum. This is your control plane. Note: No data traffic flows through the controllers. The NSX Controllers perform many roles and each role has a master controller node (if this node fails another one takes its place via election). 

Remember that in NSX the VXLAN is your data plane. NSX supports three control plane modes: multicast, unicast, and hybrid when it comes to BUM (Broadcast, unknown Unicast, and Multicast) traffic. BUM traffic is basically traffic that doesn’t have a specific Layer 3 destination. (More info: [1], [2], [3] … and many on the Internet but these three are what I came across initially via Google searches).

  • In unicast mode a host replicates all BUM traffic to all other hosts on the same VXLAN and also picks a host in every other VXLAN to do the same for hosts in their VXLANs. Thus there’s no dependence on the underlying hardware. There could, however, be increased traffic as the number of VXLANs increase. Note that in the case of unknown unicast the host first checks with the NSX Controller for more info. (That’s the impression I get at least from the [2] post above – I am not entirely clear). 
  • In multicast mode a host depends on the underlying networking hardware to replicate BUM traffic via multicast. All hosts on all VXLAN segments join multicast groups so any BUM traffic can be replicated by the network hardware to this multicast group. Obviously this mode requires hardware support. Note that multicast is used for both Layer 2 and Layer 3 here. 
  • In hybrid mode some of the BUM traffic replication is handed over to the first hop physical switch (so rather than a host sending unicast traffic to all other hosts connected to the same physical switch it relies on the switch to do this) while the rest of the replication is done by the host to hosts in other VXLANs. Note that multicast is used only for Layer 2 here. Also note that as in the unicast mode, in the case of unknown unicast traffic the Controller is consulted first. 

NSX Edge provides the routing. This is either via the Distributed Logical Router (DLR), which is installed on the hypervisor + a DLR virtual appliance; or via the Edge Services Gateway (ESG), which is a virtual appliance. 

  • A DLR can have up to 8 uplink interfaces and 1000 internal interfaces.
    • A DLR uplink typically connects to an ESG via a Layer 2 logical switch. 
    • DLR virtual appliance can be set up in HA mode – in an active/ standby configuration.
      • Created from NSX Manager?
    • The DLR virtual appliance is the control plane – it supports dynamic routing protocols and exchanges routing updates with Layer 3 devices (usually ESG).
      • Even if this virtual appliance is down the routing isn’t affected. New routes won’t be learnt that’s all.
    • The ESXi hypervisors have DLR VIBs which contain the routing information etc. got from the controllers (note: not from the DLR virtual appliance). This the data layer. Performs ARP lookup, route lookup etc. 
      • The VIBs also add a Logical InterFace (LIF) to the hypervisor. There’s one for each Logical Switch (VXLAN) the host connects to. Each LIF, of each host, is set to the default gateway IP of that Layer 2 segment. 
  • An ESG can have up to 10 uplink and internal interfaces. (With a trunk an ESG can have up to 200 sub-interfaces). 
    • There can be multiple ESG appliances in a datacenter. 
    • Here’s how new routes are learnt: NSX Edge Gateway (EGW) learns a new route -> This is picked up by the DLR virtual appliance as they are connected -> DLR virtual appliance passes this info to the NSX Controllers -> NSX Controllers pass this to the ESXi hosts.
    • The ESG is what connects to the uplink. The DLR connects to ESG via a Logical Switch. 

Logical Switch – this is the switch for a VXLAN. 

NSX Edge provides Logical VPNs, Logical Firewall, and Logical Load Balancer. 

TIL: Control Plane & Data Plane (networking)

Reading a bit of networking stuff, which is new to me, as I am trying to understand and appreciate NSX (instead of already diving into it). Hence a few of these TIL posts like this one and the previous. 

One common term I read in the context of NSX or SDN (Software Defined Networking) in general is “control plane” and “data plane” (a.k.a “forwarding” plane). 

This forum post is a good intro. Basically, when it comes to Networking your network equipment does two sort of things. One is the actual pushing of packets that come to it to others. The other is figuring out what packets need to go where. The latter is where various networking protocols like RIP and EIGRP come in. Control plane traffic is used to update a network device’s routing tables or configuration state, and its processing happens on the network device itself.  Data plane traffic passes through the router. Control plane traffic determines what should be done with the data plane traffic. Another way of thinking about control plan and data planes is where the traffic originates from/ is destined to. Basically, control plane traffic is sent to/ from the network devices to control it (e.g RIP, EIGRP); while data plane traffic is what passes through a network device.

( Control plane traffic doesn’t necessarily mean its traffic for controlling a network device. For example, SSH or Telnet could be used to connect to a network device and control it, but it’s not really in the control plane. These come more under a “management” plane – which may or may not be considered as a separate plane. )

Once you think of network devices along these lines, you can see that a device’s actual work is in the data plane. How fast can it push packets through. Yes, it needs to know where to push packets through to, but the two aren’t tied together. It’s sort of like how one might think of a computer as being hardware (CPU) + software (OS) tied together. If we imagine the two as tied together, then we are limiting ourselves on how much each of these can be pushed. If improvements in the OS require improvements in the CPU then we limit ourselves – the two can only be improved in-step. But if the OS improvements can happen independent of the underlying CPU (yes, a newer CPU might help the OS take advantage of newer features or perform better, but it isn’t a requirement) then OS developers can keep innovating on the OS irrespective of CPU manufacturers. In fact, OS developers can use any CPU as long as there are clearly defined interfaces between the OS and the CPU. Similarly, CPU manufacturers can innovate independent of the OS. Ultimately if we think (very simply) of CPUs as having a function of quickly processing data, and OS as a platform that can make use of a CPU to do various processing tasks, we can see that the two are independent and all that’s required is a set of interfaces between them. This is how things already are with computers so what I mentioned just now doesn’t sound so grand or new, but this wasn’t always the case. 

With SDN we try to decouple the control and data planes. The data plane then is the physical layer comprising of network devices or servers. They are programmable and expose a set of interfaces. The control plane now can be a VM or something independent of the physical hardware of the data plane. It is no longer limited to what a single network device sees. The control plane is aware of the whole infrastructure and accordingly informs/ configures the data plane devices.  

If you want a better explanation of what I was trying to convey above, this article might help. 

In the context of NSX its data plane would be the VXLAN based Logical Switches and the ESXi hosts that make it up. And its control plane would be the NSX Controllers. It’s the NSX Controllers that takes care of knowing what to do with the network traffic. It identifies all these, informs the hosts that are part of the data plane accordingly, and let them do the needful. The NSX Controller VMs are deployed in odd numbers (preferably 3 or higher, though you could get away with 1 too) for HA and cluster quorum (that’s why odd numbers) but they are independent of the data plane. Even if all the NSX Controllers are down the data flow would not be affected


I saw a video from Scott Shenker on the future of networking and the past of protocols. Here’s a link to the slides, and here’s a link to the video on YouTube. I think the video is a must watch. Here’s some of the salient points from the video+slides though – mainly as a reminder to myself (note: since I am not a networking person I am vague at many places as I don’t understand it myself):

  • Layering is a useful thing. Layering is what made networking successful. The TCP/IP model, the OSI model. Basically you don’t try and think of the “networking problem” as a big composite thing, but you break it down into layers with each layer doing one task and the layer above it assuming that the layer below it has somehow solved that problem. It’s similar to Unix pipes and all that. Break the problem into discrete parts with interfaces, and each part does what it does best and assumes the part below it is taking care of what it needs to do. 
  • This layering was useful when it came to the data plane mentioned above. That’s what TCP/IP is all about anyways – getting stuff from one point to another. 
  • The control plane used to be simple. It was just about the L2 or L3 tables – where to send a frame to, or where to send a packet to. Then the control plane got complicated by way of ACLs and all that (I don’t know what all to be honest as I am not a networking person :)). There was no “academic” approach to solving this problem similar to how the data plane was tackled; so we just kept adding more and more protocols to the mix to simply solve each issue as it came along. This made things even more complicated, but that’s OK as the people who manage all these liked the complexity and it worked after all. 
  • A good quote (from Don Norman) – “The ability to master complexity is not the same as the ability to extract simplicity“. Well said! So simple and prescient. 
    • It’s OK if you are only good at mastering complexity. But be aware of that. Don’t be under a misconception that just because you are good at mastering the complexity you can also extract simplicity out of it. That’s the key thing. Don’t fool yourself. :)
  • In the context of the control plane, the thing is we have learnt to master its complexity but not learnt to extract simplicity from it. That’s the key problem. 
    • To give an analogy with programming, we no longer think of programming in terms of machine language or registers or memory spaces. All these are abstracted away. This abstraction means a programmer can focus on tackling the problem in a totally different way compared to how he/ she would have had to approach it if they had to take care of all the underlying issues and figure it out. Abstraction is a very useful tool. E.g. Object Oriented Programming, Garbage Collection. Extract simplicity! 
  • Another good quote (from Barbara Liskov) – “Modularity based on abstraction is the way things get done“.
    • Or put another way :) Abstractions -> Interfaces -> Modularity (you abstract away stuff; provide interfaces between them; and that leads to modularity). 
  • As mentioned earlier the data plan has good abstraction, interfaces, and modularity (the layers). Each layer has well defined interfaces and the actual implementation of how a particular layer gets things done is down to the protocols used in that layer or its implementations. The layers above and below do not care. E.g. Layer 3 (IP) expects Layer 2 to somehow get it’s stuff done. The fact that it uses Ethernet and Frames etc is of no concern to IP. 
  • So, what are the control plane problems in networking? 
    • We need to be able to compute the configuration state of each network device. As in what ACLs are it supposed to be applying, what its forwarding tables are like …
    • We need to be able to do this while operating without communication guarantees. So we have to deal with communication delays or packet drops etc as changes are pushed out. 
    • We also need to be able to do this while operating within the limitations of the protocol we are using (e.g. IP). 
  • Anyone trying to master the control plane has to deal with all three. To give an analogy with programming, it is as though a programmer had to worry about where data is placed in RAM, take care of memory management and process communication etc. No one does that now. It is all magically taken care of by the underlying system (like the OS or the programming language itself). The programmer merely focuses on what they need to do. Something similar is required for the control plane. 
  • What is needed?
    • We need an abstraction for computing the configuration state of each device. [Specification Abstraction]
      • Instead of thinking of how to compute the configuration state of a device or how to change a configuration state, we just declare what we want and it is magically taken care of. You declare how things should be, and the underlying system takes care of making it so. 
      • We think in terms of specifications. If the intention is that Device A should not have access to Device B, we simply specify that in the language of our model without thinking of the how in terms of the underlying physical model. The shift in thinking here is that we view each thing as a layer and only focus on that. To implement a policy that Device A should not have access to Device B we do not need to think of the network structure or the devices in between – all that is just taken care of (by the Network Operating System, so to speak). 
      • This layer is  Network Virtualization. We have a simplified model of the network that we work with and which we specify how it should be, and the Network Virtualization takes care of actually implementing it. 
    • We need an abstraction that captures the lack of communication guarantees- i.e. the distributed state of the system. [Distributed State Abstraction]
      • Instead of thinking how to deal with the distributed network we abstract it away and assume that it is magically taken care of. 
      • Each device has access to an annotated network graph that they can query for whatever info they want. A global network view, so to say. 
      • There is some layer that gathers an overall picture of the network from all the devices and presents this global view to the devices. (We can think of this layer as being a central source of information, but it can be decentralized too. Point is that’s an implementation problem for whoever designs that layer). This layer is the Network Operating System, so to speak. 
    • We need an abstraction of the underlying protocol so we don’t have to deal with it directly.  [Forwarding Abstraction]
      • Network devices have a Management CPU and a Forwarding ASIC. We need an abstraction for both. 
      • The Management CPU abstraction can be anything. The ASIC abstraction is OpenFlow. 
      • This is the layer that closest to the hardware. 
  • SDN abstracts these three things – distribution, forwarding, and configuration. 
    • You have a Control Program that configures an abstract network view based on the operator requirements (note: this doesn’t deal with the underlying hardware at all) ->
    • You have a Network Virtualization layer that takes this abstract network view and maps it to a global view based on the underlying physical hardware (the specification abstraction) ->
    • You have a Network OS that communicates this global network view to all the physical devices to make it happen (the distributed state abstraction (for disseminating the information) and the forwarding abstraction (for configuring the hardware)).
  • Very important: Each piece of the above architecture has a very limited job that doesn’t involve the overall picture. 

From this Whitepaper:

SDN has three layers: (1) an Application layer, (2) a Control layer (the Control Program mentioned above), and (3) an Infrastructure layer (the network devices). 

The Application layer is where business applications reside. These talk to the Control Program in the Control layer via APIs. This way applications can program their network requirements directly. 

OpenFlow (mentioned in Scott’s talk under the ASIC abstraction) is the interface between the control plane and the data/ forwarding place. Rather than paraphrase, let me quote from that whitepaper for my own reference:

OpenFlow is the first standard communications interface defined between the control and forwarding layers of an SDN architecture. OpenFlow allows direct access to and manipulation of the forwarding plane of network devices such as switches and routers, both physical and virtual (hypervisor-based). It is the absence of an open interface to the forwarding plane that has led to the characterization of today’s networking devices as monolithic, closed, and mainframe-like. No other standard protocol does what OpenFlow does, and a protocol like OpenFlow is needed to move network control out of the networking switches to logically centralized control software.

OpenFlow can be compared to the instruction set of a CPU. The protocol specifies basic primitives that can be used by an external software application to program the forwarding plane of network devices, just like the instruction set of a CPU would program a computer system.

OpenFlow uses the concept of flows to identify network traffic based on pre-defined match rules that can be statically or dynamically programmed by the SDN control software. It also allows IT to define how traffic should flow through network devices based on parameters such as usage patterns, applications, and cloud resources. Since OpenFlow allows the network to be programmed on a per-flow basis, an OpenFlow-based SDN architecture provides extremely granular control, enabling the network to respond to real-time changes at the application, user, and session levels. Current IP-based routing does not provide this level of control, as all flows between two endpoints must follow the same path through the network, regardless of their different requirements.

I don’t think OpenFlow is used by NSX though. It is used by Open vSwitch and was used by NVP (Nicira Virtualization Platform – the predecessor of NSX).

Speaking of NVP and NSX: VMware acquired NSX from Nicira (which was a company founded by Martin Casado, Nick McKeown and Scott Shenker – the same Scott Shenker whose video I was watching above). The product was called NVP back then and primarily ran on the Xen hypervisor. VMware renamed it to NSX and it was has two flavors. NSX-V is the version that runs on the VMware ESXi hypervisor, and is in active development. There’s also NSX-MH which is a “multi-hypervisor” version that’s supposed to be able to run on Xen, KVM, etc. but I couldn’t find much information on it. There’s some presentation slides in case anyone’s interested. 

Before I conclude here’s some more blog posts related to all this. They are in order of publishing so we get a feel of how things have progressed. I am starting to get a headache reading all this network stuff, most of which is going above my head, so I am going to take a break here and simply link to the articles (with minimal/ half info) and not go much into it. :)

  • This one talks about how the VXLAN specification doesn’t specify any control plane.
    • There is no way for hosts participating in a VXLAN network to know the MAC addresses of other hosts or VMs in the VXLAN so we need some way of achieving that. 
    • Nicira NVP uses OpenFlow as a control-plane protocol. 
  • This one talks about how OpenFlow is used by Nicira NVP. Some points of interest:
    • Each Open vSwitch (OVS) implementation has 1) a flow-based forwarding module loaded in the kernel; 2) an agent that communicates with the Controller; and 3) an OVS DB daemon that keeps track of of the local configuration. 
    • NVP had clusters of 3 or 5 controllers. These used OpenFlow protocol to download forwarding entries into the OVS and OVSDB (a.k.a. ovsdb-daemon) to configure the OVS itself  (creating/ deleting/ modifying bridges, interfaces, etc). 
    • Read that post on how the forwarding tables and tunnel interfaces are modified as new devices join the overlay network. 
    • Broadcast traffic, unknown Unicast traffic, and Multicast traffic (a.k.a. BUM traffic) can be handled in two ways – either by sending these to an extra server that replicates these to all devices in the overlay network; or the source hypervisor/ physical device can encapsulate the BUM frame and send it as unicast to all the other devices in that overlay. 
  • This one talks about how Nicira NVP seems to be moving away from OpenFlow or supplementing it with something (I am not entirely clear).
    • This is a good read though just that I was lost by this point coz I have been doing this reading for nearly 2 days and it’s starting to get tiring. 

One more post from the author of the three posts above. It’s a good read. Kind of obvious stuff, but good to see in pictures. That author has some informative posts – wish I was more brainy! :)

TIL: VXLAN is a standard

VXLAN == Virtual eXtensible LAN.

While reading about NSX I was under the impression VXLAN is something VMware cooked up and owns (possibly via Nicira, which is where NSX came from). But turns out that isn’t the case. It was originally created by VMware & Cisco (check out this Register article – a good read) and is actually covered under an RFC 7348. The encapsulation mechanism is standardized, and so is the UDP port used for communication (port number 4789 by the way). A lot of vendors now support VXLAN, and similar to NSX being an implementation of VXLAN we also have Open vSwitch. Nice!

(Note to self: got to read more about Open vSwitch. It’s used in XenServer and is a part of Linux. The *BSDs too support it). 

VXLAN is meant to both virtualize Layer 2 and also replace VLANs. You can have up to 16 million VXLANs (the NSX Logical Switches I mentioned earlier). In contrast you are limited to 4094 VLANs. I like the analogy of how VXLAN is to IP addresses how cell phones are to telephone numbers. Prior to cell phones, when everyone had landline numbers, your phone number was tied to your location. If you shifted houses/ locations you got a new phone number. In contrast, with cell phones numbers it doesn’t matter where you are as the number is linked to you, not your location. Similarly with VXLAN your VM IP address is linked to the VM, not its location. 

Update:

  • Found a good whitepaper by Arista on VXLANs. Something I hadn’t realized earlier was that the 24bit VXLAN Network Identifier is called VNI (this is what lets you have 16 millions VXLAN segments/ NSX Logical Switches) and that a VM’s MAC is combined with its VNI – thus allowing multiple VMs with the same MAC address to exist across the network (as long as they are on separate VXNETs). 
  • Also, while I am noting acronyms I might as well also mention VTEPs. These stand for Virtual Tunnel End Points. This is the “thing” that encapsulates/ decapsulates packets for VXLAN. This can be virtual bridges in the hypervisor (ESXi or any other); or even VXLAN aware VM applications or VXLAN capable switching hardware (wasn’t aware of this until I read the Arista whitepaper). 
  • VTEP communicates over UDP. The port number is 4789 (NSX 6.2.3 and later) or 8472 (pre-NSX 6.2.3).
  • A post by Duncan Epping on VXLAN use cases. Probably dated in terms of the VXLAN issues it mentions (traffic tromboning) but I wanted to link it here as (a) it’s a good read and (b) it’s good to know such issues as that will help me better understand why things might be a certain way now (because they are designed to work around such issues). 

Cisco CME outgoing caller ID not showing individual extensions

Been working with our Cisco CME (Cisco Unified Communications Manager – Express – as a reminder to myself!) at work past 2-3 days. I have no idea about Cisco telephony, but wanted to tackle an issue anyways. Good way to learn a new system.

The issue was that whenever anyone make an outgoing external call from our system the caller ID number shown at the remote end is that of our main number. That is to say, if our main number extension is 900, this main number externally appears as 1234900, and my extension is 929, when I make an outgoing external call to (say) my mobile the number appears as 1234900 instead of 1234929.

A useful command to debug such situations is the following command:

This shows what is set to the ISP when I place an outgoing external call.

Another useful command is:

This shows the internal processing that happens on CME/ CUCM. Things like what translations happen, what dial peers are selected, etc. It’s a lot of output compared to the first command.

To see what debugging is enabled on your system the following command is useful:

To turn off all debugging (coz it takes a toll on your router and so you must disable it once you are done):

Lastly, to see the debugging output if you are SSH’d into the router rather than on the console, do the following:

In my case I found out that even though CME was correctly sending the calling extension/ number as 9xx to the ISP, it looked like the ISP was ignoring it. I thought that maybe it expects the number in a proper format (as an external number) so I made a translation rule for outgoing calling numbers to change 9xx to the correct format (12349xx) and pass to ISP – and that fixed the issue.

Here’s the rule and translation profile I created. I played around a bit with the output and saw that I need to pass a “0” before the full number for the ISP to recognize it correctly.

I set the plan and type too though it didn’t make any difference in my case. I saw some posts on the Internet where it does seem to make a difference, so I didn’t remove it.

Lastly I apply this profile to the voice port:

That’s all.

vCenter unable to connects to hosts; vSphere client gives error ‘”ServiceInstance.RetrieveContent” for object “ServiceInstance” on Server “IP-Address” failed’

Our Network team had been making some changes at work and suddenly vCenter in our London office lost connectivity with all the ESX hosts in one of our remote office. Moreover, when trying to connect from the vSphere Client to any of the remote hosts directly we were getting the following error –

client error

Connectivity from vSphere Client in the remote office to the ESX host in the same office was fine; it was only connectivity from other offices to this remote office. So it definitely indicated a network issue.

This KB article is a handy one to know what ports are required by various VMware products. Port 443 is what needs to be open to ESX hosts for vCenter Server to be able to talk to them. I did a telnet from the vCenter server to each of the remote office hosts on port 443 and it went through fine – so wasn’t a firewall issue. (Another post with port numbers, just FYI, is this one).

After a fair bit of troubleshooting we tracked the issue down to MTU.

Digressing into MTUs

Communication between two IP addresses (i.e. layer 3) happens through packets. Thus when my London vCenter Server communicates with my remote office ESX host, the two send TCP/IP packets to each other. When these packets from the vCenter Server reach the switch/ router on the same LAN as the ESX host, it becomes a layer 2 communication (because they are on the same network and it’s a matter of data reaching the ESX host from the switch/ router). In the case of Ethernet, this layer 2 communication happens via Ethernet frames. The frames encapsulate the IP packets – so the switch/ router breaks the packets and fits them into multiple frames, while the ESX host receives these frames and re-assembles the packets (and vice versa). (The picture on this Wikipedia page is worth a look to see the encapsulation). 

How much data can be held by a layer 2 frame is defined by the Maximum Transmission Unit (MTU). Larger MTUs are good because you can carry more data; but they have a downside in that each frame takes longer to be transmitted, and in case of any errors more data has to be re-transmitted when the frame is resent. So a balance is important. In the case of Ethernet, RFC 894 (see errata also) defines the MTU as a maximum of 1500 bytes. In the case of other layer 2 protocols, the MTU varies: for example 4464 bytes for Token Ring; 4352 bytes for FDDI; 9180 bytes for ATM; etc. In the case of Ethernet there are now also jumbo frames, which are frames with an MTU size of 9000 bytes (see this page for a table comparing regular frames and jumbo frames) and are commonly used in iSCSI networks.

Taking the case of Ethernet, assume the MTU of all Ethernet networks is 1500 bytes. So when two devices are conversing with each other over layer 3, and this conversation spans multiple Ethernet networks, it is helpful if the devices know that the MTU of the underlying layer 2 network is 1500 bytes. That way the two devices can keep the size of their layer 3 packets to be less than 1500 bytes. Why? Because if the size of the layer 3 packets are greater than 1500 bytes, then the devices and all the routers/ switches in between will have to fragment (break) the layer 3 packets into smaller packets of less than 1500 bytes to fit it in the Ethernet frame. This is a waste of resources for all, so it’s best if the two devices know of the underlying layer 2 MTU and act accordingly.

Now, note that Ethernet MTUs are defined as a maximum of 1500 bytes. So the MTU for a particular LAN segment can be set to a lower number for whatever reason (maybe there are additional fields in the Ethernet frame and to accommodate these the data portion must be reduced). Similarly, a layer 3 conversation between when two devices can go over a mix of layer 2 networks – Ethernet, Token Ring, etc – each with a different MTU. So what is required for the two devices really is a way of knowing what’s the lowest MTU across all these layer 2 devices, so the two devices can use it as the MTU of the layer 3 packets for their conversation. This is known as the Path MTU or IP MTU – and is basically the smallest MTU of all the underlying layer 2 MTUs over which that conversation traverses. It is discovered through a process known as “Path MTU Discovery” (PMTUD) (check this Wikipedia article, or Google this term to learn more). Very briefly, in the case of IPv4 what happens is that each device sends across packets of increasing size to the other end, with a flag set that says “do not fragment this packet”. Packets of size smaller than the lowest layer 2 MTU will get through, but once the size exceeds the lowest MTU the packet will fail & return because it cannot be fragmented (due to the flag) and so is returned via ICMP to the sender. Thus the Path MTU is discovered. This check happens in both directions.

So we have layer 2 MTUs and layer 3 MTUs. Layer 2 MTUs have a maximum value that is dependent on the layer 2 network technology. But what about the minimum value? RFC 791, which defines the Internet Protocol (the IP in TCP/IP), requires that all devices supporting IP must be able to forward packets of 68 bytes without fragmenting (68 bytes because IP headers take 60 bytes size and layer 2 headers take 8 bytes size minimum) and be able to accept packets of minimum size 576 bytes either as one packet or multiple packets that require assembling. Because of this the minimum layer 2 MTU can be thought of as 68 bytes. In a practical sense, however, most IP devices accept 576 bytes without fragmenting, and since this number is higher than the values for all layer 2 networks the minimum layer 2 & layer 3 MTU can be thought of as 576 bytes.

Just for completeness I will also mention Maximum Segment Size (MSS) which is a layer 4 MTU (of sorts) that defines what’s the maximum TCP segment (which is what a TCP packet is called) that can be accepted by devices. It has a default value of 536 bytes. This is based on the 576 bytes that IP requires hosts to accept at minimum, minus 20 bytes for IP headers and 20 bytes for TCP headers. Idea behind using 576 bytes as the base is that this way the TCP segment can be expected to arrive without fragmenting. In a practical sense again, for TCP/IP traffic over Ethernet (which is the common case), since Ethernet frames have an MTU of 1500, the MSS is usually set to 1500 minus 20 minus 20 = 1460 bytes.

This is a good article I came upon. Just linking it as a reference to myself.

Back to our issue

In our case the router in the remote site had the following set in its configuration:

I am not entirely clear where it was set or why it was set, as that comes under the Network team. What this does though is tell the router not to clear the “Do Not Fragment” (DF) bit in Ethernet frames. If a DF bit is present in a frame then the router will not fragment it if the frame size is larger than the MTU (this is how PMTUD also works). I am not sure why this was set – part of some testing I suppose – but because of this larger frames were not getting through to the other side and hence failing. Our Network team removed this statement and then communication with the ESX hosts started working fine.

I wanted to write more about this statement but I am running out of time. This and this are two good links worth reading for more info. Especially the Scenario 4 section in the second link – that’s pretty much what was happening in our case, I think.

OpenVPN not setting default gateway

Note to self. Using OpenVPN on Windows, if your Internet traffic still goes via the local network rather than the VPN network, check whether OpenVPN has set itself as the default gateway.

Notice the second entry with the lower metric above. That’s OpenVPN. If such an entry exists, and its metric is lower than the other entries, then it will be used as the default route. If such an entry does not exist then double check that you are running OpenVPN GUI with “Run as Administrator” rights. I wasn’t, and it took be a while to realize that was the reason why OpenVPN wasn’t setting the default route on my laptop!

Hope this helps someone.

PowerShell: Add a line to multiple files

Trivial stuff, but I don’t get to use PowerShell as much as I would like to so I end up forgetting elementary things that should just be second nature to me. Case in hand, I wanted to a add a line to a bunch of text files. Here’s what I came up with – just posting it here as a reference to my future selef.

For the background behind this, I use Private Internet Access for my VPN needs and since last month or so my ISP’s been blocking traffic to it. Private Internet Access offers a client that lets me connects to their servers via UDP or TCP. The UDP option began failing but the TCP option surprisingly worked. Of course I don’t want to use TCP as that’s slow and so I went to Private Internet Access’s website where they give a bunch of OpenVPN config files we can use. These are cool in the sense that some of them connect via IP (instead of server name) while others connect to well-known ports or use a well-known local port and so there’s less chance of being blocked by the ISP. In my case turned out just connecting via IP was more than enough so it looks like the ISP isn’t blocking OpenVPN UDP ports, it’s just blocking UDP traffic to these server names.

Anyhow, next step was to stop the client from prompting for an username/ password. OpenVPN has an option auth-user-pass which lets you specify a file-name where the usrname and password are on separate lines. So all I had to do was create this file and add a line such as auth-user-pass pia.txt to all the configuration files. That’s what the code snippet above does.

Notes on Teredo (part 4)

Continuing my posts on Teredo, consider the following network:

IPv6 - Teredo

There are two Windows clients behind two different NAT routers. There’s a Teredo server and a Teredo relay. And there are IPv6 only and IPv6/IPv4 resources on the Internet that these clients would like to access. The Teredo server and relay are on different networks and the Teredo relay is in a different network from the clients and IPv6 resources. They don’t need to be on different networks or even different servers; I just put them so to illustrate that they are independent devices and not necessarily on the same network as the client and resources.

The clients communicate with the Teredo server and relay through IPv6 packets in UDP messages in IPv4 packets. The Teredo server and relay communicate with the IPv6 parts of the Internet over IPv6, except when the host has both IPv4 and IPv6 in which case the host acts as its own relay and talks to the clients directly over IPv4 (using IPv6 in UDP messages as usual).

Once I configure both clients with the Teredo server address they get a Teredo IPv6 addresses as below. These are both link-local as well as 2001::/32 prefix addresses. The Teredo relay too gets Teredo IPv6 addresses while the Teredo server has link-local Teredo IPv6 addresses only.

IPv6-Teredo2

Decoding the Teredo IPv6 address

Let’s decode the Teredo IPv6 address of one of the clients to see what it means.

Client 192.168.99.200 has address 2001:0:1117:34fa:1027:374b:e1fc:f635. So the network prefix is 2001:0:1117:34fa, host identifier is 1027:374b:e1fc:f635.

Dealing with the network prefix:

  • 2001::/32 is the Teredo network prefix.
  • 1117:34fa is 32 bits of the Teredo server IPv4 address in hex. 0x11 = 17, 0x17 = 23, 0x34 = 52, 0xFA = 250. Which converts to 17.23.52.250.

And the host identifier:

  • 1027 is 16  bits of flags and random bits.
  • 374b is 16 bits of the obfuscated external port of the NAT. (Obfuscation happens by XOR-ing the digits with 0xFFFF).
    • 0x374b = 0xc8b4 unobfuscated =51380.
  • e1fc:f635 is 32 bits of obfuscated external IPv4 address of the NAT.
    • 0xe1 = 0x1e unobfuscated = 30
    • 0xfc = 0x3 unobfuscated = 3
    • 0xf6 = 0x9 unobfuscated = 9
    • 0x35 = 0xca unobfuscated = 202
    • So the NAT’s external IPv4 address is 30.3.9.202.

Packet flow

Let’s look at some Wireshark captures to see what happens. From one of the clients (192.168.20.200) I am pinging an IPv6 host and here’s what happens.

(The Teredo IPv6 addresses in these captures are different from the ones above because I made the captures a while later and the NAT IPv4 address and port changed by then).

First the client makes a DNS request (over IPv4) to resolve the name, one of the name servers replies with the IPv6 address (the AAAA records).

Capture1

Next the client contacts the Teredo server to configure itself. It sends an ICMPv6 Router Solicitation message to the ff02::2 multicast IPv6 address (link-local scope, all routers). The source is set to a link-local IPv6 address fe80::ffff:ffff:fffe which as far as I can tell seems to be an anycast address but I am not too sure. (Typically Router Solicitation messages set the source address as fe80::/64 followed by a random or EUI-64 based host identifier).

Capture2

It’s worth noting that this ICMPv6 message is actually sent within a UDP message in an IPv4 packet addresses to the Teredo server IPv4 address. Pay attention to the destination IPv4 address of this packet – it’s the first IPv4 address of the Teredo server. From the capture above you’ll see there are two Router Solicitation messages sent and two replies got. That’s because the second Router Solicitation message is to the second IPv4 address of the Teredo server. Both addresses send a reply and that’s how the Teredo client identifies whether it’s behind a symmetric NAT or not.

Capture3

In the capture above it’s also worth noting the port number used by the client’s NAT: 52643. There is now a mapping on the NAT device from this port to the client. The mapping could be a cone or restricted, but that does not matter to us (as long as it’s not symmetric).

Moving on, the Teredo client sends an IPv6 Connectivity Test message from it’s Teredo IPv6 address to the IPv6 address of the destination. The client has auto configured an IPv6 address for itself. And it knows the destination IPv6 address from the initial DNS query. This test message is really an ICMPv6 ping request to the destination, sent within an UDP message over IPv4 to the Teredo server. I had mentioned this previously in the section on Teredo relays. This is how a Teredo client finds the Teredo relay closest to the destination.

Capture4The Teredo server now does a Neighbor Solicitation to find the MAC address of the destination. From the IPv6 address it knows the destination is on the same network as itself. If it were on different networks, the Teredo server would have sent this packet to its gateway for forwarding.

Notice the destination address ff02::1:ff00:254. As mentioned in an earlier post this is a solicited node multicast address.

  • ff02 indicates the scope is link-local.
  • ::1:ffxx:xxxx is the solicited node address a host with last 24bits IPv6 address xx:xxxx would have subscribed to. For the destination host IPv6 address 2dcc:7c4e:3651:52::254, this would be 00:254.

Capture5

The destination host now sends a Neighbor Solicitation message to find its default gateway’s MAC address, gets a reply, and sends the ping reply to the Teredo client. Notice the ping reply is a pure IPv6 ICMPv6 packet.

Capture6

This packet reaches the Teredo relay as it is advertising the 2001::/32 prefix. This packet is discarded.

The Teredo relay sends a UDP message to the NAT router of client, on the port that’s mapped to it. The NAT router replies with a prohibited message. This tells the Teredo relay that any previous mappings the client NAT might have had for the relay are gone and so a new mapping must be made. Since it can’t contact the client directly it must take help from the Teredo server.

Capture7

The Teredo relay now sends a bubble packet from itself to the IPv6 Teredo address of the client. This bubble packet is sent via UDP over IPv4 to the Teredo server. (Remember this is how the need for hole punching is communicated to the server).

Capture8The Teredo server sends this packet via UDP to the port that’s mapped on the client’s NAT router. Note the IPv6 portion of the packet it sends is same as the one it receives. This way the client is able to identify the Teredo relay’s IPv4 and Teredo IPv6 address.

Capture9I was under the impression the Teredo relay address is sent to the client as an “Origin Indication” by the Teredo server but apparently the client picks it up from the IPv6 packet the relay sends it. The Teredo server doesn’t modify the IPv6 payload. The payload is a bubble packet (next header is 59) and contains details of the relay.

Now the Teredo client contacts the Teredo relay and sends it bubble packets. The relay responds and thus the client and relay know of each other.

Capture12

All this was just prep work done behind the scenes when the client wants to talk to an IPv6 host. I generated this traffic by trying to ping the IPv6 host from the client so below is the traffic for that:

Capture14

The Teredo client contacts the relay via UDP/ IPv4 and passes on the ICMPv6 ping request packet. That’s sent on the IPv6 network to the destination. A reply arrives. The reply is passed via UDP/IPv4 to the client. No more additional steps.

Below is the capture for all four ping requests and replies.

Capture15

And that’s it!

Microsoft has a good page on Teredo that’s worth looking at. Goes into more details with the processes and scenarios that I don’t go into above.

Notes on Teredo (part 3)

In the previous two parts I talked about Teredo in general and also about NAT &  Teredo. In this post I hope to talk more about how Teredo works.

Teredo Clients

Microsoft has made available Teredo servers on the Internet. These are reachable at win8.ipv6.microsoft.com and teredo.ipv6.microsoft.com and Windows clients have this address already set as their Teredo server.

If the Teredo server address is not reachable, the client is in an offline state:

If the Teredo server address is reachable, the client is in a dormant state. As the name indicates this is a state in which the Teredo client is not active, but when required it can contact the server and auto-configure an IPv6 address and send/ receive packets.

Send some IPv6 traffic and the state automatically changes to qualified. (Note how the first ping reply took a lot more time than the rest as the Teredo interface was being configured. Sometimes the first reply can timeout too).

Now the Teredo state also shows the type of NAT the client is behind and also the local and external mappings.

Another thing to note in the output above is the “Network” which is currently set to “unmanaged”. Since Teredo allows a client to be reached across a firewall/ NAT and this is something an organisation might not want for its managed machines, the Teredo client tries to accommodate that and before initializing itself it checks whether the computer is on a managed network. If the computer is domain joined and on a network where it’s domain controllers are reachable – i.e. within an organisation – the Teredo client detects that it’s on a managed network and disables itself.

This setting can be changed to set the Teredo client as qualified even in a managed network. This can be done via GPOs, PowerShell, or netsh. The netsh command for this is:

This command must be run as an administrator. When a Teredo client is in a managed network and qualified, it is known as an Enterprise Client. Hence the name.

It is also possible to configure clients with a manually specified Teredo server. This can be done via PowerShell …

… or netsh

Teredo Servers

Setting up your own Windows Teredo Server is easy. Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012, and later can function as a Teredo server.

Here’s how I enable one of these as a Teredo server:

And that’s it really! The computer is now working as a Teredo server.

Running the above command again shows updated stats.

One thing to keep in mind that even a computer functioning as a Teredo server must be setup with a Teredo server. If the pre-configured Teredo server (e.g. win8.ipv6.microsoft.com) is unreachable, the Teredo interlace will be disabled and the computer will not work as a Teredo server. The Teredo state will show as offline even though this computer itself is a server.

To fix this, set this computer as its own Teredo server.

Teredo Relays

Previously I wrote about how a Teredo relay sends packets to a Teredo client. How do clients know which Teredo relay to use though? So far we haven’t set a Teredo relay anywhere in our client and server configuration, so where does it enter the picture?

While Teredo servers are specific to a client – i.e. the client is assigned a Teredo server and each client uses only one Teredo server – Teredo relays are specific to the remote end and a particular client will use different relays for different destinations. Here’s how the process works:

  1. When a Teredo client needs to contact a remote IPv6 host, it first sends an ICMPv6 packet to the remote host.
  2. Since it doesn’t know how to contact this host, and this is an initial setup connection, the client sends this packet to the Teredo server as an UDP message in IPv4.
  3. The Teredo server receives this message, decapsulates the IPv6 packet, and sends it on the IPv6 network. Note: this IPv6 packet has the destination address set as the IPv6 address of the remote host, and source address set as the Teredo IPv6 address of the Teredo client.
  4. Now for the fun part! The IPv6 packet reaches the destination host, the host creates a reply IPv6 packet with itself as the source and the Teredo client IPv6 address as the destination. This packet is sent on the IPv6 network. On the IPv6 network are many Teredo relays, all of them advertising the 2001:0:/32 prefix. The packet will reach the relay nearest to the destination host who will then send it to the Teredo client. Once the Teredo client receives the ICMPv6 reply, it knows which relay was used and thus knows the IPv4 address of the relay closest to the destination.
  5. The Teredo client then sends the actual IPv6 packet as a UDP message in an IPv4 packet to this Teredo relay. And since a hole punching is done for this relay address, further packets to and from this relay can travel through.

Similarly when an IPv6 host has a packet for a Teredo client, the packet makes its way to the relay closest to that host. The relay then checks whether it already has a communication set up with the client, in which case it sends the packet over via IPv4. If there’s no on-going communication, or it’s been a while, the relay goes through the hole punching process again and sends the packet.

Similar to the Teredo server, Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012, and later can function as a Teredo relay. Setting up one of these as a Teredo relay is quite straight-forward. All one has to do is:

  1. Ensure the Teredo interface is ready – i.e. the relay can reach a Teredo server and the interface is not offline.
  2. Enable forwarding on the Teredo interface. Enable forwarding on the interface(s) to the IPv6 network.
  3. Publish a route for the 2001::.32 prefix.
  4. Enable IPv6 router advertisements on the IPv6 network so other routers pick up the published route.

And that’s it! Here are the commands:

That’s all for now!

Notes on Teredo (part 2)

Before going more into Teredo it’s worth talking about the types of NAT.

Types of NAT

When an internal device sends a packet to an Internet device, the source address and port (i.e. the private IPv4 address and port number of the internal device) are translated by the NAT. The outgoing packet from the NAT device will have the source IPv4 address set as the public IPv4 address of the NAT box and a newly assigned port on the NAT box as the source port address of the packet. And a mapping for this will be stored on the NAT box (internal IPv4 address, port number <-> Public IPv4 address, port number).

What happens next is what defines the various types of NAT:

Cone NAT

In a cone NAT, once a mapping is thus stored, packets from any external device to the public IPv4 address and port number of the mapping will be forwarded to the internal IPv4 address and port number. The key word here is any. So my machine behind the NAT can send a packet to (say) Google.com from port 1234 of my machine -> this will create a mapping on the NAT box from my internal IPv4 address and port 1234 to the public IPv4 address and (say) port 5467. Now a totally different server (say) Yahoo.com can send a packet to port 5678 of the public IPv4 address and it will be forwarded to my machine’s port 1234.

Essentially my internal machine has a port mapped on the NAT now which is always forwarded to my machine. Of course, the mapping could go after a while of disuse and then I’ll get a new external port number and mapping, but suppose I kept sending traffic to keep the mapping alive the port is forever forwarded to me.

You can imagine this looks like a cone. The tip of the cone is the public IPv4 address and port. The base – the open end – is everything in the external world. Everything in the external world can contact me on that port. It’s like a cone, hence the name.

Restricted NAT

In a restricted NAT, once a mapping is stored, only packets from the particular external device which the packets were originally sent to can connect to the mapping and be forwarded. That is, in the example above once there’s a mapping for my internal IPv4 address and port 1234 to the external public IPv4 and port 5467 for Google.com, only Google.com can send a packet to the external IPv4 address and port 5678 and it will be forwarded to my machine, if Yahoo.com sends a packet to the same IPv4 address and port it will be discarded.

There is a stricter form of restricted NAT where even the port number of the sender is checked. That is, if initial packet was to pot 9999 of Google.com, only packets from port 9999 of Google.com are allowed to enter via that mapping. Packets from port 9998 of Google.com will be silently discarded!

Symmetric NAT

Symmetric NAT takes things one step further!

In the two NAT types above the mapping is stored for all traffic from the internal device and internal IP. That is if I send packets from port 1234 and my internal IPv4, a mapping is created to port 5678 on the NAT box and that is used for all traffic from port 1234 of my internal machine. So – I contact Google.com from internal port 1234, the same mapping is used. I contact Yahoo.com from the same internal port, the same mapping is used. I contact Bing.com from the same internal port, the same mapping is used! The only difference between the two types above was to do with the incoming packets – whether they were from an IPv4 address (and port, in the stricter version) that was previously contacted. But the mapping was always the same for all traffic.

However, in Symmetric mapping a new mapping is created for each destination. So if I contact Google.com from port 1234 of my internal machine, it could be mapped to port 5678 of the NAT box and a mapping created for that. Next, if I contact Yahoo.com from port 1234 of the internal machine (same internal port as before), it could be mapped to port 5889 of the NAT box and a new mapping created for that. And later if I contact Bing.com from again the same port 1234 of the internal machine, yet another mapping will be created on the NAT box, say to port 6798. And worse, each of these mappings behaves like a Restricted NAT mapping – i.e. only the IP address to which the packets were initially sent when creating that mapping can use the mapping to send back packets.

You can see why this is called Symmetric NAT. It’s because each traffic from the internal side has its own mapping on the external side. There is a one-to-one mapping from the internal port & IPv4 address to the external port for each traffic.

Why do the types matter?

It’s important to know what type of NAT a particular is in when using Teredo. Because remember my quick overview of Teredo – at step 2 the host behind the NAT asks the server for an IPv6 address and it gets it. What this also does is that a mapping is created on the NAT box for the host’s internal IPv4 address and port number to the NAT box’s public IPv4 address and external port. Now …

If the NAT were Cone, then at step 8 a Teredo relay can forward IPv6 over IPv4 packets to the internal host by sending to the external port and NAT public IPv4 address.

But if the NAT were Restricted, then step 8 would fail as the internal host hasn’t contacted the relay yet and so there’s no mapping for the relay’s IPv4 address and/ port port  (it could be that the internal host has contacted the relay to send some packets and this is a response from the relay, in which case it will be let through, but it could also be a fresh connection from the relay forwarding packets from an IPv6 host that are not in response to anything from the internal host – these will be discarded). So in the case of a restricted NAT Teredo has some additional stuff to do first – namely create a mapping in the NAT table for the Teredo relay’s IPv4 address.

First the Teredo relay checks if the IPv4 address and port of the Teredo client – which it extracts from the Teredo IPv6 address of the client – are known to it. If it is known (as in the relay and client have communicated recently) it means a mapping exists and the relay can send the packet to the client.

If a mapping does not exist the Teredo relay takes help of the Teredo server to punch a hole in the NAT. This is akin to the Romeo & Juliet example I described yesterday. The relay needs to contact the client but the client’s NAT box will discard any packets that are not in response to something that the client has sent out, so the relay needs a third party server “friend” to help out. Here’s what happens:

  1. The Teredo relay queues the incoming packet.
  2. From the Teredo IPv6 address of the client it extracts the IPv4 address of the Teredo server.
  3. The relay then creates a “bubble packet” with the source address as the relay’s IPv6 address, destination as the client’s Teredo IPv6 address, and sends it to the Teredo server’s IPv4 address. A bubble packet is essentially an empty IPv6 packet. Since it is sent to the IPv4 address of the Teredo server, it will be encapsulated in an IPv4 packet.
  4. The Teredo server extracts the IPv6 bubble packet. From the bubble packet’s destination IPv6 address the Teredo server notes that it itself if the Teredo server for that client. This tells the server that its help is required for hole punching. It notes the IPv4 address of the relay from the source address of the packet (this is used in step 7).
  5. The Teredo server extracts the NAT IPv4 address and port from the host portion of the client’s Teredo address. It puts the bubble packet within a UDP message and sends it over IPv4 to the IPv4 address and port of the NAT box. The NAT box forwards this packet to the internal host (the Teredo client) as a mapping already exists for the Teredo server IPv4 address.
    • The UDP packet has to contain the Teredo server IPv4 address and port as the source address and port. It has to else the packet won’t pass through NAT. But the client also needs to know the IPv4 address of the Teredo relay. So the Teredo server sets an “Origin Indication” within this UDP packet that specifies the IPv4 address of the Teredo relay.
  6. The Teredo client receives the bubble packet in UDP. From the “Origin Indication” it knows the IPv4 address of the Teredo relay. From the bubble packet it knows the IPv6 address of the Teredo relay. And since it doesn’t know the Teredo relay’s IPv4 address and this packet came from its Teredo server – indicating that the client has to do it’s part of the hole punching – the client will now create a new IPv6 bubble packet, setting its Teredo IPv6 address as the source and the Teredo relay’s IPv6 address as the destination, put this within a UDP message, set the IPv4 address of the Teredo relay as the destination, and send it out.
  7. The packet passes through NAT and reaches the Teredo relay. Since this is a response to the bubble it previously sent, the Teredo relay knows the mapping is ready. Now the Teredo relay sends the previously queued incoming packet to the Teredo client and it gets through …!

Phew! Now we know why Teredo is a tunnel of last resort. There’s so much behind the scenes stuff that has to happen to keep it working. And that’s not to mention additional stuff like regular bubble packets from the Teredo client to server to keep the NAT mapping alive, checks to ensure there’s no spoofing done, and many more. Added to that, for security reasons an update to the Teredo RFC (Teredo RFC is RFC 4380, update is RFC 5991) specifies that Teredo should always assume it’s behind a Restricted NAT and so the above steps must always be done, even for clients behind Core NATs.

Back to NATs – if the NAT were Symmetric, Teredo does not work at all unless you make some changes on the NAT to exempt the Teredo clients. (Teredo in Windows Vista and above can work between Teredo clients if only one Teredo client is behind a Symmetric NAT and the other is behind a Cone/ Restricted NAT).

Identifying the type of NAT

Here’s how a Teredo client identifies the type of NAT it is behind.

Two things are in play here:

  1. The Teredo client IPv6 address has a flag that specifies whether it is behind a cone NAT or not. This flag is in the host bits of the address – remember I had previously mentioned the host bit has some flags, random bits, and the NAT IPv4 address and port number? One of these flags specifies whether the client is behind a cone NAT or not.
  2. The Teredo server has two public IPv4 addresses. The Teredo RFC does not expect these to be consecutive IPv4 addresses, but Windows Teredo clients expect these to be.

When the Teredo client contacts the Teredo server initially for an IPv6 address, it sends a Router Solicitation message as all IPv6 clients do (the difference being this is a unicast message, sent within a UDP message over IPv4 to a specific Teredo server address). The Router Solicitation message requires a link-local address – all Router Solicitation messages do – so the Teredo Client generates one, following the same format as a regular Teredo IPv6 address. The network prefix of this link-local address is set to fe80::/64, with the host bits having flags and random bits as usual but with the embedded IPv4 address being the private IPv4 address and port of the internal host rather than the public IPv4 address and port of the NAT box (because the Teredo client doesn’t know what this is).

This link-local address sets the cone flag is 1 – meaning the client believes it is behind a cone NAT.

When the Teredo server receives this it sends a Router Advertisement message as usual (as a UDP message within IPv4 unlike usual). The server does a trick here though. Instead of responding from the public IPv4 address on which it received the UDP message from the client, it responds from the second public IPv4 address. If this reply from a different IPv4 address gets through the NAT, then the client knows it is indeed behind a cone NAT. But if no replies come through (after an appropriate time-out (default: 4s) and number of retries (3 times)), the client realizes it may not be behind a cone NAT and so it sends a new Router Solicitation message to the Teredo server only this time it sets the cone flag to 0 (i.e. not behind a cone NAT).

Again the Teredo server receives the message and sends a Router Advertisement message, but now since the cone flag is 0 it sends the reply from the same IPv4 address as it received the message on. This will get through, confirming to the client that it is behind a non-cone NAT. (Note: If this reply too does not get through, after an appropriate time-out and number of retries the client realizes that UDP messages are blocked/ not getting through NAT/ firewall and so it sets the Teredo interface as disconnected/ off-line. Teredo cannot be used in this situation).

Next the client needs to know if it is behind a symmetric NAT. It now contacts the Teredo server on the second IPv4 address with a Router Solicitation message, setting the cone flag to 0 so the server uses the same IPv4 address when replying, and when it gets a reply from the Teredo server it compares the NAT port in the reply with the NAT port in the previous reply. If the ports are same the client determines that it is behind a restricted NAT; if the ports are different the client determines it is behind a symmetric NAT and that Teredo might not work.

(Note: I oversimplified a bit above to keep things easy. When the Teredo server sends a Router Advertisement, this includes the network prefix only. The host bits are set by the Teredo client once it identifies the type of NAT it is behind. The host bits require knowledge of the NAT IPv4 address and port, but how does the Teredo client know these? It knows these because the Router Advertisements from the Teredo server contains an “Origin Indication” field specifying the public IPv4 address and port. This is how the client gets the port number used for both Router Advertisements and determines if it is behind a symmetric NAT or not. Once that determination is done, the client has all the info required to self-assign a Teredo IPv6 address).