Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Yay! (VXLAN) contd. + Notes to self while installing NSX 6.3 (part 2)

In my previous post I said the following (in gray). Here I’d like to add on:

  • A VDS uses VMKernel ports (vmk ports) to carry out the actual traffic. These are virtual ports bound to the physical NICs on an ESXi host, and there can be multiple vmk ports per VDS for various tasks (vMotion, FT, etc). Similar to this we need to create a new vmk port for the host to connect into the VTEP used by the VXLAN. 
    • Unlike regular vmk ports though we don’t create and assign IP addresses manually. Instead we either use DHCP or create an IP pool when configuring the VXLAN for a cluster. (It is possible to specify a static IP either via DHCP reservation or as mentioned in the install guide).
      • The number of vmk ports (and hence IP addresses) corresponds to the number of uplinks. So a host with 2 uplinks will have two VTEP vmk ports, hence two IP addresses taken from the pool. Bear that in mind when creating the pool.
    • Each cluster uses one VDS for its VXLAN traffic. This can be a pre-existing VDS – there’s nothing special about it just that you point to it when enabling VXLAN on a cluster; and the vmk port is created on this VDS. NSX automatically creates another portgroup, which is where the vmk port is assigned to.
    • VXLANs are created on this VDS – they are basically portgroups in the VDS. Each VXLAN has an ID – the VXLAN Network Identifier (VNI) – which NSX refers to as segment IDs. 
      • Before creating VXLANS we have to allocate a pool of segment IDs (the VNIs) taking into account any VNIs that may already be in use in the environment.
      • The number of segment IDs is also limited by the fact that a single vCenter only supports a maximum of 10,000 portgroups
      • The web UI only allows us to configure a single segment ID range, but multiple ranges can be configured via the NSX API
  • Logical Switch == VXLAN -> which has an ID (called segment ID or VNI) == Portgroup. All of this is in a VDS. 

While installing NSX I came across “Transport Zones”.

Remember ESXi hosts are part of a VDS. VXLANs are created on a VDS. Each VXLAN is a portgroup on this VDS. However, not all hosts need be part of the same VXLANs, but since all hosts are part of the same VDS and hence have visibility to all the VXLANs we need same way of marking which hosts are part of a VXLAN. We also need some place to identify if a VXLAN is in unicast, multicast, or hybrid mode. This is where Transport Zones come in.

If all your VXLANs are going to behave the same way (multicast etc) and have the same hosts, then you just need one transport zone. Else you would create separate zones based on your requirement. (That said, when you create a Logical Switch/ VXLAN you have an option to specify the control plane mode (multicast mode etc). Am guessing that overrides the zone setting, so you don’t need to create separate zones just to specify different modes). 

Note: I keep saying hosts above (last two paragraphs) but that’s not correct. It’s actually clusters. I keep forgetting, so thought I should note it separately here rather the correct my mistake above. 1) VXLANs are configured on clusters, not hosts. 2) All hosts within a cluster must be connected to a common VDS (at least one common VDS, for VXLAN purposes). 3) NSX Controllers are optional and can be skipped if you are using multicast replication? 4) Transport Zones are made up of clusters (i.e. all hosts in a cluster; you cannot pick & choose just some hosts – this makes sense when you think that a cluster is for HA and DRS so naturally you wouldn’t want to exclude some hosts from where a VM can vMotion to as this would make things difficult). 

Worth keeping in mind: 1) A cluster can belong to multiple transport zones. 2) A logical switch can belong to only one transport zone. 3) A VM cannot be connected to logical switches in different transport zones. 4) A DLR (Distributed Logical Router) cannot connect to logical switches in multiple transport zones. Ditto for an ESG (Edge Services Gateway). 

After creating a transport zone, we can create a Logical Switch. This assigns a segment ID from the pool automatically and this (finally!!) is your VXLAN. Each logical switch creates yet another portgroup. Once you create a logical switch you can assign VMs to it – that basically changes their port group to the one created by the logical switch. Now your VMs will have connectivity to each other even if they are on hosts in separate L3 networks. 

Something I hadn’t realized: 1) Logical Switches are created on Transport Zones. 2) Transport Zones are made up of / can span clusters. 3) Within a cluster the logical switches (VXLANs) are created on the VDS that’s common to the cluster. 4) What I hadn’t realized was this: no where in the previous statements did I imply that transport zones are limited to a single VDS. So if a transport zone is made up of multiple clusters, each / some of which have their own common VDS, any logical switch I create will be created on all these VDSes.  

Sadly, I don’t feel like saying yay at the this point unlike before. I am too tired. :(

Which also brings me to the question of how I got this working with VMware Workstation. 

By default VMware Workstation emulates an e1000 NIC in the VMs and this doesn’t support an MTU larger than 1500 bytes. We can edit the .VMX file of a VM and replace “e1000” with “vmxnet3” to replace the emulated Intel 82545EM Gigabit Etherne NIC with a paravirtual VMXNET3 NIC to the VMs. This NIC supports an MTU larger than 1500 bytes and VXLAN will begin working. One thing though: a quick way of testing if the VTEP VMkernel NICs are able to talk to each other with a larger MTU is via a command such as ping ++netstack=vxlan -I vmk3 -d -s 1600 xxx.xxx.xxx.xxx. If you do this once you add a VMXNET3 NIC though, it crashes the ESXi host. I don’t know why. It only crashes when using the VXLAN network stack; the same command with any other VMkernel NIC works fine (so I know the MTU part is ok). Also, when testing the Logical Switch connectivity via the Web UI (see example here) there’s no crash with a VXLAN standard test packet – maybe that doesn’t use the VXLAN network stack? I spent a fair bit of time chasing after the ping ++netstack command until I realized that even though it was crashing my host the VXLAN was actually working!

Before I conclude a hat-tip to this post for the Web UI test method and also for generally posting how the author set up his NSX test lab. That’s an example of how to post something like this properly, instead of the stream of thoughts my few posts have been. :)

Yay! (VXLAN)

I decided to take a break from my NSX reading and just go ahead and set up a VXLAN in my test lab. Just go with a hunch of what I think the options should be based on what the menus ask me and what I have read so far. Take a leap! :)

*Ahem* The above is actually incorrect, and I am an idiot. A super huge idiot! Each VM is actually just pinging itself and not the other. Unbelievable! And to think that I got all excited thinking I managed to do something without reading the docs etc. The steps below are incomplete. I should just delete this post, but I wrote this much and had a moment of excitement that day … so am just leaving it as it is with this note. 

Above we have two OpenBSD VMs running in my nested EXIi hypervisors. 

  • obsd-01 is running on host 1, which is on network 10.10.3.0/24.
  • obsd-02 is running on host 2, which is on network 10.10.4.0/24. 
  • Note that each host is on a separate L3 network.
  • Each host is in a cluster of its own (doesn’t matter but just mentioning) and they connect to the same VDS.
  • In that VDS there’s a port group for VMs and that’s where obsd-01 and obsd-02 connect to. 
  • Without NSX, since the hosts are on separate networks, the two VMs wouldn’t be able to see each other. 
  • With NSX, I am able to create a VXLAN network on the VDS such that both VMs are now on the same network.
    • I put the VMs on a 192.168.0.0/24 network so that’s my overlay network. 
    • VXLANs are basically port groups within your NSX enhanced VDS. The same way you don’t specify IP/ network information on the VMware side when creating a regular portgroup, you don’t do anything when creating the VXLAN portgroup either. All that is within the VMs on the portgroup.
  • A VDS uses VMKernel ports (vmk ports) to carry out the actual traffic. These are virtual ports bound to the physical NICs on an ESXi host, and there can be multiple vmk ports per VDS for various tasks (vMotion, FT, etc). Similar to this we need to create a new vmk port for the host to connect into the VTEP used by the VXLAN. 
    • Unlike regular vmk ports though we don’t create and assign IP addresses manually. Instead we either use DHCP or create an IP pool when configuring the VXLAN for a cluster. (It is possible to specify a static IP either via DHCP reservation or as mentioned in the install guide). 
    • Each cluster uses one VDS for its VXLAN traffic. This can be a pre-existing VDS – there’s nothing special about it just that you point to it when enabling VXLAN on a cluster; and the vmk port is created on this VDS. NSX automatically creates another portgroup, which is where the vmk port is assigned to. 

And that’s where I am so far. After doing this I went through the chapter for configuring VXLAN in the install guide and I was pretty much on the right track. Take a look at that chapter for more screenshots and info. 

Yay, my first VXLAN! :o)

p.s. I went ahead with OpenBSD in my nested environment coz (a) I like OpenBSD (though I have never got to play around much with it); (b) it has a simple & fast install process and I am familiar with it; (c) the ISO file is small, so doesn’t take much space in my ISO library; (d) OpenBSD comes with VMware tools as part of the kernel, so nothing additional to install; (e) I so love that it still has a simple rc based system and none of that systemd stuff that newer Linux distributions have (not that there’s anything wrong with systemd just that I am unfamiliar with it and rc is way simpler for my needs); (f) the base install has manpages for all the commands unlike minimal Linux ISOs that usually seem to skip these; (g) take a look at this memory usage! :o)

p.p.s. Remember to disable the PF firewall via pfctl -d.

Yay again! :o)

Update: Short-lived excitement sadly. A while later the VMs stopped communicating. Turns out VMware Workstation doesn’t support MTU larger than 1500 bytes, and VXLAN requires 1600 byte. So the VTEP interfaces of both ESXi hosts are unable to talk to each other. Bummer!

Update 2: I finally got this working. Turns out I had missed some stuff; and also I had to make some changes to allows VMware Workstation to with larger MTU sizes. I’ll blog this in a later post

Notes to self while installing NSX 6.3 (part 1)

(No sense or order here. These are just notes I took when installing NSX 6.3 in my home lab, while reading this excellent NSX for Newbies series and the NSX 6.3 install guide from VMware (which I find to be quite informative). Splitting these into parts as I have been typing this for a few days).

You can install NSX Manager in VMware Workstation (rather than in the nested ESXi installation if you are doing it in a home lab). You won’t get a chance to configure the IP address, but you can figure it from your DHCP server. Browse to that IP in a browser and login as username “admin” password “default” (no double quotes). 

If you want to add a certificate from your AD CA to NSX Manager create the certificate as usual in Certificate Manager. Then export the generated certificate and your root CA and any intermediate CA certificates as a “Base-64 encoded X.509 (.CER)” file. Then concatenate all these certificates into a single file (basically, open up Notepad and make a new file that has all these certificates in it). Then you can import it into NSX Manager. (More details here).

During the Host Preparation step on an ESXi 5.5 host it failed with the following error: 

“Could not install image profile: ([], “Error in running [‘/etc/init.d/vShield-Stateful-Firewall’, ‘start’, ‘install’]:\nReturn code: 1\nOutput: vShield-Stateful-Firewall is not running\nwatchdog-dfwpktlogs: PID file /var/run/vmware/watchdog-dfwpktlogs.PID does not exist\nwatchdog-dfwpktlogs: Unable to terminate watchdog: No running watchdog process for dfwpktlogs\nFailed to release memory reservation for vsfwd\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nResource pool ‘host/vim/vmvisor/vsfwd’ release failed. retrying..\nSet memory minlimit for vsfwd to 256MB\nFailed to set memory reservation for vsfwd to 256MB, trying for 256MB\nFailed to set memory reservation for vsfwd to failsafe value of 256MB\nMemory reservation released for vsfwd\nResource pool ‘host/vim/vmvisor/vsfwd’ released.\nResource pool creation failed. Not starting vShield-Stateful-Firewall\n\nIt is not safe to continue. Please reboot the host immediately to discard the unfinished update.”)” Error 3/16/2017 5:17:49 AM esx55-01.fqdn

Initially I thought maybe NSX 6.3 wasn’t compatible with ESXi 5.5 or that I was on an older version of ESXi 5.5 – so I Googled around on pre-requisites (ESXi 5.5 seems to be fine) and also updated ESXi 5.5 to the latest version. Then I took a closer look at the error message above and saw the bit about the 256MB memory reservation. My ESXi 5.5 host only had 3GB RAM (I had installed with 4GB and reduced it to 3GB) so I bumped it up to 4GB RAM and tried again. And voila! the install worked. So NSX 6.3 requires an ESXi 5.5 host with minimum 4GB RAM (well maybe 3.5 GB RAM works too – I was too lazy to try!) :o)

If you want, you can browse to “https://<NSX_MANAGER_IP>/bin/vdn/nwfabric.properties” to manually download the VIBs that get installed as part of the Host Preparation. This is in case you want to do a manual install (thought had crossed my mind as part of troubleshooting above).

NSX Manager is your management layer. You install it first and it communicates with vCenter server. A single NSX Manager install is sufficient. There’s one NSX Manager per vCenter. 

The next step after installing NSX Manager is to install NSX Controllers. These are installed in odd numbers to maintain quorum. This is your control plane. Note: No data traffic flows through the controllers. The NSX Controllers perform many roles and each role has a master controller node (if this node fails another one takes its place via election). 

Remember that in NSX the VXLAN is your data plane. NSX supports three control plane modes: multicast, unicast, and hybrid when it comes to BUM (Broadcast, unknown Unicast, and Multicast) traffic. BUM traffic is basically traffic that doesn’t have a specific Layer 3 destination. (More info: [1], [2], [3] … and many on the Internet but these three are what I came across initially via Google searches).

  • In unicast mode a host replicates all BUM traffic to all other hosts on the same VXLAN and also picks a host in every other VXLAN to do the same for hosts in their VXLANs. Thus there’s no dependence on the underlying hardware. There could, however, be increased traffic as the number of VXLANs increase. Note that in the case of unknown unicast the host first checks with the NSX Controller for more info. (That’s the impression I get at least from the [2] post above – I am not entirely clear). 
  • In multicast mode a host depends on the underlying networking hardware to replicate BUM traffic via multicast. All hosts on all VXLAN segments join multicast groups so any BUM traffic can be replicated by the network hardware to this multicast group. Obviously this mode requires hardware support. Note that multicast is used for both Layer 2 and Layer 3 here. 
  • In hybrid mode some of the BUM traffic replication is handed over to the first hop physical switch (so rather than a host sending unicast traffic to all other hosts connected to the same physical switch it relies on the switch to do this) while the rest of the replication is done by the host to hosts in other VXLANs. Note that multicast is used only for Layer 2 here. Also note that as in the unicast mode, in the case of unknown unicast traffic the Controller is consulted first. 

NSX Edge provides the routing. This is either via the Distributed Logical Router (DLR), which is installed on the hypervisor + a DLR virtual appliance; or via the Edge Services Gateway (ESG), which is a virtual appliance. 

  • A DLR can have up to 8 uplink interfaces and 1000 internal interfaces.
    • A DLR uplink typically connects to an ESG via a Layer 2 logical switch. 
    • DLR virtual appliance can be set up in HA mode – in an active/ standby configuration.
      • Created from NSX Manager?
    • The DLR virtual appliance is the control plane – it supports dynamic routing protocols and exchanges routing updates with Layer 3 devices (usually ESG).
      • Even if this virtual appliance is down the routing isn’t affected. New routes won’t be learnt that’s all.
    • The ESXi hypervisors have DLR VIBs which contain the routing information etc. got from the controllers (note: not from the DLR virtual appliance). This the data layer. Performs ARP lookup, route lookup etc. 
      • The VIBs also add a Logical InterFace (LIF) to the hypervisor. There’s one for each Logical Switch (VXLAN) the host connects to. Each LIF, of each host, is set to the default gateway IP of that Layer 2 segment. 
  • An ESG can have up to 10 uplink and internal interfaces. (With a trunk an ESG can have up to 200 sub-interfaces). 
    • There can be multiple ESG appliances in a datacenter. 
    • Here’s how new routes are learnt: NSX Edge Gateway (EGW) learns a new route -> This is picked up by the DLR virtual appliance as they are connected -> DLR virtual appliance passes this info to the NSX Controllers -> NSX Controllers pass this to the ESXi hosts.
    • The ESG is what connects to the uplink. The DLR connects to ESG via a Logical Switch. 

Logical Switch – this is the switch for a VXLAN. 

NSX Edge provides Logical VPNs, Logical Firewall, and Logical Load Balancer. 

TIL: Control Plane & Data Plane (networking)

Reading a bit of networking stuff, which is new to me, as I am trying to understand and appreciate NSX (instead of already diving into it). Hence a few of these TIL posts like this one and the previous. 

One common term I read in the context of NSX or SDN (Software Defined Networking) in general is “control plane” and “data plane” (a.k.a “forwarding” plane). 

This forum post is a good intro. Basically, when it comes to Networking your network equipment does two sort of things. One is the actual pushing of packets that come to it to others. The other is figuring out what packets need to go where. The latter is where various networking protocols like RIP and EIGRP come in. Control plane traffic is used to update a network device’s routing tables or configuration state, and its processing happens on the network device itself.  Data plane traffic passes through the router. Control plane traffic determines what should be done with the data plane traffic. Another way of thinking about control plan and data planes is where the traffic originates from/ is destined to. Basically, control plane traffic is sent to/ from the network devices to control it (e.g RIP, EIGRP); while data plane traffic is what passes through a network device.

( Control plane traffic doesn’t necessarily mean its traffic for controlling a network device. For example, SSH or Telnet could be used to connect to a network device and control it, but it’s not really in the control plane. These come more under a “management” plane – which may or may not be considered as a separate plane. )

Once you think of network devices along these lines, you can see that a device’s actual work is in the data plane. How fast can it push packets through. Yes, it needs to know where to push packets through to, but the two aren’t tied together. It’s sort of like how one might think of a computer as being hardware (CPU) + software (OS) tied together. If we imagine the two as tied together, then we are limiting ourselves on how much each of these can be pushed. If improvements in the OS require improvements in the CPU then we limit ourselves – the two can only be improved in-step. But if the OS improvements can happen independent of the underlying CPU (yes, a newer CPU might help the OS take advantage of newer features or perform better, but it isn’t a requirement) then OS developers can keep innovating on the OS irrespective of CPU manufacturers. In fact, OS developers can use any CPU as long as there are clearly defined interfaces between the OS and the CPU. Similarly, CPU manufacturers can innovate independent of the OS. Ultimately if we think (very simply) of CPUs as having a function of quickly processing data, and OS as a platform that can make use of a CPU to do various processing tasks, we can see that the two are independent and all that’s required is a set of interfaces between them. This is how things already are with computers so what I mentioned just now doesn’t sound so grand or new, but this wasn’t always the case. 

With SDN we try to decouple the control and data planes. The data plane then is the physical layer comprising of network devices or servers. They are programmable and expose a set of interfaces. The control plane now can be a VM or something independent of the physical hardware of the data plane. It is no longer limited to what a single network device sees. The control plane is aware of the whole infrastructure and accordingly informs/ configures the data plane devices.  

If you want a better explanation of what I was trying to convey above, this article might help. 

In the context of NSX its data plane would be the VXLAN based Logical Switches and the ESXi hosts that make it up. And its control plane would be the NSX Controllers. It’s the NSX Controllers that takes care of knowing what to do with the network traffic. It identifies all these, informs the hosts that are part of the data plane accordingly, and let them do the needful. The NSX Controller VMs are deployed in odd numbers (preferably 3 or higher, though you could get away with 1 too) for HA and cluster quorum (that’s why odd numbers) but they are independent of the data plane. Even if all the NSX Controllers are down the data flow would not be affected


I saw a video from Scott Shenker on the future of networking and the past of protocols. Here’s a link to the slides, and here’s a link to the video on YouTube. I think the video is a must watch. Here’s some of the salient points from the video+slides though – mainly as a reminder to myself (note: since I am not a networking person I am vague at many places as I don’t understand it myself):

  • Layering is a useful thing. Layering is what made networking successful. The TCP/IP model, the OSI model. Basically you don’t try and think of the “networking problem” as a big composite thing, but you break it down into layers with each layer doing one task and the layer above it assuming that the layer below it has somehow solved that problem. It’s similar to Unix pipes and all that. Break the problem into discrete parts with interfaces, and each part does what it does best and assumes the part below it is taking care of what it needs to do. 
  • This layering was useful when it came to the data plane mentioned above. That’s what TCP/IP is all about anyways – getting stuff from one point to another. 
  • The control plane used to be simple. It was just about the L2 or L3 tables – where to send a frame to, or where to send a packet to. Then the control plane got complicated by way of ACLs and all that (I don’t know what all to be honest as I am not a networking person :)). There was no “academic” approach to solving this problem similar to how the data plane was tackled; so we just kept adding more and more protocols to the mix to simply solve each issue as it came along. This made things even more complicated, but that’s OK as the people who manage all these liked the complexity and it worked after all. 
  • A good quote (from Don Norman) – “The ability to master complexity is not the same as the ability to extract simplicity“. Well said! So simple and prescient. 
    • It’s OK if you are only good at mastering complexity. But be aware of that. Don’t be under a misconception that just because you are good at mastering the complexity you can also extract simplicity out of it. That’s the key thing. Don’t fool yourself. :)
  • In the context of the control plane, the thing is we have learnt to master its complexity but not learnt to extract simplicity from it. That’s the key problem. 
    • To give an analogy with programming, we no longer think of programming in terms of machine language or registers or memory spaces. All these are abstracted away. This abstraction means a programmer can focus on tackling the problem in a totally different way compared to how he/ she would have had to approach it if they had to take care of all the underlying issues and figure it out. Abstraction is a very useful tool. E.g. Object Oriented Programming, Garbage Collection. Extract simplicity! 
  • Another good quote (from Barbara Liskov) – “Modularity based on abstraction is the way things get done“.
    • Or put another way :) Abstractions -> Interfaces -> Modularity (you abstract away stuff; provide interfaces between them; and that leads to modularity). 
  • As mentioned earlier the data plan has good abstraction, interfaces, and modularity (the layers). Each layer has well defined interfaces and the actual implementation of how a particular layer gets things done is down to the protocols used in that layer or its implementations. The layers above and below do not care. E.g. Layer 3 (IP) expects Layer 2 to somehow get it’s stuff done. The fact that it uses Ethernet and Frames etc is of no concern to IP. 
  • So, what are the control plane problems in networking? 
    • We need to be able to compute the configuration state of each network device. As in what ACLs are it supposed to be applying, what its forwarding tables are like …
    • We need to be able to do this while operating without communication guarantees. So we have to deal with communication delays or packet drops etc as changes are pushed out. 
    • We also need to be able to do this while operating within the limitations of the protocol we are using (e.g. IP). 
  • Anyone trying to master the control plane has to deal with all three. To give an analogy with programming, it is as though a programmer had to worry about where data is placed in RAM, take care of memory management and process communication etc. No one does that now. It is all magically taken care of by the underlying system (like the OS or the programming language itself). The programmer merely focuses on what they need to do. Something similar is required for the control plane. 
  • What is needed?
    • We need an abstraction for computing the configuration state of each device. [Specification Abstraction]
      • Instead of thinking of how to compute the configuration state of a device or how to change a configuration state, we just declare what we want and it is magically taken care of. You declare how things should be, and the underlying system takes care of making it so. 
      • We think in terms of specifications. If the intention is that Device A should not have access to Device B, we simply specify that in the language of our model without thinking of the how in terms of the underlying physical model. The shift in thinking here is that we view each thing as a layer and only focus on that. To implement a policy that Device A should not have access to Device B we do not need to think of the network structure or the devices in between – all that is just taken care of (by the Network Operating System, so to speak). 
      • This layer is  Network Virtualization. We have a simplified model of the network that we work with and which we specify how it should be, and the Network Virtualization takes care of actually implementing it. 
    • We need an abstraction that captures the lack of communication guarantees- i.e. the distributed state of the system. [Distributed State Abstraction]
      • Instead of thinking how to deal with the distributed network we abstract it away and assume that it is magically taken care of. 
      • Each device has access to an annotated network graph that they can query for whatever info they want. A global network view, so to say. 
      • There is some layer that gathers an overall picture of the network from all the devices and presents this global view to the devices. (We can think of this layer as being a central source of information, but it can be decentralized too. Point is that’s an implementation problem for whoever designs that layer). This layer is the Network Operating System, so to speak. 
    • We need an abstraction of the underlying protocol so we don’t have to deal with it directly.  [Forwarding Abstraction]
      • Network devices have a Management CPU and a Forwarding ASIC. We need an abstraction for both. 
      • The Management CPU abstraction can be anything. The ASIC abstraction is OpenFlow. 
      • This is the layer that closest to the hardware. 
  • SDN abstracts these three things – distribution, forwarding, and configuration. 
    • You have a Control Program that configures an abstract network view based on the operator requirements (note: this doesn’t deal with the underlying hardware at all) ->
    • You have a Network Virtualization layer that takes this abstract network view and maps it to a global view based on the underlying physical hardware (the specification abstraction) ->
    • You have a Network OS that communicates this global network view to all the physical devices to make it happen (the distributed state abstraction (for disseminating the information) and the forwarding abstraction (for configuring the hardware)).
  • Very important: Each piece of the above architecture has a very limited job that doesn’t involve the overall picture. 

From this Whitepaper:

SDN has three layers: (1) an Application layer, (2) a Control layer (the Control Program mentioned above), and (3) an Infrastructure layer (the network devices). 

The Application layer is where business applications reside. These talk to the Control Program in the Control layer via APIs. This way applications can program their network requirements directly. 

OpenFlow (mentioned in Scott’s talk under the ASIC abstraction) is the interface between the control plane and the data/ forwarding place. Rather than paraphrase, let me quote from that whitepaper for my own reference:

OpenFlow is the first standard communications interface defined between the control and forwarding layers of an SDN architecture. OpenFlow allows direct access to and manipulation of the forwarding plane of network devices such as switches and routers, both physical and virtual (hypervisor-based). It is the absence of an open interface to the forwarding plane that has led to the characterization of today’s networking devices as monolithic, closed, and mainframe-like. No other standard protocol does what OpenFlow does, and a protocol like OpenFlow is needed to move network control out of the networking switches to logically centralized control software.

OpenFlow can be compared to the instruction set of a CPU. The protocol specifies basic primitives that can be used by an external software application to program the forwarding plane of network devices, just like the instruction set of a CPU would program a computer system.

OpenFlow uses the concept of flows to identify network traffic based on pre-defined match rules that can be statically or dynamically programmed by the SDN control software. It also allows IT to define how traffic should flow through network devices based on parameters such as usage patterns, applications, and cloud resources. Since OpenFlow allows the network to be programmed on a per-flow basis, an OpenFlow-based SDN architecture provides extremely granular control, enabling the network to respond to real-time changes at the application, user, and session levels. Current IP-based routing does not provide this level of control, as all flows between two endpoints must follow the same path through the network, regardless of their different requirements.

I don’t think OpenFlow is used by NSX though. It is used by Open vSwitch and was used by NVP (Nicira Virtualization Platform – the predecessor of NSX).

Speaking of NVP and NSX: VMware acquired NSX from Nicira (which was a company founded by Martin Casado, Nick McKeown and Scott Shenker – the same Scott Shenker whose video I was watching above). The product was called NVP back then and primarily ran on the Xen hypervisor. VMware renamed it to NSX and it was has two flavors. NSX-V is the version that runs on the VMware ESXi hypervisor, and is in active development. There’s also NSX-MH which is a “multi-hypervisor” version that’s supposed to be able to run on Xen, KVM, etc. but I couldn’t find much information on it. There’s some presentation slides in case anyone’s interested. 

Before I conclude here’s some more blog posts related to all this. They are in order of publishing so we get a feel of how things have progressed. I am starting to get a headache reading all this network stuff, most of which is going above my head, so I am going to take a break here and simply link to the articles (with minimal/ half info) and not go much into it. :)

  • This one talks about how the VXLAN specification doesn’t specify any control plane.
    • There is no way for hosts participating in a VXLAN network to know the MAC addresses of other hosts or VMs in the VXLAN so we need some way of achieving that. 
    • Nicira NVP uses OpenFlow as a control-plane protocol. 
  • This one talks about how OpenFlow is used by Nicira NVP. Some points of interest:
    • Each Open vSwitch (OVS) implementation has 1) a flow-based forwarding module loaded in the kernel; 2) an agent that communicates with the Controller; and 3) an OVS DB daemon that keeps track of of the local configuration. 
    • NVP had clusters of 3 or 5 controllers. These used OpenFlow protocol to download forwarding entries into the OVS and OVSDB (a.k.a. ovsdb-daemon) to configure the OVS itself  (creating/ deleting/ modifying bridges, interfaces, etc). 
    • Read that post on how the forwarding tables and tunnel interfaces are modified as new devices join the overlay network. 
    • Broadcast traffic, unknown Unicast traffic, and Multicast traffic (a.k.a. BUM traffic) can be handled in two ways – either by sending these to an extra server that replicates these to all devices in the overlay network; or the source hypervisor/ physical device can encapsulate the BUM frame and send it as unicast to all the other devices in that overlay. 
  • This one talks about how Nicira NVP seems to be moving away from OpenFlow or supplementing it with something (I am not entirely clear).
    • This is a good read though just that I was lost by this point coz I have been doing this reading for nearly 2 days and it’s starting to get tiring. 

One more post from the author of the three posts above. It’s a good read. Kind of obvious stuff, but good to see in pictures. That author has some informative posts – wish I was more brainy! :)

TIL: VXLAN is a standard

VXLAN == Virtual eXtensible LAN.

While reading about NSX I was under the impression VXLAN is something VMware cooked up and owns (possibly via Nicira, which is where NSX came from). But turns out that isn’t the case. It was originally created by VMware & Cisco (check out this Register article – a good read) and is actually covered under an RFC 7348. The encapsulation mechanism is standardized, and so is the UDP port used for communication (port number 4789 by the way). A lot of vendors now support VXLAN, and similar to NSX being an implementation of VXLAN we also have Open vSwitch. Nice!

(Note to self: got to read more about Open vSwitch. It’s used in XenServer and is a part of Linux. The *BSDs too support it). 

VXLAN is meant to both virtualize Layer 2 and also replace VLANs. You can have up to 16 million VXLANs (the NSX Logical Switches I mentioned earlier). In contrast you are limited to 4094 VLANs. I like the analogy of how VXLAN is to IP addresses how cell phones are to telephone numbers. Prior to cell phones, when everyone had landline numbers, your phone number was tied to your location. If you shifted houses/ locations you got a new phone number. In contrast, with cell phones numbers it doesn’t matter where you are as the number is linked to you, not your location. Similarly with VXLAN your VM IP address is linked to the VM, not its location. 

Update:

  • Found a good whitepaper by Arista on VXLANs. Something I hadn’t realized earlier was that the 24bit VXLAN Network Identifier is called VNI (this is what lets you have 16 millions VXLAN segments/ NSX Logical Switches) and that a VM’s MAC is combined with its VNI – thus allowing multiple VMs with the same MAC address to exist across the network (as long as they are on separate VXNETs). 
  • Also, while I am noting acronyms I might as well also mention VTEPs. These stand for Virtual Tunnel End Points. This is the “thing” that encapsulates/ decapsulates packets for VXLAN. This can be virtual bridges in the hypervisor (ESXi or any other); or even VXLAN aware VM applications or VXLAN capable switching hardware (wasn’t aware of this until I read the Arista whitepaper). 
  • VTEP communicates over UDP. The port number is 4789 (NSX 6.2.3 and later) or 8472 (pre-NSX 6.2.3).
  • A post by Duncan Epping on VXLAN use cases. Probably dated in terms of the VXLAN issues it mentions (traffic tromboning) but I wanted to link it here as (a) it’s a good read and (b) it’s good to know such issues as that will help me better understand why things might be a certain way now (because they are designed to work around such issues). 

Cisco CME outgoing caller ID not showing individual extensions

Been working with our Cisco CME (Cisco Unified Communications Manager – Express – as a reminder to myself!) at work past 2-3 days. I have no idea about Cisco telephony, but wanted to tackle an issue anyways. Good way to learn a new system.

The issue was that whenever anyone make an outgoing external call from our system the caller ID number shown at the remote end is that of our main number. That is to say, if our main number extension is 900, this main number externally appears as 1234900, and my extension is 929, when I make an outgoing external call to (say) my mobile the number appears as 1234900 instead of 1234929.

A useful command to debug such situations is the following command:

This shows what is set to the ISP when I place an outgoing external call.

Another useful command is:

This shows the internal processing that happens on CME/ CUCM. Things like what translations happen, what dial peers are selected, etc. It’s a lot of output compared to the first command.

To see what debugging is enabled on your system the following command is useful:

To turn off all debugging (coz it takes a toll on your router and so you must disable it once you are done):

Lastly, to see the debugging output if you are SSH’d into the router rather than on the console, do the following:

In my case I found out that even though CME was correctly sending the calling extension/ number as 9xx to the ISP, it looked like the ISP was ignoring it. I thought that maybe it expects the number in a proper format (as an external number) so I made a translation rule for outgoing calling numbers to change 9xx to the correct format (12349xx) and pass to ISP – and that fixed the issue.

Here’s the rule and translation profile I created. I played around a bit with the output and saw that I need to pass a “0” before the full number for the ISP to recognize it correctly.

I set the plan and type too though it didn’t make any difference in my case. I saw some posts on the Internet where it does seem to make a difference, so I didn’t remove it.

Lastly I apply this profile to the voice port:

That’s all.

vCenter unable to connects to hosts; vSphere client gives error ‘”ServiceInstance.RetrieveContent” for object “ServiceInstance” on Server “IP-Address” failed’

Our Network team had been making some changes at work and suddenly vCenter in our London office lost connectivity with all the ESX hosts in one of our remote office. Moreover, when trying to connect from the vSphere Client to any of the remote hosts directly we were getting the following error –

client error

Connectivity from vSphere Client in the remote office to the ESX host in the same office was fine; it was only connectivity from other offices to this remote office. So it definitely indicated a network issue.

This KB article is a handy one to know what ports are required by various VMware products. Port 443 is what needs to be open to ESX hosts for vCenter Server to be able to talk to them. I did a telnet from the vCenter server to each of the remote office hosts on port 443 and it went through fine – so wasn’t a firewall issue. (Another post with port numbers, just FYI, is this one).

After a fair bit of troubleshooting we tracked the issue down to MTU.

Digressing into MTUs

Communication between two IP addresses (i.e. layer 3) happens through packets. Thus when my London vCenter Server communicates with my remote office ESX host, the two send TCP/IP packets to each other. When these packets from the vCenter Server reach the switch/ router on the same LAN as the ESX host, it becomes a layer 2 communication (because they are on the same network and it’s a matter of data reaching the ESX host from the switch/ router). In the case of Ethernet, this layer 2 communication happens via Ethernet frames. The frames encapsulate the IP packets – so the switch/ router breaks the packets and fits them into multiple frames, while the ESX host receives these frames and re-assembles the packets (and vice versa). (The picture on this Wikipedia page is worth a look to see the encapsulation). 

How much data can be held by a layer 2 frame is defined by the Maximum Transmission Unit (MTU). Larger MTUs are good because you can carry more data; but they have a downside in that each frame takes longer to be transmitted, and in case of any errors more data has to be re-transmitted when the frame is resent. So a balance is important. In the case of Ethernet, RFC 894 (see errata also) defines the MTU as a maximum of 1500 bytes. In the case of other layer 2 protocols, the MTU varies: for example 4464 bytes for Token Ring; 4352 bytes for FDDI; 9180 bytes for ATM; etc. In the case of Ethernet there are now also jumbo frames, which are frames with an MTU size of 9000 bytes (see this page for a table comparing regular frames and jumbo frames) and are commonly used in iSCSI networks.

Taking the case of Ethernet, assume the MTU of all Ethernet networks is 1500 bytes. So when two devices are conversing with each other over layer 3, and this conversation spans multiple Ethernet networks, it is helpful if the devices know that the MTU of the underlying layer 2 network is 1500 bytes. That way the two devices can keep the size of their layer 3 packets to be less than 1500 bytes. Why? Because if the size of the layer 3 packets are greater than 1500 bytes, then the devices and all the routers/ switches in between will have to fragment (break) the layer 3 packets into smaller packets of less than 1500 bytes to fit it in the Ethernet frame. This is a waste of resources for all, so it’s best if the two devices know of the underlying layer 2 MTU and act accordingly.

Now, note that Ethernet MTUs are defined as a maximum of 1500 bytes. So the MTU for a particular LAN segment can be set to a lower number for whatever reason (maybe there are additional fields in the Ethernet frame and to accommodate these the data portion must be reduced). Similarly, a layer 3 conversation between when two devices can go over a mix of layer 2 networks – Ethernet, Token Ring, etc – each with a different MTU. So what is required for the two devices really is a way of knowing what’s the lowest MTU across all these layer 2 devices, so the two devices can use it as the MTU of the layer 3 packets for their conversation. This is known as the Path MTU or IP MTU – and is basically the smallest MTU of all the underlying layer 2 MTUs over which that conversation traverses. It is discovered through a process known as “Path MTU Discovery” (PMTUD) (check this Wikipedia article, or Google this term to learn more). Very briefly, in the case of IPv4 what happens is that each device sends across packets of increasing size to the other end, with a flag set that says “do not fragment this packet”. Packets of size smaller than the lowest layer 2 MTU will get through, but once the size exceeds the lowest MTU the packet will fail & return because it cannot be fragmented (due to the flag) and so is returned via ICMP to the sender. Thus the Path MTU is discovered. This check happens in both directions.

So we have layer 2 MTUs and layer 3 MTUs. Layer 2 MTUs have a maximum value that is dependent on the layer 2 network technology. But what about the minimum value? RFC 791, which defines the Internet Protocol (the IP in TCP/IP), requires that all devices supporting IP must be able to forward packets of 68 bytes without fragmenting (68 bytes because IP headers take 60 bytes size and layer 2 headers take 8 bytes size minimum) and be able to accept packets of minimum size 576 bytes either as one packet or multiple packets that require assembling. Because of this the minimum layer 2 MTU can be thought of as 68 bytes. In a practical sense, however, most IP devices accept 576 bytes without fragmenting, and since this number is higher than the values for all layer 2 networks the minimum layer 2 & layer 3 MTU can be thought of as 576 bytes.

Just for completeness I will also mention Maximum Segment Size (MSS) which is a layer 4 MTU (of sorts) that defines what’s the maximum TCP segment (which is what a TCP packet is called) that can be accepted by devices. It has a default value of 536 bytes. This is based on the 576 bytes that IP requires hosts to accept at minimum, minus 20 bytes for IP headers and 20 bytes for TCP headers. Idea behind using 576 bytes as the base is that this way the TCP segment can be expected to arrive without fragmenting. In a practical sense again, for TCP/IP traffic over Ethernet (which is the common case), since Ethernet frames have an MTU of 1500, the MSS is usually set to 1500 minus 20 minus 20 = 1460 bytes.

This is a good article I came upon. Just linking it as a reference to myself.

Back to our issue

In our case the router in the remote site had the following set in its configuration:

I am not entirely clear where it was set or why it was set, as that comes under the Network team. What this does though is tell the router not to clear the “Do Not Fragment” (DF) bit in Ethernet frames. If a DF bit is present in a frame then the router will not fragment it if the frame size is larger than the MTU (this is how PMTUD also works). I am not sure why this was set – part of some testing I suppose – but because of this larger frames were not getting through to the other side and hence failing. Our Network team removed this statement and then communication with the ESX hosts started working fine.

I wanted to write more about this statement but I am running out of time. This and this are two good links worth reading for more info. Especially the Scenario 4 section in the second link – that’s pretty much what was happening in our case, I think.

OpenVPN not setting default gateway

Note to self. Using OpenVPN on Windows, if your Internet traffic still goes via the local network rather than the VPN network, check whether OpenVPN has set itself as the default gateway.

Notice the second entry with the lower metric above. That’s OpenVPN. If such an entry exists, and its metric is lower than the other entries, then it will be used as the default route. If such an entry does not exist then double check that you are running OpenVPN GUI with “Run as Administrator” rights. I wasn’t, and it took be a while to realize that was the reason why OpenVPN wasn’t setting the default route on my laptop!

Hope this helps someone.

PowerShell: Add a line to multiple files

Trivial stuff, but I don’t get to use PowerShell as much as I would like to so I end up forgetting elementary things that should just be second nature to me. Case in hand, I wanted to a add a line to a bunch of text files. Here’s what I came up with – just posting it here as a reference to my future selef.

For the background behind this, I use Private Internet Access for my VPN needs and since last month or so my ISP’s been blocking traffic to it. Private Internet Access offers a client that lets me connects to their servers via UDP or TCP. The UDP option began failing but the TCP option surprisingly worked. Of course I don’t want to use TCP as that’s slow and so I went to Private Internet Access’s website where they give a bunch of OpenVPN config files we can use. These are cool in the sense that some of them connect via IP (instead of server name) while others connect to well-known ports or use a well-known local port and so there’s less chance of being blocked by the ISP. In my case turned out just connecting via IP was more than enough so it looks like the ISP isn’t blocking OpenVPN UDP ports, it’s just blocking UDP traffic to these server names.

Anyhow, next step was to stop the client from prompting for an username/ password. OpenVPN has an option auth-user-pass which lets you specify a file-name where the usrname and password are on separate lines. So all I had to do was create this file and add a line such as auth-user-pass pia.txt to all the configuration files. That’s what the code snippet above does.

Notes on Teredo (part 4)

Continuing my posts on Teredo, consider the following network:

IPv6 - Teredo

There are two Windows clients behind two different NAT routers. There’s a Teredo server and a Teredo relay. And there are IPv6 only and IPv6/IPv4 resources on the Internet that these clients would like to access. The Teredo server and relay are on different networks and the Teredo relay is in a different network from the clients and IPv6 resources. They don’t need to be on different networks or even different servers; I just put them so to illustrate that they are independent devices and not necessarily on the same network as the client and resources.

The clients communicate with the Teredo server and relay through IPv6 packets in UDP messages in IPv4 packets. The Teredo server and relay communicate with the IPv6 parts of the Internet over IPv6, except when the host has both IPv4 and IPv6 in which case the host acts as its own relay and talks to the clients directly over IPv4 (using IPv6 in UDP messages as usual).

Once I configure both clients with the Teredo server address they get a Teredo IPv6 addresses as below. These are both link-local as well as 2001::/32 prefix addresses. The Teredo relay too gets Teredo IPv6 addresses while the Teredo server has link-local Teredo IPv6 addresses only.

IPv6-Teredo2

Decoding the Teredo IPv6 address

Let’s decode the Teredo IPv6 address of one of the clients to see what it means.

Client 192.168.99.200 has address 2001:0:1117:34fa:1027:374b:e1fc:f635. So the network prefix is 2001:0:1117:34fa, host identifier is 1027:374b:e1fc:f635.

Dealing with the network prefix:

  • 2001::/32 is the Teredo network prefix.
  • 1117:34fa is 32 bits of the Teredo server IPv4 address in hex. 0x11 = 17, 0x17 = 23, 0x34 = 52, 0xFA = 250. Which converts to 17.23.52.250.

And the host identifier:

  • 1027 is 16  bits of flags and random bits.
  • 374b is 16 bits of the obfuscated external port of the NAT. (Obfuscation happens by XOR-ing the digits with 0xFFFF).
    • 0x374b = 0xc8b4 unobfuscated =51380.
  • e1fc:f635 is 32 bits of obfuscated external IPv4 address of the NAT.
    • 0xe1 = 0x1e unobfuscated = 30
    • 0xfc = 0x3 unobfuscated = 3
    • 0xf6 = 0x9 unobfuscated = 9
    • 0x35 = 0xca unobfuscated = 202
    • So the NAT’s external IPv4 address is 30.3.9.202.

Packet flow

Let’s look at some Wireshark captures to see what happens. From one of the clients (192.168.20.200) I am pinging an IPv6 host and here’s what happens.

(The Teredo IPv6 addresses in these captures are different from the ones above because I made the captures a while later and the NAT IPv4 address and port changed by then).

First the client makes a DNS request (over IPv4) to resolve the name, one of the name servers replies with the IPv6 address (the AAAA records).

Capture1

Next the client contacts the Teredo server to configure itself. It sends an ICMPv6 Router Solicitation message to the ff02::2 multicast IPv6 address (link-local scope, all routers). The source is set to a link-local IPv6 address fe80::ffff:ffff:fffe which as far as I can tell seems to be an anycast address but I am not too sure. (Typically Router Solicitation messages set the source address as fe80::/64 followed by a random or EUI-64 based host identifier).

Capture2

It’s worth noting that this ICMPv6 message is actually sent within a UDP message in an IPv4 packet addresses to the Teredo server IPv4 address. Pay attention to the destination IPv4 address of this packet – it’s the first IPv4 address of the Teredo server. From the capture above you’ll see there are two Router Solicitation messages sent and two replies got. That’s because the second Router Solicitation message is to the second IPv4 address of the Teredo server. Both addresses send a reply and that’s how the Teredo client identifies whether it’s behind a symmetric NAT or not.

Capture3

In the capture above it’s also worth noting the port number used by the client’s NAT: 52643. There is now a mapping on the NAT device from this port to the client. The mapping could be a cone or restricted, but that does not matter to us (as long as it’s not symmetric).

Moving on, the Teredo client sends an IPv6 Connectivity Test message from it’s Teredo IPv6 address to the IPv6 address of the destination. The client has auto configured an IPv6 address for itself. And it knows the destination IPv6 address from the initial DNS query. This test message is really an ICMPv6 ping request to the destination, sent within an UDP message over IPv4 to the Teredo server. I had mentioned this previously in the section on Teredo relays. This is how a Teredo client finds the Teredo relay closest to the destination.

Capture4The Teredo server now does a Neighbor Solicitation to find the MAC address of the destination. From the IPv6 address it knows the destination is on the same network as itself. If it were on different networks, the Teredo server would have sent this packet to its gateway for forwarding.

Notice the destination address ff02::1:ff00:254. As mentioned in an earlier post this is a solicited node multicast address.

  • ff02 indicates the scope is link-local.
  • ::1:ffxx:xxxx is the solicited node address a host with last 24bits IPv6 address xx:xxxx would have subscribed to. For the destination host IPv6 address 2dcc:7c4e:3651:52::254, this would be 00:254.

Capture5

The destination host now sends a Neighbor Solicitation message to find its default gateway’s MAC address, gets a reply, and sends the ping reply to the Teredo client. Notice the ping reply is a pure IPv6 ICMPv6 packet.

Capture6

This packet reaches the Teredo relay as it is advertising the 2001::/32 prefix. This packet is discarded.

The Teredo relay sends a UDP message to the NAT router of client, on the port that’s mapped to it. The NAT router replies with a prohibited message. This tells the Teredo relay that any previous mappings the client NAT might have had for the relay are gone and so a new mapping must be made. Since it can’t contact the client directly it must take help from the Teredo server.

Capture7

The Teredo relay now sends a bubble packet from itself to the IPv6 Teredo address of the client. This bubble packet is sent via UDP over IPv4 to the Teredo server. (Remember this is how the need for hole punching is communicated to the server).

Capture8The Teredo server sends this packet via UDP to the port that’s mapped on the client’s NAT router. Note the IPv6 portion of the packet it sends is same as the one it receives. This way the client is able to identify the Teredo relay’s IPv4 and Teredo IPv6 address.

Capture9I was under the impression the Teredo relay address is sent to the client as an “Origin Indication” by the Teredo server but apparently the client picks it up from the IPv6 packet the relay sends it. The Teredo server doesn’t modify the IPv6 payload. The payload is a bubble packet (next header is 59) and contains details of the relay.

Now the Teredo client contacts the Teredo relay and sends it bubble packets. The relay responds and thus the client and relay know of each other.

Capture12

All this was just prep work done behind the scenes when the client wants to talk to an IPv6 host. I generated this traffic by trying to ping the IPv6 host from the client so below is the traffic for that:

Capture14

The Teredo client contacts the relay via UDP/ IPv4 and passes on the ICMPv6 ping request packet. That’s sent on the IPv6 network to the destination. A reply arrives. The reply is passed via UDP/IPv4 to the client. No more additional steps.

Below is the capture for all four ping requests and replies.

Capture15

And that’s it!

Microsoft has a good page on Teredo that’s worth looking at. Goes into more details with the processes and scenarios that I don’t go into above.

Notes on Teredo (part 3)

In the previous two parts I talked about Teredo in general and also about NAT &  Teredo. In this post I hope to talk more about how Teredo works.

Teredo Clients

Microsoft has made available Teredo servers on the Internet. These are reachable at win8.ipv6.microsoft.com and teredo.ipv6.microsoft.com and Windows clients have this address already set as their Teredo server.

If the Teredo server address is not reachable, the client is in an offline state:

If the Teredo server address is reachable, the client is in a dormant state. As the name indicates this is a state in which the Teredo client is not active, but when required it can contact the server and auto-configure an IPv6 address and send/ receive packets.

Send some IPv6 traffic and the state automatically changes to qualified. (Note how the first ping reply took a lot more time than the rest as the Teredo interface was being configured. Sometimes the first reply can timeout too).

Now the Teredo state also shows the type of NAT the client is behind and also the local and external mappings.

Another thing to note in the output above is the “Network” which is currently set to “unmanaged”. Since Teredo allows a client to be reached across a firewall/ NAT and this is something an organisation might not want for its managed machines, the Teredo client tries to accommodate that and before initializing itself it checks whether the computer is on a managed network. If the computer is domain joined and on a network where it’s domain controllers are reachable – i.e. within an organisation – the Teredo client detects that it’s on a managed network and disables itself.

This setting can be changed to set the Teredo client as qualified even in a managed network. This can be done via GPOs, PowerShell, or netsh. The netsh command for this is:

This command must be run as an administrator. When a Teredo client is in a managed network and qualified, it is known as an Enterprise Client. Hence the name.

It is also possible to configure clients with a manually specified Teredo server. This can be done via PowerShell …

… or netsh

Teredo Servers

Setting up your own Windows Teredo Server is easy. Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012, and later can function as a Teredo server.

Here’s how I enable one of these as a Teredo server:

And that’s it really! The computer is now working as a Teredo server.

Running the above command again shows updated stats.

One thing to keep in mind that even a computer functioning as a Teredo server must be setup with a Teredo server. If the pre-configured Teredo server (e.g. win8.ipv6.microsoft.com) is unreachable, the Teredo interlace will be disabled and the computer will not work as a Teredo server. The Teredo state will show as offline even though this computer itself is a server.

To fix this, set this computer as its own Teredo server.

Teredo Relays

Previously I wrote about how a Teredo relay sends packets to a Teredo client. How do clients know which Teredo relay to use though? So far we haven’t set a Teredo relay anywhere in our client and server configuration, so where does it enter the picture?

While Teredo servers are specific to a client – i.e. the client is assigned a Teredo server and each client uses only one Teredo server – Teredo relays are specific to the remote end and a particular client will use different relays for different destinations. Here’s how the process works:

  1. When a Teredo client needs to contact a remote IPv6 host, it first sends an ICMPv6 packet to the remote host.
  2. Since it doesn’t know how to contact this host, and this is an initial setup connection, the client sends this packet to the Teredo server as an UDP message in IPv4.
  3. The Teredo server receives this message, decapsulates the IPv6 packet, and sends it on the IPv6 network. Note: this IPv6 packet has the destination address set as the IPv6 address of the remote host, and source address set as the Teredo IPv6 address of the Teredo client.
  4. Now for the fun part! The IPv6 packet reaches the destination host, the host creates a reply IPv6 packet with itself as the source and the Teredo client IPv6 address as the destination. This packet is sent on the IPv6 network. On the IPv6 network are many Teredo relays, all of them advertising the 2001:0:/32 prefix. The packet will reach the relay nearest to the destination host who will then send it to the Teredo client. Once the Teredo client receives the ICMPv6 reply, it knows which relay was used and thus knows the IPv4 address of the relay closest to the destination.
  5. The Teredo client then sends the actual IPv6 packet as a UDP message in an IPv4 packet to this Teredo relay. And since a hole punching is done for this relay address, further packets to and from this relay can travel through.

Similarly when an IPv6 host has a packet for a Teredo client, the packet makes its way to the relay closest to that host. The relay then checks whether it already has a communication set up with the client, in which case it sends the packet over via IPv4. If there’s no on-going communication, or it’s been a while, the relay goes through the hole punching process again and sends the packet.

Similar to the Teredo server, Windows 7, Windows 8, Windows Server 2008 R2, Windows Server 2012, and later can function as a Teredo relay. Setting up one of these as a Teredo relay is quite straight-forward. All one has to do is:

  1. Ensure the Teredo interface is ready – i.e. the relay can reach a Teredo server and the interface is not offline.
  2. Enable forwarding on the Teredo interface. Enable forwarding on the interface(s) to the IPv6 network.
  3. Publish a route for the 2001::.32 prefix.
  4. Enable IPv6 router advertisements on the IPv6 network so other routers pick up the published route.

And that’s it! Here are the commands:

That’s all for now!

Notes on Teredo (part 2)

Before going more into Teredo it’s worth talking about the types of NAT.

Types of NAT

When an internal device sends a packet to an Internet device, the source address and port (i.e. the private IPv4 address and port number of the internal device) are translated by the NAT. The outgoing packet from the NAT device will have the source IPv4 address set as the public IPv4 address of the NAT box and a newly assigned port on the NAT box as the source port address of the packet. And a mapping for this will be stored on the NAT box (internal IPv4 address, port number <-> Public IPv4 address, port number).

What happens next is what defines the various types of NAT:

Cone NAT

In a cone NAT, once a mapping is thus stored, packets from any external device to the public IPv4 address and port number of the mapping will be forwarded to the internal IPv4 address and port number. The key word here is any. So my machine behind the NAT can send a packet to (say) Google.com from port 1234 of my machine -> this will create a mapping on the NAT box from my internal IPv4 address and port 1234 to the public IPv4 address and (say) port 5467. Now a totally different server (say) Yahoo.com can send a packet to port 5678 of the public IPv4 address and it will be forwarded to my machine’s port 1234.

Essentially my internal machine has a port mapped on the NAT now which is always forwarded to my machine. Of course, the mapping could go after a while of disuse and then I’ll get a new external port number and mapping, but suppose I kept sending traffic to keep the mapping alive the port is forever forwarded to me.

You can imagine this looks like a cone. The tip of the cone is the public IPv4 address and port. The base – the open end – is everything in the external world. Everything in the external world can contact me on that port. It’s like a cone, hence the name.

Restricted NAT

In a restricted NAT, once a mapping is stored, only packets from the particular external device which the packets were originally sent to can connect to the mapping and be forwarded. That is, in the example above once there’s a mapping for my internal IPv4 address and port 1234 to the external public IPv4 and port 5467 for Google.com, only Google.com can send a packet to the external IPv4 address and port 5678 and it will be forwarded to my machine, if Yahoo.com sends a packet to the same IPv4 address and port it will be discarded.

There is a stricter form of restricted NAT where even the port number of the sender is checked. That is, if initial packet was to pot 9999 of Google.com, only packets from port 9999 of Google.com are allowed to enter via that mapping. Packets from port 9998 of Google.com will be silently discarded!

Symmetric NAT

Symmetric NAT takes things one step further!

In the two NAT types above the mapping is stored for all traffic from the internal device and internal IP. That is if I send packets from port 1234 and my internal IPv4, a mapping is created to port 5678 on the NAT box and that is used for all traffic from port 1234 of my internal machine. So – I contact Google.com from internal port 1234, the same mapping is used. I contact Yahoo.com from the same internal port, the same mapping is used. I contact Bing.com from the same internal port, the same mapping is used! The only difference between the two types above was to do with the incoming packets – whether they were from an IPv4 address (and port, in the stricter version) that was previously contacted. But the mapping was always the same for all traffic.

However, in Symmetric mapping a new mapping is created for each destination. So if I contact Google.com from port 1234 of my internal machine, it could be mapped to port 5678 of the NAT box and a mapping created for that. Next, if I contact Yahoo.com from port 1234 of the internal machine (same internal port as before), it could be mapped to port 5889 of the NAT box and a new mapping created for that. And later if I contact Bing.com from again the same port 1234 of the internal machine, yet another mapping will be created on the NAT box, say to port 6798. And worse, each of these mappings behaves like a Restricted NAT mapping – i.e. only the IP address to which the packets were initially sent when creating that mapping can use the mapping to send back packets.

You can see why this is called Symmetric NAT. It’s because each traffic from the internal side has its own mapping on the external side. There is a one-to-one mapping from the internal port & IPv4 address to the external port for each traffic.

Why do the types matter?

It’s important to know what type of NAT a particular is in when using Teredo. Because remember my quick overview of Teredo – at step 2 the host behind the NAT asks the server for an IPv6 address and it gets it. What this also does is that a mapping is created on the NAT box for the host’s internal IPv4 address and port number to the NAT box’s public IPv4 address and external port. Now …

If the NAT were Cone, then at step 8 a Teredo relay can forward IPv6 over IPv4 packets to the internal host by sending to the external port and NAT public IPv4 address.

But if the NAT were Restricted, then step 8 would fail as the internal host hasn’t contacted the relay yet and so there’s no mapping for the relay’s IPv4 address and/ port port  (it could be that the internal host has contacted the relay to send some packets and this is a response from the relay, in which case it will be let through, but it could also be a fresh connection from the relay forwarding packets from an IPv6 host that are not in response to anything from the internal host – these will be discarded). So in the case of a restricted NAT Teredo has some additional stuff to do first – namely create a mapping in the NAT table for the Teredo relay’s IPv4 address.

First the Teredo relay checks if the IPv4 address and port of the Teredo client – which it extracts from the Teredo IPv6 address of the client – are known to it. If it is known (as in the relay and client have communicated recently) it means a mapping exists and the relay can send the packet to the client.

If a mapping does not exist the Teredo relay takes help of the Teredo server to punch a hole in the NAT. This is akin to the Romeo & Juliet example I described yesterday. The relay needs to contact the client but the client’s NAT box will discard any packets that are not in response to something that the client has sent out, so the relay needs a third party server “friend” to help out. Here’s what happens:

  1. The Teredo relay queues the incoming packet.
  2. From the Teredo IPv6 address of the client it extracts the IPv4 address of the Teredo server.
  3. The relay then creates a “bubble packet” with the source address as the relay’s IPv6 address, destination as the client’s Teredo IPv6 address, and sends it to the Teredo server’s IPv4 address. A bubble packet is essentially an empty IPv6 packet. Since it is sent to the IPv4 address of the Teredo server, it will be encapsulated in an IPv4 packet.
  4. The Teredo server extracts the IPv6 bubble packet. From the bubble packet’s destination IPv6 address the Teredo server notes that it itself if the Teredo server for that client. This tells the server that its help is required for hole punching. It notes the IPv4 address of the relay from the source address of the packet (this is used in step 7).
  5. The Teredo server extracts the NAT IPv4 address and port from the host portion of the client’s Teredo address. It puts the bubble packet within a UDP message and sends it over IPv4 to the IPv4 address and port of the NAT box. The NAT box forwards this packet to the internal host (the Teredo client) as a mapping already exists for the Teredo server IPv4 address.
    • The UDP packet has to contain the Teredo server IPv4 address and port as the source address and port. It has to else the packet won’t pass through NAT. But the client also needs to know the IPv4 address of the Teredo relay. So the Teredo server sets an “Origin Indication” within this UDP packet that specifies the IPv4 address of the Teredo relay.
  6. The Teredo client receives the bubble packet in UDP. From the “Origin Indication” it knows the IPv4 address of the Teredo relay. From the bubble packet it knows the IPv6 address of the Teredo relay. And since it doesn’t know the Teredo relay’s IPv4 address and this packet came from its Teredo server – indicating that the client has to do it’s part of the hole punching – the client will now create a new IPv6 bubble packet, setting its Teredo IPv6 address as the source and the Teredo relay’s IPv6 address as the destination, put this within a UDP message, set the IPv4 address of the Teredo relay as the destination, and send it out.
  7. The packet passes through NAT and reaches the Teredo relay. Since this is a response to the bubble it previously sent, the Teredo relay knows the mapping is ready. Now the Teredo relay sends the previously queued incoming packet to the Teredo client and it gets through …!

Phew! Now we know why Teredo is a tunnel of last resort. There’s so much behind the scenes stuff that has to happen to keep it working. And that’s not to mention additional stuff like regular bubble packets from the Teredo client to server to keep the NAT mapping alive, checks to ensure there’s no spoofing done, and many more. Added to that, for security reasons an update to the Teredo RFC (Teredo RFC is RFC 4380, update is RFC 5991) specifies that Teredo should always assume it’s behind a Restricted NAT and so the above steps must always be done, even for clients behind Core NATs.

Back to NATs – if the NAT were Symmetric, Teredo does not work at all unless you make some changes on the NAT to exempt the Teredo clients. (Teredo in Windows Vista and above can work between Teredo clients if only one Teredo client is behind a Symmetric NAT and the other is behind a Cone/ Restricted NAT).

Identifying the type of NAT

Here’s how a Teredo client identifies the type of NAT it is behind.

Two things are in play here:

  1. The Teredo client IPv6 address has a flag that specifies whether it is behind a cone NAT or not. This flag is in the host bits of the address – remember I had previously mentioned the host bit has some flags, random bits, and the NAT IPv4 address and port number? One of these flags specifies whether the client is behind a cone NAT or not.
  2. The Teredo server has two public IPv4 addresses. The Teredo RFC does not expect these to be consecutive IPv4 addresses, but Windows Teredo clients expect these to be.

When the Teredo client contacts the Teredo server initially for an IPv6 address, it sends a Router Solicitation message as all IPv6 clients do (the difference being this is a unicast message, sent within a UDP message over IPv4 to a specific Teredo server address). The Router Solicitation message requires a link-local address – all Router Solicitation messages do – so the Teredo Client generates one, following the same format as a regular Teredo IPv6 address. The network prefix of this link-local address is set to fe80::/64, with the host bits having flags and random bits as usual but with the embedded IPv4 address being the private IPv4 address and port of the internal host rather than the public IPv4 address and port of the NAT box (because the Teredo client doesn’t know what this is).

This link-local address sets the cone flag is 1 – meaning the client believes it is behind a cone NAT.

When the Teredo server receives this it sends a Router Advertisement message as usual (as a UDP message within IPv4 unlike usual). The server does a trick here though. Instead of responding from the public IPv4 address on which it received the UDP message from the client, it responds from the second public IPv4 address. If this reply from a different IPv4 address gets through the NAT, then the client knows it is indeed behind a cone NAT. But if no replies come through (after an appropriate time-out (default: 4s) and number of retries (3 times)), the client realizes it may not be behind a cone NAT and so it sends a new Router Solicitation message to the Teredo server only this time it sets the cone flag to 0 (i.e. not behind a cone NAT).

Again the Teredo server receives the message and sends a Router Advertisement message, but now since the cone flag is 0 it sends the reply from the same IPv4 address as it received the message on. This will get through, confirming to the client that it is behind a non-cone NAT. (Note: If this reply too does not get through, after an appropriate time-out and number of retries the client realizes that UDP messages are blocked/ not getting through NAT/ firewall and so it sets the Teredo interface as disconnected/ off-line. Teredo cannot be used in this situation).

Next the client needs to know if it is behind a symmetric NAT. It now contacts the Teredo server on the second IPv4 address with a Router Solicitation message, setting the cone flag to 0 so the server uses the same IPv4 address when replying, and when it gets a reply from the Teredo server it compares the NAT port in the reply with the NAT port in the previous reply. If the ports are same the client determines that it is behind a restricted NAT; if the ports are different the client determines it is behind a symmetric NAT and that Teredo might not work.

(Note: I oversimplified a bit above to keep things easy. When the Teredo server sends a Router Advertisement, this includes the network prefix only. The host bits are set by the Teredo client once it identifies the type of NAT it is behind. The host bits require knowledge of the NAT IPv4 address and port, but how does the Teredo client know these? It knows these because the Router Advertisements from the Teredo server contains an “Origin Indication” field specifying the public IPv4 address and port. This is how the client gets the port number used for both Router Advertisements and determines if it is behind a symmetric NAT or not. Once that determination is done, the client has all the info required to self-assign a Teredo IPv6 address).

Notes on Teredo (part 1)

Previously I had talked about ISATAP. Today I want to blog about Teredo.

Teredo is another IPv6 transition mechanism. It is meant to be used as a transition strategy of last resort – i.e. only if other mechanisms such as ISATAP and 6to4 (which I haven’t blogged about yet) fail. This is because Teredo needs support from other servers on the Internet to do its work; and also because Teredo doesn’t encapsulate IPv6 packets within IPv4 directly, rather it puts them within UDP packets that are then carried by IPv4 packets. The latter means there’s extra overhead for using Teredo but it has the advantage that Teredo can work through NAT (with the exception of one type of NAT called Symmetric NAT) and so is more likely to work than ISATAP or 6to4.

Unlike ISATAP Teredo is meant for use over the Internet. And unlike 6to4 Teredo does not require a public IPv4 address. Teredo can work over the Internet from hosts with a private IPv4 address behind a NAT.

Before I go into the details of Teredo here’s a quick overview:

  1. If you have a host with private IPv4 address, you need some way of assigning it a global IPv6 address. But how do you do that? 6to4 takes the approach of creating an IPv6 address from the IPv4 address and that works because it requires public IPv4 addresses – which are unique in the first place, resulting in a unique IPv6 address. Teredo doesn’t have that luxury so it needs an IPv6 address generated through some other means.
  2. Here’s what Teredo does. It asks a server on the Internet (called a Teredo Server) for an IPv6 address. The server assigns it an IPv6 whose network prefix has first 32 bits as 2001:0000 and next 32 bits as the IPv4 address of the Teredo Server in hexadecimal. Thus all Teredo clients connecting to that server have the same network prefix.
    1. Say the Teredo Server IPv4 address is 17.23.52.1. You can use the in-built Windows calculator (in Programmer mode) to convert decimal to hex. 17 = 0x11, 23 = 0x17, 52 = 0x34, 1 = 0x1. So 17.23.53.1 in hex would be 1117:3401, resulting in a Teredo network prefix of 2001:0:1117:3401.
  3. The Teredo server also sets the host portion of the IPv6 address. This consists of some flags and random bits followed by the UDP port the client’s request came from along with the public IPv4 address of the NAT box the client is behind. Thus the host portion is also unique – the uniqueness being provided by the random bits as well as the UDP port of the client request, with some level of uniqueness also being provided by the public IPv4 address of the NAT box (though this is not unique among all clients within that same NAT).
  4. This way an IPv4 only host behind a NAT can get for itself a global unicast IPv6 address. The next question is how will it send and receive packets to the IPv6 Internet?
  5. For this Teredo clients need a Teredo Relay (this is usually a separate server, but one could have the Teredo server doubling as a Teredo relay too).
  6. A Teredo relay is a server set up by an ISP or organization that is happy to act as a “relay” between Teredo clients and IPv6 hosts. The relay advertises to the IPv6 Internet that it can route to the Teredo network prefix 2001:0:/32 (note that it advertises the entire Teredo network prefix, not just a specific network like 2001:0:1117:3401/64).
  7. So Teredo clients send IPv6 packets encapsulated in IPv4 to the IPv4 address of the Teredo relay. The relay passes it on to the IPv6 Internet as pure IPv6 packets with the source address set to the global unicast Teredo address of the client.
  8. The relay also receives packets to the Teredo prefix 2001:0:/32 from the IPv6 portion of the Internet and passes it on to the IPv4 clients. It knows which clients to pass these on to so because the host portion of the Teredo client address contains the IPv4 public IPv4 address of the NAT box and UDP port which will be forwarded to the private IPv4 address of the client. So all the relay needs to do is send IPv4 packets (containing IPv6 packets) to this public IPv4 address & UDP port.
    • It’s worth emphasizing here that the Teredo relay does not have an IPv6 routing table entry to the Teredo client. Rather, the packet is sent via IPv4. That’s why a relay is able to broadcast the entire 2001:0/32 and get packets for any Teredo client.
  9. If an IPv6 host on the Internet has both IPv6 and IPv4 addresses it can skip the Teredo relay altogether to send packets to the Teredo client. How? Because when this host receives a packet with source address as that of the Teredo client, it knows from the network prefix that this is a Teredo address, and it knows from the host bits how to contact this client via IPv4. So why go a roundabout way through Teredo relays and such, when it can directly send IPv4 packets (containing IPv6 packets) to the Teredo client? That’s precisely what the host does. This functionality is called Teredo Host Specific Relay – the host acts as its own relay. It does not matter if the IPv4 address of such a host is public or private. The IPv6 address must be a native global unicast address or a 6to4 address (obviously because that’s how it’s reachable on the IPv6 network initially).

This, in a nutshell, is how Teredo works. I think it’s a very cool piece of technology! It’s cool not only in terms of how it allows clients behind NATs to have a global unicast IPv6 address and be able to access the IPv6 Internet, but also in terms of some details I skipped over above like how clients find relays, and how relays punch hole in the firewall/ NAT behind which clients are to enable communication from clients to the IPv6 Internet. Fascinating!

Romeo & Juliet go hole punching!

Imagine Romeo and Juliet are under house arrest by their families. The families don’t allow any phone calls through either, unless it’s from a number Romeo or Juliet had just called out to. So while Romeo or Juliet can call anyone, a call from Romeo to Juliet will be blocked by her family as it’s not in response to a call she had just made. Similarly a call from Juliet to Romeo will be blocked by his family as it’s not in response to a call he made. The two lovers are thus unable to talk to each other, what are they to do?

Here’s one thing they could do. If Juliet were to call Romeo somehow and he were to call her back, then the call would get through. The problem is when Juliet calls Romeo her call is discarded by his family and so there’s no way for Romeo to know she has called him and if he were to call back now it will be let through. But suppose they had a common friend willing to help them, could that make a difference?

What Romeo and Juliet could do is call this mutual friend and not hang up. When Juliet wants to call Romeo she’ll inform the mutual friend who will pass this news on to Romeo. (A nice twist here is that Romeo and Juliet could use burner phones. When they call the mutual friend they’ll pass on their current phone number to the friend who will pass it on to each other. This way the families have no idea what Romeo or Juliet’s current phone numbers are and they can’t block calls to these numbers!). Juliet will then call Romeo’s number. The call will be discarded by Romeo’s family as it is not in response to a call he made, but Juliet’s family does not know that and as far as they are concerned Juliet has just called Romeo and no one answered. Since Romeo knows Juliet will be calling him and getting blocked, he can now call Juliet and his call will be let through as her family thinks its in response to her previous call. The two lovers can thus talk to each other! Happy days!

What I have described here is what’s known as hole punching in Computer networks. The difference being instead of Romeo, Juliet, and their families, you have computers behind firewalls or NAT (Network Address Translation) routers. The computers behind the firewall or NAT are unable to accept incoming packets from the outside world. They can only accept replies to packets they have sent out. To get two such computers talking to each other we need to “punch a hole” in the firewalls. That is, have a third computer somewhere on the Internet that will co-ordinate between them and help with the initial discovery and connection (if any of the computers are behind NATs it is akin to Romeo or Juliet using a burner phone). This third computer doesn’t have to do much except help with the initial phase. And worst case, if the hole punching does not work for any reason, the third computer can act as a “relay” passing packets from one computer to the other.

What is RPC?

In computer programming a “procedure” is a set of instructions packaged into one unit. For instance: say you regularly need to open a text file, scan it for certain text, and output a yes or no depending on whether there’s a match. One way is to keep writing these instructions at each place in your program. Another way would be to put these instructions at one place in your program, give that logical grouping a name, and then on every time you need to run these instructions simply invoke that grouping. The latter is an easier approach, not only because it saves your typing (or copy-pasting) the same code all over the place, but also because if you ever need to improve the instructions you just have to do it at one place. This logical grouping is what’s known as a procedure. Other names for a procedure are function, subroutine, and method.

When a procedure is invoked at a point in the code, usually with any inputs that are to be passed on to the procedure, the inputs are saved someplace and the procedure is loaded and executed. The main program is on hold while the procedure runs, its current state is stored at a location and execution moves to the procedure. The procedure reads the inputs from the saved place, does its deeds, stores any outputs in a designated place, and passes execution back to the main program (whose location was previously stored). The main program then reads the output and continues.

To give an analogy: think of it as though you are reading a book and you find a word whose meaning you do not know, so you keep down the book and open a dictionary to find the meaning. “Reading a book” is the main program; the “looking up the meaning” is a procedure that takes as input a word whose meaning you want, scans a dictionary, and gives as output an understanding of the word. The word itself is stored in your memory, and so is the the output (the meaning). That’s pretty much how procedures work.

Procedures can be part of the code of the main program or separate from it. They can even be from a 3rd party, packaged into files (such as DLLs) containing multiple procedures. When writing code the author links to these files and invokes procedures from them. When compiling the code to turn them into machine instructions the compiler puts in the required details on how to find the procedures (their address and such). That’s how the processor, which actually executes the program, knows where to find the procedure and what it expects.

When executing a program the processor only has access to its memory space. And since a procedure is linked to the program and has to be executed by the processor, it too must be in the same memory space as the main program. So procedures have to be local to the main program, i.e. on the same computer as the program. They can’t be on a remote computer. If a procedure is on a different machine (a different memory space) the processor has has no way of connecting to it.

As computers started to be connected over networks though, it because useful to have procedures that were remote to the main program. This way you could distribute the load across multiple computers, and maybe use powerful computers for some tasks and regular computers for others. Thus came about the concept of Remote Procedure Calls (RPC). RPC is a way of running procedures on remote computers by tricking the local processor into thinking the procedure is local to it.

Here’s how RPC works:

  1. Instead of a local procedure as usual, you have a stub procedure. This is a local procedure that takes input parameters meant for the remote procedure as though it were the remote procedure. As far as the main program is concerned the procedure is local.
    • To avoid confusion later on, we will refer to this stub as a client stub henceforth.
  2. The client stub procedure converts the input parameters into a form that can be understood by the remote computer. This is necessary as the parameters are in a location (memory space) that cannot be accessed by the remote computer, and also because the local and remote computers may use different ways of representing data (big-endian vs little-endian, for instance) and so the data may need conversion to a standard format. This process is called marshaling.
    • On Windows the client stub takes the input parameters (as in copies the values of the input parameters) and passes on to a Client Runtime Library (called rpcrt4.dll). This library does the marshaling.
    • This standard format is called a Network Data Representation (NDR) format. On Windows there are two marshaling “engines” that do this conversion. NDR20 is for 32-bit programs. NDR64 is for 64-bit programs (thought 64-bit programs can also use NDR20). Both marshaling engines are part of the Client Runtime Library.
  3. The client stub procedure uses the OS to contact the remote end and sends it messages. The OS in turns uses transport layer network protocols.
    • The client stub knows where the remote end is because it is either configured with the information or it looks up a central location to find the remote end. Usually the network address and port number(s) of the remote end.
      • For instance Windows domain joined computers can use Active Directory. In Windows 2000 domains, RPC services register under System\RPCServices.
      • This remote network address and port(s) is called an Endpoint.
    • The remote end usually has an RPC service listening on a designated port(s) and network address. This service allocates dynamic Endpoints for client stubs to connect to.
      • This service is called an Endpoint Mapper (EPM). In Windows you have service called the RPC Endpoint Mapper. It listens on port 135.
    • During this negotiation phase the EPM protocol messages contain items known as towers and floors. I am not too clear on what they are except that they are ways of representing RPC data and also for identifying the host address and port. When doing a network capture of RPC traffic these can be seen (the last link in the list below has examples of capturing RPC traffic).
    • On Windows the Client Runtime Library has three protocol engines – 1) a Connection RPC protocol engine (used when a connection-oriented protocol is required), 2) a Datagram RPC protocol engine (used when a connection-less protocol is required), and 3) a Local RPC protocol engine (used when the remote end is on the same host as the client). These protocol engines are what do the actual message passing.
    • Connection-oriented protocols used are TCP (dynamically assigned ports), SPX (Sequenced Packet Exchange), named pipes (port 445), or HTTP (ports 80, 443, 593). Connection-less protocols used are UDP (dynamically assigned ports) or CDP (Cluster Datagram Protocol).
  4. The remote end has a server stub that receives data from the local stub and unmarshals it. If required, the server stub converts the data to the remote machine specific format.
    • Server stubs register themselves with the Endpoint Mapper. Server stubs have a UUID (a well known GUID that identifies the application).
    • Once client stubs get an Endpoint from the Endpoint Mapper the client and server stubs talk to each other directly.
  5. The server stub then invokes the remote procedure with the data it has. As far as the remote procedure is concerned the main program is local to it. It is not aware that it is remote to the main program.
  6. Output from the remote procedure is passed on to the server stub which sends it to the client stub following a similar process as above (marshaling, unmarshaling).

The neat thing about RPC is that programmers can write distributed applications and don’t have to worry about the network details. In terms of the Internet Protocol suite, RPC sits in the Application layer.

Here are some good links on RPC: