Subscribe via Email

Subscribe via RSS/JSON


Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan


Migrating VMkernel port from Standard to Distributed Switch fails

I am putting a link to the official VMware documentation on this as I Googled it just to confirm to myself I am not doing anything wrong! What I need to do is migrate the physical NICs and Management/ VM Network VMkernel NIC from a standard switch to a distributed switch. Process is simple and straight-forward, and one that I have done numerous times; yet it fails for me now!

Here’s a copy paste from the documentation:

  1. Navigate to Home > Inventory > Networking.
  2. Right-click the dVswitch.
  3. If the host is already added to the dVswitch, click Manage Hosts, else Click Add Host.
  4. Select the host(s), click Next.
  5. Select the physical adapters ( vmnic) to use for the vmkernel, click Next.
  6. Select the Virtual adapter ( vmk) to migrate and click Destination port group field. For each adapter, select the correct port group from dropdown, Click Next.
  7. Click Next to omit virtual machine networking migration.
  8. Click Finish after reviewing the new vmkernel and Uplink assignment.
  9. The wizard and the job completes moving both the vmk interface and the vmnic to the dVswitch.

Basically add physical NICs to the distributed switch & migrate vmk NICs as part of the process. For good measure I usually migrate only one physical NIC from the standard switch to the distributed switch, and then separately migrate the vmk NICs. 

Here’s what happens when I am doing the above now. (Note: now. I never had an issue with this earlier. Am guessing it must be some bug in a newer 5.5 update, or something’s wrong in the underlying network at my firm. I don’t think it’s the networking coz I got my network admins to take a look, and I tested that all NICs on the host have connectivity to the outside world (did this by making each NIC the active one and disabling the others)). 

First it’s stuck in progress:

And then vCenter cannot see the host any more:

Oddly I can still ping the host on the vmk NIC IP address. However I can’t SSH into it, so the Management bits are what seem to be down. The host has connectivity to the outside world because it passes the Management network tests from DCUI (which I can connect to via iLO). I restarted the Management agents too, but nope – cannot SSH or get vCenter to see the host. Something in the migration step breaks things. Only solution is to reboot and then vCenter can see the host.

Here’s what I did to workaround anyways. 

First I moved one physical NIC to the distributed switch.

Then I created a new management portgroup and VMkernel NIC on that for management traffic. Assigned it a temporary IP.

Next I opened a console to the host. Here’s the current config on the host:

The interface vmk0 (or its IPv4 address rather) is what I wanted to migrate. The interface vmk4 is what I created temporarily. 

I now removed the IPv4 address of the existing vmk NIC and assigned that to the new one. Also, confirmed the changes just to be sure. As soon as I did so vCenter picked up the changes. I then tried to move the remaining physical NIC over to the distributed switch, but that failed. Gave an error that the existing connection was forcibly closed by the host. So I rebooted the host. Post-reboot I found that the host now thought it had no IP, even though it was responding to the old IP via the new vmk. So this approach was a no-go (but still leaving it here as a reminder to myself that this does not work)

I now migrated vmk0 from the standard switch to the distributed switch. As before, this will fail – vCenter will lose connectivity to the ESX host. But that’s why I have a console open. As expected the output of esxcli network ip interface list shows me that vmk0 hasn’t moved to the distributed switch:

So now I go ahead and remove the IPv4 address of vmk0 and assign that to vmk4 (the new one). Also confirmed the changes. 

Next I rebooted the host, and via the CLI I removed vmk0 (for some reason the GUI showed both vmk0 and vmk4 with the same IP I assigned above). 

Reboot again!

Post-reboot I can go back to the GUI and move the remaining physical NIC over to the distributed switch. :) Yay!

[Aside] How to quickly get ESXi logs from a web browser (without SSH, vSphere client, etc)

This post made my work easy yesterday –

tl;dr version:  go to https://IP_of_Your_ESXi/host

[Aside] Memory Resource Management in ESXi

Came across this PDF from VMware while reading on memory management. It’s dated, but a good read. Below are some notes I took while reading it. Wanted to link to the PDF and also put these somewhere; hence this post.

Some terminology:

  • Host physical memory <–[mapped to]– Guest physical memory (continuous virtual address space presented by Hypervisor to Guest OS) <–[mapped to]– Guest virtual memory (continuous virtual address space presented by Guest OS to its applications).
    • Guest virtual -> Guest physical mapping is in Guest OS page tables
    • Guest physical -> Host physical mapping is in pmap data structure
      • There’s also a shadow page table that the Hypervisor maintains for Guest virtual -> Guest physical
      • A VM does Guest virtual -> Guest physical mapping via hardware Translation Lookup Buffers (TLBs). The hypervisor intercepts calls to these; and uses these to keep its shadow page tables up to date.
  • Guest physical memory -> Guest swap device (disk) == Guest level paging.
  • Guest physical memory -> Host swap device (disk) == Hypervisor swapping.

Some interesting bits on the process:

  • Applications use OS provided interfaces to allocate & de-allocate memory.
  • OSes have different implementations on how memory is classified as free or allocated. For example: two lists.
  • A VM has no pre-allocated physical memory.
  • Hypervisor maintains its own data structures for free and allocated memory for a VM.
  • Allocating memory for a VM is easy. When the VM Guest OS makes a request to a certain location, it will generate a page fault. The hypervisor can capture that and allocate memory.
  • De-allocation is tricky because there’s no way for the hypervisor to know the memory is not in use. These lists are internal to the OS. So there’s no straight-forward way to take back memory from a VM.
  • The host physical memory assigned to a VM doesn’t keep growing indefinitely though as the guest OS will free and allocate within the range assigned to it, so it will stick within what it has. And side by side the hypervisor tries to take back memory anyways.
    • Only when the VM tries to access memory that is not actually mapped to host physical memory does a page fault happen. The hypervisor will intercept that and allocate memory.
  • For de-allocation, the hypervisor adds the VM assigned memory to a free list. Actual data in the physical memory may not be modified. Only when that physical memory is subsequently allocated to some other VM does it get zeroed out.
  • Ballooning is one way of reclaiming memory from the VM. This is a driver loaded in the Guest OS.
    • Hypervisor tells ballooning driver how much memory it needs back.
    • Driver will pin those memory pages using Guest OS APIs (so the Guest OS thinks those pages are in use and should not assign to anyone else).
    • Driver will inform Hypervisor it has done this. And Hypervisor will remove the physical backing of those pages from physical memory and assign it to other VMs.
    • Basically the balloon driver inflates the VM’s memory usage, giving it the impression a lot of memory is in use. Hence the term “balloon”.
  • Another way is Hypervisor swapping. In this the Hypervisor swaps to physical disk some of the physical memory it has assigned to the VM. So what the VM thinks is physical memory is actually on disk. This is basically swapping – just that it’s done by Hypervisor, instead of Guest OS.
    • This is not at all preferred coz it’s obviously going to affect VM performance.
    • Moreover, the Guest OS too could swap the same memory pages to its disk if it is under memory pressure. Hence double paging.
  • Ballooning is slow. Hypervisor swapping is fast. Ballooning is preferred though; Hypervisor swapping is only used when under lots of pressure.
  • Host (Hypervisor) has 4 memory states (view this via esxtop, press m).
    • High == All Good
    • Soft == Start ballooning. (Starts before the soft state is actually reached).
    • Hard == Hypervisor swapping too.
    • Low == Hypervisor swapping + block VMs that use more memory than their target allocations.


Yay! (VXLAN) contd. + Notes to self while installing NSX 6.3 (part 3)

Finally continuing with my NSX adventures … some two weeks have past since my last post. During this time I moved everything from VMware Workstation to ESXi. 

Initially I tried doing a lift and shift from Workstation to ESXi. Actually, initially I went with ESXi 6.5 and that kept crashing. Then I learnt it’s because I was using the HPE customized version of ESXi 6.5 and since the server model I was using isn’t supported by ESXi 6.5 it has a tendency to PSOD. But strangely the non-HPE customized version has no issues. But after trying the HPE version and failing a couple of times, I gave up and went to ESXi 5.5. Set it up, tried exporting from VMware Workstation to ESXi 5.5, and that failed as the VM hardware level on Workstation was newer than ESXi. 

Not an issue – I fired up VMware Converter and converted each VM from Workstation to ESXi. 

Then I thought hmm, maybe the MAC addresses will change and that will cause an issue, so I SSH’ed into the ESXi host and manually changed the MAC addresses of all my VMs to whatever it was in Workstation. Also changed the adapters to VMXNet3 wherever it wasn’t. Reloaded the VMs in ESXi, created all the networks (portgroups) etc, hooked up the VMs to these, and fired them up. That failed coz the MAC address ranges were of VMware Workstation and ESXi refuses to work with those! *grr* Not a problem – change the config files again to add a parameter asking ESXi to ignore this MAC address problem – and finally it all loaded. 

But all my Windows VMs had their adapters reset to a default state. Not sure why – maybe the drivers are different? I don’t know. I had to reconfigure all of them again. Then I turned to OpnSense – that too had reset all its network settings, so I had to configure those too – and finally to nested ESXi hosts. For whatever reason none of them were reachable; and worse, my vCenter VM was just a pain in the a$$. The web client kept throwing some errors and simply refused to open. 

That was the final straw. So in frustration I deleted it all and decided to give up.

But then …

I decided to start afresh. 

Installed ESXi 6.5 (the VMware version, non-HPE) on the host. Created a bunch of nested ESXi VMs in that from scratch. Added a Windows Server 2012R2 as the shared iSCSI storage and router. Created all the switches and port groups etc, hooked them up. Ran into some funny business with the Windows Firewall (I wanted to assign some interface as Private, others as Public, and enable firewall only only the Public ones – but after each reboot Windows kept resetting this). So I added OpnSense into the mix as my DMZ firewall.

So essentially you have my ESXi host -> which hooks into an internal vSwitch portgroup that has the OpnSense VM -> which hooks into another vSwitch portgroup where my Server 2012R2 is connected to, and that in turn connects to another vSwitch portgroup (a couple of them actually) where my ESXi hosts are connected to (need a couple of portgroup as my ESXi hosts have to be in separate L3 networks so I can actually see a benefit of VXLANs). OpnSense provides NAT and firewalling so none of my VMs are exposed from the outside network, yet they can connect to the outside network if needed. (I really love OpnSense by the way! An amazing product). 

Then I got to the task of setting these all up. Create the clusters, shared storage, DVS networks, install my OpenBSD VMs inside these nested EXSi hosts. Then install NSX Manager, deploy controllers, configure the ESXi hosts for NSX, setup VXLANs, segment IDs, transport zones, and finally create the Logical Switches! :) I was pissed off initially at having to do all this again, but on the whole it was good as I am now comfortable setting these up. Practice makes perfect, and doing this all again was like revision. Ran into problems at each step – small niggles, but it was frustrating. Along the way I found that my (virtual) network still does not seem to support large MTU sizes – but then I realized it’s coz my Server 2012R2 VM (which is the router) wasn’t setup with the large MTU size. Changed that, and that took care of the MTU issue. Now both Web UI and CLI tests for VXLAN succeed. Finally!

Third time lucky hopefully. Above are my two OpenBSD VMs on the same VXLAN, able to ping each other. They are actually on separate L3 ESXi hosts so without NSX they won’t be able to see each other. 

Not sure why there are duplicate packets being received. 

Next I went ahead and set up a DLR so there’s communicate between VXLANs. 

Yeah baby! :o)

Finally I spent some time setting up an ESG and got these OpenBSD VMs talking to my external network (and vice versa). 

The two command prompt windows are my Server 2012R2 on the LAN. It is able to ping the OpenBSD VMs and vice versa. This took a bit more time – not on the NSX side – as I forgot to add the routing info on the ESG for my two internal networks ( and as well on the Server 2012R2 ( Once I did that routing worked as above. 

I am aware this is more of a screenshots plus talking post rather than any techie details, but I wanted to post this here as a record for myself. I finally got this working! Yay! Now to read the docs and see what I missed out and what I can customize. Time to break some stuff finally (intentionally). 


Find the profiles in an offline ESXi update zip file

I use esxcli to manually update our ESXi hosts that don’t have access to VUM (e.g. our DMZ hosts). I do so via command-line:

Usually the VMware page where I download the patch from mentions the profile name, but today I had a patch file and wanted to find the list of profiles it had. 

One way is to open the zip file, then the file in that, and that should contain a list of profiles. Another way is to use esxcli

Screenshot example:

TIL: Transparent Page Sharing (TPS) between VMs is disabled by default

(TIL is short for “Today I Learned” by the way).

I always thought an ESXi host did some page sharing of VM memory between the VMs running on it. The feature is called Transparent Page Sharing (TPS) and it was something I remember from my VMware course and also read in blog posts such as this and this. The idea is that if you have (say) a bunch of Server 2012R2 VMs running on a host, it’s quite likely these VMs have quite a bit of common stuff between them in RAM, so it makes sense for the ESXi host to share that common stuff between the hosts. So even if each VM has (say) 4 GB RAM assigned to it, and there’s about 2GB worth of stuff common between the VMs, the host only needs to use 2GB shared RAM + 2 x 2GB private RAM for a total of 6GB RAM. 

Apart from this as the host is under increased memory pressure it resorts to techniques like ballooning and memory swapping to free up some RAM for itself. 

I even made a script today to list out all the VMs in our environment that have greater than 8GB RAM assigned to them and are powered on and to list the amount of shared RAM (just for my own info). 

Anyhow – around 2015 VMware stopped page sharing of VM memory between VMs. VMware calls this sort of RAM sharing as inter-VM TPS. Apparently this is a security risk and VMware likes to ship their products as secure by default, so via some patches to the 5.x series (and as default in the 6.x series) it turned off inter-VM TPS and introduced some controls that allow IT Admins to turn this on if they so wish. Intra-VM TPS is still enabled – i.e. the ESXi host will do page sharing within each VM – but it not longer does page sharing between VMs by default. 

Using the newly introduced controls, however, it is possible to enable inter-VM TPS for all VMs, or selectively between some VMs. Quoting from this blog post

You can set a virtual machine’s advanced parameter sched.mem.pshare.salt to control its ability to participate in transparent page sharing.  

TPS is only allowed within a virtual machine (intra-VM TPS) by default, because the ESXi host configuration option Mem.ShareForceSalting is set to 2, the sched.mem.pshare.salt is not present in the virtual machine configuration file, and thus the virtual machine salt value is set to unique value. In this case, to allow TPS among a specific set of virtual machines, set the sched.mem.pshare.salt of each virtual machine in the set to an identical value.  

Alternatively, to enable TPS among all virtual machines (inter-VM TPS), you can set Mem.ShareForceSalting to 0, which causes sched.mem.pshare.salt to be ignored and to have no impact.

Or, to enable inter-VM TPS as the default, but yet allow the use of sched.mem.pshare.salt to control the effect of TPS per virtual machine, set the value of Mem.ShareForceSalting to 1. In this case, change the value of sched.mem.pshare.salt per virtual machine to prevent it from sharing with all virtual machines and restrict it to sharing with those that have an identical setting.


I wonder if intra-VM TPS has much memory savings. Looking at the output from my script for our estate I see that many of our server VMs have about half their allocated RAM as shared, so it does make an impact. I guess it will also make a difference when moving to a container architecture wherein a single VM might have many containers. 

I would also like to point out to this blog post and another blog post I came across from it on whether inter-VM TPS even makes much sense in today’s environments and also on the kind of impact it can have during vMotion etc. Good stuff. I am still reading these but wanted to link to them for reference. Mainly – nowadays we have larger page sizes and so the probability of finding an identical page to be shared between two VMs is low; then there is NUMA that places memory pages closer to the CPU and TPS could disrupt that; and also, TPS is a process that runs periodically to compare pages, so there is an operational cost as it runs and finds a match and then does a full compare of the two pages to ensure they are really identical. 

Good to know. 

Configure NTP for multiple ESXi hosts

Following on my previous post I wanted to set NTP servers for my ESX servers and also start the service & allow firewall exceptions. Here’s what I did –


Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

Troubleshooting ESXi host reboots

Had to troubleshoot an ESXi host reboot today. Came across this link – good one.

Here’s what I did though after the host reboot.

Once the host was online I connected to it via the vSphere client. I didn’t connect to the host directly (though you can do that too). I connected to the vCenter, then navigated to that host, went to the File menu and exported the system logs.


This creates a zip file containing another archive. I extracted the contents of this into a folder. The root of that folder has the usual Linux filesystem structure.


I went into the var folder here. (The log subfolder has many logs but most of these might be from after the reboot. If that’s the case, check the run/log subfolder).

In my case the /var/log/vmksummary.log file had entries for when the host rebooted. None of the other files mentioned anything.

Then I went to the /var/run/log folder via PowerShell and ran a grep for the word reboot –

Lots of messages indicating that the host was rebooted via the DCUI (lines 2, 4, 5, and 12). Thus I realized someone had manually rebooted the host.

Power cycle/ Reset an HP blade server

Was getting the following error on one of our servers. It’s from ESXi. None of the NICs were working for the server (the NICs seemed to be working, just that the driver wasn’t loading). 


Power cycle required. 

I switched off and switched on the server but that didn’t help. Turns out that doesn’t actually power cycle the server (because the server still has power – doh!). What you need to do is do something called an e-fuse reset. This power cycles the blade. You have to do this by opening an SSH session to the Onboard Administrator, finding the bay number of the blade you want to power cycle, and typing the command reset server <bay number>

Good to know!

Note: The command does not appear when you type help, but it’s there:

Also, to get a list of your bays and servers use the show server list command. To do the same for interconnects use the show interconnect list command.

Downgrading ESXi Host

Today I upgraded one of our hosts to a newer version than what was supported by our vCenter so had to find a way of downgrading it. The host was now at “5.5 Patch 10” (which is after “5.5 Update 3”) which our vCenter version only supported versions prior to “5.5 Update 3”. (See this post for a list of build numbers and versions; see this KB article for why vCenter and the host were now incompatible).

I found this blog post and KB article that talked about downgrading and upgrading. Based on those two here’s what I did to downgrade my host.

First, some terminology. Read this blog post on what VIBs are. At a very high level a VIB file is like a zip file with some metadata and verification thrown in. They are the software packages for ESX (think of it like a .deb or .rpm file). The VIB file contains the actual files on the host that will be replaced. The metadata tells you more about the VIB file – its dependencies, requirements, issues, etc. And the verification bit lets the host verify that the VIB hasn’t been tampered with, and also allows you to have various “levels” of VIBs – those certified by VMware, those certified by partners of VMware, etc – such that you as a System Admin can decide what level of VIBs you want installed on your host.

You can install/ remove/ update VIBs via the command esxcli:

Here’s a short list of the VIBs installed on my host:

Next you have Image Profiles. These are a collection of VIBs. In fact, since any installation of ESXi is a collection of VIBs, an image profile can be thought of as defining an ESXi image. For instance, all the VIBs on my currently installed ESXi server – including 3rd party VIBs – together can be thought of as an image profile. I can then deploy this image profile to other hosts to get the exact configuration on those hosts too.

One thing to keep in mind is that image profiles are not anything tangible. As in they are not files as such, they just define the VIBs that make up the profile.

Lastly you have Software Depots. These are your equivalent of Linux package repositories. They contain VIBs and Image Profiles and are accessible online via HTTP/ HTTPS/ FTP or even offline as a ZIP file (which is a neat thing IMHO). You would point to a software depot – online or offline – and specify an image profile you want, which then pulls in the VIBs you want.

Now back to esxcli. As we saw above this command can be used to list, update, remove etc VIBs. The cool thing though is that it can work with both VIB files and software depots (either online or a ZIP file containing a bunch of VIB files). Here’s the usage for the software vib install command which deals with installing VIBs:

You have two options:

  • The -d switch can be used to specify a software depot (online or offline) along with the -n switch to specify the VIBs to be installed from this depot.
  • Or the -v switch can be used to directly specify VIBs to be installed.

The esxcli command can also work with image profiles.

Here you have just one option (coz like I said you can’t download something called an image profile – you have to necessarily use a software depot). You use the -d switch to specify a depot (online or offline) and the -p switch to specify the image profile you are interested in.

Apart from installing VIBs & image profiles, the esxcli command can also remove and update these. When it comes to image profiles though, the command can also downgrade profiles via an --allow-downgrades switch. So that’s what we use to downgrade ESXi versions. 

First find the ESXi version you want to downgrade to. In my case it was ESXi 5.5 Update 2. Go to My VMware (login with your account) and find the 5.5 Update 2 product. Download the offline bundle – which is a ZIP file (basically an offline software depot). In my case I got a file named “”. Now open this ZIP file and go to the “\profiles” folder in that. This gives you the list of profiles in this depot.


You can also get the names from a link such as this which gives more info on the release and the image profiles in it. (I came across it by Googling for “ESXi 5.5 Update 2 profile name”).

The profiles with an “s” in them only contain security fixes while the ones without an “s” contain both security and bug fixes. In my case the profile I am looking for is “ESXi-5.5.0-20140902001-standard”. I wasn’t sure if I need to go for the “no-tools” version or not, but figured I’ll stick with the “standard”.

Now, copy the ZIP file you downloaded to the host. Either upload it to the host directly, or to some shared storage, etc.

Then run a command similar to this:

That’s it! Following a host reboot you are now downgraded. Very straight-forward and easy.

Notes on NLB, VMware, etc

Just some notes to myself so I am clear about it while reading about it. In the context of this VMware KB article – Microsoft NLB not working properly in Unicast mode.

Before I get to the article I better talk about a regular scenario. Say you have a switch and it’s got a couple of devices connected to it. A switch is a layer 2 device – meaning, it has no knowledge of IP addresses and networks etc. All devices connected to a switch are in the same network. The devices on a switch use MAC addresses to communicate with each other. Yes, the devices have IPv4 (or IPv6) addresses but how they communicate to each other is via MAC addresses.

Say Server A (IPv4 address wants to communicate with Server B (IPv4 address Both are connected to the same switch, hence on the same LAN. Communication between them happens in layer 2. Here the machines identify each other via MAC addresses, so first Server A checks whether it knows the MAC address of Server B. If it knows (usually coz Server A has communicated with Server B recently and the MAC address is cached in its ARP table) then there’s nothing to do; but if it does not, then Server A finds the MAC address via something called ARP (Address Resolution Protocol). The way this works is that Server A broadcasts to the whole network that it wants the MAC address of the machine with IPv4 address (the address of Server B). This message goes to the switch, the switch sends it to all the devices connected to it, Server B replies with its MAC address and that is sent to Server A. The two now communicate – I’ll come to that in a moment.

When it’s communication from devices in a different network to Server A or Server B, the idea is similar except that you have a router connected to the switch. The router receives traffic for a device on this network – it knows the IPv4 address – so it finds the MAC address similar to above and passes it to that device. Simple.

Now, how does the switch know which port a particular device is connected to. Say the switch gets traffic addresses to MAC address 00:eb:24:b2:05:ac – how does the switch know which port that is on? Here’s how that happens –

  • First the switch checks if it already has this information cached. Switches have a table called the CAM (Content Addressable Memory) table which holds this cached info.
  • Assuming the CAM table doesn’t have this info the switch will send the frame (containing the packets for the destination device) to all ports. Note, this is not like ARP where a question is sent asking for the device to respond; instead the frame is simply sent to all ports. It is broadcast to the whole network.
  • When a switch receives frames from a port it notes the source MAC address and port and that’s how it keeps the CAM table up to date. Thus when Server A sends data to Server B, the MAC address and switch port of Server A are stored in the switch’s CAM table.  This entry is only stored for a brief period.

Now let’s talk about NLB (Network Load Balancing).

Consider two machines – with MAC address 00:eb:24:b2:05:ac and with MAC address 00:eb:24:b2:05:ad. NLB is a form of load balancing wherein you create a Virtual IP (VIP) such as such that any traffic to is sent to either of or Thus you have the traffic being load balanced between the two machines; and not only that if any one of the machines go down, nothing is affected because the other machine can continue handling the traffic.

But now we have a problem. If we want a VIP that should send traffic to either host, how will this work when it comes to MAC addresses? That depends on the type of NLB. There’s two sorts – Unicast and Multicast.

In Unicast the NIC that is used for clustering on each server has its MAC address changed to a new Unicast MAC address that’s the same for all hosts. Thus for example, the NIC that holds the NLB IP address in the scenario above will have its MAC address changed from 00:eb:24:b2:05:ac and 00:eb:24:b2:05:ad respectively to (say) 00:eb:24:b2:05:af. Note that the MAC address is a Unicast MAC (which basically means the MAC address looks like a regular MAC address, such as that assigned to a single machine). Since this is a Unicast MAC address, and by definition it can only be assigned to one machine/ switch port, the NLB driver on each machines cheats a bit and changes the source MAC address address to whatever the original NIC MAC address was. That is to say –

  • Server IP
    • Has MAC address 00:eb:24:b2:05:ac
    • Which is changed to a MAC address of 00:eb:24:b2:05:af as part of the Unicast IP/ enabling NLB
    • However when traffic is sent out from this machine the MAC address is changed back to 00:eb:24:b2:05:ac
  • Same for Server

Why does this happen? This is because –

  • When a device wants to send data to the VIP address, it will try find the MAC address using ARP. That is, it sends a broadcast over the network asking for the device with this IP address to respond. Since both servers now have the same MAC address for their NLB NIC either server will respond with this common MAC address.
  • Now the switch receives frames for this MAC address. The switch does not have this in its CAM table so it will broadcast the frame to all ports – reaching either of the servers.
  • But why does outgoing traffic from either server change the MAC address of outgoing traffic? That’s because if outgoing frames have the common MAC address, then the switch will associate this common MAC address with that port – resulting in all future traffic to the common MAC address only going to one of the servers. By changing the outgoing frame MAC address back to the server’s original MAC address, the switch never gets to store the common MAC address in its CAM table and all frames for the common MAC address are always broadcast.

In the context of VMware what this means is that (a) the port group to which the NLB NICs connect to must allow changes to the MAC address and allow forged transmits; and (b) when a VM is powered on the port group by default notifies the physical switch of the VMs MAC address, since we want to avoid this because this will expose the cluster MAC address to the switch this notification too must be disabled. Without these changes NLB will not work in Unicast mode with VMware.

(This is a good post to read more about NLB).

Apart from Unicast NLB there’s also Multicast NLB. In this form the NLB NIC’s MAC address is not changed. Instead, a new Multicast MAC address is assigned to the NLB NIC. This is in addition to the regular MAC address of the NIC. The advantage of this method is that since each host retains its existing MAC address the communication between hosts is unaffected. However, since the new MAC address is a Multicast MAC address – and switches by default are set to ignore such address – some changes need to be done on the switch side to get Multicast NLB working.

One thing to keep in mind is that it’s important to add a default gateway address to your NLB NIC. At work, for instance, the NLB IPv4 address was reachable within the network but from across networks it wasn’t. Turns out that’s coz Windows 2008 onwards have a strong host behavior – traffic coming in via one NIC does not go out via a different NIC, even if both are in the same subnet and the second NIC has a default gateway set. In our case I added the same default gateway to the NLB NIC too and it was then reachable across networks. 

HP DL360 Gen9 with HP FlexFabric 534 adapter and HP Ethernet 530 adapter and ESXi

That’s a very vague subject line, I know, but I couldn’t think of anything concise. Just wanted to put some keywords so that if anyone else comes across the same problem and types something similar into Google hopefully they stumble upon this post.

At work we got some HP DL360 Gen9s to use as ESXi hosts. To these servers we added additional network cards –

  • HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter; and
  • HP Ethernet 10Gb 2-port 530SFP+ Adapter.

Each of these adapters have two NICs each. Here’s a picture of the adapters in the server and the vmnic numbers ESXi assigns to them.

serverIn this picture –

  • vmnic5 & vmnic4 are the HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter;
  • vmnic6 & vmnic7 are the HP Ethernet 10Gb 2-port 530SFP+ Adapter; and
  • vmnic0 – vmnic3 are HP Ethernet 1Gb 4-port 331i Adapter (which come in-built into the server);
  • iLO is the iLO port (which I’ll ignore for now).

We didn’t want to use vmnic0 – vmnic3 as they are only 1Gb. So the idea was the use vmnic4 – vmnic7. Two NICs would be for Management+vMotion (connecting to two different switches); two NICs would be for iSCSI (again connecting to different switches).

We came across two issues. First was that the FlexFabric NICs didn’t seem to support iSCSI. ESXi showed two iSCSI adapters but the NICs mapped to them were the regular Ethernet 10Gb ones, not the FlexFabric 10Gb ones. Second issue was that we wanted to use vmnic4 and vmnic6 for Management+vMotion and vmnic5 and vmnic7 for iSCSI – basically a NIC from each adapter such that even if an adapter were to fail there’s a NIC from another adapter for resiliency. This didn’t work for some reason. The Ethernet 10Gb NICs weren’t “connecting” to the network switch for some reason. They would connect in the sense that the link status appears as connected and the LEDs on the switch and NICs blink, but something was missing. There was no real connectivity.

Here’s what we did to fix these.

But first, for both these fixes you have to reboot the server and go into the System Utilities menu.

f9 system utils

Change 1: Enable iSCSI on the FlexFabric adapter (vmnic4 and vmnic5)

Once in the System Utilities menu select “System Configuration”.

system configurationSelect the first FlexFabric NIC (port1).

select flexfabricThen select the Device Hardware Configuration menu.

select device hardwareYou will see that the storage personality is FCoE.

current flex personalityThat’s the problem. This is why the FlexFabric adapters don’t show up as iSCSI adapters. Select the FCoE entry and change it to iSCSI.

new flex personalityNow press Esc to go back to the previous menus (you will be prompted to save the changes – do so). Then repeat the above steps for the second FlexFabric NIC (port 2).

With this change the FlexFabric NICs will appear as iSCSI adapters. Now for the second change.

Change 2: Enable DCB for the Ethernet adapters

From the System Configuration menu now select the first Ethernet NIC (port 1).

select ethernetThen select its Device Hardware Configuration menu.

select device hardware (ethernet)Notice the entry for “DCB Protocol”. Most likely it is “Disabled” (which is why the NICs don’t work for you).

current DCBChange that to “Enabled” and now the NICs will work.

new DCBThat’s it. Once again press Esc (choosing to save the changes when prompted) and then reboot the system. Now all the NICs will work as expected and appear as iSCSI adapters too.

rebootI have no idea what DCB does. From what I can glean via Google it seems to be a set of extensions to Ethernet that provide “hardware-based bandwidth allocation to a specific type of traffic and enhances Ethernet transport reliability with the use of priority-based flow control” (via TechNet) (also check out this Cisco whitepaper for more info). I didn’t read much into it because I couldn’t find anything that mentioned why DCB mattered in this case – as in why were the NICs not working when DCB was disabled? The NICs are connected to an HP 5920AF switch but I couldn’t find anything that suggested the switch requires DCB enabled for the ports to work. This switch supports DCB but that doesn’t imply it requires DCB.

Anyhow, the FlexFabric adapters have DCB enabled by default which is probably why they worked. That’s how I got the idea to enable DCB on the Ethernet adapters to see if it makes a difference – and it did! The only thing I can think of is that DCB also seems to include a DCBX (Data Centre Bridging Exchange) protocol which is about discovering peers, discovering mismatched configuration etc – so maybe the fact that DCB was disabled on these adapters made the switch not “see” these NICs and soft-disable them somehow. That’s my guess at least.

vMotion NIC load balancing fails even though there is an active link

The other day I blogged about how I had a host whose vMotion VMkernel interface seemed to be broken. Any vMotion attempts to it would hang at 14%.

At that time I logged on to the destination host, then used vmkping with the -I switch (to explicitly specify the vMotion VMkernel interface of the destination host), and found that I couldn’t ping the VMkernel interface of the other hosts. These hosts could ping each other but couldn’t ping the destination host.

The VMKernel interface is backed by two physical NICs. I found that if I remove one of the physical NICs from the VMkernel it works. Interestingly this link wasn’t showing any CDP info either, so it looked like something was wrong with it (the physical NIC shows as unclaimed coz the screenshot was taken after I moved it to unclaimed).

Missing CDP infoSo the first question is why did the VMkernel fail when only one of the physical NICs failed? Since the other physical NIC backing the VMkernel NIC is still active shouldn’t it have continued working?

The reason why it failed is that by default network failover detection is via “Link status only”. This only detects failures to the link – like say the cable is broken, the switch is down, or the NIC has failed – while failures such as the link being connected but blocked by switch are not detected. In my case as you can see from the screenshot above the link status is connected – so the host doesn’t consider the link failed even though it isn’t actually working, thus continues to use it.

Next I discovered that other hosts too similarly had their second vMotion physical NIC in a failed state as above yet they weren’t failing like this host. The simple explanation for this is that the host above somehow selected the faulty physical NIC as the one to use, didn’t detect it as failed and so continued to use it; whereas other hosts were more lucky and chose the physical NIC that works alright, so didn’t have any issues.

I am not sure that’s the entire answer though. For once the host that failed was ESXi 5.5 and using a distributed switch, while the other two hosts were ESXi 4.0 and using standard switches. Did that make a difference?

The default load balancing method for both standard and distributed switches is the same. (For a standard switch you check this under the vSwitch properties on the host. For a distributed switch you check this under the portgroup in the Networking section of vSphere (web) client).

default load balancingLoad balancing is what I am concerned about here because that’s what the hosts should be using to balance between both the NICs. That’s what the host will be using to select the physical NIC to use for that particular traffic flow. The load balancing method is same between standard and distributed switches yet why were the distributed switch/ ESXi 5.5 hosts behaving differently?

I am still not sure of an answer but I have my theory. My theory is that since a distributed switch is across multiple hosts the load balancing method (above) of choosing a route based on virtual port ID comes into play. Here’s screenshots from two of my hosts connected to the same distributed switch port group for instance:

port numberAs you can see the virtual port number is different for the VMkernel NIC of each host. So each host could potentially use a different underlying physical NIC depending on how the load balancing algorithm maps it.

But what about a standard switch? Since the standard switch is only on the host, and the only VMkernel NIC connected to it (in the case of vMotion) is the single VMKernel NIC I have assigned for vMotion, there is no load balancing algorithm coming into play! If, instead of a VMkernel I had a Virtual Machine network, then the virtual port number matters because there are multiple VMs connecting to the various port numbers; but that doesn’t matter for VMkernel NICs as there is only one of them. And so my theory is that for a VMkernel NIC (such as vMotion) backed by multiple physical NICs and using the default load balancing algorithm of virtual port ID – all traffic by default goes into one of the physical NICs and the other physical NIC is never used unless the chosen one fails. And that is why my hosts using the standard switches were always using the same physical NIC (am guessing the lower numbered one as that’s what both hosts chose) while hosts using distributed switches would have chosen different physical NICs per host.

That’s all! Just thought I’d put this out there in case anyone else has the same question.

vMotion is using the Management Network (and failing)

Was migrating one of our offices to a new IP scheme the other day and vMotion started failing. I had a good idea what the problem could be (coz I encountered something similar a few days ago in another context) so here’s a blog post detailing what I did.

For simplicity let’s say the hosts have two VMkernel NICs – vmk0 and vmk1. vmk0 is connected to the Management Network. vmk1 is for vMotion. Both are on separate VLANs.

When our Network admins gave out the new IPs they gave IPs from the same range for both functions. That is, for example, vmk0 had an IP (and and 10.20.4/24 on the other hosts) and vmk1 had an IP of 10.20.12/24 (and and on the other hosts).

Since both interfaces are on separate VLANs (basically separate LANs) the above setup won’t work. That’s because as far as the hosts are concerned both interfaces are on the same network yet physically they are on separate networks. Here’s the routing table on the hosts:

Notice that any traffic to the network goes via vmk0. And that includes the vMotion traffic because that too is in the same network! And since the network that vmk0 is on is physically a separate network (because it is a VLAN) this traffic will never reach the vMotion interfaces of the other hosts because they don’t know of it.

So even though you have specific vmk1 as your vMotion traffic NIC, it never gets used because of the default routes.

If you could force the outgoing traffic to specifically use vmk1 it will work. Below are the results of vmkping using the default route vs explicitly using vmk1:

The solution here is to either remove the VLANs and continue with the existing IP scheme, or to keep using VLANs but assign a different IP network for the vMotion interfaces.

Update: Came across the following from this blog post while searching for something else:

If the management network (actually the first VMkernel NIC) and the vMotion network share the same subnet (same IP-range) vMotion sends traffic across the network attached to first VMkernel NIC. It does not matter if you create a vMotion network on a different standard switch or distributed switch or assign different NICs to it, vMotion will default to the first VMkernel NIC if same IP-range/subnet is detected.

Please be aware that this behavior is only applicable to traffic that is sent by the source host. The destination host receives incoming vMotion traffic on the vMotion network!

That answered another question I had but didn’t blog about in my post above. You see, my network admins had also set the iSCSI networks to be in the same subnet as the management network – but separate VLANs – yet the iSCSI traffic was correctly flowing over that VLAN instead of defaulting to the management VMkernel NIC. Now I understand why! It’s only vMotion that defaults to the first VMkernel NIC in the same IP range/ subnet as vMotion.