Subscribe via Email

Subscribe via RSS


Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Notes on NLB, VMware, etc

Just some notes to myself so I am clear about it while reading about it. In the context of this VMware KB article – Microsoft NLB not working properly in Unicast mode.

Before I get to the article I better talk about a regular scenario. Say you have a switch and it’s got a couple of devices connected to it. A switch is a layer 2 device – meaning, it has no knowledge of IP addresses and networks etc. All devices connected to a switch are in the same network. The devices on a switch use MAC addresses to communicate with each other. Yes, the devices have IPv4 (or IPv6) addresses but how they communicate to each other is via MAC addresses.

Say Server A (IPv4 address wants to communicate with Server B (IPv4 address Both are connected to the same switch, hence on the same LAN. Communication between them happens in layer 2. Here the machines identify each other via MAC addresses, so first Server A checks whether it knows the MAC address of Server B. If it knows (usually coz Server A has communicated with Server B recently and the MAC address is cached in its ARP table) then there’s nothing to do; but if it does not, then Server A finds the MAC address via something called ARP (Address Resolution Protocol). The way this works is that Server A broadcasts to the whole network that it wants the MAC address of the machine with IPv4 address (the address of Server B). This message goes to the switch, the switch sends it to all the devices connected to it, Server B replies with its MAC address and that is sent to Server A. The two now communicate – I’ll come to that in a moment.

When it’s communication from devices in a different network to Server A or Server B, the idea is similar except that you have a router connected to the switch. The router receives traffic for a device on this network – it knows the IPv4 address – so it finds the MAC address similar to above and passes it to that device. Simple.

Now, how does the switch know which port a particular device is connected to. Say the switch gets traffic addresses to MAC address 00:eb:24:b2:05:ac – how does the switch know which port that is on? Here’s how that happens –

  • First the switch checks if it already has this information cached. Switches have a table called the CAM (Content Addressable Memory) table which holds this cached info.
  • Assuming the CAM table doesn’t have this info the switch will send the frame (containing the packets for the destination device) to all ports. Note, this is not like ARP where a question is sent asking for the device to respond; instead the frame is simply sent to all ports. It is broadcast to the whole network.
  • When a switch receives frames from a port it notes the source MAC address and port and that’s how it keeps the CAM table up to date. Thus when Server A sends data to Server B, the MAC address and switch port of Server A are stored in the switch’s CAM table.  This entry is only stored for a brief period.

Now let’s talk about NLB (Network Load Balancing).

Consider two machines – with MAC address 00:eb:24:b2:05:ac and with MAC address 00:eb:24:b2:05:ad. NLB is a form of load balancing wherein you create a Virtual IP (VIP) such as such that any traffic to is sent to either of or Thus you have the traffic being load balanced between the two machines; and not only that if any one of the machines go down, nothing is affected because the other machine can continue handling the traffic.

But now we have a problem. If we want a VIP that should send traffic to either host, how will this work when it comes to MAC addresses? That depends on the type of NLB. There’s two sorts – Unicast and Multicast.

In Unicast the NIC that is used for clustering on each server has its MAC address changed to a new Unicast MAC address that’s the same for all hosts. Thus for example, the NIC that holds the NLB IP address in the scenario above will have its MAC address changed from 00:eb:24:b2:05:ac and 00:eb:24:b2:05:ad respectively to (say) 00:eb:24:b2:05:af. Note that the MAC address is a Unicast MAC (which basically means the MAC address looks like a regular MAC address, such as that assigned to a single machine). Since this is a Unicast MAC address, and by definition it can only be assigned to one machine/ switch port, the NLB driver on each machines cheats a bit and changes the source MAC address address to whatever the original NIC MAC address was. That is to say –

  • Server IP
    • Has MAC address 00:eb:24:b2:05:ac
    • Which is changed to a MAC address of 00:eb:24:b2:05:af as part of the Unicast IP/ enabling NLB
    • However when traffic is sent out from this machine the MAC address is changed back to 00:eb:24:b2:05:ac
  • Same for Server

Why does this happen? This is because –

  • When a device wants to send data to the VIP address, it will try find the MAC address using ARP. That is, it sends a broadcast over the network asking for the device with this IP address to respond. Since both servers now have the same MAC address for their NLB NIC either server will respond with this common MAC address.
  • Now the switch receives frames for this MAC address. The switch does not have this in its CAM table so it will broadcast the frame to all ports – reaching either of the servers.
  • But why does outgoing traffic from either server change the MAC address of outgoing traffic? That’s because if outgoing frames have the common MAC address, then the switch will associate this common MAC address with that port – resulting in all future traffic to the common MAC address only going to one of the servers. By changing the outgoing frame MAC address back to the server’s original MAC address, the switch never gets to store the common MAC address in its CAM table and all frames for the common MAC address are always broadcast.

In the context of VMware what this means is that (a) the port group to which the NLB NICs connect to must allow changes to the MAC address and allow forged transmits; and (b) when a VM is powered on the port group by default notifies the physical switch of the VMs MAC address, since we want to avoid this because this will expose the cluster MAC address to the switch this notification too must be disabled. Without these changes NLB will not work in Unicast mode with VMware.

(This is a good post to read more about NLB).

Apart from Unicast NLB there’s also Multicast NLB. In this form the NLB NIC’s MAC address is not changed. Instead, a new Multicast MAC address is assigned to the NLB NIC. This is in addition to the regular MAC address of the NIC. The advantage of this method is that since each host retains its existing MAC address the communication between hosts is unaffected. However, since the new MAC address is a Multicast MAC address – and switches by default are set to ignore such address – some changes need to be done on the switch side to get Multicast NLB working.

One thing to keep in mind is that it’s important to add a default gateway address to your NLB NIC. At work, for instance, the NLB IPv4 address was reachable within the network but from across networks it wasn’t. Turns out that’s coz Windows 2008 onwards have a strong host behavior – traffic coming in via one NIC does not go out via a different NIC, even if both are in the same subnet and the second NIC has a default gateway set. In our case I added the same default gateway to the NLB NIC too and it was then reachable across networks. 

User PowerShell/ PowerCLI to get VM space usage

Wanted to get the space used by all VMs across a bunch of our newer hosts –

There’s probably a way to show the total too but I used a separate pipeline for that –


HP DL360 Gen9 with HP FlexFabric 534 adapter and HP Ethernet 530 adapter and ESXi

That’s a very vague subject line, I know, but I couldn’t think of anything concise. Just wanted to put some keywords so that if anyone else comes across the same problem and types something similar into Google hopefully they stumble upon this post.

At work we got some HP DL360 Gen9s to use as ESXi hosts. To these servers we added additional network cards –

  • HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter; and
  • HP Ethernet 10Gb 2-port 530SFP+ Adapter.

Each of these adapters have two NICs each. Here’s a picture of the adapters in the server and the vmnic numbers ESXi assigns to them.

serverIn this picture –

  • vmnic5 & vmnic4 are the HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter;
  • vmnic6 & vmnic7 are the HP Ethernet 10Gb 2-port 530SFP+ Adapter; and
  • vmnic0 – vmnic3 are HP Ethernet 1Gb 4-port 331i Adapter (which come in-built into the server);
  • iLO is the iLO port (which I’ll ignore for now).

We didn’t want to use vmnic0 – vmnic3 as they are only 1Gb. So the idea was the use vmnic4 – vmnic7. Two NICs would be for Management+vMotion (connecting to two different switches); two NICs would be for iSCSI (again connecting to different switches).

We came across two issues. First was that the FlexFabric NICs didn’t seem to support iSCSI. ESXi showed two iSCSI adapters but the NICs mapped to them were the regular Ethernet 10Gb ones, not the FlexFabric 10Gb ones. Second issue was that we wanted to use vmnic4 and vmnic6 for Management+vMotion and vmnic5 and vmnic7 for iSCSI – basically a NIC from each adapter such that even if an adapter were to fail there’s a NIC from another adapter for resiliency. This didn’t work for some reason. The Ethernet 10Gb NICs weren’t “connecting” to the network switch for some reason. They would connect in the sense that the link status appears as connected and the LEDs on the switch and NICs blink, but something was missing. There was no real connectivity.

Here’s what we did to fix these.

But first, for both these fixes you have to reboot the server and go into the System Utilities menu.

f9 system utils

Change 1: Enable iSCSI on the FlexFabric adapter (vmnic4 and vmnic5)

Once in the System Utilities menu select “System Configuration”.

system configurationSelect the first FlexFabric NIC (port1).

select flexfabricThen select the Device Hardware Configuration menu.

select device hardwareYou will see that the storage personality is FCoE.

current flex personalityThat’s the problem. This is why the FlexFabric adapters don’t show up as iSCSI adapters. Select the FCoE entry and change it to iSCSI.

new flex personalityNow press Esc to go back to the previous menus (you will be prompted to save the changes – do so). Then repeat the above steps for the second FlexFabric NIC (port 2).

With this change the FlexFabric NICs will appear as iSCSI adapters. Now for the second change.

Change 2: Enable DCB for the Ethernet adapters

From the System Configuration menu now select the first Ethernet NIC (port 1).

select ethernetThen select its Device Hardware Configuration menu.

select device hardware (ethernet)Notice the entry for “DCB Protocol”. Most likely it is “Disabled” (which is why the NICs don’t work for you).

current DCBChange that to “Enabled” and now the NICs will work.

new DCBThat’s it. Once again press Esc (choosing to save the changes when prompted) and then reboot the system. Now all the NICs will work as expected and appear as iSCSI adapters too.

rebootI have no idea what DCB does. From what I can glean via Google it seems to be a set of extensions to Ethernet that provide “hardware-based bandwidth allocation to a specific type of traffic and enhances Ethernet transport reliability with the use of priority-based flow control” (via TechNet) (also check out this Cisco whitepaper for more info). I didn’t read much into it because I couldn’t find anything that mentioned why DCB mattered in this case – as in why were the NICs not working when DCB was disabled? The NICs are connected to an HP 5920AF switch but I couldn’t find anything that suggested the switch requires DCB enabled for the ports to work. This switch supports DCB but that doesn’t imply it requires DCB.

Anyhow, the FlexFabric adapters have DCB enabled by default which is probably why they worked. That’s how I got the idea to enable DCB on the Ethernet adapters to see if it makes a difference – and it did! The only thing I can think of is that DCB also seems to include a DCBX (Data Centre Bridging Exchange) protocol which is about discovering peers, discovering mismatched configuration etc – so maybe the fact that DCB was disabled on these adapters made the switch not “see” these NICs and soft-disable them somehow. That’s my guess at least.

Windows DNS server subnet prioritization and round-robin

Consider the following multiple A records for a DNS record

  • IN A
  • IN A
  • IN A
  • IN A
  • IN A

These records are defined on a DNS server. When a client queries the DNS server for the address to, the DNS server returns all the addresses above. However, the order of answers returned keeps varying. The first client asking for answers could get them in the following order for instance:

  • IN A
  • IN A
  • IN A
  • IN A
  • IN A

The second client could get them in the following order:

  • IN A
  • IN A
  • IN A
  • IN A
  • IN A

The third client could get:

  • IN A
  • IN A
  • IN A
  • IN A
  • IN A

This is called round-robin. Basically it rotates between the various IP addresses. All IP addresses are offered as answers, but the order is rotated so that as long as clients choose the first answer in the list every client chooses a different IP address.

Notice I said clients choose the first answer in the list. This needn’t always be the case though. When I said clients above, I meant the client computer that is querying the DNS server for an answer. But that’s not really who’s querying the server. Instead, an application on the client (e.g. Chrome, Internet Explorer) or the client OS itself is the one looking for an answer. These ask the DNS resolver which is usually a part of the OS for an answer, and it’s the resolver that actually queries the server and gets the list of answers above.

The DNS resolver can then return the list as it is to the requesting application, or it can apply a re-ordering of its own. For instance, if the client is from the network, the resolver may re-order the answers such that the answer is always first. This is called Subnet prioritization. Basically, the resolver prioritizes answers that are from the same subnet as the client. The idea being that client applications would prefer reaching out to a server in their same subnet (it’s closer to them, no need to go over the WAN link for instance) than one on a different subnet.

Subnet prioritization can be disabled on the resolver side by adding a registry key PrioritizeRecordData (link) with value 0 (REG_DWORD) at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DnsCache\Parameters. By default this key does not exist and has a default value of 1 (subnet prioritization enabled).

Subnet prioritization can also be set on the server side so it orders the responses based on the client network. This is controlled by the registry key LocalNetPriority (link) under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\ on the DNS server. By default this is 0, so the server doesn’t do any subnet prioritization. Change this to 1 and the server will order its responses according to the client subnet.

By default the server also does round-robin for the results it returns. This can be turned off via the DNS Management tool (under server properties > advanced tab). If round-robin is off the server returns records in the order they were added.

More on subnet prioritization at this link.

That’s is not the end though. :)

Consider a server who has round-robin and subnet prioritization enabled. Now consider the DNS records from above:

  • IN A
  • IN A
  • IN A
  • IN A
  • IN A

The first and last records are from class C networks. The other three are from Class A networks. In reality though thanks to CIDR these are all class C addresses.

Now say there’s a client with IP address asking the server for answers. On the face of it the client network does not match any of the answer record networks so the server will simply return answers as per round-robin, without any re-ordering. But in reality though the client is in the same network as and both are part of a larger network that’s simply been broken into multiple /24 networks (to denote clients, servers, etc). What can we do so the server correctly identifies the proxy record for this client?

This is where the LocalNetPriorityNetMask registry key under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\ on the DNS server comes into play. This key – which does not exist by default – tells the server what subnet mask to assume when it’s trying to subnet prioritize. By default the server assumes a /24 subnet, but by tweaking this key we can tell the server to use a different subnet in its calculations and thus correctly return an answer.

The LocalNetPriorityNetMask key takes a REG_DWORD value in a hex format. Check out this KB article for more info, but a quick run through:

A netmask can be written as 4 pairs of numbers. The LocalNetPriorityNetMask key is of format 0xaabbccdd – again, 4 pairs of hex numbers. This is a mask that’s applied on the mask of so to calculate this number you subtract the mask you want from and convert the resulting numbers into hex.

For example: you want a /8 netmask. That is Subtracting this from leaves you with What’s that in hex? 00ffffff. So LocalNetPriorityNetMask will be 0x00ffffff. Easy?

So in the example above I want a /20 netmask. That is, I am telling the server to assume the clients and the record IPs it has to be in a /20 network, so subnet prioritize accordingly. A /20 netmask is Subtract from to get Which in hex is 00000fff (15 decimal is F hex). So all I have to do is put this value as LocalNetPriorityNetMask on the DNS server, restart the service, and now the server will correctly return subnet prioritized answers for my /20 network.

It is possible to vMotion VMs across ESX hosts without shared storage

Today (well actually, a few days ago; but today is when I read more about it) I learnt that you can vMotion VMs across hosts without shared storage.

This is only for vSphere 5.1 and above. That’s a pretty cool feature, especially because at work we are migrating all our VMs to new hosts & storage and one of things we were wondering about was how to move the VMs across. The new hosts have 3Par storage while the old hosts have StoreVirtual storage, so the thinking was that we’d probably have to give the new hosts access to the StoreVirtual storage and then do a vMotion. Now we won’t have to!

There’s no separate name for this sort of vMotion and it seems to be a not quite hyped feature. For anyone interested here’s some screenshots on how to do such a vMotion.

For starters here’s my testlab setup:

setupOne datacenter. Two clusters. Cluster one has two hosts with shared storage. Cluster two has a single host with no shared storage. UBUNTU1 is a VM I would like to migrate over.

Note that host esx03 has no connectivity to the shared storage either. I have removed the iSCSI VMkernel mappings from it so there’s no confusion.

esx03 shared storageESX01 and ESX02 have access to shared storage.

esx01 shared storageMigration is quite simple. Right click the VM and select Migrate. Choose the option to migrate both host and datastore. If the VM is powered on (which it would be as we are doing vMotion instead of a cold migration) you will see the option is grayed out in the older/ C# vSphere client.

migrate host and datastore - 1That’s because the newer features of vSphere 5.1 are only available in the web client so you’ll have to use that instead (thanks to this blog post for pointing me to that).

migrate host and datastore - 2Select the destination host. Note that vMotion is only between datacenters so you can only chose a host in the same datacenter (as opposed to cold migration which can happen between datacenters).

select destination

Select Datacenter

select destination host

Select Host

Select Datastore

Select Datastore

Notice that any datastore accessible from the destination host can be selected.

And that’s it. vMotion begins and I have easily live migrated a VM from one host to another without any shared storage. Cool! :)


Get a list of users in an OU along with last logged on date

Trivial stuff. Wanted to note it down someplace for future reference –


Start-BitsTransfer does nothing

I had to copy some VMware templates from our head office to the branch offices. Thought I’d copy them out of the datastores manually, then do a BITS transfer to the remote offices. This way I can do the transfer during normal hours but with minimal user impact.

Since PowerShell 2.0 you had the Start-BitsTransfer cmdlet to do BITS transfers.

Oddly however, the command would just exit without any error for me. And it didn’t seem to be doing anything. Then I realized I was pointing the cmdlet to my source folder and that’s why it was failing! Start-BitsTransfer only takes files. You can specify wildcards to select multiple files but you can’t point it to a folder.

So the following doesn’t work:

But this works:

Since Start-BitsTransfer supports wildcards it’s mostly fine unless your folder contains sub-folders and you want to copy these and/ or preserve the structure. Fortunately this is PowerShell so it’s just a matter of creating some wrappers around the cmdlet to support folders too. Like this one for instance. Or use CSV files like in this MSDN article.

One thing to keep in mind – by default Start-BitsTransfer has a default priority (specified via the -Priority switch) of Foreground. This competes with other applications so is probably not what you want. You alternatives are High, Normal, or Low – each of which does the transfer in the background and uses the idle bandwidth of the client for transfer (the priority determines which transfer job gets priority over the other similar BITS transfers).

Another thing to keep in mind is that BITS only really looks at the bandwidth availability of the client (when I say “client” I am not sure if it’s the sender or receiver – I didn’t read much into this). In a LAN environment it could be the case that the WAN side is saturated but the particular client you are targeting is idle – in this case BITS will use the full client bandwidth available to it even though the network itself doesn’t have any spare bandwidth (this was the case prior to BITS 2.0 but since then BITS can use an Internet Gateway Device to try and assess the bandwidth availability on the WAN side – this requires an IGD be find-able via UPnP and also that the Internet Gateway Device support such reporting, so I am not sure how well it works in practice). You can also use GPOs to control BITS bandwidth usage.

Anyhoo, this is not a BITS intro so I’ll leave it at that. :)

Once you start a BITS transfer you can also pause it via Suspend-BitsTransfer or cancel it via Remove-BitsTransfer. The latter will also delete any files that are already transferred, so if you just want to cancel but leave the transferred files as it is use Complete-BitsTransfer instead.

That’s all for now!

p.s. Almost forgot. You can also use BITS to download HTTP files. Like in this post for instance.

vMotion NIC load balancing fails even though there is an active link

The other day I blogged about how I had a host whose vMotion VMkernel interface seemed to be broken. Any vMotion attempts to it would hang at 14%.

At that time I logged on to the destination host, then used vmkping with the -I switch (to explicitly specify the vMotion VMkernel interface of the destination host), and found that I couldn’t ping the VMkernel interface of the other hosts. These hosts could ping each other but couldn’t ping the destination host.

The VMKernel interface is backed by two physical NICs. I found that if I remove one of the physical NICs from the VMkernel it works. Interestingly this link wasn’t showing any CDP info either, so it looked like something was wrong with it (the physical NIC shows as unclaimed coz the screenshot was taken after I moved it to unclaimed).

Missing CDP infoSo the first question is why did the VMkernel fail when only one of the physical NICs failed? Since the other physical NIC backing the VMkernel NIC is still active shouldn’t it have continued working?

The reason why it failed is that by default network failover detection is via “Link status only”. This only detects failures to the link – like say the cable is broken, the switch is down, or the NIC has failed – while failures such as the link being connected but blocked by switch are not detected. In my case as you can see from the screenshot above the link status is connected – so the host doesn’t consider the link failed even though it isn’t actually working, thus continues to use it.

Next I discovered that other hosts too similarly had their second vMotion physical NIC in a failed state as above yet they weren’t failing like this host. The simple explanation for this is that the host above somehow selected the faulty physical NIC as the one to use, didn’t detect it as failed and so continued to use it; whereas other hosts were more lucky and chose the physical NIC that works alright, so didn’t have any issues.

I am not sure that’s the entire answer though. For once the host that failed was ESXi 5.5 and using a distributed switch, while the other two hosts were ESXi 4.0 and using standard switches. Did that make a difference?

The default load balancing method for both standard and distributed switches is the same. (For a standard switch you check this under the vSwitch properties on the host. For a distributed switch you check this under the portgroup in the Networking section of vSphere (web) client).

default load balancingLoad balancing is what I am concerned about here because that’s what the hosts should be using to balance between both the NICs. That’s what the host will be using to select the physical NIC to use for that particular traffic flow. The load balancing method is same between standard and distributed switches yet why were the distributed switch/ ESXi 5.5 hosts behaving differently?

I am still not sure of an answer but I have my theory. My theory is that since a distributed switch is across multiple hosts the load balancing method (above) of choosing a route based on virtual port ID comes into play. Here’s screenshots from two of my hosts connected to the same distributed switch port group for instance:

port numberAs you can see the virtual port number is different for the VMkernel NIC of each host. So each host could potentially use a different underlying physical NIC depending on how the load balancing algorithm maps it.

But what about a standard switch? Since the standard switch is only on the host, and the only VMkernel NIC connected to it (in the case of vMotion) is the single VMKernel NIC I have assigned for vMotion, there is no load balancing algorithm coming into play! If, instead of a VMkernel I had a Virtual Machine network, then the virtual port number matters because there are multiple VMs connecting to the various port numbers; but that doesn’t matter for VMkernel NICs as there is only one of them. And so my theory is that for a VMkernel NIC (such as vMotion) backed by multiple physical NICs and using the default load balancing algorithm of virtual port ID – all traffic by default goes into one of the physical NICs and the other physical NIC is never used unless the chosen one fails. And that is why my hosts using the standard switches were always using the same physical NIC (am guessing the lower numbered one as that’s what both hosts chose) while hosts using distributed switches would have chosen different physical NICs per host.

That’s all! Just thought I’d put this out there in case anyone else has the same question.

AppV – Empty package map for package content root

Had an interesting problem at work yesterday about which I wish I could write a long and interesting blog post, but truthfully it was such a simple thing once I identified the cause.

We use AppV for streaming applications. We have many branch offices so there’s a DFS share which points to targets in each office. AppV installations in each office point to this DFS share and thanks to the magic of DFS referrals correctly pick up the local Content folder. From day-before, however, one of our offices started getting errors with AppV apps (same as in this post), and when I checked the AppV server I found errors similar to this in the Event Logs:

The DFS share seemed to be working OK. I could open it via File Explorer and its contents seemed correct. I checked the number of files and the size of the share and they matched across offices. If I pointed the DFS share to use a different target (open the share in File Explorer, right click, Properties, go to the DFS tab and select a different location target) AppV works. So the problem definitely looked like something to do with the local target, but what was wrong?

I tried forcing a replication. And checked permissions and used tools like dfsrdiag to confirm things were alright. No issues anywhere. Restarting the DFS Replication service on the server threw up some errors in the Event Logs about some AD objects, so I spent some time chasing up that tree (looks like older replication groups that were still hanging around in AD with missing info but not present in the DFS Management console any more) until I realized all the replication servers were throwing similar errors. Moreover, adding a test folder to the source DFS share correctly resulted it in appearing on the local target immediately – so obviously replication was working correctly.

I also used robocopy to compare the the local target and another one and saw that they were identical.

Bummer. Looked like a dead end and I left it for a while.

Later, while sitting through a boring conference call I had a brainwave that maybe the AppV service runs in a different user context and that may not be seeing the DFS share? As in, maybe the error message above is literally what is happening. AppV is really seeing an empty content root and it’s not a case of a corrupt content root or just some missing files?

So I checked the AppV service and saw that it runs as NT AUTHORITY\NETWORK SERVICE. Ah ha! That means it authenticates with the remote server with the machine account of the server AppV is running on. I thought I’d verify what happens by launching File Explorer or a Command Prompt as NT AUTHORITY\NETWORK SERVICE but this was a Server 2003 and apparently there’s no straightforward way to do that. (You can use psexec to launch something as .\LOCALSYSTEM and starting from Server 2008 you can create a scheduled task that runs as NT AUTHORITY\NETWORK SERVICE and launch that to get what you want but I couldn’t use that here; also, I think you need to first run as the .\LOCALSYSTEM account and then run as the NT AUTHORITY\NETWORK SERVICE account). So I checked the Audit logs of the server hosting the DFS target and sure enough found errors that the machine account of the AppV server was indeed being denied login:

Awesome! Now we are getting somewhere.

I fired up the Local Security Policy console on the server hosting the DFS target (it’s under the Administrative Tools folder, or just type secpol.msc). Then went down to “Local Policies” > “User Rights Assignment” > “Access this computer from the Network”:

secpolSure enough this was limited to a set of computers which didn’t include the AppV server. When I compared this with our DFS servers I saw that they were still on the default values (which includes “Everyone” as in the screenshot above) and that’s why those targets worked.

To dig further I used gpresult and compared the GPOs that affected the above policy between both servers. The server that was affected had this policy modified via  GPO while the server that wasn’t affected showed the GPO as inaccessible. Both servers were in the same OU but upon examining the GPO I saw that it was limited to a certain group only. Nice! And when I checked that group our problem server was a member of it while the rest weren’t! :)

Turns out the server was added to the group by error two days ago. Removed the server from this group, waited a while for the change across the domain, did a gpupdate on the server, and tada! now the AppV server is able to access the DFS share on this local target again. Yay!

Moral of the story: if one of your services is unable to access a shared folder, check what user account the service runs as.

vMotion is using the Management Network (and failing)

Was migrating one of our offices to a new IP scheme the other day and vMotion started failing. I had a good idea what the problem could be (coz I encountered something similar a few days ago in another context) so here’s a blog post detailing what I did.

For simplicity let’s say the hosts have two VMkernel NICs – vmk0 and vmk1. vmk0 is connected to the Management Network. vmk1 is for vMotion. Both are on separate VLANs.

When our Network admins gave out the new IPs they gave IPs from the same range for both functions. That is, for example, vmk0 had an IP (and and 10.20.4/24 on the other hosts) and vmk1 had an IP of 10.20.12/24 (and and on the other hosts).

Since both interfaces are on separate VLANs (basically separate LANs) the above setup won’t work. That’s because as far as the hosts are concerned both interfaces are on the same network yet physically they are on separate networks. Here’s the routing table on the hosts:

Notice that any traffic to the network goes via vmk0. And that includes the vMotion traffic because that too is in the same network! And since the network that vmk0 is on is physically a separate network (because it is a VLAN) this traffic will never reach the vMotion interfaces of the other hosts because they don’t know of it.

So even though you have specific vmk1 as your vMotion traffic NIC, it never gets used because of the default routes.

If you could force the outgoing traffic to specifically use vmk1 it will work. Below are the results of vmkping using the default route vs explicitly using vmk1:

The solution here is to either remove the VLANs and continue with the existing IP scheme, or to keep using VLANs but assign a different IP network for the vMotion interfaces.

Update: Came across the following from this blog post while searching for something else:

If the management network (actually the first VMkernel NIC) and the vMotion network share the same subnet (same IP-range) vMotion sends traffic across the network attached to first VMkernel NIC. It does not matter if you create a vMotion network on a different standard switch or distributed switch or assign different NICs to it, vMotion will default to the first VMkernel NIC if same IP-range/subnet is detected.

Please be aware that this behavior is only applicable to traffic that is sent by the source host. The destination host receives incoming vMotion traffic on the vMotion network!

That answered another question I had but didn’t blog about in my post above. You see, my network admins had also set the iSCSI networks to be in the same subnet as the management network – but separate VLANs – yet the iSCSI traffic was correctly flowing over that VLAN instead of defaulting to the management VMkernel NIC. Now I understand why! It’s only vMotion that defaults to the first VMkernel NIC in the same IP range/ subnet as vMotion. 


Find Outlook rules that are deleting a message

As part of troubleshooting something I needed to quickly find what Outlook rules the user had for deleting messages. So I came up with this one-liner.

The result is a list of rule names and a friendly description of what the rule does.

Run this from the EMS of course.

ESXi host – cannot install HA – no space left on device

These are less of notes and more of links and what I did when I encountered this issue. Just for my future self.

At work we had a host which was giving HA errors. The message was along the lines that vCenter could not contact HA. So I tried reconfiguring it for HA (right click the host and select “Reconfigure for vSphere HA”) upon which I got a new error: Cannot install the vCenter Server agent service. Cannot upload agent.

HA-errorInitially I thought it must just be a permissions issue. But it wasn’t so.

To investigate further I tried logging on to the server. I couldn’t enable SSH and ESXi Shell from the Configuration tab – it gave me an error. So I iLO’d into the server DCUI and enabled SSH and ESXi Shell. SSH still refused to let me in, and when I’d press Alt+F1 on the console to get the login prompt it was filled with messages like these: /bin/sh cant fork. Initially I thought it might be to do with HP AMS memory leak (see this and this) but it wasn’t.

I pressed Alt+F12 to see the on-screen logs. It was filled with messages like these:

alt+f12 logsBlimey!

There was nothing more I could do here basically. Couldn’t login to the server at all, heck I couldn’t even Shutdown/ Restart it gracefully via F12 in DCUI (nothing would happen). So I cold booted it and that got it working. 

It’s been about 2 hours since I did that and the server seems stable so maybe it was a one off-thing. I looked at more logs though and here’s what I found.


(Contains: Management service initialization, watchdogs, scheduled tasks and DCUI use)


(Contains: A summary of Warning and Alert log messages excerpted from the VMkernel logs)


(Contains: VMkernel Observation events)


(Contains: Core VMkernel logs, including device discovery, storage and networking device and driver events, and virtual machine startup)


(Contains: Host management service logs, including virtual machine and host Task and Events, communication with the vSphere Client and vCenter Server vpxa agent, and SDK connections.)

From these logs one thing was clear. The ESXi RAMdisk hosting the root filesystem had run out of inodes. Possibly caused by the SFCB service. Because of this the root filesystem had run out of space and everything was failing. Great!

In Linux I am used to the df command to check filesystem usage. But in ESXi df only seems to be give info on the mounted filesystems whereas vdf gives the local filesystems (like RAMdisks and Tardisks (whatever that is)).

Above output is after a reboot and all seems fine. To check the inode usage use the stat command.

Or use exscli. It gives you the free space as well as the inode count!

Note to self: Make a habit of using the esxcli command as that seems to be the VMware preferred way of doing things. Plus it’s one command with various namespaces you can use for networking and other info.

In my case things look to be fine now.

KB 2037798 talks about this problem. Apparently it is fixed via a patch released in 2013, and as far as I can tell we are properly patched so we shouldn’t have been hit by this issue. If it happens again though the same KB article talks about creating a separate RAMdisk for SFCB so even if it eats up all the inodes your root file system isn’t affected. This involves creating a new RAMdisk at boot time by modifying rc.local (nice!). The esxcli command can be used to create a new ramdisk and mount it at the mount point required by SFCB:

Turns out such an issue can also occur because of SNMP. Or if you have an HP Gen8 blade server then coz of the hpHelper.log file, which is fixed via a patch from HP (this server was a Gen8 blade but it didn’t have this log file). KB 2040707 too talks about this. Didn’t help much in my case as that didn’t seem to be my issue.

Two useful links for future reference are:

That’s all for now.

p.s. I keep talking about SFCB above but have no idea what it is. Turns out it is the CIM server for ESXi. Found this blog post on it. 

How to fail-over the primary VC module to another

Probably obvious to most people but it was new to me. To failover a VC module from primary to backup login to the VC Manager, then go to “Tools > Reset Virtual Connect Manager”.

toolsClick “Yes”, but before that tick the box that says “Force Failover”.

force failoverThat’s all. Now you will be logged out and after a few minutes you can login to the module that is now primary. It takes a few minutes so be patient.

This has no impact on the network or FC connectivity.

Some more ways of doing this can be found at this HP link.

vMotion fails at 14% – ESX hosts failed to connect over the VMotion network

On a newly built host today vMotion was failing while migrating VMs to this host. vMotion would get stuck at 14% and then fail with the above error. Found this VMware KB article – it was informative but didn’t help. From this article I learnt though that I can use vmkping with the -I switch to specify an interface to use while pinging. This is handy when you want to ping a remote address via a specific interface – say, the vMotion IP address of a remote host, via the vMotion VMKernel of this host. Usually vmkping automatically selects an interface on the network you are trying to ping but it’s possible you are using the same subnet for vMotion and many other services.

Anyhow, in my case I noticed that if I removed one of the underlying physical adapters I am able to vmkping. So add that to the list of things to try if you too are in a similar situation. Odd though that it failed though! I would have thought a failed physical adapter means it will just try a different one? Clearly in my case the other adapter was working.

I don’t know more details but it could be that the physical NIC was up but the switch was blocking? Not sure.

Brief notes on Windows Time

The w32time service provides time for Windows. Since Windows XP NTP (Network Time Protocol) is supported. Prior to that it was only SNTP (Simple NTP).

Non domain joined computers (including servers) use SNTP.

This is a good article that explains the Windows Time service and its configurations. Covers both registry keys and GPOs. This is another good article that goes into even more detail.

Any Windows machine can be set up to sync time in one of four ways: (1) no syncing! (2) sync from specified NTP servers (3) sync via domain hierarchy (i.e. members sync from a DC in the domain; DCs sync from PDC of the parent domain/ forest root domain) (4) use either of the above (i.e. NTP and domain hierarchy). Default mechanism on domain joined computers is domain hierarchy (the setting is called NT5DS). Stand-alone machines have the default as NTP servers (the setting is called NTP; the default server is though you can change it (and probably recommended that you change it?)).

For machines that are off and on the domain – e.g. laptops – it is better to set their time sync mechanism as any. They needn’t always have contact with the DC to sync time.

When specifying NTP time servers you also specify flags. Check this post for an explanation of the flags. There are four possible flags: 0x01 SpecialInterval; 0x02 UseAsFallbackOnly; 0x04 SymmetricActive; 0x08 Client.

  • Flag UseAsFallbackOnly means the server is only used if the others are unavailable. Check out this post for an example of this.
  • Flag SpecialInterval lets you change how often the NTP server is polled. By default the interval is determined by Windows based on the quality of time samples, but you can use the above flag and set a registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient\SpecialPollInterval to change the polling interval.
  • I am not sure what the other two flags do. The Client flag seems to be a commonly used one. Some posts/ articles use it, others don’t. The default setting uses this flag as well as the SpecialInterval.

p.s. To turn on w32tm debugging check out this link.