Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Enabling SNMPv3 on ESXi hosts

A continuation to my earlier post which was to do with SNMPv2.

As before, connect to the vCenter via PowerCLI. And as before the set() method can be used to set SNMP – both v2 and/or v3. The definition of this method is as follows:

That’s confusing so best to copy paste the definition into notepad or something so you can be sure you are passing the correct arguments.

First things first. There doesn’t seem to be a way of turning off something. As in, say you already have SNMPv2 turned on, you can’t turn it off by setting the community strings to blank. Doing so generates an error. So if you want to turn previous things off it’s best to do a reset and start with a clean slate.

This sets things back to their defaults:

Before going ahead with any SNMPv3 configuration we need to decide on what authentication and privacy protocols to use. In my case I want to use SHA1 and AES-128. So I need to set that first:

Once I have done this I can generate the hashes. I will need this later to configure SNMPv3.

In the example above both my passwords are Password1.

With this in hand I configure SNMPv3:

That’s it really. In the above example I will be using an SNMPv3 user called snmpUser1.

Now to do it across my estate I can make a loop. No need to create password hashes for each host. The hash stays the same as long as you are using the same password for each host.

That’s all!

vSphere Replication does not support changing the length of a replicated disk.

Had to extend a VM disk today and got the above error. This is because the VM is replicated via vSphere Replication so you can’t simply extend the disk as you would do for any regular VM.

error-1

Here’s a top level summary of how you do it (based on this KB article).

  1. You have to break the replication. Stop it that is. But doing so deletes the replicated files, so first you want to work around that (as below).
    1. Note the current settings of the replication.
    2. Then pause the replication.
    3. Find out which datastore holds the replicated VM disks.
    4. Rename the replicated VM folder.
    5. Now you can stop the replication because you have kept a copy of the data.
  2. SSH into any ESX host that has access to the above datastore and extend the disk associated with the VMDK via vmkfstools.
  3. Rename the folder back to what it was before.
  4. Recreate the replication, but point the destination to the same datastore as above and select the folder above. vSphere Replication will ask whether you want to use the existing data as seed – answer yes.

That’s it basically.

In terms of the details, I didn’t know how to find which datastore had the replicated VM files. So I SSH’d into one of the hosts in the replicated VM cluster and ran the following:

There must be some better way, but what the heck. Once I found the path above I did the following to find other VMs in it, and using that info I was able to find the datastore name from vSphere client.

You need this datastore name for when setting up a new replication, so you can point to that.

Some more things to keep in mind are the following.

  1. Since we pause the replication rather than stop it, the folder will contain a bunch of hbr* files. Delete those.
  2. The vmkfstools command -X switch takes the new size of the disk. Not the additional amount. So if the disk is 10GB and you want to add 20GB, you specify it the argument as 30GB. If you are getting a “Failed to extend disk : One of the parameters supplied is invalid (1).” error with vmkfstools that’s probably why.

VMware client – unable to login with username, password; but able to login with “use windows credentials”

We had this weird issue at work yesterday wherein you could not login to the vCenter server by entering a username/ password, but could if you just ticked on the “Use windows session credentials” checkbox.

The issue got resolved eventually by stopping the “VMware Secure Token Service”, restarting the “VMware VirtualCenter Server” service, and then starting the “VMware Secure Token Service”. No idea why that made a difference though, and whether that actually fixed things or was just coincidental. Around the same time I had seen some VMware Tools errors so I (a) upgraded the tools, (b) moved the vCenter VM to a different host, (c) saw that one of these had caused issues with the network driver so I had to uninstall and reinstall the tools and then reset the secure channel with the domain (since when the vCenter VM came up it didn’t have network connectivity).

So it was a bit of a damper actually. Nothing more frustrating than spending a lot of time troubleshooting something and not really figuring out what the issue is. On the plus side at least the issue got sorted, but it leaves me uneasy not knowing what really went wrong and whether it will re-occur.

In the event logs there were many entries like these:

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        VCENTER01$
    Account Domain:        MYDOMAIN
    Logon ID:        0x3e7

Logon Type:            3

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        SomeAccount
    Account Domain:        MYDOMAIN.COM

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xc000006d
    Sub Status:        0xc0000064

Process Information:
    Caller Process ID:    0xe20
    Caller Process Name:    E:\Program Files\VMware\Infrastructure\VMware\CIS\vmware-sso\VMwareIdentityMgmtService.exe

Network Information:
    Workstation Name:    VCENTER01
    Source Network Address:    –
    Source Port:        –

Detailed Authentication Information:
    Logon Process:        Advapi  
    Authentication Package:    Negotiate
    Transited Services:    –
    Package Name (NTLM only):    –
    Key Length:        0

Here’s what the error codes mean –

  • NULL SID suggests that the account that was being authenticated could not be identified
  • 0xC000006D means that authentication failed due to bad credentials
  • 0xC0000064 means that the requested user name does not exist.
  • Logon type 3 means the request was received from the network (but given the request originated from “server”, suggests that the request was looped back from itself over the network stack.

Not that it throws much light on what’s happening.

For info – this KB article lists the useful vCenter log files. I looked at the vpxd-xxxx.log file which had some entries like these –

2016-06-06T16:08:18.046+01:00 [02856 error ‘[SSO]’ opID=138a737d] [UserDirectorySso] AcquireToken exception: class SsoClient::CommunicationException(No connection could be made because the target machine actively refused it)
2016-06-06T16:08:18.046+01:00 [02856 error ‘authvpxdUser’ opID=138a737d] Failed to authenticate user <mydomain\someaccount>

This file is under C:\ProgramData\VMware\VMware VirtualCenter\Logs by the way.

I also found messages like these –

2016-06-06T10:17:59.226+01:00 [06952 error ‘[SSO]’ opID=1790eabb] [UserDirectorySso] AcquireToken exception: class SsoClient::SsoException(Failed to parse Group Identity value: `\Authentication authority asserted identity’; domain or group missing)

Two more logs I looked at are C:\ProgramData\VMware\CIS\logs\vmware-sso\vmware-sts-idmd.log and some files under C:\ProgramData\VMware\CIS\runtime\VMwareSTS\logs. In case of the latter location I just sorted by the recently modified timestamp and found some logs to look at. I focused on one called ssoAdminServer.log. This file had a few entries like these –

[2016-06-06 12:19:08,987 pool-11-thread-1  ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception
com.vmware.identity.idm.IDMException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:131)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:4006)

I found mention of this message in a forum post which pointed to this being a known issue for vCenter installed on a 2012 server with a 2012 DC. That doesn’t apply to me.

The vSphere Web Client gives an error message “Cannot Parse Group Information” – which too is a symptom if you install vCenter on a 2012 server with a 2012 DC. Moreover it applies to vCenter 5.5 GA, which is what we are on, so all the symptoms point to that issue but it is not so in our case. :(

Back to the vmware-sts-idmd.log, that had entries like these –

2016-06-06 09:00:26,089 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,090 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,091 ERROR  [ValidateUtil] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
2016-06-06 09:00:26,092 INFO   [ActiveDirectoryProvider] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
…<snip>…
2016-06-06 09:02:53,005 INFO   [IdentityManager] Failed to find principal [SomeAccount@mydomain.tld] as FSP group in tenant [vsphere.local]
2016-06-06 09:02:53,008 INFO   [IdentityManager] Failed to find FSP user or gorup [SomeAccount@mydomain.tld]’s nested parent groups in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [IdentityManager] Failed to find nested parent groups of principal [SomeAccount@mydomain.tld] in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [ServerUtils] Exception ‘java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]’
java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroupsByPac(ActiveDirectoryProvider.java:2140)
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroups(ActiveDirectoryProvider.java:791)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:3985)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroups(IdentityManager.java:3856)
    at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Again, something to do with DC/ domain … but what!? Found this blog post too that suggested the same.

For my reference, here’s a KB article listing all the SSO log files. And this is a useful blog post in case I happen upon a similar issue later (the case of the flapping VMware Secure Token Service). As is this KB article on an SSO facade error.

Cannot login to vCenter with “use windows session credentials” but can login by entering username & password

Had this issue today (and a few months ago). I open vCenter client, type in the vCenter server name, tick “Use Windows Session Credentials” as usual, and login fails. Says it cannot login with the given credentials.

At the same time I can login with the vSphere Web Client and also by un-ticking the box and manually entering the username/ password.

Fix for both times was to reset the secure channel by logging in to the vCenter server –

 

Enabling SNMP on ESXi hosts

I wanted to enable SNMP on our ESXi hosts for monitoring via Solarwinds. Here’s what I did. (I am doing this kind of generically, using variables etc, so I can script the thing for multiple hosts).

First I connected to the vCenter Server from PowerCLI.

Next I got its ESXCLI object. This will let me run ESXCLI commands against the host.

To view the current status of SNMP you can do can invoke a get() method –

Nothing’s configured currently. To configure something we can use the set() method. From the definition of this method we can see it takes a whole bunch of parameters –

Here’s what I did to configure SNMP. I want a community string of “public”, enable SNMP, and specify two trap destinations.

The result of that will either be a true or false. The get() method can be used again to confirm it is set correctly. And the test() method can be used to test it works –

Now Solarwinds will be able to poll the host via SNMP.

To do this en-masse on all your hosts the following should help –

Shout out to this VMware blog post which helped a lot and has more info.

The above script failed on some of our ESX hosts with the following error –

Turns out these hosts only accept 16 parameters instead of 17 (the one called largestorage is missing). Not sure why. All our hosts are ESXi 5.5 but am thinking the problem ones are perhaps not using the HP customized version of ESXi.

Anyways, so I modified my script above to take care of this –

Also, just for my own info – the $null above means the parameter is not set. If that parameter already has a value on the server it is not over-written. To over-write or blank out the existing value replace $null with "".

Configure NTP for multiple ESXi hosts

Following on my previous post I wanted to set NTP servers for my ESX servers and also start the service & allow firewall exceptions. Here’s what I did –

 

Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

vCenter unable to connects to hosts; vSphere client gives error ‘”ServiceInstance.RetrieveContent” for object “ServiceInstance” on Server “IP-Address” failed’

Our Network team had been making some changes at work and suddenly vCenter in our London office lost connectivity with all the ESX hosts in one of our remote office. Moreover, when trying to connect from the vSphere Client to any of the remote hosts directly we were getting the following error –

client error

Connectivity from vSphere Client in the remote office to the ESX host in the same office was fine; it was only connectivity from other offices to this remote office. So it definitely indicated a network issue.

This KB article is a handy one to know what ports are required by various VMware products. Port 443 is what needs to be open to ESX hosts for vCenter Server to be able to talk to them. I did a telnet from the vCenter server to each of the remote office hosts on port 443 and it went through fine – so wasn’t a firewall issue. (Another post with port numbers, just FYI, is this one).

After a fair bit of troubleshooting we tracked the issue down to MTU.

Digressing into MTUs

Communication between two IP addresses (i.e. layer 3) happens through packets. Thus when my London vCenter Server communicates with my remote office ESX host, the two send TCP/IP packets to each other. When these packets from the vCenter Server reach the switch/ router on the same LAN as the ESX host, it becomes a layer 2 communication (because they are on the same network and it’s a matter of data reaching the ESX host from the switch/ router). In the case of Ethernet, this layer 2 communication happens via Ethernet frames. The frames encapsulate the IP packets – so the switch/ router breaks the packets and fits them into multiple frames, while the ESX host receives these frames and re-assembles the packets (and vice versa). (The picture on this Wikipedia page is worth a look to see the encapsulation). 

How much data can be held by a layer 2 frame is defined by the Maximum Transmission Unit (MTU). Larger MTUs are good because you can carry more data; but they have a downside in that each frame takes longer to be transmitted, and in case of any errors more data has to be re-transmitted when the frame is resent. So a balance is important. In the case of Ethernet, RFC 894 (see errata also) defines the MTU as a maximum of 1500 bytes. In the case of other layer 2 protocols, the MTU varies: for example 4464 bytes for Token Ring; 4352 bytes for FDDI; 9180 bytes for ATM; etc. In the case of Ethernet there are now also jumbo frames, which are frames with an MTU size of 9000 bytes (see this page for a table comparing regular frames and jumbo frames) and are commonly used in iSCSI networks.

Taking the case of Ethernet, assume the MTU of all Ethernet networks is 1500 bytes. So when two devices are conversing with each other over layer 3, and this conversation spans multiple Ethernet networks, it is helpful if the devices know that the MTU of the underlying layer 2 network is 1500 bytes. That way the two devices can keep the size of their layer 3 packets to be less than 1500 bytes. Why? Because if the size of the layer 3 packets are greater than 1500 bytes, then the devices and all the routers/ switches in between will have to fragment (break) the layer 3 packets into smaller packets of less than 1500 bytes to fit it in the Ethernet frame. This is a waste of resources for all, so it’s best if the two devices know of the underlying layer 2 MTU and act accordingly.

Now, note that Ethernet MTUs are defined as a maximum of 1500 bytes. So the MTU for a particular LAN segment can be set to a lower number for whatever reason (maybe there are additional fields in the Ethernet frame and to accommodate these the data portion must be reduced). Similarly, a layer 3 conversation between when two devices can go over a mix of layer 2 networks – Ethernet, Token Ring, etc – each with a different MTU. So what is required for the two devices really is a way of knowing what’s the lowest MTU across all these layer 2 devices, so the two devices can use it as the MTU of the layer 3 packets for their conversation. This is known as the Path MTU or IP MTU – and is basically the smallest MTU of all the underlying layer 2 MTUs over which that conversation traverses. It is discovered through a process known as “Path MTU Discovery” (PMTUD) (check this Wikipedia article, or Google this term to learn more). Very briefly, in the case of IPv4 what happens is that each device sends across packets of increasing size to the other end, with a flag set that says “do not fragment this packet”. Packets of size smaller than the lowest layer 2 MTU will get through, but once the size exceeds the lowest MTU the packet will fail & return because it cannot be fragmented (due to the flag) and so is returned via ICMP to the sender. Thus the Path MTU is discovered. This check happens in both directions.

So we have layer 2 MTUs and layer 3 MTUs. Layer 2 MTUs have a maximum value that is dependent on the layer 2 network technology. But what about the minimum value? RFC 791, which defines the Internet Protocol (the IP in TCP/IP), requires that all devices supporting IP must be able to forward packets of 68 bytes without fragmenting (68 bytes because IP headers take 60 bytes size and layer 2 headers take 8 bytes size minimum) and be able to accept packets of minimum size 576 bytes either as one packet or multiple packets that require assembling. Because of this the minimum layer 2 MTU can be thought of as 68 bytes. In a practical sense, however, most IP devices accept 576 bytes without fragmenting, and since this number is higher than the values for all layer 2 networks the minimum layer 2 & layer 3 MTU can be thought of as 576 bytes.

Just for completeness I will also mention Maximum Segment Size (MSS) which is a layer 4 MTU (of sorts) that defines what’s the maximum TCP segment (which is what a TCP packet is called) that can be accepted by devices. It has a default value of 536 bytes. This is based on the 576 bytes that IP requires hosts to accept at minimum, minus 20 bytes for IP headers and 20 bytes for TCP headers. Idea behind using 576 bytes as the base is that this way the TCP segment can be expected to arrive without fragmenting. In a practical sense again, for TCP/IP traffic over Ethernet (which is the common case), since Ethernet frames have an MTU of 1500, the MSS is usually set to 1500 minus 20 minus 20 = 1460 bytes.

This is a good article I came upon. Just linking it as a reference to myself.

Back to our issue

In our case the router in the remote site had the following set in its configuration:

I am not entirely clear where it was set or why it was set, as that comes under the Network team. What this does though is tell the router not to clear the “Do Not Fragment” (DF) bit in Ethernet frames. If a DF bit is present in a frame then the router will not fragment it if the frame size is larger than the MTU (this is how PMTUD also works). I am not sure why this was set – part of some testing I suppose – but because of this larger frames were not getting through to the other side and hence failing. Our Network team removed this statement and then communication with the ESX hosts started working fine.

I wanted to write more about this statement but I am running out of time. This and this are two good links worth reading for more info. Especially the Scenario 4 section in the second link – that’s pretty much what was happening in our case, I think.

What does the vCloud Air “Enable Service Network” do?

I don’t know. :)

But I think it’s used when you want to connect a vCloud Air network (which is part of a Disaster Recovery or Virtual Private Cloud setup) with the network of a vCloud Air PaaS such as vCloud Air SQL. I base this on this vCloud Air SQL Users Guide that talks as about enabling the service network to connect a vDC (vCloud Air Virtual Datacenter) to the vCloud Air SQL network.

Will add more to this post if & when I get to know more.

Troubleshooting ESXi host reboots

Had to troubleshoot an ESXi host reboot today. Came across this link – good one.

Here’s what I did though after the host reboot.

Once the host was online I connected to it via the vSphere client. I didn’t connect to the host directly (though you can do that too). I connected to the vCenter, then navigated to that host, went to the File menu and exported the system logs.

exportsyslogs

This creates a zip file containing another archive. I extracted the contents of this into a folder. The root of that folder has the usual Linux filesystem structure.

dirstructure

I went into the var folder here. (The log subfolder has many logs but most of these might be from after the reboot. If that’s the case, check the run/log subfolder).

In my case the /var/log/vmksummary.log file had entries for when the host rebooted. None of the other files mentioned anything.

Then I went to the /var/run/log folder via PowerShell and ran a grep for the word reboot –

Lots of messages indicating that the host was rebooted via the DCUI (lines 2, 4, 5, and 12). Thus I realized someone had manually rebooted the host.

Install a vSphere web-client plugin offline

Trying out vCloud Air at work and I wanted to install the vCloud Air plugin for vSphere web-client. The installation kept failing though. Initially it was due to the vCenter server not having access to the Internet (not your browser, the vSphere web-client itself needs to have access) but even after I specified a proxy (check out this post on how to specify a proxy) and gave vSphere web-client access to the Internet the download would begin and fail. 

install-failed

It’s possible to download the plugin, but how to add it to vSphere web-client?

Through a bit of trial and error I found a way. :)

Turns the plugins are store at C:\Program Files\VMware\Infrastructure\vSphereWebClient\plugin-packages on the server. So all you have to do is:

  1. Download the plugin zip file. 
  2. Create a folder in the above location and extract the zip file to this folder.
  3. Restart the vSphere web-client service. 

And that’s it! Then your plugin will appear under Administration > Client Plug-Ins

plugins

It’s very simple but I couldn’t find any info on how to download and install a plugin when I Googled for it, so thought I’d make a post. Hope it helps someone!

Downgrading ESXi Host

Today I upgraded one of our hosts to a newer version than what was supported by our vCenter so had to find a way of downgrading it. The host was now at “5.5 Patch 10” (which is after “5.5 Update 3”) which our vCenter version only supported versions prior to “5.5 Update 3”. (See this post for a list of build numbers and versions; see this KB article for why vCenter and the host were now incompatible).

I found this blog post and KB article that talked about downgrading and upgrading. Based on those two here’s what I did to downgrade my host.

First, some terminology. Read this blog post on what VIBs are. At a very high level a VIB file is like a zip file with some metadata and verification thrown in. They are the software packages for ESX (think of it like a .deb or .rpm file). The VIB file contains the actual files on the host that will be replaced. The metadata tells you more about the VIB file – its dependencies, requirements, issues, etc. And the verification bit lets the host verify that the VIB hasn’t been tampered with, and also allows you to have various “levels” of VIBs – those certified by VMware, those certified by partners of VMware, etc – such that you as a System Admin can decide what level of VIBs you want installed on your host.

You can install/ remove/ update VIBs via the command esxcli:

Here’s a short list of the VIBs installed on my host:

Next you have Image Profiles. These are a collection of VIBs. In fact, since any installation of ESXi is a collection of VIBs, an image profile can be thought of as defining an ESXi image. For instance, all the VIBs on my currently installed ESXi server – including 3rd party VIBs – together can be thought of as an image profile. I can then deploy this image profile to other hosts to get the exact configuration on those hosts too.

One thing to keep in mind is that image profiles are not anything tangible. As in they are not files as such, they just define the VIBs that make up the profile.

Lastly you have Software Depots. These are your equivalent of Linux package repositories. They contain VIBs and Image Profiles and are accessible online via HTTP/ HTTPS/ FTP or even offline as a ZIP file (which is a neat thing IMHO). You would point to a software depot – online or offline – and specify an image profile you want, which then pulls in the VIBs you want.

Now back to esxcli. As we saw above this command can be used to list, update, remove etc VIBs. The cool thing though is that it can work with both VIB files and software depots (either online or a ZIP file containing a bunch of VIB files). Here’s the usage for the software vib install command which deals with installing VIBs:

You have two options:

  • The -d switch can be used to specify a software depot (online or offline) along with the -n switch to specify the VIBs to be installed from this depot.
  • Or the -v switch can be used to directly specify VIBs to be installed.

The esxcli command can also work with image profiles.

Here you have just one option (coz like I said you can’t download something called an image profile – you have to necessarily use a software depot). You use the -d switch to specify a depot (online or offline) and the -p switch to specify the image profile you are interested in.

Apart from installing VIBs & image profiles, the esxcli command can also remove and update these. When it comes to image profiles though, the command can also downgrade profiles via an --allow-downgrades switch. So that’s what we use to downgrade ESXi versions. 

First find the ESXi version you want to downgrade to. In my case it was ESXi 5.5 Update 2. Go to My VMware (login with your account) and find the 5.5 Update 2 product. Download the offline bundle – which is a ZIP file (basically an offline software depot). In my case I got a file named “update-from-esxi5.5-5.5_update02-2068190.zip”. Now open this ZIP file and go to the “metadata.zip\profiles” folder in that. This gives you the list of profiles in this depot.

profiles

You can also get the names from a link such as this which gives more info on the release and the image profiles in it. (I came across it by Googling for “ESXi 5.5 Update 2 profile name”).

The profiles with an “s” in them only contain security fixes while the ones without an “s” contain both security and bug fixes. In my case the profile I am looking for is “ESXi-5.5.0-20140902001-standard”. I wasn’t sure if I need to go for the “no-tools” version or not, but figured I’ll stick with the “standard”.

Now, copy the ZIP file you downloaded to the host. Either upload it to the host directly, or to some shared storage, etc.

Then run a command similar to this:

That’s it! Following a host reboot you are now downgraded. Very straight-forward and easy.

Hyper-V between Windows 10 & Windows 8.1 in a workgroup

My laptop’s running Windows 10, desktop’s running Windows 8.1. Since both have client Hyper-V I thought it would be cool to install Hyper-V manager on the laptop and use it to manage Hyper-V running on the desktop. Did that and came across the following error –

Hyper-V error

DOGBERT is the Windows 8.1 desktop. The error is from my Windows 10 laptop.

First I followed the steps in this blog post. Actually, I didn’t have to do much as the account I was using on the desktop was already in the local Administrators group and so I didn’t have to do anything in terms of COM (step 3) & WMI (step 4) permissions. But I did enable the firewall rules for the Windows Management Instruction (WMI) group (step 2).

Additionally, I noticed that the Windows Remote Management (WS-Man) service was not running on the desktop so I enabled that. For this I used the winrm command.

 

Then I had to enable the Windows Remote Management (WS-Man) service on the laptop and add the desktop as a trusted host. Remember the error message above? It said that either I must use HTTPS or add the remote computer to the TrustedHosts list. I add that thus (from my laptop):

Probably a good idea to see what your existing trusted hosts are before you run this command (so you can append to the list instead of removing existing entries). You can do that thus:

After this Hyper-V manager from the laptop was able to connect to the desktop, but in the Virtual Machines section I had the following error:

Access denied. Unable to establish communication between ‘Hyper-V Server’ and ‘Hyper-V Manager’

The solution for that (thanks to this blog post) is to open “Component Services” on the laptop. Alternatively open a run window/ command prompt and type dcomcnfg.

In the windows that opens expand to Component Services > Computers > My Computer, right click and go to Properties, then the COM Security tab, and click “Edit Limits” under Access Permissions. Select the ANONYMOUS LOGIN username here and tick the box to allow Remote Access.

Component Services

That’s it! After this Hyper-V on my laptop was able to talk to the desktop.

Notes on NLB, VMware, etc

Just some notes to myself so I am clear about it while reading about it. In the context of this VMware KB article – Microsoft NLB not working properly in Unicast mode.

Before I get to the article I better talk about a regular scenario. Say you have a switch and it’s got a couple of devices connected to it. A switch is a layer 2 device – meaning, it has no knowledge of IP addresses and networks etc. All devices connected to a switch are in the same network. The devices on a switch use MAC addresses to communicate with each other. Yes, the devices have IPv4 (or IPv6) addresses but how they communicate to each other is via MAC addresses.

Say Server A (IPv4 address 10.136.21.12) wants to communicate with Server B (IPv4 address 10.136.21.22). Both are connected to the same switch, hence on the same LAN. Communication between them happens in layer 2. Here the machines identify each other via MAC addresses, so first Server A checks whether it knows the MAC address of Server B. If it knows (usually coz Server A has communicated with Server B recently and the MAC address is cached in its ARP table) then there’s nothing to do; but if it does not, then Server A finds the MAC address via something called ARP (Address Resolution Protocol). The way this works is that Server A broadcasts to the whole network that it wants the MAC address of the machine with IPv4 address 10.136.21.22 (the address of Server B). This message goes to the switch, the switch sends it to all the devices connected to it, Server B replies with its MAC address and that is sent to Server A. The two now communicate – I’ll come to that in a moment.

When it’s communication from devices in a different network to Server A or Server B, the idea is similar except that you have a router connected to the switch. The router receives traffic for a device on this network – it knows the IPv4 address – so it finds the MAC address similar to above and passes it to that device. Simple.

Now, how does the switch know which port a particular device is connected to. Say the switch gets traffic addresses to MAC address 00:eb:24:b2:05:ac – how does the switch know which port that is on? Here’s how that happens –

  • First the switch checks if it already has this information cached. Switches have a table called the CAM (Content Addressable Memory) table which holds this cached info.
  • Assuming the CAM table doesn’t have this info the switch will send the frame (containing the packets for the destination device) to all ports. Note, this is not like ARP where a question is sent asking for the device to respond; instead the frame is simply sent to all ports. It is broadcast to the whole network.
  • When a switch receives frames from a port it notes the source MAC address and port and that’s how it keeps the CAM table up to date. Thus when Server A sends data to Server B, the MAC address and switch port of Server A are stored in the switch’s CAM table.  This entry is only stored for a brief period.

Now let’s talk about NLB (Network Load Balancing).

Consider two machines – 10.136.21.11 with MAC address 00:eb:24:b2:05:ac and 10.136.21.12 with MAC address 00:eb:24:b2:05:ad. NLB is a form of load balancing wherein you create a Virtual IP (VIP) such as 10.136.21.10 such that any traffic to 10.136.21.10 is sent to either of 10.136.21.11 or 10.136.21.12. Thus you have the traffic being load balanced between the two machines; and not only that if any one of the machines go down, nothing is affected because the other machine can continue handling the traffic.

But now we have a problem. If we want a VIP 10.136.21.10 that should send traffic to either host, how will this work when it comes to MAC addresses? That depends on the type of NLB. There’s two sorts – Unicast and Multicast.

In Unicast the NIC that is used for clustering on each server has its MAC address changed to a new Unicast MAC address that’s the same for all hosts. Thus for example, the NIC that holds the NLB IP address 10.136.21.10 in the scenario above will have its MAC address changed from 00:eb:24:b2:05:ac and 00:eb:24:b2:05:ad respectively to (say) 00:eb:24:b2:05:af. Note that the MAC address is a Unicast MAC (which basically means the MAC address looks like a regular MAC address, such as that assigned to a single machine). Since this is a Unicast MAC address, and by definition it can only be assigned to one machine/ switch port, the NLB driver on each machines cheats a bit and changes the source MAC address address to whatever the original NIC MAC address was. That is to say –

  • Server IP 10.136.21.11
    • Has MAC address 00:eb:24:b2:05:ac
    • Which is changed to a MAC address of 00:eb:24:b2:05:af as part of the Unicast IP/ enabling NLB
    • However when traffic is sent out from this machine the MAC address is changed back to 00:eb:24:b2:05:ac
  • Same for Server 10.136.21.12

Why does this happen? This is because –

  • When a device wants to send data to the VIP address, it will try find the MAC address using ARP. That is, it sends a broadcast over the network asking for the device with this IP address to respond. Since both servers now have the same MAC address for their NLB NIC either server will respond with this common MAC address.
  • Now the switch receives frames for this MAC address. The switch does not have this in its CAM table so it will broadcast the frame to all ports – reaching either of the servers.
  • But why does outgoing traffic from either server change the MAC address of outgoing traffic? That’s because if outgoing frames have the common MAC address, then the switch will associate this common MAC address with that port – resulting in all future traffic to the common MAC address only going to one of the servers. By changing the outgoing frame MAC address back to the server’s original MAC address, the switch never gets to store the common MAC address in its CAM table and all frames for the common MAC address are always broadcast.

In the context of VMware what this means is that (a) the port group to which the NLB NICs connect to must allow changes to the MAC address and allow forged transmits; and (b) when a VM is powered on the port group by default notifies the physical switch of the VMs MAC address, since we want to avoid this because this will expose the cluster MAC address to the switch this notification too must be disabled. Without these changes NLB will not work in Unicast mode with VMware.

(This is a good post to read more about NLB).

Apart from Unicast NLB there’s also Multicast NLB. In this form the NLB NIC’s MAC address is not changed. Instead, a new Multicast MAC address is assigned to the NLB NIC. This is in addition to the regular MAC address of the NIC. The advantage of this method is that since each host retains its existing MAC address the communication between hosts is unaffected. However, since the new MAC address is a Multicast MAC address – and switches by default are set to ignore such address – some changes need to be done on the switch side to get Multicast NLB working.

One thing to keep in mind is that it’s important to add a default gateway address to your NLB NIC. At work, for instance, the NLB IPv4 address was reachable within the network but from across networks it wasn’t. Turns out that’s coz Windows 2008 onwards have a strong host behavior – traffic coming in via one NIC does not go out via a different NIC, even if both are in the same subnet and the second NIC has a default gateway set. In our case I added the same default gateway to the NLB NIC too and it was then reachable across networks. 

User PowerShell/ PowerCLI to get VM space usage

Wanted to get the space used by all VMs across a bunch of our newer hosts –

There’s probably a way to show the total too but I used a separate pipeline for that –