Contact

Subscribe via Email

Subscribe via RSS

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

VCSA migration – “A problem occurred while logging in. Verify the connection details.”

So, I was trying out a Windows vCenter 5.5 to VCSA 6.5 appliance migration and at the stage where I enter the target ESX host name where the appliance will be deployed to I got the above error.

Wasted the better part of my day troubleshooting this as I could find absolutely no mention of what was causing this. The installer log had the following but that didn’t shed much light either.

Tried stuff like 1) try a different ESX host, 2) update it to a later version (it was 5.5 Build 3568722), 3) turn on the ESX Shell and SSH in case that mattered – but nothing helped!

Nothing came up regarding the “vimService creation failed: Error” line either. But then I began Googling on “vimService” and learnt that it is the vSphere Management SDK and that you access the SDK via a URL like https://servername/sdk. That got me thinking whether the VCSA installer looks to the proxy settings of the machine where I am running it from, so I turned off the proxy settings in IE – and that helped!

Who would have thought. :)

P2V a SQL cluster by breaking the cluster

Need to P2V a SQL cluster at work. Here’s screenshots of what I did in a test environment to see if an idea of mine would work.

We have a 2 physical-nodes SQL cluster. The requirement was to convert this into a single virtual machine.

P2V-ing a single server is easy. Use VMware Converter. But P2V-ing a cluster like this is tricky. You could P2V each node and end up with a cluster of 2 virtual-nodes but that wasn’t what we wanted. We didn’t want to deal with RDMs and such for the cluster, so we wanted to get rid of the cluster itself. VMware can provide HA if anything happens to the single node.

My idea was to break the cluster and get one of the nodes of the cluster to assume the identity of the cluster. Have SQL running off that. Virtualize this single node. And since there’s no change as far as the outside world is concerned no one’s the wiser.

Found a blog post that pretty much does what I had in mind. Found one more which was useful but didn’t really pertain to my situation. Have a look at the latter post if your DTC is on the Quorum drive (wasn’t so in my case).

So here we go.

1) Make the node that I want to retain as the active node of the cluster (so it was all the disks and databases). Then shutdown SQL server.

sqlshutdown

2) Shutdown the cluster.

clustershutdown

3) Remove the node we want to retain, from the cluster.

We can’t remove/ evict the node via GUI as the cluster is offline. Nor can we remove the Failover Cluster feature from the node as it is still part of a cluster (even though the cluster is shutdown). So we need to do a bit or “surgery”. :)

Open PowerShell and do the following:

This simply clears any cluster related configuration from the node. It is meant to be used on evicted nodes.

Once that’s done remove the Failover Cluster feature and reboot the node. If you want to do this via PowerShell:

4) Bring online the previously shared disks.

Once the node is up and running, open Disk Management and mark as online the shared disks that were previously part of the cluster.

disksonline

5) Change the IP and name of this node to that of the cluster.

Straight-forward. Add CNAME entries in DNS if required. Also, you will have to remove the cluster computer object from AD first before renaming this node to that name.

6) Make some registry changes.

The SQL Server is still not running as it expects to be on a cluster. So make some registry changes.

First go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Setup and open the entry called SQLCluster and change its value from 1 to 0.

Then take a backup (just in case; we don’t really need it) of the key called HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Cluster and delete it.

Note that MSSQL10_50.MSSQLSERVER may vary depending on whether you have a different version of SQL than in my case.

7) Start the SQL services and change their startup type to Automatic.

I had 3 services.

Now your SQL server should be working.

8) Restart the server – not needed, but I did so anyways.

Test?

If you are doing this in a test environment (like I was) and don’t have any SQL applications to test with, do the following.

Right click the desktop on any computer (or the SQL server computer itself) and create a new text file. Then rename that to blah.udl. The name doesn’t matter as long as the extension is .udl. Double click on that to get a window like this:

udl

Now you can fill in the SQL server name and test it.

One thing to keep in mind (if you are not a SQL person – I am not). The Windows NT Integrated security is what you need to use if you want to authenticate against the server with an AD account. It is tempting to select the “Use a specific user name …” option and put in an AD username/ password there, but that won’t work. That option is for using SQL authentication.

If you want to use a different AD account you will have to do a run as of the tool.

Also, on a fresh install of SQL server SQL authentication is disabled by default. You can create SQL accounts but authentication will fail. To enable SQL authentication right click on the server in SQL Server Management Studio and go to Properties, then go to Security and enable SQL authentication.

sqlauth

That’s all!

Now one can P2V this node.

Removing Datastores from an ESX host

Datastores in ESX hosts are made up of extents. Extents can be thought of as the underlying physical disk/ LUN that goes into making up the datastore.

A datastore is usually made up of a single extent, but can span multiple extents too. So removing a datastore from an ESX hosts means you dismount the datastore and then detach the extents.

Datastores have friendly names that you assign when creating it. Extents have names that usually start with naa or eui.

In vSphere client when you select a host, go to its Configuration tab, Storage, select Datastores view – the “Identification” column shows the datastore name and the “Device” column shows the extent name.

In PowerCLI the same information can be seeing using  Get-View or the ExtensionData property object of a datastore object (as in my previous post).

Anyways, to remove a datastore from an ESX host you first go to the Datastores screen as above, select the datastore, right click and select “Unmount”. This will do a bunch of checks (such as whether any VMs running on that host have their disks on this datastore) and then let you unmount it. This only removes the datastore name from the ESX host though; the host can still see and mount the datastore. So the next step is to also detach the extent from the host – i.e. unpresent the underlying disk/ LUN from the host.

For this you need the extent names. Get these as above (by expanding the “Device” column to see the name; or use PowerCLI). Then go to the Devices view (instead of the Datastores view that you currently are on). Expand the “Identifier” column now and find the extents that we want to detach. Once you find this right click and select “Detach”. This too does some checks and then lets you detach the extent if it’s not in use.

That’s it.

p.s. Too lazy to take screenshots. Sorry about that. :)

Get a list of VMs running on specific datastores, along with the host

Needed to dismount some datastores/ LUNs from a few hosts but before doing that needed to ensure none of the VMs running on these datastores are hosted on the hosts I want to remove access from. This one-liner PowerCLI will do just that for you:

Replace “PP_” with the pattern you are interested in matching in the datastore name.

A variation of the above where I only list VMs that are hosted the hosts I want to remove access from:

In my case the hosts that should have access to the datastores with a “PP_” in their name will also have numbers 01-03 in them. Any VMs not on hosts with these names are what I am interested in.

PowerCLI, VMware Tools update, etc.

(The following is based on this VMware KB article which is for ESXi 4.0 and earlier but can be made to work for later versions too).

In vSphere client we can see the VMware Tools related settings of a VM in the Options tab of the VM properties window. In PowerCLI these are exposed under the ExtensionData object. Specifically the ExtensionData.Config.Tools object.

The ExtensionData object has many methods and properties – think of it like the advanced options menu in a GUI. One of these methods is ReconfigVM() which takes an object of type VMware.Vim.VirtualMachineConfigSpec and reconfigures the VM accordingly.

So to take the example of modifying the VMware Tools update settings all one has to do is create a new object of the type above and pass it to the ReconfigVM() method. Something as below.

First we create an object of this type:

If we look at this object now we will see that it has various properties and methods. The Tools related settings are controlled by a property called Tools of type VMware.Vim.ToolsConfigInfo. To modify these we need to create a new object of that type:

This has no settings by default:

But we can set the properties we are interested in modifying.

For instance to set VMware Tools to be automatically updated upon power cycle do the following:

To undo that change set the value to “manual” (it only takes two options).

Here’s an example of me changing the VMware Tools updating settings to be manual.

So that’s it. Now to do this en-masse for a bunch of VMs you can make a loop.

If the list of VMs is got from vCenter directly (via say something like Get-VM | where {(Get-Cluster).Name -eq “CLUSTER NAME”}) then the code needs a bit of change (the $VMObj line can be removed).

Just as a reference to future me, the output returned by the ExtensionData object is what you would get via the Get-View cmdlet.

Update: Came across this while writing this post. If you have multiple vCenter servers and want PowerCLI to work against entities in all of them the following will help.

Enabling SNMPv3 on ESXi hosts

A continuation to my earlier post which was to do with SNMPv2.

As before, connect to the vCenter via PowerCLI. And as before the set() method can be used to set SNMP – both v2 and/or v3. The definition of this method is as follows:

That’s confusing so best to copy paste the definition into notepad or something so you can be sure you are passing the correct arguments.

First things first. There doesn’t seem to be a way of turning off something. As in, say you already have SNMPv2 turned on, you can’t turn it off by setting the community strings to blank. Doing so generates an error. So if you want to turn previous things off it’s best to do a reset and start with a clean slate.

This sets things back to their defaults:

Before going ahead with any SNMPv3 configuration we need to decide on what authentication and privacy protocols to use. In my case I want to use SHA1 and AES-128. So I need to set that first:

Once I have done this I can generate the hashes. I will need this later to configure SNMPv3.

In the example above both my passwords are Password1.

With this in hand I configure SNMPv3:

That’s it really. In the above example I will be using an SNMPv3 user called snmpUser1.

Now to do it across my estate I can make a loop. No need to create password hashes for each host. The hash stays the same as long as you are using the same password for each host.

That’s all!

vSphere Replication does not support changing the length of a replicated disk.

Had to extend a VM disk today and got the above error. This is because the VM is replicated via vSphere Replication so you can’t simply extend the disk as you would do for any regular VM.

error-1

Here’s a top level summary of how you do it (based on this KB article).

  1. You have to break the replication. Stop it that is. But doing so deletes the replicated files, so first you want to work around that (as below).
    1. Note the current settings of the replication.
    2. Then pause the replication.
    3. Find out which datastore holds the replicated VM disks.
    4. Rename the replicated VM folder.
    5. Now you can stop the replication because you have kept a copy of the data.
  2. SSH into any ESX host that has access to the above datastore and extend the disk associated with the VMDK via vmkfstools.
  3. Rename the folder back to what it was before.
  4. Recreate the replication, but point the destination to the same datastore as above and select the folder above. vSphere Replication will ask whether you want to use the existing data as seed – answer yes.

That’s it basically.

In terms of the details, I didn’t know how to find which datastore had the replicated VM files. So I SSH’d into one of the hosts in the replicated VM cluster and ran the following:

There must be some better way, but what the heck. Once I found the path above I did the following to find other VMs in it, and using that info I was able to find the datastore name from vSphere client.

You need this datastore name for when setting up a new replication, so you can point to that.

Some more things to keep in mind are the following.

  1. Since we pause the replication rather than stop it, the folder will contain a bunch of hbr* files. Delete those.
  2. The vmkfstools command -X switch takes the new size of the disk. Not the additional amount. So if the disk is 10GB and you want to add 20GB, you specify it the argument as 30GB. If you are getting a “Failed to extend disk : One of the parameters supplied is invalid (1).” error with vmkfstools that’s probably why.

VMware client – unable to login with username, password; but able to login with “use windows credentials”

We had this weird issue at work yesterday wherein you could not login to the vCenter server by entering a username/ password, but could if you just ticked on the “Use windows session credentials” checkbox.

The issue got resolved eventually by stopping the “VMware Secure Token Service”, restarting the “VMware VirtualCenter Server” service, and then starting the “VMware Secure Token Service”. No idea why that made a difference though, and whether that actually fixed things or was just coincidental. Around the same time I had seen some VMware Tools errors so I (a) upgraded the tools, (b) moved the vCenter VM to a different host, (c) saw that one of these had caused issues with the network driver so I had to uninstall and reinstall the tools and then reset the secure channel with the domain (since when the vCenter VM came up it didn’t have network connectivity).

So it was a bit of a damper actually. Nothing more frustrating than spending a lot of time troubleshooting something and not really figuring out what the issue is. On the plus side at least the issue got sorted, but it leaves me uneasy not knowing what really went wrong and whether it will re-occur.

In the event logs there were many entries like these:

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        VCENTER01$
    Account Domain:        MYDOMAIN
    Logon ID:        0x3e7

Logon Type:            3

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        SomeAccount
    Account Domain:        MYDOMAIN.COM

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xc000006d
    Sub Status:        0xc0000064

Process Information:
    Caller Process ID:    0xe20
    Caller Process Name:    E:\Program Files\VMware\Infrastructure\VMware\CIS\vmware-sso\VMwareIdentityMgmtService.exe

Network Information:
    Workstation Name:    VCENTER01
    Source Network Address:    –
    Source Port:        –

Detailed Authentication Information:
    Logon Process:        Advapi  
    Authentication Package:    Negotiate
    Transited Services:    –
    Package Name (NTLM only):    –
    Key Length:        0

Here’s what the error codes mean –

  • NULL SID suggests that the account that was being authenticated could not be identified
  • 0xC000006D means that authentication failed due to bad credentials
  • 0xC0000064 means that the requested user name does not exist.
  • Logon type 3 means the request was received from the network (but given the request originated from “server”, suggests that the request was looped back from itself over the network stack.

Not that it throws much light on what’s happening.

For info – this KB article lists the useful vCenter log files. I looked at the vpxd-xxxx.log file which had some entries like these –

2016-06-06T16:08:18.046+01:00 [02856 error ‘[SSO]’ opID=138a737d] [UserDirectorySso] AcquireToken exception: class SsoClient::CommunicationException(No connection could be made because the target machine actively refused it)
2016-06-06T16:08:18.046+01:00 [02856 error ‘authvpxdUser’ opID=138a737d] Failed to authenticate user <mydomain\someaccount>

This file is under C:\ProgramData\VMware\VMware VirtualCenter\Logs by the way.

I also found messages like these –

2016-06-06T10:17:59.226+01:00 [06952 error ‘[SSO]’ opID=1790eabb] [UserDirectorySso] AcquireToken exception: class SsoClient::SsoException(Failed to parse Group Identity value: `\Authentication authority asserted identity’; domain or group missing)

Two more logs I looked at are C:\ProgramData\VMware\CIS\logs\vmware-sso\vmware-sts-idmd.log and some files under C:\ProgramData\VMware\CIS\runtime\VMwareSTS\logs. In case of the latter location I just sorted by the recently modified timestamp and found some logs to look at. I focused on one called ssoAdminServer.log. This file had a few entries like these –

[2016-06-06 12:19:08,987 pool-11-thread-1  ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception
com.vmware.identity.idm.IDMException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:131)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:4006)

I found mention of this message in a forum post which pointed to this being a known issue for vCenter installed on a 2012 server with a 2012 DC. That doesn’t apply to me.

The vSphere Web Client gives an error message “Cannot Parse Group Information” – which too is a symptom if you install vCenter on a 2012 server with a 2012 DC. Moreover it applies to vCenter 5.5 GA, which is what we are on, so all the symptoms point to that issue but it is not so in our case. :(

Back to the vmware-sts-idmd.log, that had entries like these –

2016-06-06 09:00:26,089 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,090 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,091 ERROR  [ValidateUtil] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
2016-06-06 09:00:26,092 INFO   [ActiveDirectoryProvider] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
…<snip>…
2016-06-06 09:02:53,005 INFO   [IdentityManager] Failed to find principal [SomeAccount@mydomain.tld] as FSP group in tenant [vsphere.local]
2016-06-06 09:02:53,008 INFO   [IdentityManager] Failed to find FSP user or gorup [SomeAccount@mydomain.tld]’s nested parent groups in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [IdentityManager] Failed to find nested parent groups of principal [SomeAccount@mydomain.tld] in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [ServerUtils] Exception ‘java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]’
java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroupsByPac(ActiveDirectoryProvider.java:2140)
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroups(ActiveDirectoryProvider.java:791)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:3985)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroups(IdentityManager.java:3856)
    at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Again, something to do with DC/ domain … but what!? Found this blog post too that suggested the same.

For my reference, here’s a KB article listing all the SSO log files. And this is a useful blog post in case I happen upon a similar issue later (the case of the flapping VMware Secure Token Service). As is this KB article on an SSO facade error.

Cannot login to vCenter with “use windows session credentials” but can login by entering username & password

Had this issue today (and a few months ago). I open vCenter client, type in the vCenter server name, tick “Use Windows Session Credentials” as usual, and login fails. Says it cannot login with the given credentials.

At the same time I can login with the vSphere Web Client and also by un-ticking the box and manually entering the username/ password.

Fix for both times was to reset the secure channel by logging in to the vCenter server –

 

Enabling SNMP on ESXi hosts

I wanted to enable SNMP on our ESXi hosts for monitoring via Solarwinds. Here’s what I did. (I am doing this kind of generically, using variables etc, so I can script the thing for multiple hosts).

First I connected to the vCenter Server from PowerCLI.

Next I got its ESXCLI object. This will let me run ESXCLI commands against the host.

To view the current status of SNMP you can do can invoke a get() method –

Nothing’s configured currently. To configure something we can use the set() method. From the definition of this method we can see it takes a whole bunch of parameters –

Here’s what I did to configure SNMP. I want a community string of “public”, enable SNMP, and specify two trap destinations.

The result of that will either be a true or false. The get() method can be used again to confirm it is set correctly. And the test() method can be used to test it works –

Now Solarwinds will be able to poll the host via SNMP.

To do this en-masse on all your hosts the following should help –

Shout out to this VMware blog post which helped a lot and has more info.

The above script failed on some of our ESX hosts with the following error –

Turns out these hosts only accept 16 parameters instead of 17 (the one called largestorage is missing). Not sure why. All our hosts are ESXi 5.5 but am thinking the problem ones are perhaps not using the HP customized version of ESXi.

Anyways, so I modified my script above to take care of this –

Also, just for my own info – the $null above means the parameter is not set. If that parameter already has a value on the server it is not over-written. To over-write or blank out the existing value replace $null with "".

Configure NTP for multiple ESXi hosts

Following on my previous post I wanted to set NTP servers for my ESX servers and also start the service & allow firewall exceptions. Here’s what I did –

 

Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

vCenter unable to connects to hosts; vSphere client gives error ‘”ServiceInstance.RetrieveContent” for object “ServiceInstance” on Server “IP-Address” failed’

Our Network team had been making some changes at work and suddenly vCenter in our London office lost connectivity with all the ESX hosts in one of our remote office. Moreover, when trying to connect from the vSphere Client to any of the remote hosts directly we were getting the following error –

client error

Connectivity from vSphere Client in the remote office to the ESX host in the same office was fine; it was only connectivity from other offices to this remote office. So it definitely indicated a network issue.

This KB article is a handy one to know what ports are required by various VMware products. Port 443 is what needs to be open to ESX hosts for vCenter Server to be able to talk to them. I did a telnet from the vCenter server to each of the remote office hosts on port 443 and it went through fine – so wasn’t a firewall issue. (Another post with port numbers, just FYI, is this one).

After a fair bit of troubleshooting we tracked the issue down to MTU.

Digressing into MTUs

Communication between two IP addresses (i.e. layer 3) happens through packets. Thus when my London vCenter Server communicates with my remote office ESX host, the two send TCP/IP packets to each other. When these packets from the vCenter Server reach the switch/ router on the same LAN as the ESX host, it becomes a layer 2 communication (because they are on the same network and it’s a matter of data reaching the ESX host from the switch/ router). In the case of Ethernet, this layer 2 communication happens via Ethernet frames. The frames encapsulate the IP packets – so the switch/ router breaks the packets and fits them into multiple frames, while the ESX host receives these frames and re-assembles the packets (and vice versa). (The picture on this Wikipedia page is worth a look to see the encapsulation). 

How much data can be held by a layer 2 frame is defined by the Maximum Transmission Unit (MTU). Larger MTUs are good because you can carry more data; but they have a downside in that each frame takes longer to be transmitted, and in case of any errors more data has to be re-transmitted when the frame is resent. So a balance is important. In the case of Ethernet, RFC 894 (see errata also) defines the MTU as a maximum of 1500 bytes. In the case of other layer 2 protocols, the MTU varies: for example 4464 bytes for Token Ring; 4352 bytes for FDDI; 9180 bytes for ATM; etc. In the case of Ethernet there are now also jumbo frames, which are frames with an MTU size of 9000 bytes (see this page for a table comparing regular frames and jumbo frames) and are commonly used in iSCSI networks.

Taking the case of Ethernet, assume the MTU of all Ethernet networks is 1500 bytes. So when two devices are conversing with each other over layer 3, and this conversation spans multiple Ethernet networks, it is helpful if the devices know that the MTU of the underlying layer 2 network is 1500 bytes. That way the two devices can keep the size of their layer 3 packets to be less than 1500 bytes. Why? Because if the size of the layer 3 packets are greater than 1500 bytes, then the devices and all the routers/ switches in between will have to fragment (break) the layer 3 packets into smaller packets of less than 1500 bytes to fit it in the Ethernet frame. This is a waste of resources for all, so it’s best if the two devices know of the underlying layer 2 MTU and act accordingly.

Now, note that Ethernet MTUs are defined as a maximum of 1500 bytes. So the MTU for a particular LAN segment can be set to a lower number for whatever reason (maybe there are additional fields in the Ethernet frame and to accommodate these the data portion must be reduced). Similarly, a layer 3 conversation between when two devices can go over a mix of layer 2 networks – Ethernet, Token Ring, etc – each with a different MTU. So what is required for the two devices really is a way of knowing what’s the lowest MTU across all these layer 2 devices, so the two devices can use it as the MTU of the layer 3 packets for their conversation. This is known as the Path MTU or IP MTU – and is basically the smallest MTU of all the underlying layer 2 MTUs over which that conversation traverses. It is discovered through a process known as “Path MTU Discovery” (PMTUD) (check this Wikipedia article, or Google this term to learn more). Very briefly, in the case of IPv4 what happens is that each device sends across packets of increasing size to the other end, with a flag set that says “do not fragment this packet”. Packets of size smaller than the lowest layer 2 MTU will get through, but once the size exceeds the lowest MTU the packet will fail & return because it cannot be fragmented (due to the flag) and so is returned via ICMP to the sender. Thus the Path MTU is discovered. This check happens in both directions.

So we have layer 2 MTUs and layer 3 MTUs. Layer 2 MTUs have a maximum value that is dependent on the layer 2 network technology. But what about the minimum value? RFC 791, which defines the Internet Protocol (the IP in TCP/IP), requires that all devices supporting IP must be able to forward packets of 68 bytes without fragmenting (68 bytes because IP headers take 60 bytes size and layer 2 headers take 8 bytes size minimum) and be able to accept packets of minimum size 576 bytes either as one packet or multiple packets that require assembling. Because of this the minimum layer 2 MTU can be thought of as 68 bytes. In a practical sense, however, most IP devices accept 576 bytes without fragmenting, and since this number is higher than the values for all layer 2 networks the minimum layer 2 & layer 3 MTU can be thought of as 576 bytes.

Just for completeness I will also mention Maximum Segment Size (MSS) which is a layer 4 MTU (of sorts) that defines what’s the maximum TCP segment (which is what a TCP packet is called) that can be accepted by devices. It has a default value of 536 bytes. This is based on the 576 bytes that IP requires hosts to accept at minimum, minus 20 bytes for IP headers and 20 bytes for TCP headers. Idea behind using 576 bytes as the base is that this way the TCP segment can be expected to arrive without fragmenting. In a practical sense again, for TCP/IP traffic over Ethernet (which is the common case), since Ethernet frames have an MTU of 1500, the MSS is usually set to 1500 minus 20 minus 20 = 1460 bytes.

This is a good article I came upon. Just linking it as a reference to myself.

Back to our issue

In our case the router in the remote site had the following set in its configuration:

I am not entirely clear where it was set or why it was set, as that comes under the Network team. What this does though is tell the router not to clear the “Do Not Fragment” (DF) bit in Ethernet frames. If a DF bit is present in a frame then the router will not fragment it if the frame size is larger than the MTU (this is how PMTUD also works). I am not sure why this was set – part of some testing I suppose – but because of this larger frames were not getting through to the other side and hence failing. Our Network team removed this statement and then communication with the ESX hosts started working fine.

I wanted to write more about this statement but I am running out of time. This and this are two good links worth reading for more info. Especially the Scenario 4 section in the second link – that’s pretty much what was happening in our case, I think.

What does the vCloud Air “Enable Service Network” do?

I don’t know. :)

But I think it’s used when you want to connect a vCloud Air network (which is part of a Disaster Recovery or Virtual Private Cloud setup) with the network of a vCloud Air PaaS such as vCloud Air SQL. I base this on this vCloud Air SQL Users Guide that talks as about enabling the service network to connect a vDC (vCloud Air Virtual Datacenter) to the vCloud Air SQL network.

Will add more to this post if & when I get to know more.

Troubleshooting ESXi host reboots

Had to troubleshoot an ESXi host reboot today. Came across this link – good one.

Here’s what I did though after the host reboot.

Once the host was online I connected to it via the vSphere client. I didn’t connect to the host directly (though you can do that too). I connected to the vCenter, then navigated to that host, went to the File menu and exported the system logs.

exportsyslogs

This creates a zip file containing another archive. I extracted the contents of this into a folder. The root of that folder has the usual Linux filesystem structure.

dirstructure

I went into the var folder here. (The log subfolder has many logs but most of these might be from after the reboot. If that’s the case, check the run/log subfolder).

In my case the /var/log/vmksummary.log file had entries for when the host rebooted. None of the other files mentioned anything.

Then I went to the /var/run/log folder via PowerShell and ran a grep for the word reboot –

Lots of messages indicating that the host was rebooted via the DCUI (lines 2, 4, 5, and 12). Thus I realized someone had manually rebooted the host.