Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

TIL: vCenter inherited permissions are not cumulative

Say you are part of two groups. Group A has full rights on the vCenter. Group B has limited rights on a cluster.

You would imagine that since you are a member of Group A and that has full rights on vCenter itself, your rights on the cluster in question won’t be limited. But nope, you are wrong. Since you are a member of Group B and that has limited rights on the cluster, your rights too are restricted. Bummer if you are a member of multiple groups and some of these groups have limited rights on child objects! :o)

Workaround is to add yourself or Group A explicitly on that cluster, with full rights. Then the permissions become cumulative.

VMware client – unable to login with username, password; but able to login with “use windows credentials”

We had this weird issue at work yesterday wherein you could not login to the vCenter server by entering a username/ password, but could if you just ticked on the “Use windows session credentials” checkbox.

The issue got resolved eventually by stopping the “VMware Secure Token Service”, restarting the “VMware VirtualCenter Server” service, and then starting the “VMware Secure Token Service”. No idea why that made a difference though, and whether that actually fixed things or was just coincidental. Around the same time I had seen some VMware Tools errors so I (a) upgraded the tools, (b) moved the vCenter VM to a different host, (c) saw that one of these had caused issues with the network driver so I had to uninstall and reinstall the tools and then reset the secure channel with the domain (since when the vCenter VM came up it didn’t have network connectivity).

So it was a bit of a damper actually. Nothing more frustrating than spending a lot of time troubleshooting something and not really figuring out what the issue is. On the plus side at least the issue got sorted, but it leaves me uneasy not knowing what really went wrong and whether it will re-occur.

In the event logs there were many entries like these:

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        VCENTER01$
    Account Domain:        MYDOMAIN
    Logon ID:        0x3e7

Logon Type:            3

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        SomeAccount
    Account Domain:        MYDOMAIN.COM

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xc000006d
    Sub Status:        0xc0000064

Process Information:
    Caller Process ID:    0xe20
    Caller Process Name:    E:\Program Files\VMware\Infrastructure\VMware\CIS\vmware-sso\VMwareIdentityMgmtService.exe

Network Information:
    Workstation Name:    VCENTER01
    Source Network Address:    –
    Source Port:        –

Detailed Authentication Information:
    Logon Process:        Advapi  
    Authentication Package:    Negotiate
    Transited Services:    –
    Package Name (NTLM only):    –
    Key Length:        0

Here’s what the error codes mean –

  • NULL SID suggests that the account that was being authenticated could not be identified
  • 0xC000006D means that authentication failed due to bad credentials
  • 0xC0000064 means that the requested user name does not exist.
  • Logon type 3 means the request was received from the network (but given the request originated from “server”, suggests that the request was looped back from itself over the network stack.

Not that it throws much light on what’s happening.

For info – this KB article lists the useful vCenter log files. I looked at the vpxd-xxxx.log file which had some entries like these –

2016-06-06T16:08:18.046+01:00 [02856 error ‘[SSO]’ opID=138a737d] [UserDirectorySso] AcquireToken exception: class SsoClient::CommunicationException(No connection could be made because the target machine actively refused it)
2016-06-06T16:08:18.046+01:00 [02856 error ‘authvpxdUser’ opID=138a737d] Failed to authenticate user <mydomain\someaccount>

This file is under C:\ProgramData\VMware\VMware VirtualCenter\Logs by the way.

I also found messages like these –

2016-06-06T10:17:59.226+01:00 [06952 error ‘[SSO]’ opID=1790eabb] [UserDirectorySso] AcquireToken exception: class SsoClient::SsoException(Failed to parse Group Identity value: `\Authentication authority asserted identity’; domain or group missing)

Two more logs I looked at are C:\ProgramData\VMware\CIS\logs\vmware-sso\vmware-sts-idmd.log and some files under C:\ProgramData\VMware\CIS\runtime\VMwareSTS\logs. In case of the latter location I just sorted by the recently modified timestamp and found some logs to look at. I focused on one called ssoAdminServer.log. This file had a few entries like these –

[2016-06-06 12:19:08,987 pool-11-thread-1  ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception
com.vmware.identity.idm.IDMException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:131)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:4006)

I found mention of this message in a forum post which pointed to this being a known issue for vCenter installed on a 2012 server with a 2012 DC. That doesn’t apply to me.

The vSphere Web Client gives an error message “Cannot Parse Group Information” – which too is a symptom if you install vCenter on a 2012 server with a 2012 DC. Moreover it applies to vCenter 5.5 GA, which is what we are on, so all the symptoms point to that issue but it is not so in our case. :(

Back to the vmware-sts-idmd.log, that had entries like these –

2016-06-06 09:00:26,089 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,090 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,091 ERROR  [ValidateUtil] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
2016-06-06 09:00:26,092 INFO   [ActiveDirectoryProvider] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
…<snip>…
2016-06-06 09:02:53,005 INFO   [IdentityManager] Failed to find principal [SomeAccount@mydomain.tld] as FSP group in tenant [vsphere.local]
2016-06-06 09:02:53,008 INFO   [IdentityManager] Failed to find FSP user or gorup [SomeAccount@mydomain.tld]’s nested parent groups in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [IdentityManager] Failed to find nested parent groups of principal [SomeAccount@mydomain.tld] in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [ServerUtils] Exception ‘java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]’
java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroupsByPac(ActiveDirectoryProvider.java:2140)
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroups(ActiveDirectoryProvider.java:791)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:3985)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroups(IdentityManager.java:3856)
    at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Again, something to do with DC/ domain … but what!? Found this blog post too that suggested the same.

For my reference, here’s a KB article listing all the SSO log files. And this is a useful blog post in case I happen upon a similar issue later (the case of the flapping VMware Secure Token Service). As is this KB article on an SSO facade error.

Cannot login to vCenter with “use windows session credentials” but can login by entering username & password

Had this issue today (and a few months ago). I open vCenter client, type in the vCenter server name, tick “Use Windows Session Credentials” as usual, and login fails. Says it cannot login with the given credentials.

At the same time I can login with the vSphere Web Client and also by un-ticking the box and manually entering the username/ password.

Fix for both times was to reset the secure channel by logging in to the vCenter server –

 

vCenter unable to connects to hosts; vSphere client gives error ‘”ServiceInstance.RetrieveContent” for object “ServiceInstance” on Server “IP-Address” failed’

Our Network team had been making some changes at work and suddenly vCenter in our London office lost connectivity with all the ESX hosts in one of our remote office. Moreover, when trying to connect from the vSphere Client to any of the remote hosts directly we were getting the following error –

client error

Connectivity from vSphere Client in the remote office to the ESX host in the same office was fine; it was only connectivity from other offices to this remote office. So it definitely indicated a network issue.

This KB article is a handy one to know what ports are required by various VMware products. Port 443 is what needs to be open to ESX hosts for vCenter Server to be able to talk to them. I did a telnet from the vCenter server to each of the remote office hosts on port 443 and it went through fine – so wasn’t a firewall issue. (Another post with port numbers, just FYI, is this one).

After a fair bit of troubleshooting we tracked the issue down to MTU.

Digressing into MTUs

Communication between two IP addresses (i.e. layer 3) happens through packets. Thus when my London vCenter Server communicates with my remote office ESX host, the two send TCP/IP packets to each other. When these packets from the vCenter Server reach the switch/ router on the same LAN as the ESX host, it becomes a layer 2 communication (because they are on the same network and it’s a matter of data reaching the ESX host from the switch/ router). In the case of Ethernet, this layer 2 communication happens via Ethernet frames. The frames encapsulate the IP packets – so the switch/ router breaks the packets and fits them into multiple frames, while the ESX host receives these frames and re-assembles the packets (and vice versa). (The picture on this Wikipedia page is worth a look to see the encapsulation). 

How much data can be held by a layer 2 frame is defined by the Maximum Transmission Unit (MTU). Larger MTUs are good because you can carry more data; but they have a downside in that each frame takes longer to be transmitted, and in case of any errors more data has to be re-transmitted when the frame is resent. So a balance is important. In the case of Ethernet, RFC 894 (see errata also) defines the MTU as a maximum of 1500 bytes. In the case of other layer 2 protocols, the MTU varies: for example 4464 bytes for Token Ring; 4352 bytes for FDDI; 9180 bytes for ATM; etc. In the case of Ethernet there are now also jumbo frames, which are frames with an MTU size of 9000 bytes (see this page for a table comparing regular frames and jumbo frames) and are commonly used in iSCSI networks.

Taking the case of Ethernet, assume the MTU of all Ethernet networks is 1500 bytes. So when two devices are conversing with each other over layer 3, and this conversation spans multiple Ethernet networks, it is helpful if the devices know that the MTU of the underlying layer 2 network is 1500 bytes. That way the two devices can keep the size of their layer 3 packets to be less than 1500 bytes. Why? Because if the size of the layer 3 packets are greater than 1500 bytes, then the devices and all the routers/ switches in between will have to fragment (break) the layer 3 packets into smaller packets of less than 1500 bytes to fit it in the Ethernet frame. This is a waste of resources for all, so it’s best if the two devices know of the underlying layer 2 MTU and act accordingly.

Now, note that Ethernet MTUs are defined as a maximum of 1500 bytes. So the MTU for a particular LAN segment can be set to a lower number for whatever reason (maybe there are additional fields in the Ethernet frame and to accommodate these the data portion must be reduced). Similarly, a layer 3 conversation between when two devices can go over a mix of layer 2 networks – Ethernet, Token Ring, etc – each with a different MTU. So what is required for the two devices really is a way of knowing what’s the lowest MTU across all these layer 2 devices, so the two devices can use it as the MTU of the layer 3 packets for their conversation. This is known as the Path MTU or IP MTU – and is basically the smallest MTU of all the underlying layer 2 MTUs over which that conversation traverses. It is discovered through a process known as “Path MTU Discovery” (PMTUD) (check this Wikipedia article, or Google this term to learn more). Very briefly, in the case of IPv4 what happens is that each device sends across packets of increasing size to the other end, with a flag set that says “do not fragment this packet”. Packets of size smaller than the lowest layer 2 MTU will get through, but once the size exceeds the lowest MTU the packet will fail & return because it cannot be fragmented (due to the flag) and so is returned via ICMP to the sender. Thus the Path MTU is discovered. This check happens in both directions.

So we have layer 2 MTUs and layer 3 MTUs. Layer 2 MTUs have a maximum value that is dependent on the layer 2 network technology. But what about the minimum value? RFC 791, which defines the Internet Protocol (the IP in TCP/IP), requires that all devices supporting IP must be able to forward packets of 68 bytes without fragmenting (68 bytes because IP headers take 60 bytes size and layer 2 headers take 8 bytes size minimum) and be able to accept packets of minimum size 576 bytes either as one packet or multiple packets that require assembling. Because of this the minimum layer 2 MTU can be thought of as 68 bytes. In a practical sense, however, most IP devices accept 576 bytes without fragmenting, and since this number is higher than the values for all layer 2 networks the minimum layer 2 & layer 3 MTU can be thought of as 576 bytes.

Just for completeness I will also mention Maximum Segment Size (MSS) which is a layer 4 MTU (of sorts) that defines what’s the maximum TCP segment (which is what a TCP packet is called) that can be accepted by devices. It has a default value of 536 bytes. This is based on the 576 bytes that IP requires hosts to accept at minimum, minus 20 bytes for IP headers and 20 bytes for TCP headers. Idea behind using 576 bytes as the base is that this way the TCP segment can be expected to arrive without fragmenting. In a practical sense again, for TCP/IP traffic over Ethernet (which is the common case), since Ethernet frames have an MTU of 1500, the MSS is usually set to 1500 minus 20 minus 20 = 1460 bytes.

This is a good article I came upon. Just linking it as a reference to myself.

Back to our issue

In our case the router in the remote site had the following set in its configuration:

I am not entirely clear where it was set or why it was set, as that comes under the Network team. What this does though is tell the router not to clear the “Do Not Fragment” (DF) bit in Ethernet frames. If a DF bit is present in a frame then the router will not fragment it if the frame size is larger than the MTU (this is how PMTUD also works). I am not sure why this was set – part of some testing I suppose – but because of this larger frames were not getting through to the other side and hence failing. Our Network team removed this statement and then communication with the ESX hosts started working fine.

I wanted to write more about this statement but I am running out of time. This and this are two good links worth reading for more info. Especially the Scenario 4 section in the second link – that’s pretty much what was happening in our case, I think.

Unable to login to vSphere because the admin@system-domain password cannot be reset

vSphere 5.1 has admin@system-domain as the default admin account. vSphere 5.5 changes that to administrator@vsphere.local. However, if you upgrade from 5.1 to 5.5 the default admin account remains admin@system-domain. Which is fine and dandy until the password for this account expires. Then you are unable to reset or login! See below. :)

Trying to login as usual

1 - login

Password has expired, needs a reset

2 - reset

Reset fails though coz you can only reset for the vsphere.local domain

3 - reset fails

Missed out on taking a screenshot but if you were to try and login with administrator@vsphere.local instead you get an error that the credentials are invalid (because that account doesn’t exist!). So you are stuck!

What do you do?

Solution is to reset the admin password

When you do this vSphere automatically creates the administrator@vsphere.local account. Follow the steps in this KB article.

4 - reset password

Now you can login with administrator@vsphere.local and the generated password.

It is possible to vMotion VMs across ESX hosts without shared storage

Today (well actually, a few days ago; but today is when I read more about it) I learnt that you can vMotion VMs across hosts without shared storage.

This is only for vSphere 5.1 and above. That’s a pretty cool feature, especially because at work we are migrating all our VMs to new hosts & storage and one of things we were wondering about was how to move the VMs across. The new hosts have 3Par storage while the old hosts have StoreVirtual storage, so the thinking was that we’d probably have to give the new hosts access to the StoreVirtual storage and then do a vMotion. Now we won’t have to!

There’s no separate name for this sort of vMotion and it seems to be a not quite hyped feature. For anyone interested here’s some screenshots on how to do such a vMotion.

For starters here’s my testlab setup:

setupOne datacenter. Two clusters. Cluster one has two hosts with shared storage. Cluster two has a single host with no shared storage. UBUNTU1 is a VM I would like to migrate over.

Note that host esx03 has no connectivity to the shared storage either. I have removed the iSCSI VMkernel mappings from it so there’s no confusion.

esx03 shared storageESX01 and ESX02 have access to shared storage.

esx01 shared storageMigration is quite simple. Right click the VM and select Migrate. Choose the option to migrate both host and datastore. If the VM is powered on (which it would be as we are doing vMotion instead of a cold migration) you will see the option is grayed out in the older/ C# vSphere client.

migrate host and datastore - 1That’s because the newer features of vSphere 5.1 are only available in the web client so you’ll have to use that instead (thanks to this blog post for pointing me to that).

migrate host and datastore - 2Select the destination host. Note that vMotion is only between datacenters so you can only chose a host in the same datacenter (as opposed to cold migration which can happen between datacenters).

select destination

Select Datacenter

select destination host

Select Host

Select Datastore

Select Datastore

Notice that any datastore accessible from the destination host can be selected.

And that’s it. vMotion begins and I have easily live migrated a VM from one host to another without any shared storage. Cool! :)

setup2

vSphere client does not depend on the Inventory Service

On numerous occasions I have noticed my vSphere client always has the correct inventory of objects in vCenter whereas the vSphere web client tends to lag behind. While reading Mastering VMware vSphere 5.5 (a great book if you want to really understand how all this works!) I learnt that that’s because the web client depends on the vCenter Inventory Service as a cache between the web client and vCenter whereas the regular client talks to it directly.

The vCenter Inventory Service does two things – one, it caches inventory objects from vCenter so that each time the web client needs something it doesn’t have to ask vCenter (thus reducing load on vCenter); two, it allows for tags. Having the Inventory Service allows for more web client sessions with lesser load on vCenter server.

This also means its a good idea to place the Inventory Service with the web client, not with the vCenter server.

vCenter and vSphere editions (5.5)

vCenter editions. Just three.

  • Essentials
  • Foundation
  • Standard

Standard is what you usually want. No limits or restrictions.

Essentials is only available when purchased as part of vSphere Essentials or vSphere Essentials Plus kits. Not sold separately. These kits are targeted for SMBs. Limited to 3 hosts of 2 CPUs each. Self-contained – cannot be used with other editions.

Foundation is also for 3 hosts only.

All editions of vCenter include the Management service, SSO, Inventory service, Orchestrator, Web client – everything. There’s no difference in the components included in each edition.

vSphere is the suite. There are three plus two edition of vSphere suite.

Two editions are the kits:

  • Essentials
  • Essentials Plus

Three editions are bundled with vCenter Operations Manager:

  • Standard
  • Enterprise
  • Enterprise Plus

The Essentials & Essentials Plus editions only work with vCenter Essentials. The Standard, Enterprise, and Enterprise Plus work with vCenter Foundation or Standard.

Essentials is pretty basic. Remember it is for 3 hosts of 2 CPUs each. Standalone. In addition you don’t get features like vMotion either. All you get is (1) Thin Provisioning, (2) Update Manager, and (3) vStorage APIs for Data Protection (VADP). Note the latter is only APIs. It is not VMware solution vSphere Data Protection (VDP). Also, no VSAN.

Essentials Plus is a bit more than basic. Once again, only for 3 hosts of 2 CPUs each. Standalone. However, in addition to the three features above you also get (4) vSphere Data Protection, (5) High Availability (HA), (6) vMotion, and (7) vSphere Replication. So you get some useful features. In fact, if I had just 3 hosts and I am unlikely to expand further this is the option I would go for – for me vMotion is very useful and so is HA. Sadly, no Distributed Resource Scheduling (DRS). But you do get VSAN.

Moving on to the big boys …

Standard gives you all the above plus useful features like (8) Storage vMotion, (9) Fault Tolerance, and some more (Hot Add & vShield Endpoint). Still no DRS.

Enterprise gives you all the above plus (10) Storage APIs for Array Integration (nice! but useful only in an Enterprise context where you are likely to have a SAN array and need something like this), (11) DRS, (12) DPM, and (13) Storage APIs for Multi-pathing. As expected, features that are more useful when you have a lot of hosts and are in an Enterprise-y setup. Except DRS :) which would have been nice to have in Standard/ Essentials Plus too.

Finally, Enterprise Plus. All the above plus (13) Distributed Switches, (14) Host Profiles, (15) Auto Deploy, (16) Storage DRS – four of my favorite features – and a bunch of others like App HA, Storage IO Control, Network IO Control, etc.

vCenter – Cannot load the users for the selected domain

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

Found an informative VMware KB article. The ntpq command (short for “NTP query”) can be used to see the status of NTP daemon on the client side. Like thus:

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

  • If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.
  • If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.
  • If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.
    • The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Cannot load the users

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

Makes no sense to me but the problem looks to be in Java/ SSO.

I tried removing AD from the list of identity sources in SSO (in the web client) and re-added it. No luck.

Tried re-adding AD but this time I used an SPN account instead of the machine account. No luck!

Finally I tried adding AD as an LDAP Server just to see if I can get it working somehow – and that clicked! :)

AD as LDAP

So while I didn’t really solve the problem I managed to work around it …

Update: Added the rest of my DCs as time sources to the ESXi hosts and restarted the ntpd service. Maybe that helped, now NTP is working on the hosts.

 

vSphere 5.5 Maximums

This document contains all the vSphere 5.5 maximums. Here are some of the figures for my quick reference:

Hosts per vCenter Server (Appliance)(embedded vPostgres database) 100
VMs per vCenter Server (Appliance)(embedded vPostgres database) 3000
Hosts per vCenter Server (Appliance)(Oracle database) 1000
VMs per vCenter Server (Appliance)(Oracle database) 10000
   
Hosts per vCenter Server (Windows)(bundled SQL Server Express database) 5
VMs per vCenter Server (Windows)(bundled SQL Server Express database) 50
Hosts per vCenter Server (Windows)(external database) 1000
VMs per vCenter Server (Windows)(external database) 10000

So the Windows install with inbuilt database is the lowest of the lot. You are better of going with the appliance (which has its own limitations of course). 

Maximums of appliance and Windows server are the same as long as they use an external database. But appliance can only use Oracle as an external database while Windows server can use SQL too.

VMware/ PowerCLI – find disks used by a template

It’s easy finding the disks used by a VM. Just check its settings via the client or use PowerCLI.

Can’t do the same for a template though as the Get-Template output has no similar property.

But then I came across Get-HardDisk:

Sweet! Same command works for templates (as above) and VMs:

That’s all. Hope this helps someone.

Misc ESXI/ vSphere stuff

Just some notes to myself so I can refer to this later.

  • You can only have a maximum of 256 VMFS datastores per ESXI host. (This is one reason why you wouldn’t want to create a LUN/ datastore per VM. Wouldn’t work if you have a lot of VMs!)
    • Other maximums (for vSphere 5.5) can be found at this link.
  • When you create distributed switch port group there are 3 port binding options:
    • Static Binding (the default): VM NICs are connected to the port group at VM creation and remain so until the VM is removed from the port group. Power off a VM or disconnecting the NIC from the port group does not remove it from the port group – the port is still kept aside for the VM. What this means is that once you connect a VM to a port it stays with that forever.
      • Since the ports are assigned at VM creation, even if vCenter is down when the VM later powers on/ connects to the port group, it will continue to have network connectivity. (Note the emphasis on “later”. If the VM were already running and vCenter were to go down network traffic isn’t affected in either of the binding options).
    • Dynamic Binding (deprecated): VM NICs are connected to the port group only when the VM is powered on and its NIC connected to the port group. Power off the VM or disconnect the NIC and it is not longer connected to the same port when it comes back on or is reconnected.
      • Since the port binding happens only when the VM is powered on or connected, and the port group resides with vCenter, what this means is that you can only power on / off such VMs via vCenter. If vCenter is off / unreachable when the VM powers on / connects, it will not have network connectivity as it won’t have a port in the port group. (As above, note that this doesn’t affect VMs that are already running).
      • Dynamic Binding is deprecated but is useful when the number of VMs is larger than the number of ports in the port group and not all VMs will be on / connected at the same time.
    • Ephemeral Binding: Similar to Distributed Binding, VM NICs are connected to the port group only when the VM is powered on and its NIC connected to the port group. Powering off the VM or disconnecting it results in the port being removed from the port group. 
      • Although Dynamic and Ephemeral Bindings seem similar, they don’t have similar limitations. Thus while VMs with Dynamic Binding port groups won’t have network connectivity if they are powered on / connected when vCenter is off / unreachable, VMs with Ephemeral Binding have no such limitation. They don’t get a proper port number from the port group, but get a temporary one like h-1 which changes to a proper port number whenever connectivity with vCenter is restored.
      • Below screenshot shows the port numbers of three VMs, each connected to a port group of different binding (Ephemeral, Dynamic, Standard from top to bottom) and powered on when the vCenter was unreachable. Bindings
      • Although the NIC is unable to get a port – like Dynamic Binding – with an Ephemeral Binding port group the host creates a fake port and connects the VM anyway. 
      • I don’t understand why Dynamic Binding even exists as an option – unless it’s for backward compatibility? Ephemeral Binding seems to have the advantage of Dynamic Binding – ports are created at VM connection / powering on and so you can oversubscribe to a port group – but doesn’t have the disadvantage of lost connectivity when vCenter is off / unreachable. (I assume Ephemeral port groups too can be used for over subscribing, though the official KB articles don’t say anything like this so I could be wrong).
      • Dynamically creating / removing ports from the port group is an expensive operation so Dynamic and Ephemeral Binding port groups have a performance overhead. Static Binding is the preferred one.
      • Also, Ephemeral Binding port groups lose their history and security controls across host reboots. Apparently Dynamic Binding port groups don’t do this as I don’t see any mention of this as a Dynamic Binding limitation anywhere.

That’s all for now!

 

Load balancing in vCenter and ESXI

One of the things you can do with a portgroup is define teaming for the underlying physical NICs.

teaming

If you don’t do anything here, the default setting of “Route based on originating virtual port” applies. What this does is quite obvious. Each virtual port on the virtual switch is mapped to a physical NIC behind the scenes; so all traffic to & from that virtual port goes & comes via that physical NIC. Since your virtual NIC connects to a virtual port this is equivalent to saying all traffic for that virtual NIC happens via a particular physical NIC.

In the screenshot above, for instance, I have two physical NICs dvUplink1 and dvUplink2. If I left teaming at the default setting and say I had 4 VMs connecting to 4 virtual ports, chances are two of these VMs will use dvUplink1 and two will use dvUplink2. They will continue using these mappings until one of the dvUplinks dies, in which case the other will take over – so that’s how you get failover.

This is pretty straightforward and easy to set up. And the only disadvantage, if any, is that you are limited to the bandwidth of a single physical NIC. If each of dvUplink1 & dvUplink2 were 1Gb NICs it isn’t as though the underlying VMs had 2Gb (2 NICs x 1Gb each) available to them. Since each VM is mapped to one uplink, 1Gb is all they get.

Moreover, if say two VMs were mapped to an uplink, and one of them was hogging up all the bandwidth of this uplink while the remaining uplink was relatively free, the other VM on this uplink won’t automatically be mapped to the free uplink to make better use of resources. So that’s a bummer too.

A neat thing about “Route based on originating virtual port” is that the virtual port is fixed for the lifetime of the virtual machine so the host doesn’t have to calculate which physical NIC to use each time it receives traffic to & from the virtual machine. Only if the virtual machine is powered off, deleted, or moved to a different host does it get a new virtual port.

The other options are:

  • Route based on MAC hash
  • Route based on IP hash
  • Route based on physical NIC load
  • Explicit failover

We’ll ignore the last one for now – that just tells the host to use the first physical NIC in the list and use that for all VMs.

“Route based on MAC hash” is similar to “Route based on originating virtual port” in that it uses the MAC address of the virtual NIC instead of virtual port. I am not very clear on how this is better than the latter. Since the MAC address of a virtual machine is usually constant (unless it is changed or a different virtual NIC used) all traffic from that MAC address will use the same physical NIC always. Moreover, there is the additional overhead in that the host has to check each packet for the MAC address and decide which physical NIC to use. VMware documentation says it provides a more even distribution of traffic but I am not clear how.

“Route based on physical NIC load” a good one. It starts off with “Route based on originating virtual port” but if a physical NIC is loaded, then the virtual ports mapped to it are moved to a physical NIC with less load! This load balancing option is only available for distributed switches. Every 30s the distributed switch checks the physical NIC load and if it exceeds 75% then the virtual port of the VM with highest utilization is moved to a different physical NIC. So you have the advantages of “Route based on originating virtual port” with one of its major disadvantages removed.

In fact, except for “Route based on IP hash” none of the other load balancing mechanisms have an option to utilize more than a single physical NIC bandwidth. And “Route based on IP hash” does not do this entirely as you would expect.

“Route based on IP hash”, as the name suggests, does load balancing based on the IP hash of the virtual machine and the remote end it is communicating with. Based on a hash of these two IP addresses all traffic for the communication between these two IPs is sent through one NIC. So if a virtual machine is communicating with two remote servers, it is quite likely that traffic to one server goes through one physical NIC while traffic to the other goes via another physical NIC – thus allowing the virtual machine to use more bandwidth than that of one physical NIC. However – and this is an often overlooked point – all traffic between the virtual server and one remote server is still constrained by the bandwidth of the physical NIC it happens via. Once traffic is mapped to a particular physical NIC, if more bandwidth is required or the physical NIC is loaded, it is not as though an additional physical NIC is used. This is a catch with “Route based on IP hash” that’s worth remembering.

If you select “Route based on IP hash” as a load balancing option you get two warnings:

  • With IP hash load balancing policy, all physical switch ports connected to the active uplinks must be in link aggregation mode.
  • IP hash load balancing should be set for all port groups using the same set of uplinks.

What this means is that unlike the other load balancing schemes where there was no additional configuration required on the physical NICs or the switch(es) they connect to, with “Route based on IP hash” we must combine/ bond/ aggregate the physical NICs as one. There’s a reason for this.

In all the other load balancing options the virtual NIC MAC is associated with one physical NIC (and hence one physical port on the physical switch). So incoming traffic for a VM knows which physical port/ physical NIC to go via. But with “Route based on IP hash” there is no such one to one mapping. This causes havoc with the physical switch. Here’s what happens:

  • Different outgoing traffic flows choose different physical NICs. With each of these packets the physical switch will keep updating its MAC address table with the port the packet was got from. So for instance, say the two physical NICs are connected to physical switch Port1 and Port2 and the virtual NIC MAC address is VMAC1. When an outgoing traffic packet goes via the first physical NIC, the switch will update its tables to reflect that VMAC1 is connected to Port1. Subsequent traffic flows might continue using the first physical NIC so all is well. Then say a traffic flow uses the second physical NIC. Now the switch will map VMAC1 to Port2; then a traffic flow could use Port1 so the mapping gets changed to Port1, and then Port2, and so on …
  • When incoming traffic hits the physical switch for MAC address VMAC1, the switch will look up its tables and decide which port to send traffic on. If the current mapping is Port1 traffic will go out via that; if the current mapping is Port2 traffic will go out via that. The important thing to note is that the incoming traffic flow port chosen is not based on the IP hash mapping – it is purely based on whatever physical port the switch currently has mapped for VMAC1.
  • So what’s required is a way of telling the physical switch that the two physical NICs are to be considered as bonded/ aggregated such that traffic from either of those NICs/ ports is to be treated accordingly. And that’s what EtherChannel does. It tells the physical switch that the two ports/ physical NICs are bonded and that it must route incoming traffic to these ports based on an IP hash (which we must tell EtherChannel to use while configuring it).
  • EtherChannel also helps with the MAC address table in that now there can be multiple ports mapped to the same MAC address. Thus in the above example there would now be two mappings VMAC1-Port1 and VMAC1-Port2 instead of them over-writing each other!

“Route based on IP hash” is a complicated load balancing option to implement because of EtherChannel. And as I mentioned above, while it does allow a virtual machine to use more bandwidth than a single physical NIC, an individual traffic flow is still limited to the bandwidth of a single physical NIC. Moreover there is more overhead on the host because it has to calculate the physical NIC used for each traffic flow (essentially each packet).

Prior to vCenter 5.1 only static EtherChannel was supported (unless you use a third party virtual switch such as the Cisco Nexus 1000V). Static EtherChannel means you explicitly bond the physical NICs. But from vCenter 5.1 onwards the inbuilt distributed switch supports LACP (Link Aggregation Control Protocol) which is a way of automatically bonding physical NICs. Enable LACP on both the physical switch and distributed switch and the physical NICs will automatically be bonded.

(To enable LACP on the physical NICs go to the uplink portgroup that these physical NICs are connected to and enable LACP).

lacpThat’s it for now!

Update

Came across this blog post which covers pretty much everything I covered above but in much greater detail. A must read!

VCSA: Unable to connect to server. Please try again.

Most likely you set the VCSA to regenerate its certificates upon reboot and forgot to uncheck it after the reboot. (It’s under Admin > Certificate Regeneration Enabled). So each time you reboot VCSA gets a new certificate and your browser throws the above error.

Fix is to refresh (Ctrl+F5 in Firefox) the page so the new certificate is fetched and you get a prompt about it.

vCenter appliance stuck on “Requesting Information” when joining domain

I tried joining my freshly installed vCenter Server appliance to the domain and it was stuck on “Requesting information”. Rebooted, tried again, same results.

Then while rebooting I looked at the console to see if there were any error messages. Noticed some messages about certification errors and SSL handshakes failing. Then I remembered I had changed the appliance name from the default to something else, and also assigned a static IP etc. The renaming wen’t fine, but could it be that certificates had to be regenerated manually with the new name?

Sure enough, yes! In the appliance go to Admin and change the radio box for “Certification regeneration enabled” to Yes. Reboot, and now you’ll see messages on the console indicating that certificates are being regenerated due to a name change. Login, try joining to the domain, and now it works!