rakhesh sasidharan's mostly techie somewhat purpley blog

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

Found an informative VMware KB article. The ntpq command (short for “NTP query”) can be used to see the status of NTP daemon on the client side. Like thus:

~ # ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 win-dc01.rakhes .INIT.          16 -    - 1024    0    0.000    0.000   0.000

~ # ntpq -p

remote refid st t when poll reach delay offset jitter

==============================================================================

win-dc01.rakhes .INIT. 16 - - 1024 0 0.000 0.000 0.000

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.
If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.
If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.
- The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

C:\Windows\system32>w32tm /monitor /domain:rakhesh.local
WIN-DC01.rakhesh.local *** PDC ***[10.50.0.20:123]:
    ICMP: 0ms delay
    NTP: +0.0000000s offset from WIN-DC01.rakhesh.local
        RefID: 'LOCL' [0x4C434F4C]
        Stratum: 1
WIN-DC02.rakhesh.local[10.50.1.21:123]:
    ICMP: 1ms delay
    NTP: +0.0127058s offset from WIN-DC01.rakhesh.local
        RefID: WIN-DC01.rakhesh.local [10.50.0.20]
        Stratum: 2
WIN-DC03.rakhesh.local[10.50.0.22:123]:
    ICMP: 1ms delay
    NTP: +0.0183887s offset from WIN-DC01.rakhesh.local
        RefID: WIN-DC01.rakhesh.local [10.50.0.20]
        Stratum: 2

Warning:
Reverse name resolution is best effort. It may not be
correct since RefID field in time packets differs across
NTP implementations and may not be using IP addresses.

C:\Windows\system32>w32tm /monitor /domain:rakhesh.local

WIN-DC01.rakhesh.local *** PDC ***[10.50.0.20:123]:

ICMP: 0ms delay

NTP: +0.0000000s offset from WIN-DC01.rakhesh.local

RefID: 'LOCL' [0x4C434F4C]

Stratum: 1

WIN-DC02.rakhesh.local[10.50.1.21:123]:

ICMP: 1ms delay

NTP: +0.0127058s offset from WIN-DC01.rakhesh.local

RefID: WIN-DC01.rakhesh.local [10.50.0.20]

Stratum: 2

WIN-DC03.rakhesh.local[10.50.0.22:123]:

ICMP: 1ms delay

NTP: +0.0183887s offset from WIN-DC01.rakhesh.local

RefID: WIN-DC01.rakhesh.local [10.50.0.20]

Stratum: 2

Warning:

Reverse name resolution is best effort. It may not be

correct since RefID field in time packets differs across

NTP implementations and may not be using IP addresses.

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

vcenter01:/ # ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 win-dc01.rakhes .LOCL.           1 u   57   64    1    0.440  643.973   0.001

vcenter01:/ # ntpq -p

remote refid st t when poll reach delay offset jitter

==============================================================================

win-dc01.rakhes .LOCL. 1 u 57 64 1 0.440 643.973 0.001

esx01:~ # ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 win-dc01.rakhes .LOCL.           1 u    8   64  377    1.364  14883.9   2.077

esx01:~ # ntpq -p

remote refid st t when poll reach delay offset jitter

==============================================================================

win-dc01.rakhes .LOCL. 1 u 8 64 377 1.364 14883.9 2.077

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

2015-07-25T18:28:22+00:00 vcenter01 lsassd[3216]: 0x7f29703b2700:Failed to find user, group, or domain by sid (sid = 'S-1-5-21-1753560075-2253055039-199176374', searched host = 'WIN-DC01.rakhesh.local') -> error = 40071, symbol = LW_ERROR_NO_SUCH_OBJECT
2015-07-25T18:28:30+00:00 vcenter01 vmware-idm: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Message stream modified)

2015-07-25T18:28:22+00:00 vcenter01 lsassd[3216]: 0x7f29703b2700:Failed to find user, group, or domain by sid (sid = 'S-1-5-21-1753560075-2253055039-199176374', searched host = 'WIN-DC01.rakhesh.local') -> error = 40071, symbol = LW_ERROR_NO_SUCH_OBJECT

2015-07-25T18:28:30+00:00 vcenter01 vmware-idm: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Message stream modified)

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

vcenter01:/opt/likewise/bin # ./lw-get-dc-list rakhesh.local
Got 3 DCs:
===========
DC 1: Name = 'win-dc01.rakhesh.local', Address = '10.50.0.20'
DC 2: Name = 'win-dc03.rakhesh.local', Address = '10.50.0.22'
DC 3: Name = 'win-dc02.rakhesh.local', Address = '10.50.1.21'

vcenter01:/opt/likewise/bin # ./lw-get-dc-name rakhesh.local
Printing LWNET_DC_INFO fields:
===============================
dwDomainControllerAddressType = 23
dwFlags = 62461
dwVersion = 5
wLMToken = 65535
wNTToken = 65535
pszDomainControllerName = WIN-DC01.rakhesh.local
pszDomainControllerAddress = 10.50.0.20
pucDomainGUID(hex) = 16 A2 83 65 D3 68 F4 48 81 45 01 16 20 42 CC DF
pszNetBIOSDomainName = RAXNET
pszFullyQualifiedDomainName = rakhesh.local
pszDnsForestName = rakhesh.local
pszDCSiteName = COCHIN
pszClientSiteName = COCHIN
pszNetBIOSHostName = WIN-DC01
pszUserName = <EMPTY>


vcenter01:/opt/likewise/bin # ./lw-enum-users
User info (Level-0):
====================
Name:              RAXNET\admin
Uid:               1204290036
Gid:               1204290049
Gecos:             <null>
Shell:             /bin/sh
Home dir:          /home/local/RAXNET/admin

User info (Level-0):
====================
Name:              RAXNET\guest
Uid:               1204290037
Gid:               1204290050
Gecos:             <null>
Shell:             /bin/sh
Home dir:          /home/local/RAXNET/guest

vcenter01:/opt/likewise/bin # ./lw-get-dc-list rakhesh.local

Got 3 DCs:

===========

DC 1: Name = 'win-dc01.rakhesh.local', Address = '10.50.0.20'

DC 2: Name = 'win-dc03.rakhesh.local', Address = '10.50.0.22'

DC 3: Name = 'win-dc02.rakhesh.local', Address = '10.50.1.21'

vcenter01:/opt/likewise/bin # ./lw-get-dc-name rakhesh.local

Printing LWNET_DC_INFO fields:

===============================

dwDomainControllerAddressType = 23

dwFlags = 62461

dwVersion = 5

wLMToken = 65535

wNTToken = 65535

pszDomainControllerName = WIN-DC01.rakhesh.local

pszDomainControllerAddress = 10.50.0.20

pucDomainGUID(hex) = 16 A2 83 65 D3 68 F4 48 81 45 01 16 20 42 CC DF

pszNetBIOSDomainName = RAXNET

pszFullyQualifiedDomainName = rakhesh.local

pszDnsForestName = rakhesh.local

pszDCSiteName = COCHIN

pszClientSiteName = COCHIN

pszNetBIOSHostName = WIN-DC01

pszUserName = <EMPTY>

vcenter01:/opt/likewise/bin # ./lw-enum-users

User info (Level-0):

====================

Name: RAXNET\admin

Uid: 1204290036

Gid: 1204290049

Gecos: <null>

Shell: /bin/sh

Home dir: /home/local/RAXNET/admin

User info (Level-0):

====================

Name: RAXNET\guest

Uid: 1204290037

Gid: 1204290050

Gecos: <null>

Shell: /bin/sh

Home dir: /home/local/RAXNET/guest

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

[2015-07-25 19:45:49,422 pool-1-thread-5  INFO  com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl] [User {Name: Administrator, Domain: VSPHERE.LOCAL} with role 'Administrator'] Find at most 200 principals by criteria searchString=, domain=rakhesh.local
[2015-07-25 19:45:49,466 pool-1-thread-5  ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception
com.vmware.identity.idm.IDMException: Operations error
        at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:134)
        at com.vmware.identity.idm.server.IdentityManager.find(IdentityManager.java:4104)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
        at sun.rmi.transport.Transport$1.run(Unknown Source)
        at sun.rmi.transport.Transport$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
        at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(Unknown Source)
        at sun.rmi.transport.StreamRemoteCall.executeCall(Unknown Source)
        at sun.rmi.server.UnicastRef.invoke(Unknown Source)
        at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(Unknown Source)
        at java.rmi.server.RemoteObjectInvocationHandler.invoke(Unknown Source)
        at com.sun.proxy.$Proxy69.find(Unknown Source)
        at com.vmware.identity.idm.client.CasIdmClient.find(CasIdmClient.java:1650)
        at com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl.find(PrincipalManagementImpl.java:493)
        at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:341)
        at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:330)
        at com.vmware.identity.admin.vlsi.util.VmodlEnhancer.invokeVmodlMethod(VmodlEnhancer.java:153)
        at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl.find(PrincipalDiscoveryServiceImpl.java:330)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at com.vmware.vim.vmomi.server.impl.InvocationTask.run(InvocationTask.java:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
[2015-07-25 19:45:49,469 pool-1-thread-5  ERROR com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl] Idm client exception: Operations error
com.vmware.identity.admin.server.ims.PrincipalManagementException: Idm client exception: Operations error
        at com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl.logAndThrow(PrincipalManagementImpl.java:2457)
        at com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl.find(PrincipalManagementImpl.java:495)
        at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:341)
        at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:330)
        at com.vmware.identity.admin.vlsi.util.VmodlEnhancer.invokeVmodlMethod(VmodlEnhancer.java:153)
        at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl.find(PrincipalDiscoveryServiceImpl.java:330)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at com.vmware.vim.vmomi.server.impl.InvocationTask.run(InvocationTask.java:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

[2015-07-25 19:45:49,422 pool-1-thread-5 INFO com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl] [User {Name: Administrator, Domain: VSPHERE.LOCAL} with role 'Administrator'] Find at most 200 principals by criteria searchString=, domain=rakhesh.local

[2015-07-25 19:45:49,466 pool-1-thread-5 ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception

com.vmware.identity.idm.IDMException: Operations error

at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:134)

at com.vmware.identity.idm.server.IdentityManager.find(IdentityManager.java:4104)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)

at sun.rmi.transport.Transport$1.run(Unknown Source)

at java.security.AccessController.doPrivileged(Native Method)

at sun.rmi.transport.Transport.serviceCall(Unknown Source)

at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)

at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)

at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(Unknown Source)

at sun.rmi.transport.StreamRemoteCall.executeCall(Unknown Source)

at sun.rmi.server.UnicastRef.invoke(Unknown Source)

at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(Unknown Source)

at java.rmi.server.RemoteObjectInvocationHandler.invoke(Unknown Source)

at com.sun.proxy.$Proxy69.find(Unknown Source)

at com.vmware.identity.idm.client.CasIdmClient.find(CasIdmClient.java:1650)

at com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl.find(PrincipalManagementImpl.java:493)

at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:341)

at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:330)

at com.vmware.identity.admin.vlsi.util.VmodlEnhancer.invokeVmodlMethod(VmodlEnhancer.java:153)

at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl.find(PrincipalDiscoveryServiceImpl.java:330)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at com.vmware.vim.vmomi.server.impl.InvocationTask.run(InvocationTask.java:76)

at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

[2015-07-25 19:45:49,469 pool-1-thread-5 ERROR com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl] Idm client exception: Operations error

com.vmware.identity.admin.server.ims.PrincipalManagementException: Idm client exception: Operations error

at com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl.logAndThrow(PrincipalManagementImpl.java:2457)

at com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl.find(PrincipalManagementImpl.java:495)

at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:341)

at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl$11.call(PrincipalDiscoveryServiceImpl.java:330)

at com.vmware.identity.admin.vlsi.util.VmodlEnhancer.invokeVmodlMethod(VmodlEnhancer.java:153)

at com.vmware.identity.admin.vlsi.PrincipalDiscoveryServiceImpl.find(PrincipalDiscoveryServiceImpl.java:330)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at com.vmware.vim.vmomi.server.impl.InvocationTask.run(InvocationTask.java:76)

at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

Makes no sense to me but the problem looks to be in Java/ SSO.

I tried removing AD from the list of identity sources in SSO (in the web client) and re-added it. No luck.

Tried re-adding AD but this time I used an SPN account instead of the machine account. No luck!

Finally I tried adding AD as an LDAP Server just to see if I can get it working somehow – and that clicked! :)

So while I didn’t really solve the problem I managed to work around it …

Update: Added the rest of my DCs as time sources to the ESXi hosts and restarted the ntpd service. Maybe that helped, now NTP is working on the hosts.

esx02: ~ # ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 win-dc01.rakhes .LOCL.           1 u    5   64    1    0.860  976.564   0.000
 win-dc03.rakhes 10.50.0.20       2 u    4   64    1    0.838  980.585   0.000
 win-dc02.rakhes 10.50.0.20       2 u    3   64    1    2.665  964.615   0.000

esx02: ~ # ntpq -p

remote refid st t when poll reach delay offset jitter

==============================================================================

win-dc01.rakhes .LOCL. 1 u 5 64 1 0.860 976.564 0.000

win-dc03.rakhes 10.50.0.20 2 u 4 64 1 0.838 980.585 0.000

win-dc02.rakhes 10.50.0.20 2 u 3 64 1 2.665 964.615 0.000