Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Configure NTP for multiple ESXi hosts

Following on my previous post I wanted to set NTP servers for my ESX servers and also start the service & allow firewall exceptions. Here’s what I did –

 

Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

Brief notes on Windows Time

The w32time service provides time for Windows. Since Windows XP NTP (Network Time Protocol) is supported. Prior to that it was only SNTP (Simple NTP).

Non domain joined computers (including servers) use SNTP.

This is a good article that explains the Windows Time service and its configurations. Covers both registry keys and GPOs. This is another good article that goes into even more detail.

Any Windows machine can be set up to sync time in one of four ways: (1) no syncing! (2) sync from specified NTP servers (3) sync via domain hierarchy (i.e. members sync from a DC in the domain; DCs sync from PDC of the parent domain/ forest root domain) (4) use either of the above (i.e. NTP and domain hierarchy). Default mechanism on domain joined computers is domain hierarchy (the setting is called NT5DS). Stand-alone machines have the default as NTP servers (the setting is called NTP; the default server is time.windows.com though you can change it (and probably recommended that you change it?)).

For machines that are off and on the domain – e.g. laptops – it is better to set their time sync mechanism as any. They needn’t always have contact with the DC to sync time.

When specifying NTP time servers you also specify flags. Check this post for an explanation of the flags. There are four possible flags: 0x01 SpecialInterval; 0x02 UseAsFallbackOnly; 0x04 SymmetricActive; 0x08 Client.

  • Flag UseAsFallbackOnly means the server is only used if the others are unavailable. Check out this post for an example of this.
  • Flag SpecialInterval lets you change how often the NTP server is polled. By default the interval is determined by Windows based on the quality of time samples, but you can use the above flag and set a registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient\SpecialPollInterval to change the polling interval.
  • I am not sure what the other two flags do. The Client flag seems to be a commonly used one. Some posts/ articles use it, others don’t. The default time.windows.com setting uses this flag as well as the SpecialInterval.

p.s. To turn on w32tm debugging check out this link.

vCenter – Cannot load the users for the selected domain

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

Found an informative VMware KB article. The ntpq command (short for “NTP query”) can be used to see the status of NTP daemon on the client side. Like thus:

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

  • If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.
  • If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.
  • If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.
    • The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Cannot load the users

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

Makes no sense to me but the problem looks to be in Java/ SSO.

I tried removing AD from the list of identity sources in SSO (in the web client) and re-added it. No luck.

Tried re-adding AD but this time I used an SPN account instead of the machine account. No luck!

Finally I tried adding AD as an LDAP Server just to see if I can get it working somehow – and that clicked! :)

AD as LDAP

So while I didn’t really solve the problem I managed to work around it …

Update: Added the rest of my DCs as time sources to the ESXi hosts and restarted the ntpd service. Maybe that helped, now NTP is working on the hosts.