Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Cannot ping an address but nslookup works (contd)

Earlier today I had blogged about nslookup working but ping and other methods not resolving names to IP addresses. That problem started again, later in the evening.

Today morning though as a precaution I had enabled the DNS Client logs on my computer. (To do that open Event Viewer with an admin account, go down to Applications and Services Logs > Microsoft > Windows > DNS Client Events > Operational – and click “Enable log” in the “Actions” pane on the right).

That showed me an error along the following lines:

A name not found error was returned for the name vcenter01.rakhesh.local. Check to ensure that the name is correct. The response was sent by the server at 10.50.1.21:53.

Interesting. So it looked like a particular DC was the culprit. Most likely when I restarted the DNS Client service it just chose a different DC and the problem temporarily went away. And sure enough nslookup for this record against this DNS server returned no answers.

I fired up DNS Manager and looked at this server. It seemed quite outdated with many missing records. This is my simulated branch office DC so I don’t always keep it on/ online. Looks like that was coming back to bite me now.

The DNS logs in Event Manager on that server had errors like this:

The DNS server was unable to complete directory service enumeration of zone TrustAnchors.  This DNS server is configured to use information obtained from Active Directory for this zone and is unable to load the zone without it.  Check that the Active Directory is functioning properly and repeat enumeration of the zone. The extended error debug information (which may be empty) is “”. The event data contains the error.

So Active Directory is the culprit (not surprising as these zones are AD integrated so the fact that they weren’t up to date indicated AD issues to me). I ran repadmin /showrepl and that had many errors:

Naming Context: CN=Configuration,DC=rakhesh,DC=local

Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: CN=Configuration,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Great! I fired up AD Sites and Services and the links seemed ok. Moreover I could ping the DCs from each other. Event Logs on the problem DC (WIN-DC02) had many entries like this though:

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server win-dc01$. The target name used was Rpcss/WIN-DC01. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (RAKHESH.LOCAL) is different from the client domain (RAKHESH.LOCAL), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Hmm, secure channel issues? I tried resetting it but that too failed:

(Ignore the above though. Later I realized that this was because I wasn’t running command prompt as an admin. Because of UAC even though I was logged in as admin I should have right clicked and ran command prompt as admin).

Since I know my environment it looks likely to be a case of this DC losing trust with other DCs. The KRB_AP_ERR_MODIFIED error also seems to be related to Windows Server 2003 and Windows Server 2012 R2 but mine wasn’t a Windows Server 2003. That blog post confirmed my suspicions that this was password related.

Time to check the last password set attribute for this DC object on all my other DCs. Time to run repadmin /showobjmeta.

The above output gives metadata of the WIN-DC02 object on WIN-DC01. I am interested in the pwdLastSet attribute and its timestamp.  Here’s a comparison of this across my three DCs:

That confirms the problem. WIN-DC02 thinks its password last changed on 9th May whereas WIN-DC01 changed it on 25th July and replicated it to WIN-DC03.

Interestingly that date of 25th July is when I first started having problems in my test lab. I thought I had sorted them but apparently they were only lurking beneath. The solution here is to reset the WIN-DC02 password on itself and WIN-DC01 and replicate it across. The steps are in this KB article, here’s what I did:

  1. On WIN-DC02 (the problem DC) I stopped the KDC service and set it to start Manual.
  2. Purge the Kerberos ticket cache. You can view the ticket cache by typing the command: klist.  To purge, do: klist purge.
  3. Open a command prompt as administrator (i.e. right click and do a “Run as administrator”) then type the following command: netdom resetpwd /server WIN-DC01.rakhesh.local /UserD MyAdminAccount /PasswordD *
  4. Restart WIN-DC02.
  5. After logon start the KDC service and set it to Automatic.

Checked the Event Logs to see if there are any errors. None initially but after I forced a sync via repadmin /syncall /e I got a few. All of them had the following as an error:

2148074274 The target principal name is incorrect.

Odd. But at least it was different from the previous errors and we seemed to be making progress.

After a bit of trial and error I noticed that whenever the KDC service on the DC was stopped it seemed to work fine.

I could access other servers (file shares), connect to them via DNS Manager, etc. But start KDC and all these would break with errors indicating the target name was wrong or that “a security package specific error occurred”. Eventually I left KDC stay off, let the domain sync via repadmin /syncall, and waited a fair amount of time (about 15-20 mins) for things to settle. I kept an eye on repadmin /replsummary to see the deltas between WIN-DC02 and the rest, and also kept an eye on the DNS zones to see if WIN-DC02 was picking up newer entries from the others. Once these two looked positive, I started KDC. And finally things were working!

 

Brief notes on HP SUM and SPP

HP SUM (Smart Update Manager) can be downloaded from http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/hpsum/index.aspx. This is just the tool. Its home page is http://www8.hp.com/us/en/products/server-software/product-detail.html?oid=5182020. As of this post date the home page says the latest version is 7.3.0 but the download page only has 7.1.0. Not sure why.

I am on Windows so I downloaded the ISO and the ZIP file (which can be found later on in the page). The ISO file is bootable. You can add firmware and drivers to this and boot up. The ZIP file has the HP SUM tool for Windows and Linux and can be extracted to these OSes and run from there. It’s not meant for booting up and deploying.

From Windows computers you can run HP SUM and update Windows, Linux, VMware, HP-UX, iLO, Virtual Connect, etc. From Linux computers you can do all these except Windows.

Documentation can be found at http://h17007.www1.hp.com/us/en/enterprise/servers/solutions/info-library/index.aspx?cat=smartupdate&subcat=hp_sum.

An SPP (Service Pack for Proliant) is the SUM along with a set of firmware and drivers. As of a certain date. These have been tested to ensure they work well together.

HP SUM only works with VMware if you are using the HP customized version of VMware. These can be found at http://www8.hp.com/us/en/products/servers/solutions.html?compURI=1499005#tab=TAB4. If your installation of VMware is not an HP customized version then the inventory step will fail with an error that the username/ password is incorrect.

A baseline is a set of updates that you want all the nodes added into SUM to be at. If you run SUM from an SPP then the baseline that of the SPP – for example 2015.04 if you are running the 2015.04 SPP. SUM creates a baseline from the packages you add to it the first time it runs. In addition to a baseline you can also add extra components (I am not too sure about that, haven’t played with it).

So you create a baseline (or it happens implicitly). You add nodes and do an inventory of the nodes. That tells you what’s present on the system. Then in the next screen you review what needs to be done and deploy accordingly. On this scren you can choose whether reboots happen or should be delayed. You can also see which updates will cause a reboot. In some cases you can even downgrade via this screen.

Some of the components will appear as “Force” or “Disabled”. This means no update is required. If you click on the details link for these components you will usually see that the installed component is already at the version with SUM. If you want you can re-install/ overwrite some of these components. The ones you can overwrite are shown as “Force”; the ones you cannot are shown as “Disabled”. If you toggle “Force” it becomes “Forced”.

SUM can be run via GUI. In this case the GUI is actually run via a web server you have to point to. Or you can run via command-line. The latter gives you more fine control over the process I think.

Extract secret keys from Two-Factor Authentication (TFA) QR codes

Got me Pebble Time yesterday! Yay. Found a cool app for Two-Factor Authentication codes called QuickAuth (it’s open source too, amazing!). 

The app requires you to enter the secret keys for your Two-Factor Authentication sites. Unfortunately I never saved these when I set up TFA on my devices. I was smart enough to save the QR code for each site and this way I was always able to add new devices by just scanning the saved QR code, but now I had to enter the secret key and I was stuck. 

Enter another open source project Zebra Crossing (zxing). This is a library for processing QR codes and they have an Android app called Barcode Scanner. Get this app, scan the QR code, and you get an output that starts with otp://. That’s the secret key you want. Enter this into QuickAuth. 

If you don’t want to download the app there’s also an online interface to upload a QR code and decode. Nice!

p.s. In case it helps anyone – on the face of it there seems to be no easy way to delete a key/ site once you enter it into QuickAuth. Later I realized if I long press the select button on the pebble when it shows a code I get many options. One of these lets you delete the key/ site. 

Cannot ping an address but nslookup works

Had an odd thing happen yesterday. I could nslookup names from my computer but couldn’t ping them. Ping would give an error as below:

Very odd.

This is not just ping, nothing in my system could connect to these addresses.

Nslookup works by querying the DNS servers directly. Ping and the OS work by telling the Windows DNS Resolver to query the DNS servers. Both should be querying the same servers (unless I specifically tell nslookup to check with someone else) so in theory both should work the same – but oddly it doesn’t.

I restarted the DNS Client service (called dnscache) and then everything began working as expected. Not sure what was really wrong …

DNS request timed out

Whenever I’d do an nslookup I noticed these timeout messages as below:

Everything seemed to be working but I was curious about the timeouts.

Then I realized the problem was that I had missed the root domain at the end of the name. You see, a name such as “www.msftncsi.com” isn’t an absolute name. For all the DNS resolver knows it could just be a hostname – like say “eu.mail” in a full name such as “eu.mail.somedomain.com”. To tell the resolver this is the full name one must terminate it with a dot like thus: “www.msftncsi.com.“. When we omit the dot in common practice the DNS resolver puts it in implicitly and so we don’t notice it usually.

The dot is what tells the resolver about the root of the domain name hierarchy. A name such as “rakhesh.com.” actually means the “rakhesh” domain in the “com” domain in the “.” domain. It is “.” that knows of all the sub-domains such as “com”, “io”, “net”, “pl”, etc.

In the case above since I had omitted the dot my resolver was trying to append my DNS search suffixes to the name “www.msftncsi.com” to come up with a complete name. I have search suffixes “rakhesh.local.” and “dyn.rakhesh.local.” (I didn’t put the dot while specifying the search suffixes but the resolver puts it in because I am telling it these are the absolute domain names) so the resolver was actually expanding “www.msftncsi.com” to “www.msftncsi.com.rakhesh.local.” and “www.msftncsi.com.dyn.rakhesh.local.” and searching for these. That fails and so I get these “DNS request timed out” messages.

If I re-try with the proper name the query goes through fine:

Just to eliminate any questions of whether a larger timeout is the solution, no it doesn’t help:

If I replace my existing DNS suffixes search list with the dot domain, that too helps (not practical because then I can’t query by just the hostname, I will always have to put in the fully qualified name):

Or I could tell nslookup to not use the DNS suffix search lists at all (a more practical solution):

So there you go. That’s why I was getting those timeout errors.

Updating Windows DNS Server root hints

Somehow I came upon the root hints of my Windows DNS Server today and had a thought to update it. Never done that before so why not give it a shot?

You can find the root hints by right clicking on the server and going to the ‘Root Hints’ tab.

root hints

Or you could click the server name in DNS Manager and select ‘Root Hints’ in the right pane. Either ways you get to the screen above. From here you can add/ remove/ edit root server names and IP addresses. If you want to update this list you can do so by each entry, or click the ‘Copy from Server’ button to update the list with a new bunch of entries. Note that ‘Copy from Server’ does not over-write the list, so you are better off removing all the entries first and then doing ‘Copy from Server’.

The ‘Copy from Server’ option had me stumped though. You can find the root hints on the IANA website – there’s an FTP link to the file containing root hints, as well as an HTTP link (http://www.internic.net/domain/named.root). I thought simply entering this in the ‘Copy from Server’ window should suffice but it doesn’t. Notice the OK  button is grayed out.

copy from serverThe window says it wants a server name or IP address so I removed everything above except the server name and then clicked OK. That looked like it was doing something but then failed with a message that it couldn’t get the root hints. The message said the specified DNS server could not be contacted so that gave me the idea it was looking for a DNS server which had the root hints.

searching for root hintssearch failsSo I tried inputting the name of one of my DNS servers. This DNS server knows of the root servers because it has them already. (You can verify that a server knows of the root hints via nslookup as below).

My DNS server doesn’t have an authoritative answer (notice the output above) because it only has the info that’s present with it by default. The real answers could have changed by now (and it often does – the root hints list that these servers come with can have outdated entries) but that’s fine because it has some answers at least. If I were to input this server’s name or IP address to the ‘Copy from Server’ dialog above, that DNS server gets the root hints from this DNS server and updates itself.

Even better though would be to put the IP address of one of the root servers returned above. Like a.root-servers.net which has an IPv4 address of 198.41.0.4. (Don’t go by the output above, you can get the latest IP addresses from IANA). If I query that address for the root servers it has an authoritative answer:

So there you have it. I put in this IPv4 address into the ‘Copy from Server’ window and my server updated itself with the IP addresses. I noticed that it had missed some of the IPv6 addresses (not sure why, maybe coz it can’t validate these?) but when I did a ‘Copy from Server’ again without removing any existing entries and input in the same IPv4 address and did an update, this time it picked up all the addresses.

(Note to self: The %WINDIR%\System32\dns\cache.dns file seems to contain root hints. I replaced this file with the root hints from IANA but that did not update the server. When I updated the server as above and checked this file it hadn’t changed either. Restarting the DNS service didn’t update the file/ root hints either, so am not sure how this file comes into play).

 

ESXI 6 seems to have a hard minimum requirement of 4GB RAM

With 2GB RAM the boot up process hangs at “User loaded successfully”

With 2.5GB RAM the boot up process goes past the above but throws many errors and eventually gives a screen with no text like below.

esxi-2.5You can login to the system by pressing F2 but all the network configuring options are grayed out. This is the case even if you boot up with 4GB RAM, configure network, then reduce RAM and reboot. The options are grayed out and you can’t ping the host. Trying to start the management services manually (Enable ESXi Shell, press Alt+F1, type /sbin/services.sh restart doesn’t work either).

With 3GB of RAM the boot up process hangs a while at “Running rabbitmqproxy start” but then pulls through. No IP address though and I can’t configure anything (even if I configure initially by booting up with 4GB RAM). I tried repairing network settings but that simply fails.

Didn’t try with 3.5GB RAM!

With 4GB of RAM the boot up process goes normally. In my case since I was playing with the host by increasing RAM in 500MB increments the network was still bust after booting with 4GB RAM. But I was able to restore the network and get it working.

So it looks like 4GB of RAM is more of a hard minimum requirement for ESXi 6. That sucks! I am able to install ESXi 5.5 with 4GB RAM and then decrease it to 2GB post-install with no issues. In my case all these hosts run in a laptop anyway so I just need them up and running with the bare minimum. Guess I won’t be able to do that with ESXi 6 (unless someone has a workaround, I didn’t search much on this topic – just noticed today and played around a bit).

Notes on vSphere High Availability (HA)

Just some notes on vSphere HA as I reading along on that. Nothing new here …

Starting with vSphere 5.0 HA has a Master/ Slave model. One ESXi host is elected as a Master, the rest are Slaves. The Master is the one with the most number of datastores connected to it; if all ESXi hosts have the same number of datastores connected to it, the Master is the one with the largest Managed Object ID (MOID). Note that the MOID is interpreted lexically – so an MOID 99 is larger than 100. PowerCLI can be used to view the MOIDs:

Also, the MOID is a vCenter specific construct. Whenever a host, VM, datastore, etc is added to vCenter it is assigned an MOID. For instance here are the MOIDs of my datastores:

Although I haven’t used this it’s also possible to find MOIDs vSphere Managed Object Browser. See this KB article for more info.

Back to the topic – the above is how a Master is elected. There’s only one Master per cluster. When it comes to HA, the Fault Domain Manager (FDM) on this Master is responsible for most of the tasks (which is why even if vCenter is down for a while HA can continue working). vCenter checks with the Master and the Master communicates with vCenter to keep each other abreast of the cluster situation.

  • FDM is installed at /opt/vmware/fdm/fdm/
  • FDM config files are at /etc/opt/vmware/fdm/

The Master monitors the Slave hosts and if a Slave goes down/ is unreachable the Master is responsible for starting these Protected VMs elsewhere. The Master is also responsible for keeping the Slaves abreast of the cluster configuration.

Slaves are limited to monitoring VMs running with them. Slaves monitor the VM health and if a Protected VM powers down they inform the Master so it can be restarted. (Note on Protected VMs: once you enable VM monitoring on a cluster or set a VM as Protected, the VM must be powered off and powered on to be protected). Slaves also keep in touch with each other and if they find the Master is down they conduct an election to select a new Master.

The only time vCenter communicates with Slaves is when a new Master needs to be elected or when the Master reports a Slave as missing and so vCenter tries to contact it.

Slaves send network heartbeats to the Master every second. When a Master stops receiving heartbeats from a Slave it knows it is offline or partitioned/ isolated. Similarly when a Slave stops receiving heartbeats from a Master it knows the Master is offline or partitioned/ isolated.

  • If a Slave is cut off from all other hosts (Master and Slaves) it is considered isolated (caveat: you can also specify up to 10 isolation IP addresses to ping – if these are reachable but the Master and Slaves are not, the Slave does not consider itself isolated, only partitioned).
  • If a Slave is cut off from the Master and some/ none Slaves (i.e. it still has contact with some Slaves) then it is considered partitioned.

In the past if a Slave were isolated/ partitioned the Master would consider it as offline and restart its Protected VMs elsewhere. Starting with vSphere 5.0 the Master also sends a ping (ICMP packet) to the Slave to see if responds and uses datastore heartbeats to verify the Slave is really down. It could be that the Management network is down but the VM and storage networks are up, so the VMs are still functioning as expected.

Datastore heartbeats work thus (and remember they are only used in case of isolation/ partition scenarios):

  • When enabling HA for a cluster, a datastore is automatically selected (or can be selected manually by the user) to be used for datastore heartbeats.
  • On this datastore a folder called .vSphere-HA is created within which a sub-folder of name FDM-<Fault Domain ID>-<vCenter Server Name> is created. (Such a name allows the same datastore to be used by multiple clusters).
  • Each host creates a file with its MOID name in this sub-folder. Like thus:heartbeats
  • Notice the host-X-hb file above? That is created by each host (you can check the /var/log/fdm.log file on each host to see it creating this file). When a Slave does not get heartbeats from a Master it updates its file above (and also checks the timestamp of the file for the Master – if that has updates it means the Master is alive). Similarly, when a Master does not hear from a Slave it checks the Slave’s file above to see if there’s updates. This is how datastore heartbeats work.
  • If a Slave is network partitioned – i.e. it cannot contact the Master – but can see some of the other Slaves, the Master and Slave can conclude that each other is still alive from the datastore heartbeats as above.
    • If the Master is down – i.e. the Slaves think they are partitioned because actually the Master is down – they can now elect a new Master since there are no datastore heartbeats from the Master.
    • If the Slave is down – i.e. the Master is not getting any datastore heartbeats from the Slave – then it restarts the Protected VMs on other hosts. (If the Slave were actually up but had lost network access to the datastore and so cannot update heartbeats, it is as good as down because the VMs have probably crashed by now).
  • If a Slave is network isolated – i.e. it cannot contact the Master or any other Slave (nor can it ping the isolation addresses) – then the Slave adds a special bit in the host-X-poweron file above. This tells the Master that the Slave is network isolated.
    • The Master then locks the file called protectedlist. This is a list of all Protected VMs. Once the Master has locked this file, the Slave knows the Master has taken responsibility for the Protected VMs and the Slave can leave these powered on, shut down, or power off (depending on which of these is selected as the host isolation response when setting up HA).
    • The protectedlist file thus ensures that unless another host has taken over these VMs the current host will not shut down/ power off these.

Two advanced options to keep in mind:

  • I mentioned this earlier: das.isolationAddress[0-9] allow one to specify up to 10 isolation IP addresses to check before a host considers itself isolated.
  • And das.allowNetwork[0-9] allow one to specify up to 10 port groups to use for HA. See this KB article for examples.

Lastly, I haven’t read it fully but this HA Deepdive is a great resource.

CentOS NAT

My home lab setup is such that everything runs in my laptop within VMware Workstation (currently version 11) on a Windows 8.1 OS. Hyper-V might be a better choice here for performance but I am quite happy with VMware Workstation and it does what I want. Specifically – VMware allows for nested virtualization, so I can install a hypervisor such as ESXi or Hyper-V within VMware Workstation and run VMs in that! Cool, isn’t it?

VMware Workstation also natively supports installing ESXi within it as a VM. I can’t do that if I were using Hyper-V instead.

Finally, VMware Workstation has some laptop lab level benefits in that it easily lets you configure network such as having a NAT network up and running quickly. I can do the same with Hyper-V but that requires a bit more configuring (expected, because Hyper-V is meant for a different purpose – I should really be comparing Hyper-V with ESXi but that’s a comparison I can’t even make here).

Within VMware Workstation I have a Windows environment setup where I play with AD, IPv6, DirectAccess, etc. This has a Corporate Network spread over two sites. I also have some CentOS VMs that act as routers – to provide routing between the two Corporate Network sites above, for instance – and also act as NAT routers for remote use. Yes, my Workstation network also has a fake Internet and some fake homes which is a mix of CentOS VMs acting as routers/ NAT and Server 2012 for DNS (root zones, Teredo, etc).

The CentOS VM that acts as a router between the Corporate Networks can also do real Internet routing. For this I have an interface that’s connected to the NAT network of VMware Workstation. I chose to NAT this interface because back when I created this lab I used to hook up the laptop to a network with static IP. I had only one IPv4 address to use so couldn’t afford to bridge this interface because I had no IPv4 address to assign it.

Because of the NAT interface the CentOS VM itself can access the real Internet. But what about other VMs that forward to this for their Internet routing? You would think that simply enabling packet forwarding on this VM is sufficient – but that won’t do. This is because packet forwarding for when forwarding between networks but in this case the external network does not know anything about my virtual networks that are behind a NAT (of VMware Workstation) so simply forwarding won’t work. So what you need to do apart from forwarding is also set up NAT on the CentOS VM so as far as the external network is concerned everything is coming from the CentOS VM (via its interface that is NAT’d with VMware Workstation).

I have done all this in the past but today I needed to revisit this after some time and forgot what exactly I had done. So this post is just a reminder to myself on what needs to be done.

First, enable packet forwarding in the OS. Add the following line to /etc/sysctl.conf:

Reboot the VM or run the following to load it now itself:

Now modify the firewall rules to allow packet forwarding as well as NAT (aka MASQUERADE in iptables):

Lines 1-6 and 16 are the relevant ones here. I have a few extra like allow ICMP and SSH but those don’t matter for what I am doing here.

vSphere client does not depend on the Inventory Service

On numerous occasions I have noticed my vSphere client always has the correct inventory of objects in vCenter whereas the vSphere web client tends to lag behind. While reading Mastering VMware vSphere 5.5 (a great book if you want to really understand how all this works!) I learnt that that’s because the web client depends on the vCenter Inventory Service as a cache between the web client and vCenter whereas the regular client talks to it directly.

The vCenter Inventory Service does two things – one, it caches inventory objects from vCenter so that each time the web client needs something it doesn’t have to ask vCenter (thus reducing load on vCenter); two, it allows for tags. Having the Inventory Service allows for more web client sessions with lesser load on vCenter server.

This also means its a good idea to place the Inventory Service with the web client, not with the vCenter server.

vCenter and vSphere editions (5.5)

vCenter editions. Just three.

  • Essentials
  • Foundation
  • Standard

Standard is what you usually want. No limits or restrictions.

Essentials is only available when purchased as part of vSphere Essentials or vSphere Essentials Plus kits. Not sold separately. These kits are targeted for SMBs. Limited to 3 hosts of 2 CPUs each. Self-contained – cannot be used with other editions.

Foundation is also for 3 hosts only.

All editions of vCenter include the Management service, SSO, Inventory service, Orchestrator, Web client – everything. There’s no difference in the components included in each edition.

vSphere is the suite. There are three plus two edition of vSphere suite.

Two editions are the kits:

  • Essentials
  • Essentials Plus

Three editions are bundled with vCenter Operations Manager:

  • Standard
  • Enterprise
  • Enterprise Plus

The Essentials & Essentials Plus editions only work with vCenter Essentials. The Standard, Enterprise, and Enterprise Plus work with vCenter Foundation or Standard.

Essentials is pretty basic. Remember it is for 3 hosts of 2 CPUs each. Standalone. In addition you don’t get features like vMotion either. All you get is (1) Thin Provisioning, (2) Update Manager, and (3) vStorage APIs for Data Protection (VADP). Note the latter is only APIs. It is not VMware solution vSphere Data Protection (VDP). Also, no VSAN.

Essentials Plus is a bit more than basic. Once again, only for 3 hosts of 2 CPUs each. Standalone. However, in addition to the three features above you also get (4) vSphere Data Protection, (5) High Availability (HA), (6) vMotion, and (7) vSphere Replication. So you get some useful features. In fact, if I had just 3 hosts and I am unlikely to expand further this is the option I would go for – for me vMotion is very useful and so is HA. Sadly, no Distributed Resource Scheduling (DRS). But you do get VSAN.

Moving on to the big boys …

Standard gives you all the above plus useful features like (8) Storage vMotion, (9) Fault Tolerance, and some more (Hot Add & vShield Endpoint). Still no DRS.

Enterprise gives you all the above plus (10) Storage APIs for Array Integration (nice! but useful only in an Enterprise context where you are likely to have a SAN array and need something like this), (11) DRS, (12) DPM, and (13) Storage APIs for Multi-pathing. As expected, features that are more useful when you have a lot of hosts and are in an Enterprise-y setup. Except DRS :) which would have been nice to have in Standard/ Essentials Plus too.

Finally, Enterprise Plus. All the above plus (13) Distributed Switches, (14) Host Profiles, (15) Auto Deploy, (16) Storage DRS – four of my favorite features – and a bunch of others like App HA, Storage IO Control, Network IO Control, etc.

vCenter – Cannot load the users for the selected domain

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

Found an informative VMware KB article. The ntpq command (short for “NTP query”) can be used to see the status of NTP daemon on the client side. Like thus:

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

  • If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.
  • If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.
  • If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.
    • The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Cannot load the users

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

Makes no sense to me but the problem looks to be in Java/ SSO.

I tried removing AD from the list of identity sources in SSO (in the web client) and re-added it. No luck.

Tried re-adding AD but this time I used an SPN account instead of the machine account. No luck!

Finally I tried adding AD as an LDAP Server just to see if I can get it working somehow – and that clicked! :)

AD as LDAP

So while I didn’t really solve the problem I managed to work around it …

Update: Added the rest of my DCs as time sources to the ESXi hosts and restarted the ntpd service. Maybe that helped, now NTP is working on the hosts.

 

Fixing “The DNS server was unable to open Active Directory” errors

For no apparent reason my home testlab went wonky today! Not entirely surprising. The DCs in there are not always on/ connected; and I keep hibernating the entire lab as it runs off my laptop so there’s bound to be errors lurking behind the scenes.

Anyways, after a reboot my main DC was acting weird. For one it took a long time to start up – indicating DNS issues, but that shouldn’t be the case as I had another DC/ DNS server running – and after boot up DNS refused to work. Gave the above error message. The Event Logs were filled with two errors:

  • Event ID 4000: The DNS server was unable to open Active Directory.  This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.
  • Event id 4007: The DNS server was unable to open zone <zone> in the Active Directory from the application directory partition <partition name>. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.

A quick Google search brought up this Microsoft KB. Looks like the DC has either lost its secure channel with the PDC, or it holds all the FSMO roles and is pointing to itself as a DNS server. Either of these could be the culprit in my case as this DC indeed had all the FSMO roles (and hence was also the PDC), and so maybe it lost trust with itself? Pretty bad state to be in, having no trust in oneself … ;-)

The KB article is worth reading for possible resolutions. In my case since I suspected DNS issues in the first place, and the slow loading usually indicates the server is looking to itself for DNS, I checked that out and sure enough it was pointing to itself as the first nameserver. So I changed the order, gave the DC a reboot, and all was well!

In case the DC had lost trust with itself the solution (according to the KB article) was to reset the DC password. Not sure how that would reset trust, but apparently it does. This involves using the netdom command which is installed on Server 2008 and up (as well as on Windows 8 or if RSAT is installed and can be downloaded for 2003 from the Support Tools package). The command has to be run on the computer whose password you want to reset (so you must login with an account whose initials are cached, or use a local account). Then run the command thus:

Of course the computer must have access to the PDC. And if you are running it on a DC the KDC service must be stopped first.

I have used netdom in the past to reset my testlab computer passwords. Since a lot of the machines are usually offline for many days, and after a while AD changes the computer account password but the machine still has the old password, when I later boot up the machine it usually gives are error like this: “The trust relationship between this workstation and the primary domain failed.”

A common suggestion for such messages is to dis-join the machine from the domain and re-join it, effectively getting it a new password. That’s a PITA though – I just use netdom and reset the password as above. :)

 

vSphere 5.5 Maximums

This document contains all the vSphere 5.5 maximums. Here are some of the figures for my quick reference:

Hosts per vCenter Server (Appliance)(embedded vPostgres database)100
VMs per vCenter Server (Appliance)(embedded vPostgres database)3000
Hosts per vCenter Server (Appliance)(Oracle database)1000
VMs per vCenter Server (Appliance)(Oracle database)10000
  
Hosts per vCenter Server (Windows)(bundled SQL Server Express database)5
VMs per vCenter Server (Windows)(bundled SQL Server Express database)50
Hosts per vCenter Server (Windows)(external database)1000
VMs per vCenter Server (Windows)(external database)10000

So the Windows install with inbuilt database is the lowest of the lot. You are better of going with the appliance (which has its own limitations of course). 

Maximums of appliance and Windows server are the same as long as they use an external database. But appliance can only use Oracle as an external database while Windows server can use SQL too.

The NETWORK SERVICE account

Someone asked me why a certain service worked when run as the NETWORK SERVICE account but not as a service account. Was there anything inherently powerful about the NETWORK SERVICE account?

The short answer is “No”. The long answer is “It depends”. :)

The NETWORK SERVICE is a special account that presents the credentials of the computer it is running on to the remote services it connects to. Thus, for example, if you have an SQL Service (which was the case here) running on a computer COMPUTER1 connecting to a file share on COMPUTER2, and this SQL Service runs as NETWORK SERVICE, when it connects to COMPUTER2 it will connect with the computer account COMPUTER1$. So the question then becomes does the COMPUTER1$ account have any additional rights to the service on COMPUTER2 over the service account? In this case – yes, the COMPUTER1$ account did have additional rights and that’s why the service worked when running as NETWORK SERVICE. Once I gave the service account the additional rights, the SQL service worked as expected under that context too.

For info – there are three such service accounts.

  • NT AUTHORITY\NetworkService – the NETWORK SERVICE account I mentioned above.
  • NT AUTHORITY\LocalService – an account with minimum privileges on the local computer, presents anonymous credentials to remote services – meant for running services locally with minimum privileges and no network access.
  • .\LocalSystem – an all powerful local account, even more powerful than the Administrator account; presents the computer credentials to remote services like the NETWORK SERVICE account.

See this StackOverflow answer for more.