Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

TIL: VXLAN is a standard

VXLAN == Virtual eXtensible LAN.

While reading about NSX I was under the impression VXLAN is something VMware cooked up and owns (possibly via Nicira, which is where NSX came from). But turns out that isn’t the case. It was originally created by VMware & Cisco (check out this Register article – a good read) and is actually covered under an RFC 7348. The encapsulation mechanism is standardized, and so is the UDP port used for communication (port number 4789 by the way). A lot of vendors now support VXLAN, and similar to NSX being an implementation of VXLAN we also have Open vSwitch. Nice!

(Note to self: got to read more about Open vSwitch. It’s used in XenServer and is a part of Linux. The *BSDs too support it). 

VXLAN is meant to both virtualize Layer 2 and also replace VLANs. You can have up to 16 million VXLANs (the NSX Logical Switches I mentioned earlier). In contrast you are limited to 4094 VLANs. I like the analogy of how VXLAN is to IP addresses how cell phones are to telephone numbers. Prior to cell phones, when everyone had landline numbers, your phone number was tied to your location. If you shifted houses/ locations you got a new phone number. In contrast, with cell phones numbers it doesn’t matter where you are as the number is linked to you, not your location. Similarly with VXLAN your VM IP address is linked to the VM, not its location. 

Update:

  • Found a good whitepaper by Arista on VXLANs. Something I hadn’t realized earlier was that the 24bit VXLAN Network Identifier is called VNI (this is what lets you have 16 millions VXLAN segments/ NSX Logical Switches) and that a VM’s MAC is combined with its VNI – thus allowing multiple VMs with the same MAC address to exist across the network (as long as they are on separate VXNETs). 
  • Also, while I am noting acronyms I might as well also mention VTEPs. These stand for Virtual Tunnel End Points. This is the “thing” that encapsulates/ decapsulates packets for VXLAN. This can be virtual bridges in the hypervisor (ESXi or any other); or even VXLAN aware VM applications or VXLAN capable switching hardware (wasn’t aware of this until I read the Arista whitepaper). 
  • VTEP communicates over UDP. The port number is 4789 (NSX 6.2.3 and later) or 8472 (pre-NSX 6.2.3).
  • A post by Duncan Epping on VXLAN use cases. Probably dated in terms of the VXLAN issues it mentions (traffic tromboning) but I wanted to link it here as (a) it’s a good read and (b) it’s good to know such issues as that will help me better understand why things might be a certain way now (because they are designed to work around such issues). 

vSphere Distributed Switches are Layer 2 devices (doh!)

This is a very basic post. Was trying to read up on NSX and before I could appreciate it I wanted to go down and explore how things are without NSX so I can better understand what NSX is trying to do. I wanted to put it down in writing as I spent some time on it, but there’s nothing new or grand here.

So. vSphere Distributed Switches (VDS). These are Layer 2 switches that exist on each ESX host and which contain port groups that you can connect VMs running on a host onto. In case it wasn’t obvious from the name “switch”, these are Layer 2. Which means that all the hosts connecting to a particular Distributed Switch must be on the same Layer 2. Once you create a Distributed Switch and add ESXi hosts and their physical NICs to it, you can create VMKernel ports for Management, vMotion, Fault Tolerance, etc but these VMKernel ports aren’t used by the port groups you create on the Distributed Switch. The port groups are just like Layer 2 switches – they communicate via broadcasting using the underlying physical NICs that are assigned to the Distributed Switch; but since there’s no IP address as such assigned to a port group there’s no routing involved. (This is an obvious point but I keep forgetting it).

For example say you have two ESX hosts – HostA and HostB – and these are on two separate physical networks (i.e. separated by a router). You create a new Distributed Switch comprising of a physical NIC each from each host. Then you make a port group on this switch and put VM-A on HostA and VM-B on HostB. When creating the Distributed Switch and adding physical NICs to it, VMware doesn’t care if the physical NICs aren’t in the same Layer 2 domain. It will happily add the NICs but when you try to send traffic from VM-A to VM-B it will fail. That’s because when VM-A tries to communicate with VM-B (let’s assume these two VMs know each others MAC address so there’s no need for ARP communication first), VM-A will send Ethernet frames to the Distributed Switch on HostA who will then broadcast it to the Layer 2 network its physical NIC assigned to the Distributed Switch is connected to. Since these broadcasted frames won’t reach the physical NIC of HostB the VM-B there never sees it, and so the two VMs cannot communicate with each other. 

So – keep in mind that all physical NICs connecting to the same Distributed Switch must be on the same Layer 2. If the underlying physical NICs are on separate Layer 3 networks, and these Layer 3 networks have connectivity to each other, it doesn’t matter – the VMs in the port groups will not be able to communicate. 

And this is where NSX comes in. Using the concept of VXLANs, NSX stretches a Layer 2 network across Layer 3. Basically it encapsulates the Layer 2 traffic within Layer 3 packets and gives the illusion of all VMs being on the same Layer 2 network – but this illusion is what Network Virtualization if all about, right? :) VXLAN is an overlay

VXLAN encapsulates Layer 2 frames in UDP packets. The VXLAN is like a tunnel to which all the hosts connecting to this VXLAN hook into. On each host there’s something called a Virtual Tunnel End Point (VTEP) which is the “thing” that actually hooks into the VXLAN. If a VXLAN is a Distributed Switch made up of physical NICs from the host, the VTEP is the VMKernel ports of this Distributed Switch that do the actual communication (like how vMotion traffic between two hosts happens via the VMKernel ports you assign for vMotion). In fact, during an NSX install you install three VIBs on the ESXi hosts – one of these enhances the existing Distributed Switch with VXLAN capabilities (the encapsulation stuff  I mentioned above). 

Once you have NSX you can create multiple Logical Switches. These are basically VXLAN switches that operate like Layer 2 switches but can actually stretch multiple Layer 3 networks. Logical Switches are overlay switches. ;o) Each Logical Switch corresponds to one VXLAN. 

ps. VXLAN is one of the cool features of NSX. The other cool features are the Distributed Logical Router (DLR) and the Distributed Firewall (DFW). I mentioned that a ESXi host has 3 VIBs installed as part of NSX, and that one of them is VXLAN functionality? Well the other two are DLR and DFW (god, so many acronyms!). Prior to DLR if an ESXi host had two VMs connected to different Distributed Switches, and if these two hosts wanted to talk to each other, the traffic would go down from one of the VMs, to the host, to the underlying physical network router, and back to the host and up to the VM on the other Distributed Switch. But with DLR, the ESXi hypervisor kernel can do Layer 3 routing too, so it will simply send traffic directly to the VM in the other Distributed Switch. 

Similarly, DFW just means each ESXi hypervisor can also apply firewall rules to the packets, so you don’t need one centralized firewall place any more. You simply create rules and push it out to the ESXi hosts and they can do firewalling between VMs. Everything is virtual! :)

pps. Some other jargon. East-West traffic means network traffic that’s usually within or between servers (ESXi hosts in our case). North-South traffic means any other network traffic – basically, traffic that goes out of this layer of ESXi hosts. With NSX you try and have more traffic East-West rather than North-South. 

TIL: Transparent Page Sharing (TPS) between VMs is disabled by default

(TIL is short for “Today I Learned” by the way).

I always thought an ESXi host did some page sharing of VM memory between the VMs running on it. The feature is called Transparent Page Sharing (TPS) and it was something I remember from my VMware course and also read in blog posts such as this and this. The idea is that if you have (say) a bunch of Server 2012R2 VMs running on a host, it’s quite likely these VMs have quite a bit of common stuff between them in RAM, so it makes sense for the ESXi host to share that common stuff between the hosts. So even if each VM has (say) 4 GB RAM assigned to it, and there’s about 2GB worth of stuff common between the VMs, the host only needs to use 2GB shared RAM + 2 x 2GB private RAM for a total of 6GB RAM. 

Apart from this as the host is under increased memory pressure it resorts to techniques like ballooning and memory swapping to free up some RAM for itself. 

I even made a script today to list out all the VMs in our environment that have greater than 8GB RAM assigned to them and are powered on and to list the amount of shared RAM (just for my own info). 

Anyhow – around 2015 VMware stopped page sharing of VM memory between VMs. VMware calls this sort of RAM sharing as inter-VM TPS. Apparently this is a security risk and VMware likes to ship their products as secure by default, so via some patches to the 5.x series (and as default in the 6.x series) it turned off inter-VM TPS and introduced some controls that allow IT Admins to turn this on if they so wish. Intra-VM TPS is still enabled – i.e. the ESXi host will do page sharing within each VM – but it not longer does page sharing between VMs by default. 

Using the newly introduced controls, however, it is possible to enable inter-VM TPS for all VMs, or selectively between some VMs. Quoting from this blog post

You can set a virtual machine’s advanced parameter sched.mem.pshare.salt to control its ability to participate in transparent page sharing.  

TPS is only allowed within a virtual machine (intra-VM TPS) by default, because the ESXi host configuration option Mem.ShareForceSalting is set to 2, the sched.mem.pshare.salt is not present in the virtual machine configuration file, and thus the virtual machine salt value is set to unique value. In this case, to allow TPS among a specific set of virtual machines, set the sched.mem.pshare.salt of each virtual machine in the set to an identical value.  

Alternatively, to enable TPS among all virtual machines (inter-VM TPS), you can set Mem.ShareForceSalting to 0, which causes sched.mem.pshare.salt to be ignored and to have no impact.

Or, to enable inter-VM TPS as the default, but yet allow the use of sched.mem.pshare.salt to control the effect of TPS per virtual machine, set the value of Mem.ShareForceSalting to 1. In this case, change the value of sched.mem.pshare.salt per virtual machine to prevent it from sharing with all virtual machines and restrict it to sharing with those that have an identical setting.

Nice! 

I wonder if intra-VM TPS has much memory savings. Looking at the output from my script for our estate I see that many of our server VMs have about half their allocated RAM as shared, so it does make an impact. I guess it will also make a difference when moving to a container architecture wherein a single VM might have many containers. 

I would also like to point out to this blog post and another blog post I came across from it on whether inter-VM TPS even makes much sense in today’s environments and also on the kind of impact it can have during vMotion etc. Good stuff. I am still reading these but wanted to link to them for reference. Mainly – nowadays we have larger page sizes and so the probability of finding an identical page to be shared between two VMs is low; then there is NUMA that places memory pages closer to the CPU and TPS could disrupt that; and also, TPS is a process that runs periodically to compare pages, so there is an operational cost as it runs and finds a match and then does a full compare of the two pages to ensure they are really identical. 

Good to know. 

ODBC DSN creation for vCenter database

Was creating a DSN on my vCenter 6.5 to point to a SQL 2014 database and the following post was useful. Take special note of the path from where you can install the SQL driver for the DSN.

 

Unable to ping Nested VMs (XenServer/ VMware ESXi)

Spent the better part of two days chasing an issue only to find it was no issue at all. So irritated! Wasn’t a total waste of time as I got to read stuff, but it side tracked me from the main issue.

Here’s my setup. I have a Windows Server 2012R2 physical server. This runs VMware Workstation 12.5. Within it I have XenServer and VMware ESXi (the hypervisor isn’t relevant to the story but I mention it anyways). Within the hypervisor I have a Windows 8.1 VM – well two of them actually, but again it doesn’t matter much to the story.

Within VMware Workstation I have a couple of other servers too – a mix of Windows Server 2012 R2, 2016, and FreeBSD.

Let’s call the VMs within VMware Workstation as “VMs” while the VMs within the nested hypervisors as “VVMs”. The issue was that from the VVMs I was able to ping the VMs and get info from them (e.g. IP addresses) but I couldn’t ping the VVMs from the VMs. It didn’t matter which hypervisor the VVM was on. Also, the VVMs couldn’t ping each other.

There’s a lot of forum and blog posts on theis topic but their issue seems to be different. Their issue is that the VVMs are unable to see the outside world (i.e. the VMs). But my issue was that the VVMs could see the outside world; it was the outside world that couldn’t see them. All the forum and blog posts pointed to it being a case of the virtual switch not allowing promiscuous mode or forged MAC addresses, and the fix was to enable these. In my case I couldn’t find any such setting on VMware Workstation so I began suspecting it as the culprit.

Some good links I found while reading on these; putting them here as info for myself:

Oh, and if you are on a Linux host (where VMware Workstation is running) then you need to do some extra stuff to enable Promiscuous mode.

Nowhere could I find anything on what to do for VMware Workstation running on Windows and whether it had promiscuous mode enabled or not.

Finally I resorted to using tcpdump (on XenServer)/ tcpdump-uw (on ESXi) to see if the nested hypervisor is receiving the ICMP packets – it was. The ARP requests had the correct MAC addresses too. Next I installed Wireshark on a VM and VVM to see what was happening, and I could see that the VVM was receiving packets but not replying. So the switch in VMware Workstation was definitely in promiscous mode – the problem was in the VVM. I didn’t suspect a VVM firewall at all as I had disabled the Windows firewall service; but just for the heck of it I enabled the firewall service and simply turned off the firewall. And what do you know – suddenly the VVM is responding to ICMP packets!!

I have no idea why this is so. I had always thought disabling the firewall service is enough to … well, disable the firewall. But looks like actually disabling the firewall for each of the network profiles is the important thing. Weird.

Anyways – after two days of scratching my head I now have connectivity from my VMs to VVMs.

Reboot a bunch of ESXi hosts one after the other

Not a big deal, I know, but I felt like posting this. :)

Our HP Gen8 ESXi hosts were randomly crashing ever since we applied the latest ESXi 5.5 updates to them in December. Logged a call with HP and turns out until a proper fix is issued by VMware/ HPE we need to change a setting on all our hosts and reboot them. I didn’t want to do it manually, so I used PowerCLI to do it en masse.

Here’s the script I wrote to target Gen8 hosts and make the change:

I could have done the reboot along with this, but I didn’t want to. Instead I copy pasted the list of affected hosts into a text file (called ESXReboot.txt in the script below) and wrote another script to put them into maintenance mode and reboot one by one.

The screenshot output is slightly different from what you would get from the script as I modified it a bit since taking the screenshot. Functionality-wise there’s no change.

VCSA migration – “A problem occurred while logging in. Verify the connection details.”

So, I was trying out a Windows vCenter 5.5 to VCSA 6.5 appliance migration and at the stage where I enter the target ESX host name where the appliance will be deployed to I got the above error.

Wasted the better part of my day troubleshooting this as I could find absolutely no mention of what was causing this. The installer log had the following but that didn’t shed much light either.

Tried stuff like 1) try a different ESX host, 2) update it to a later version (it was 5.5 Build 3568722), 3) turn on the ESX Shell and SSH in case that mattered – but nothing helped!

Nothing came up regarding the “vimService creation failed: Error” line either. But then I began Googling on “vimService” and learnt that it is the vSphere Management SDK and that you access the SDK via a URL like https://servername/sdk. That got me thinking whether the VCSA installer looks to the proxy settings of the machine where I am running it from, so I turned off the proxy settings in IE – and that helped!

Who would have thought. :)

P2V a SQL cluster by breaking the cluster

Need to P2V a SQL cluster at work. Here’s screenshots of what I did in a test environment to see if an idea of mine would work.

We have a 2 physical-nodes SQL cluster. The requirement was to convert this into a single virtual machine.

P2V-ing a single server is easy. Use VMware Converter. But P2V-ing a cluster like this is tricky. You could P2V each node and end up with a cluster of 2 virtual-nodes but that wasn’t what we wanted. We didn’t want to deal with RDMs and such for the cluster, so we wanted to get rid of the cluster itself. VMware can provide HA if anything happens to the single node.

My idea was to break the cluster and get one of the nodes of the cluster to assume the identity of the cluster. Have SQL running off that. Virtualize this single node. And since there’s no change as far as the outside world is concerned no one’s the wiser.

Found a blog post that pretty much does what I had in mind. Found one more which was useful but didn’t really pertain to my situation. Have a look at the latter post if your DTC is on the Quorum drive (wasn’t so in my case).

So here we go.

1) Make the node that I want to retain as the active node of the cluster (so it was all the disks and databases). Then shutdown SQL server.

sqlshutdown

2) Shutdown the cluster.

clustershutdown

3) Remove the node we want to retain, from the cluster.

We can’t remove/ evict the node via GUI as the cluster is offline. Nor can we remove the Failover Cluster feature from the node as it is still part of a cluster (even though the cluster is shutdown). So we need to do a bit or “surgery”. :)

Open PowerShell and do the following:

This simply clears any cluster related configuration from the node. It is meant to be used on evicted nodes.

Once that’s done remove the Failover Cluster feature and reboot the node. If you want to do this via PowerShell:

4) Bring online the previously shared disks.

Once the node is up and running, open Disk Management and mark as online the shared disks that were previously part of the cluster.

disksonline

5) Change the IP and name of this node to that of the cluster.

Straight-forward. Add CNAME entries in DNS if required. Also, you will have to remove the cluster computer object from AD first before renaming this node to that name.

6) Make some registry changes.

The SQL Server is still not running as it expects to be on a cluster. So make some registry changes.

First go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Setup and open the entry called SQLCluster and change its value from 1 to 0.

Then take a backup (just in case; we don’t really need it) of the key called HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Cluster and delete it.

Note that MSSQL10_50.MSSQLSERVER may vary depending on whether you have a different version of SQL than in my case.

7) Start the SQL services and change their startup type to Automatic.

I had 3 services.

Now your SQL server should be working.

8) Restart the server – not needed, but I did so anyways.

Test?

If you are doing this in a test environment (like I was) and don’t have any SQL applications to test with, do the following.

Right click the desktop on any computer (or the SQL server computer itself) and create a new text file. Then rename that to blah.udl. The name doesn’t matter as long as the extension is .udl. Double click on that to get a window like this:

udl

Now you can fill in the SQL server name and test it.

One thing to keep in mind (if you are not a SQL person – I am not). The Windows NT Integrated security is what you need to use if you want to authenticate against the server with an AD account. It is tempting to select the “Use a specific user name …” option and put in an AD username/ password there, but that won’t work. That option is for using SQL authentication.

If you want to use a different AD account you will have to do a run as of the tool.

Also, on a fresh install of SQL server SQL authentication is disabled by default. You can create SQL accounts but authentication will fail. To enable SQL authentication right click on the server in SQL Server Management Studio and go to Properties, then go to Security and enable SQL authentication.

sqlauth

That’s all!

Now one can P2V this node.

Removing Datastores from an ESX host

Datastores in ESX hosts are made up of extents. Extents can be thought of as the underlying physical disk/ LUN that goes into making up the datastore.

A datastore is usually made up of a single extent, but can span multiple extents too. So removing a datastore from an ESX hosts means you dismount the datastore and then detach the extents.

Datastores have friendly names that you assign when creating it. Extents have names that usually start with naa or eui.

In vSphere client when you select a host, go to its Configuration tab, Storage, select Datastores view – the “Identification” column shows the datastore name and the “Device” column shows the extent name.

In PowerCLI the same information can be seeing using  Get-View or the ExtensionData property object of a datastore object (as in my previous post).

Anyways, to remove a datastore from an ESX host you first go to the Datastores screen as above, select the datastore, right click and select “Unmount”. This will do a bunch of checks (such as whether any VMs running on that host have their disks on this datastore) and then let you unmount it. This only removes the datastore name from the ESX host though; the host can still see and mount the datastore. So the next step is to also detach the extent from the host – i.e. unpresent the underlying disk/ LUN from the host.

For this you need the extent names. Get these as above (by expanding the “Device” column to see the name; or use PowerCLI). Then go to the Devices view (instead of the Datastores view that you currently are on). Expand the “Identifier” column now and find the extents that we want to detach. Once you find this right click and select “Detach”. This too does some checks and then lets you detach the extent if it’s not in use.

That’s it.

p.s. Too lazy to take screenshots. Sorry about that. :)

Get a list of VMs running on specific datastores, along with the host

Needed to dismount some datastores/ LUNs from a few hosts but before doing that needed to ensure none of the VMs running on these datastores are hosted on the hosts I want to remove access from. This one-liner PowerCLI will do just that for you:

Replace “PP_” with the pattern you are interested in matching in the datastore name.

A variation of the above where I only list VMs that are hosted the hosts I want to remove access from:

In my case the hosts that should have access to the datastores with a “PP_” in their name will also have numbers 01-03 in them. Any VMs not on hosts with these names are what I am interested in.

PowerCLI, VMware Tools update, etc.

(The following is based on this VMware KB article which is for ESXi 4.0 and earlier but can be made to work for later versions too).

In vSphere client we can see the VMware Tools related settings of a VM in the Options tab of the VM properties window. In PowerCLI these are exposed under the ExtensionData object. Specifically the ExtensionData.Config.Tools object.

The ExtensionData object has many methods and properties – think of it like the advanced options menu in a GUI. One of these methods is ReconfigVM() which takes an object of type VMware.Vim.VirtualMachineConfigSpec and reconfigures the VM accordingly.

So to take the example of modifying the VMware Tools update settings all one has to do is create a new object of the type above and pass it to the ReconfigVM() method. Something as below.

First we create an object of this type:

If we look at this object now we will see that it has various properties and methods. The Tools related settings are controlled by a property called Tools of type VMware.Vim.ToolsConfigInfo. To modify these we need to create a new object of that type:

This has no settings by default:

But we can set the properties we are interested in modifying.

For instance to set VMware Tools to be automatically updated upon power cycle do the following:

To undo that change set the value to “manual” (it only takes two options).

Here’s an example of me changing the VMware Tools updating settings to be manual.

So that’s it. Now to do this en-masse for a bunch of VMs you can make a loop.

If the list of VMs is got from vCenter directly (via say something like Get-VM | where {(Get-Cluster).Name -eq “CLUSTER NAME”}) then the code needs a bit of change (the $VMObj line can be removed).

Just as a reference to future me, the output returned by the ExtensionData object is what you would get via the Get-View cmdlet.

Update: Came across this while writing this post. If you have multiple vCenter servers and want PowerCLI to work against entities in all of them the following will help.

Enabling SNMPv3 on ESXi hosts

A continuation to my earlier post which was to do with SNMPv2.

As before, connect to the vCenter via PowerCLI. And as before the set() method can be used to set SNMP – both v2 and/or v3. The definition of this method is as follows:

That’s confusing so best to copy paste the definition into notepad or something so you can be sure you are passing the correct arguments.

First things first. There doesn’t seem to be a way of turning off something. As in, say you already have SNMPv2 turned on, you can’t turn it off by setting the community strings to blank. Doing so generates an error. So if you want to turn previous things off it’s best to do a reset and start with a clean slate.

This sets things back to their defaults:

Before going ahead with any SNMPv3 configuration we need to decide on what authentication and privacy protocols to use. In my case I want to use SHA1 and AES-128. So I need to set that first:

Once I have done this I can generate the hashes. I will need this later to configure SNMPv3.

In the example above both my passwords are Password1.

With this in hand I configure SNMPv3:

That’s it really. In the above example I will be using an SNMPv3 user called snmpUser1.

Now to do it across my estate I can make a loop. No need to create password hashes for each host. The hash stays the same as long as you are using the same password for each host.

That’s all!

vSphere Replication does not support changing the length of a replicated disk.

Had to extend a VM disk today and got the above error. This is because the VM is replicated via vSphere Replication so you can’t simply extend the disk as you would do for any regular VM.

error-1

Here’s a top level summary of how you do it (based on this KB article).

  1. You have to break the replication. Stop it that is. But doing so deletes the replicated files, so first you want to work around that (as below).
    1. Note the current settings of the replication.
    2. Then pause the replication.
    3. Find out which datastore holds the replicated VM disks.
    4. Rename the replicated VM folder.
    5. Now you can stop the replication because you have kept a copy of the data.
  2. SSH into any ESX host that has access to the above datastore and extend the disk associated with the VMDK via vmkfstools.
  3. Rename the folder back to what it was before.
  4. Recreate the replication, but point the destination to the same datastore as above and select the folder above. vSphere Replication will ask whether you want to use the existing data as seed – answer yes.

That’s it basically.

In terms of the details, I didn’t know how to find which datastore had the replicated VM files. So I SSH’d into one of the hosts in the replicated VM cluster and ran the following:

There must be some better way, but what the heck. Once I found the path above I did the following to find other VMs in it, and using that info I was able to find the datastore name from vSphere client.

You need this datastore name for when setting up a new replication, so you can point to that.

Some more things to keep in mind are the following.

  1. Since we pause the replication rather than stop it, the folder will contain a bunch of hbr* files. Delete those.
  2. The vmkfstools command -X switch takes the new size of the disk. Not the additional amount. So if the disk is 10GB and you want to add 20GB, you specify it the argument as 30GB. If you are getting a “Failed to extend disk : One of the parameters supplied is invalid (1).” error with vmkfstools that’s probably why.

VMware client – unable to login with username, password; but able to login with “use windows credentials”

We had this weird issue at work yesterday wherein you could not login to the vCenter server by entering a username/ password, but could if you just ticked on the “Use windows session credentials” checkbox.

The issue got resolved eventually by stopping the “VMware Secure Token Service”, restarting the “VMware VirtualCenter Server” service, and then starting the “VMware Secure Token Service”. No idea why that made a difference though, and whether that actually fixed things or was just coincidental. Around the same time I had seen some VMware Tools errors so I (a) upgraded the tools, (b) moved the vCenter VM to a different host, (c) saw that one of these had caused issues with the network driver so I had to uninstall and reinstall the tools and then reset the secure channel with the domain (since when the vCenter VM came up it didn’t have network connectivity).

So it was a bit of a damper actually. Nothing more frustrating than spending a lot of time troubleshooting something and not really figuring out what the issue is. On the plus side at least the issue got sorted, but it leaves me uneasy not knowing what really went wrong and whether it will re-occur.

In the event logs there were many entries like these:

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        VCENTER01$
    Account Domain:        MYDOMAIN
    Logon ID:        0x3e7

Logon Type:            3

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        SomeAccount
    Account Domain:        MYDOMAIN.COM

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xc000006d
    Sub Status:        0xc0000064

Process Information:
    Caller Process ID:    0xe20
    Caller Process Name:    E:\Program Files\VMware\Infrastructure\VMware\CIS\vmware-sso\VMwareIdentityMgmtService.exe

Network Information:
    Workstation Name:    VCENTER01
    Source Network Address:    –
    Source Port:        –

Detailed Authentication Information:
    Logon Process:        Advapi  
    Authentication Package:    Negotiate
    Transited Services:    –
    Package Name (NTLM only):    –
    Key Length:        0

Here’s what the error codes mean –

  • NULL SID suggests that the account that was being authenticated could not be identified
  • 0xC000006D means that authentication failed due to bad credentials
  • 0xC0000064 means that the requested user name does not exist.
  • Logon type 3 means the request was received from the network (but given the request originated from “server”, suggests that the request was looped back from itself over the network stack.

Not that it throws much light on what’s happening.

For info – this KB article lists the useful vCenter log files. I looked at the vpxd-xxxx.log file which had some entries like these –

2016-06-06T16:08:18.046+01:00 [02856 error ‘[SSO]’ opID=138a737d] [UserDirectorySso] AcquireToken exception: class SsoClient::CommunicationException(No connection could be made because the target machine actively refused it)
2016-06-06T16:08:18.046+01:00 [02856 error ‘authvpxdUser’ opID=138a737d] Failed to authenticate user <mydomain\someaccount>

This file is under C:\ProgramData\VMware\VMware VirtualCenter\Logs by the way.

I also found messages like these –

2016-06-06T10:17:59.226+01:00 [06952 error ‘[SSO]’ opID=1790eabb] [UserDirectorySso] AcquireToken exception: class SsoClient::SsoException(Failed to parse Group Identity value: `\Authentication authority asserted identity’; domain or group missing)

Two more logs I looked at are C:\ProgramData\VMware\CIS\logs\vmware-sso\vmware-sts-idmd.log and some files under C:\ProgramData\VMware\CIS\runtime\VMwareSTS\logs. In case of the latter location I just sorted by the recently modified timestamp and found some logs to look at. I focused on one called ssoAdminServer.log. This file had a few entries like these –

[2016-06-06 12:19:08,987 pool-11-thread-1  ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception
com.vmware.identity.idm.IDMException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:131)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:4006)

I found mention of this message in a forum post which pointed to this being a known issue for vCenter installed on a 2012 server with a 2012 DC. That doesn’t apply to me.

The vSphere Web Client gives an error message “Cannot Parse Group Information” – which too is a symptom if you install vCenter on a 2012 server with a 2012 DC. Moreover it applies to vCenter 5.5 GA, which is what we are on, so all the symptoms point to that issue but it is not so in our case. :(

Back to the vmware-sts-idmd.log, that had entries like these –

2016-06-06 09:00:26,089 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,090 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,091 ERROR  [ValidateUtil] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
2016-06-06 09:00:26,092 INFO   [ActiveDirectoryProvider] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
…<snip>…
2016-06-06 09:02:53,005 INFO   [IdentityManager] Failed to find principal [SomeAccount@mydomain.tld] as FSP group in tenant [vsphere.local]
2016-06-06 09:02:53,008 INFO   [IdentityManager] Failed to find FSP user or gorup [SomeAccount@mydomain.tld]’s nested parent groups in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [IdentityManager] Failed to find nested parent groups of principal [SomeAccount@mydomain.tld] in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [ServerUtils] Exception ‘java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]’
java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroupsByPac(ActiveDirectoryProvider.java:2140)
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroups(ActiveDirectoryProvider.java:791)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:3985)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroups(IdentityManager.java:3856)
    at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Again, something to do with DC/ domain … but what!? Found this blog post too that suggested the same.

For my reference, here’s a KB article listing all the SSO log files. And this is a useful blog post in case I happen upon a similar issue later (the case of the flapping VMware Secure Token Service). As is this KB article on an SSO facade error.

Cannot login to vCenter with “use windows session credentials” but can login by entering username & password

Had this issue today (and a few months ago). I open vCenter client, type in the vCenter server name, tick “Use Windows Session Credentials” as usual, and login fails. Says it cannot login with the given credentials.

At the same time I can login with the vSphere Web Client and also by un-ticking the box and manually entering the username/ password.

Fix for both times was to reset the secure channel by logging in to the vCenter server –