Contact

Subscribe via Email

Subscribe via RSS

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

P2V a SQL cluster by breaking the cluster

Need to P2V a SQL cluster at work. Here’s screenshots of what I did in a test environment to see if an idea of mine would work.

We have a 2 physical-nodes SQL cluster. The requirement was to convert this into a single virtual machine.

P2V-ing a single server is easy. Use VMware Converter. But P2V-ing a cluster like this is tricky. You could P2V each node and end up with a cluster of 2 virtual-nodes but that wasn’t what we wanted. We didn’t want to deal with RDMs and such for the cluster, so we wanted to get rid of the cluster itself. VMware can provide HA if anything happens to the single node.

My idea was to break the cluster and get one of the nodes of the cluster to assume the identity of the cluster. Have SQL running off that. Virtualize this single node. And since there’s no change as far as the outside world is concerned no one’s the wiser.

Found a blog post that pretty much does what I had in mind. Found one more which was useful but didn’t really pertain to my situation. Have a look at the latter post if your DTC is on the Quorum drive (wasn’t so in my case).

So here we go.

1) Make the node that I want to retain as the active node of the cluster (so it was all the disks and databases). Then shutdown SQL server.

sqlshutdown

2) Shutdown the cluster.

clustershutdown

3) Remove the node we want to retain, from the cluster.

We can’t remove/ evict the node via GUI as the cluster is offline. Nor can we remove the Failover Cluster feature from the node as it is still part of a cluster (even though the cluster is shutdown). So we need to do a bit or “surgery”. :)

Open PowerShell and do the following:

This simply clears any cluster related configuration from the node. It is meant to be used on evicted nodes.

Once that’s done remove the Failover Cluster feature and reboot the node. If you want to do this via PowerShell:

4) Bring online the previously shared disks.

Once the node is up and running, open Disk Management and mark as online the shared disks that were previously part of the cluster.

disksonline

5) Change the IP and name of this node to that of the cluster.

Straight-forward. Add CNAME entries in DNS if required. Also, you will have to remove the cluster computer object from AD first before renaming this node to that name.

6) Make some registry changes.

The SQL Server is still not running as it expects to be on a cluster. So make some registry changes.

First go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Setup and open the entry called SQLCluster and change its value from 1 to 0.

Then take a backup (just in case; we don’t really need it) of the key called HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Cluster and delete it.

Note that MSSQL10_50.MSSQLSERVER may vary depending on whether you have a different version of SQL than in my case.

7) Start the SQL services and change their startup type to Automatic.

I had 3 services.

Now your SQL server should be working.

8) Restart the server – not needed, but I did so anyways.

Test?

If you are doing this in a test environment (like I was) and don’t have any SQL applications to test with, do the following.

Right click the desktop on any computer (or the SQL server computer itself) and create a new text file. Then rename that to blah.udl. The name doesn’t matter as long as the extension is .udl. Double click on that to get a window like this:

udl

Now you can fill in the SQL server name and test it.

One thing to keep in mind (if you are not a SQL person – I am not). The Windows NT Integrated security is what you need to use if you want to authenticate against the server with an AD account. It is tempting to select the “Use a specific user name …” option and put in an AD username/ password there, but that won’t work. That option is for using SQL authentication.

If you want to use a different AD account you will have to do a run as of the tool.

Also, on a fresh install of SQL server SQL authentication is disabled by default. You can create SQL accounts but authentication will fail. To enable SQL authentication right click on the server in SQL Server Management Studio and go to Properties, then go to Security and enable SQL authentication.

sqlauth

That’s all!

Now one can P2V this node.

PowerShell – List of machines with CPU and disk info

Had to come up with a list of machines and their CPU, disk info etc. Thought it better to make a small script for it.

Hope it helps someone.

Removing Datastores from an ESX host

Datastores in ESX hosts are made up of extents. Extents can be thought of as the underlying physical disk/ LUN that goes into making up the datastore.

A datastore is usually made up of a single extent, but can span multiple extents too. So removing a datastore from an ESX hosts means you dismount the datastore and then detach the extents.

Datastores have friendly names that you assign when creating it. Extents have names that usually start with naa or eui.

In vSphere client when you select a host, go to its Configuration tab, Storage, select Datastores view – the “Identification” column shows the datastore name and the “Device” column shows the extent name.

In PowerCLI the same information can be seeing using  Get-View or the ExtensionData property object of a datastore object (as in my previous post).

Anyways, to remove a datastore from an ESX host you first go to the Datastores screen as above, select the datastore, right click and select “Unmount”. This will do a bunch of checks (such as whether any VMs running on that host have their disks on this datastore) and then let you unmount it. This only removes the datastore name from the ESX host though; the host can still see and mount the datastore. So the next step is to also detach the extent from the host – i.e. unpresent the underlying disk/ LUN from the host.

For this you need the extent names. Get these as above (by expanding the “Device” column to see the name; or use PowerCLI). Then go to the Devices view (instead of the Datastores view that you currently are on). Expand the “Identifier” column now and find the extents that we want to detach. Once you find this right click and select “Detach”. This too does some checks and then lets you detach the extent if it’s not in use.

That’s it.

p.s. Too lazy to take screenshots. Sorry about that. :)

Get a list of VMs running on specific datastores, along with the host

Needed to dismount some datastores/ LUNs from a few hosts but before doing that needed to ensure none of the VMs running on these datastores are hosted on the hosts I want to remove access from. This one-liner PowerCLI will do just that for you:

Replace “PP_” with the pattern you are interested in matching in the datastore name.

A variation of the above where I only list VMs that are hosted the hosts I want to remove access from:

In my case the hosts that should have access to the datastores with a “PP_” in their name will also have numbers 01-03 in them. Any VMs not on hosts with these names are what I am interested in.

Kindles – Voyage & Oasis

Recently I decided to upgrade my Kindle. And went on a splurge and first bought the Voyage, and then the Oasis (on a 5 month installment scheme from Amazon). This was a huge upgrade for me – device I hitherto used for reading being the first gen Paperwhite. 

The first gen Paperwhite was my first and only Kindle up to this point. When the subsequent generations were released I never upgraded. Mainly coz my reading habits were off and on, and also because I used to supplement the Paperwhite with the Kindle apps on my iPad and Nexus tablet. Neither of them were as good as the Paperwhite but like I said my reading was off and on, and I used to read other stuff like PDFs and Instapaper and Longform, plus for a long time I was into comics. 

Fast forward to the present I slowly stopped reading all those other mediums too and pretty much stopped any reading. I think after a long time I read  “A Slight Trick of the Mind” by Mitch Cullen mainly because I saw that excellent movie “Mr. Holmes” which is based on the book and was so in love with the movie and it’s background score. The book didn’t live up to either of them but I persisted and finished it nevertheless over a weekend. After this I think I read a few books on the Paperwhite – mostly non-fiction. 

A few months later I signed up on Audible to try it out, this with yet another Holmes book – that of the elder brother (a book called “Mycroft Holmes”). I didn’t enjoy this book much either but I bought the Kindle version to read side by side and also try out Whispersync. That was nice. The book wasn’t great but I enjoyed the ability to sync and read together etc. Anyways, I didn’t manage much on Audiblr either and was about to close it after the trial month but Amazon offered a 3 month extension at half the price and so I stuck on. Good that I did coz now I am hooked on to Audible. 

I guess it’s coz of Audible and a rekindling of my interest in reading/ prose, plus a nudge from Amazon in terms of a reduced price on the Voyage for Prime members that I bought the Voyage. This was a giant step forward from the aging first gen Paperwhite that I was hooked and started voraciously reading.  Then I wanted to try the Oasis too, and even though it is pricey and has many negative reviews regarding its screen (and I am not rich and don’t have cash to throw around) I decided to buy it. 

Both are delightful devices. My favorite features would be the single pane of glass (without the depression on the screen as with the Paperwhite) – not sure why that matters, but it feels good – plus the ability to turn pages via Pagepress or the physical buttons. I especially love the latter. Makes it so convenient reading single handedly. 

I like both devices. I think I prefer the Voyage slightly more coz it feels more polished; but the Oasis has a lot more “cute” or children’s book sort of feel to it. It’s a nice little device. Sort of short and squarish. And more handy reading in a dark room as the page turn is via physical buttons as opposed to pressing the bezel on the Voyage (which is a hit and miss in the dark). Plus I love the cover and I feel it a lot easier to hold in hand. That’s not fair to the Voyage though as my comparison doesn’t include the Voyage case (which I don’t have). 

Initially I thought my Oasis had lighting issues as I felt one side is a bit darker than the other. I still feel so but when reading in the dark it doesn’t feel so, so maybe it’s just the external lighting. The Voyage consistently feels better in terms of lighting though. And maybe I am wrong but the text on the Voyage seems slightly more sharper – but that’s probably just me nitpicking. 

Anyhow. For anyone sitting on the fence these are excellent devices and a worthy upgrade over the Paperwhite (which is a good device too – what I mean is that you are getting some value for the extra cash you dole out for the Voyage or Oasis). 

Installing a new license key in KMS

KMS is something you login to once in a blue moon and then you wonder how the heck are you supposed to install a license key and verify that it got added correctly. So as a reminder to myself.

To install a license key:

Then activate it:

If you want to check that it was added correctly:

I use cscript so that the output comes in the command prompt itself and I can scroll up and down (or put into a text file) as opposed to a GUI window which I can’t navigate.

PowerCLI, VMware Tools update, etc.

(The following is based on this VMware KB article which is for ESXi 4.0 and earlier but can be made to work for later versions too).

In vSphere client we can see the VMware Tools related settings of a VM in the Options tab of the VM properties window. In PowerCLI these are exposed under the ExtensionData object. Specifically the ExtensionData.Config.Tools object.

The ExtensionData object has many methods and properties – think of it like the advanced options menu in a GUI. One of these methods is ReconfigVM() which takes an object of type VMware.Vim.VirtualMachineConfigSpec and reconfigures the VM accordingly.

So to take the example of modifying the VMware Tools update settings all one has to do is create a new object of the type above and pass it to the ReconfigVM() method. Something as below.

First we create an object of this type:

If we look at this object now we will see that it has various properties and methods. The Tools related settings are controlled by a property called Tools of type VMware.Vim.ToolsConfigInfo. To modify these we need to create a new object of that type:

This has no settings by default:

But we can set the properties we are interested in modifying.

For instance to set VMware Tools to be automatically updated upon power cycle do the following:

To undo that change set the value to “manual” (it only takes two options).

Here’s an example of me changing the VMware Tools updating settings to be manual.

So that’s it. Now to do this en-masse for a bunch of VMs you can make a loop.

If the list of VMs is got from vCenter directly (via say something like Get-VM | where {(Get-Cluster).Name -eq “CLUSTER NAME”}) then the code needs a bit of change (the $VMObj line can be removed).

Just as a reference to future me, the output returned by the ExtensionData object is what you would get via the Get-View cmdlet.

Update: Came across this while writing this post. If you have multiple vCenter servers and want PowerCLI to work against entities in all of them the following will help.

Enabling SNMPv3 on ESXi hosts

A continuation to my earlier post which was to do with SNMPv2.

As before, connect to the vCenter via PowerCLI. And as before the set() method can be used to set SNMP – both v2 and/or v3. The definition of this method is as follows:

That’s confusing so best to copy paste the definition into notepad or something so you can be sure you are passing the correct arguments.

First things first. There doesn’t seem to be a way of turning off something. As in, say you already have SNMPv2 turned on, you can’t turn it off by setting the community strings to blank. Doing so generates an error. So if you want to turn previous things off it’s best to do a reset and start with a clean slate.

This sets things back to their defaults:

Before going ahead with any SNMPv3 configuration we need to decide on what authentication and privacy protocols to use. In my case I want to use SHA1 and AES-128. So I need to set that first:

Once I have done this I can generate the hashes. I will need this later to configure SNMPv3.

In the example above both my passwords are Password1.

With this in hand I configure SNMPv3:

That’s it really. In the above example I will be using an SNMPv3 user called snmpUser1.

Now to do it across my estate I can make a loop. No need to create password hashes for each host. The hash stays the same as long as you are using the same password for each host.

That’s all!

vSphere Replication does not support changing the length of a replicated disk.

Had to extend a VM disk today and got the above error. This is because the VM is replicated via vSphere Replication so you can’t simply extend the disk as you would do for any regular VM.

error-1

Here’s a top level summary of how you do it (based on this KB article).

  1. You have to break the replication. Stop it that is. But doing so deletes the replicated files, so first you want to work around that (as below).
    1. Note the current settings of the replication.
    2. Then pause the replication.
    3. Find out which datastore holds the replicated VM disks.
    4. Rename the replicated VM folder.
    5. Now you can stop the replication because you have kept a copy of the data.
  2. SSH into any ESX host that has access to the above datastore and extend the disk associated with the VMDK via vmkfstools.
  3. Rename the folder back to what it was before.
  4. Recreate the replication, but point the destination to the same datastore as above and select the folder above. vSphere Replication will ask whether you want to use the existing data as seed – answer yes.

That’s it basically.

In terms of the details, I didn’t know how to find which datastore had the replicated VM files. So I SSH’d into one of the hosts in the replicated VM cluster and ran the following:

There must be some better way, but what the heck. Once I found the path above I did the following to find other VMs in it, and using that info I was able to find the datastore name from vSphere client.

You need this datastore name for when setting up a new replication, so you can point to that.

Some more things to keep in mind are the following.

  1. Since we pause the replication rather than stop it, the folder will contain a bunch of hbr* files. Delete those.
  2. The vmkfstools command -X switch takes the new size of the disk. Not the additional amount. So if the disk is 10GB and you want to add 20GB, you specify it the argument as 30GB. If you are getting a “Failed to extend disk : One of the parameters supplied is invalid (1).” error with vmkfstools that’s probably why.

Very Brief Notes on Windows Memory etc

I have no time nowadays to update this blog but I wanted to dump these notes I made for myself today. Just so the blog has some update and I know I can find these notes here again when needed.

This is on the various types of memory etc in Windows (and other OSes in general). I was reading up on this in the context of Memory Mapped Files.

Virtual Memory

  • Amount of memory the OS can access. Not related to the physical memory. It is only related to processor and OS – is it 32-bit or 64-bit.
  • 32-bit means 2^32 = 4GB; 64-bit means 2^64 = a lot! :)
  • On a default install of 32-bit Windows kernel reserves 2GB for itself and applications can use the balance 2GB. Each application gets 2GB. Coz it doesn’t really exist and is not limited by the physical memory in the machine. The OS just lies to the applications that they have 2GB of virtual memory for themselves.

Physical Memory

  • This is physical. But not limited to the RAM modules. It is RAM modules plus paging file/ disk.

Committed Memory

  • When a virtual memory page is touched (read/ write/ committed) it becomes “real” – i.e. put into Physical Memory. This is Committed Memory. It is a mix of RAM modules and disk.

Commit Limit

  • The total amount of Committed Memory is obviously limited by your Physical Memory – i.e. the RAM modules plus disk space. This is the Commit Limit.

Working Set

  • Set of virtual memory pages that are committed and fully belong to that process.
  • These are memory pages that exist. They are backed by Physical Memory (RAM plus paging files). They are real, not virtual.
  • So a working set can be thought of as the subset of a processes Virtual Memory space that is valid; i.e. can be referenced without a page fault.
    • Page fault means when the process requests for a virtual page and that is not in the Physical Memory, and so has to be loaded from disk (not page file even), the OS will put that process on hold and do this behind the scene. Obviously this causes a performance impact so you want to avoid page faults. Again note: this is not RAM to page file fault; this is Physical Memory to disk fault. Former is “soft” page fault; latter is “hard” page fault.
    • Hard faults are bad and tied to insufficient RAM.

Life cycle of a Page

  • Pages in Working Set -> Modified Pages -> Standby Pages -> Free Pages -> Zeroed pages.
  • All of these are still in Physical RAM, just different lists on the way out.

From http://stackoverflow.com/a/22174816:

Memory can be reserved, committed, first accessed, and be part of the working set. When memory is reserved, a portion of address space is set aside, nothing else happens.

When memory is committed, the operating system guarantees that the corresponding pages could in principle exist either in physical RAM or on the page file. In other words, it counts toward its hard limit of total available pages on the system, and it formally creates pages. That is, it creates pages and pretends that they exist (when in reality they don’t exist yet).

When memory is accessed for the first time, the pages that formally exist are created so they truly exist. Either a zero page is supplied to the process, or data is read into a page from a mapping. The page is moved into the working set of the process (but will not necessarily remain in there forever).

Memory Mapped Files

Windows (and other OSes) have a feature called memory mapped files.

Typically your files are on a physical disk and there’s an I/O cost involved in using them. To improve performance what Windows can do is map a part of the virtual memory allocated to a process to the file(s) on disk.

This doesn’t copy the entire file(s) into RAM, but a part of the virtual memory address range allocated to the process is set aside as mapping to these files on disk. When the process tries to read/ write these files, the parts that are read/ written get copied into the virtual memory. The changes happen in virtual memory, and the process continues to access the data via virtual memory (for better performance) and behind the scenes the data is read/ written to disk if needed. This is what is known as memory mapped files.

My understanding is that even though I say “virtual memory” above, it is actually restricted to the Physical RAM and does not include page files (coz obviously there’s no advantage to using page files instead of the location where the file already is). So memory mapped files are mapped to Physical RAM. Memory mapped files are commonly used by Windows with binary images (EXE & DLL files).

In Task Manager the “Memory (Private Working Set)” column does not show memory mapped files. For this look to the “Commit Size” column.

Also, use tools like RAMMap (from SysInternals) or Performance Monitor.

More info

Use PowerShell to get a list of GPOs without Authenticated Users in the delegation

Must have seen this recent Windows update that broke GPOs which were missing the Read permission for the “Authenticated Users” group. Solution is to get a list of these GPOs and add the “Authenticated Users” group to them. Here’s a one liner that gets you such a list –

This puts it into a file called GPOs.txt in the current directory. Remove/ Modify that last re-direct as needed.

VMware client – unable to login with username, password; but able to login with “use windows credentials”

We had this weird issue at work yesterday wherein you could not login to the vCenter server by entering a username/ password, but could if you just ticked on the “Use windows session credentials” checkbox.

The issue got resolved eventually by stopping the “VMware Secure Token Service”, restarting the “VMware VirtualCenter Server” service, and then starting the “VMware Secure Token Service”. No idea why that made a difference though, and whether that actually fixed things or was just coincidental. Around the same time I had seen some VMware Tools errors so I (a) upgraded the tools, (b) moved the vCenter VM to a different host, (c) saw that one of these had caused issues with the network driver so I had to uninstall and reinstall the tools and then reset the secure channel with the domain (since when the vCenter VM came up it didn’t have network connectivity).

So it was a bit of a damper actually. Nothing more frustrating than spending a lot of time troubleshooting something and not really figuring out what the issue is. On the plus side at least the issue got sorted, but it leaves me uneasy not knowing what really went wrong and whether it will re-occur.

In the event logs there were many entries like these:

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        VCENTER01$
    Account Domain:        MYDOMAIN
    Logon ID:        0x3e7

Logon Type:            3

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        SomeAccount
    Account Domain:        MYDOMAIN.COM

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xc000006d
    Sub Status:        0xc0000064

Process Information:
    Caller Process ID:    0xe20
    Caller Process Name:    E:\Program Files\VMware\Infrastructure\VMware\CIS\vmware-sso\VMwareIdentityMgmtService.exe

Network Information:
    Workstation Name:    VCENTER01
    Source Network Address:    –
    Source Port:        –

Detailed Authentication Information:
    Logon Process:        Advapi  
    Authentication Package:    Negotiate
    Transited Services:    –
    Package Name (NTLM only):    –
    Key Length:        0

Here’s what the error codes mean –

  • NULL SID suggests that the account that was being authenticated could not be identified
  • 0xC000006D means that authentication failed due to bad credentials
  • 0xC0000064 means that the requested user name does not exist.
  • Logon type 3 means the request was received from the network (but given the request originated from “server”, suggests that the request was looped back from itself over the network stack.

Not that it throws much light on what’s happening.

For info – this KB article lists the useful vCenter log files. I looked at the vpxd-xxxx.log file which had some entries like these –

2016-06-06T16:08:18.046+01:00 [02856 error ‘[SSO]’ opID=138a737d] [UserDirectorySso] AcquireToken exception: class SsoClient::CommunicationException(No connection could be made because the target machine actively refused it)
2016-06-06T16:08:18.046+01:00 [02856 error ‘authvpxdUser’ opID=138a737d] Failed to authenticate user <mydomain\someaccount>

This file is under C:\ProgramData\VMware\VMware VirtualCenter\Logs by the way.

I also found messages like these –

2016-06-06T10:17:59.226+01:00 [06952 error ‘[SSO]’ opID=1790eabb] [UserDirectorySso] AcquireToken exception: class SsoClient::SsoException(Failed to parse Group Identity value: `\Authentication authority asserted identity’; domain or group missing)

Two more logs I looked at are C:\ProgramData\VMware\CIS\logs\vmware-sso\vmware-sts-idmd.log and some files under C:\ProgramData\VMware\CIS\runtime\VMwareSTS\logs. In case of the latter location I just sorted by the recently modified timestamp and found some logs to look at. I focused on one called ssoAdminServer.log. This file had a few entries like these –

[2016-06-06 12:19:08,987 pool-11-thread-1  ERROR com.vmware.identity.admin.server.ims.impl.PrincipalManagementImpl] Idm client exception
com.vmware.identity.idm.IDMException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.ServerUtils.getRemoteException(ServerUtils.java:131)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:4006)

I found mention of this message in a forum post which pointed to this being a known issue for vCenter installed on a 2012 server with a 2012 DC. That doesn’t apply to me.

The vSphere Web Client gives an error message “Cannot Parse Group Information” – which too is a symptom if you install vCenter on a 2012 server with a 2012 DC. Moreover it applies to vCenter 5.5 GA, which is what we are on, so all the symptoms point to that issue but it is not so in our case. :(

Back to the vmware-sts-idmd.log, that had entries like these –

2016-06-06 09:00:26,089 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,090 WARN   [ActiveDirectoryProvider] obtainDcInfo for domain [VCENTER01] failed Failed to get domain controller information for VCENTER01(dwError – 1355 – ERROR_NO_SUCH_DOMAIN)
2016-06-06 09:00:26,091 ERROR  [ValidateUtil] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
2016-06-06 09:00:26,092 INFO   [ActiveDirectoryProvider] resolved group name=[\Authentication authority asserted identity] is invalid: not a valid netbios name format  
…<snip>…
2016-06-06 09:02:53,005 INFO   [IdentityManager] Failed to find principal [SomeAccount@mydomain.tld] as FSP group in tenant [vsphere.local]
2016-06-06 09:02:53,008 INFO   [IdentityManager] Failed to find FSP user or gorup [SomeAccount@mydomain.tld]’s nested parent groups in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [IdentityManager] Failed to find nested parent groups of principal [SomeAccount@mydomain.tld] in tenant [vsphere.local]
2016-06-06 09:02:53,013 ERROR  [ServerUtils] Exception ‘java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]’
java.lang.IllegalStateException: Invalid group name format for [\Authentication authority asserted identity]
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroupsByPac(ActiveDirectoryProvider.java:2140)
    at com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider.findNestedParentGroups(ActiveDirectoryProvider.java:791)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroupsInternal(IdentityManager.java:3985)
    at com.vmware.identity.idm.server.IdentityManager.findNestedParentGroups(IdentityManager.java:3856)
    at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at sun.rmi.transport.Transport$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Again, something to do with DC/ domain … but what!? Found this blog post too that suggested the same.

For my reference, here’s a KB article listing all the SSO log files. And this is a useful blog post in case I happen upon a similar issue later (the case of the flapping VMware Secure Token Service). As is this KB article on an SSO facade error.

Solarwinds not seeing correct disk size; “Connection timeout. Job canceled by scheduler.” errors

Had this issue at work today. Notice the disk usage data below in Solarwinds –

Disk Usage

The ‘Logical Volumes’ section shows the correct info but the ‘Disk Volumes’ section shows 0 for everything.

Added to that all the Application Monitors had errors –

Timeout

I searched Google on the error message “Connection timeout. Job canceled by Scheduler.” and found this Solarwinds KB article. Corrupt performance counters seemed to be a suspect. That KB article was a bit confusing me to in that it gives three resolutions and I wasn’t sure if I am to do all three or just pick and choose. :)

Event Logs on the target server did show corrupt performance counters.

Initial Errors

I tried to get the counters via PowerShell to double check and got an error as expected –

Broken Get-Counter

Ok, so performance counter issue indeed. Since the Solarwinds KB article didn’t make much sense to me I searched for the Event ID 3001 as in the screenshot and came across a TechNet article. Solution seemed simple – open up command prompt as an admin, run the command lodctr /R. This command apparently rebuilds the performance counters from scratch based on currently registry settings adn backup INI files (that’s what the help message says). The command completed straight-forwardly too.

lodctr - 1

With this the performance counters started working via PowerShell.

Working Get-Counter

Event Logs still had some error but those were to do with the performance counters of ASP.Net and Oracle etc.

More Errors

The fix for this seemed to be a bit more involved and requires rebooting the server. I decided to skip it for now as I don’t these additional counters have much to do with Solarwinds. So I let those messages be and tried to see if Solarwinds was picking up the correct info. Initially I took a more patient approach of waiting and trying to make it poll again; then I got impatient and did things like removing the node from monitoring and adding it back (and then wait again for Solarwinds to poll it etc) but eventually it began working. Solarwinds now sees the disk space correctly and all the Application Monitors work without any errors too.

Here’s what I am guessing happened (based on that Solarwinds KB article I linked to above). The performance counters of the server got corrupt. Solarwinds uses counters to get the disk info etc. Due to this corruption the poller spent more time than usual when fetching info from the server. This resulted in the Application Monitor components not getting a chance to run as the poller had run out of time to poll the server. Thus the Application Monitors gave the timeout errors above. In reality the timeout was not from those components, it was from the corrupt performance counters.

Get a list of services and “Log On As” accounts

Wanted to find what account our NetBackup service is running under on a bunch of servers –

You have to use WMI for this coz Get-Service doesn’t show the Log On As user.

Wheee!! Had a tweet from Jeffrey Snover for this post.

 

 

Following on that tweet I noticed something odd.

The following command works –

Or this –

In the second one I am explicitly casting the arguments as an array.

But this variant doesn’t work –

That generates the following error –

Get-WmiObject : The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
At line:1 char:1
+ Get-WmiObject Win32_Service -cn $Servers -Filter ‘Name= “NetBackup Client Service …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Get-WmiObject], COMException
    + FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands.GetWmiObjectCommand

Get-WmiObject : The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
At line:1 char:1
+ Get-WmiObject Win32_Service -cn $Servers  -Filter ‘Name= “NetBackup Client Service …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Get-WmiObject], COMException
    + FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands.GetWmiObjectCommand

The error is generated for each entry in the array.

It looks like when I pass the list of servers as an array variable PowerShell uses a different way to connect to each server (PowerShell remoting/ WinRM) while if I specify the list in-line it behaves differently. I didn’t search much on this but found this Reddit thread with the same issue. Something to keep in mind …

 

Cannot login to vCenter with “use windows session credentials” but can login by entering username & password

Had this issue today (and a few months ago). I open vCenter client, type in the vCenter server name, tick “Use Windows Session Credentials” as usual, and login fails. Says it cannot login with the given credentials.

At the same time I can login with the vSphere Web Client and also by un-ticking the box and manually entering the username/ password.

Fix for both times was to reset the secure channel by logging in to the vCenter server –