Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Bug in Server 2016 and DNS zone delegation with CNAME records

A colleague at work mentioned a Server 2016 DNS zone delegation bug he had found. I found just one post on the Internet when I searched for this. According to my colleague Microsoft has now confirmed this as a bug in the support call he raised. 

DNS being an area of interest I wanted to replicate the issue and post it – so here goes. Hopefully it makes sense. :)

Imagine a split-DNS scenario for a zone rockylabs.zero

  • This zone is hosted externally by a DNS server (doesn’t matter what OS/ software) called data01
  • This zone is hosted internally by two DNS servers: a Server 2012R2 (called DC2012-01), and a Server 2016 (called DC2016-01). 

Now say there’s a record rakhesh.rockylabs.zero that is the same both internally and externally. As in, we want both internal and external users to get the same (external) IP address for this record. 

What you would typically do is add this record to your external DNS server and create a delegation from your two internal DNS servers, for this record, to the external DNS server. Here’s some screenshots:

The zone on my external DNS server. Notice I have an A record for rakhesh.rockylabs.zero.  

Ignore the rakhesh2.rockylabs.zero record for now. That comes in later. :)

Here’s a look at the delegation from my Server 2012R2 internal DNS server to the external DNS server for the rakhesh.rockylabs.zero record. Basically I create a delegation within the rockylabs.zero zone on the internal server, for the rakhesh domain, and point it to the external DNS server. On the external DNS server rakhesh.rockylabs.zero is defined as an A record so that will be returned as an answer when this delegation chain is followed. 

In my case both the internal DNS servers are also DCs, and the rockylabs.zero zone is AD integrated, so a similar delegation is automatically created on the other DNS server too. 

As would be expected, I am able to resolve this A record correctly from both internal DNS servers.

Now for the fun part!

Notice the rakhesh2.rockylabs.zero record on my external DNS server? Unlike rakhesh.rockylabs.zero this one is a CNAME record. This too should be common for both internal and external users. Shouldn’t be a problem really as it should work similarly to the A record. Following the chain of delegation when I resolve rakhesh2.rockylabs.zero to a CNAME record called rakhesh.com, my DNS server should automatically resolve the A record for rakhesh.com and give me its address as the answer for rakhesh2.rockylabs.zero.  It works with the Server 2012R2 internal DNS server as expected – 

But breaks for the 2016 internal DNS server!

And that’s it! That’s the bug basically. 

Here’s the odd bit though. If I were to query rakhesh.com (the domain to which the CNAME record points to), and then try to resolve the delegated record, it works!

If I go ahead and clear the cache on that 2016 internal server and try the name resolution again, it’s broken as before.

So the issue is that the 2016 DNS Server is able to follow the delegation for rakhesh2.rockylabs.zero to the external DNS server and resolve it to rakhesh.com, but it is doesn’t then go ahead and lookup rakhesh.com to get its A record. But if the A record for rakhesh.com is already cached with it, it is sensible enough to return that address. 

I dug a bit more into this by enabling debug logging on the 2016 server. Here’s what I found.

The 2016 server receives my query:

It passes this on to the external server (10.10.1.11 is data01 – external DNS server where rakhesh2.rockylabs.zero is delegated to). FYI, I am truncating the output here:

It gets a reply with the CNAME record. So far so good. 

Now it queries the external DNS server (data01 – 10.10.1.11) asking for the A record of rakhesh.com! That’s two wrong things: 1) Why ask the external DNS server (who as far as the internal DNS server knows is only delegated the rakhesh2.rockylabs.zero zone and has nothing to do with rakhesh.com) and 2) why ask it for the A record instead of the NS record so it can find the name servers for rakhesh.com and ask those for the IP address of rakhesh.com

It’s pretty much downhill from there, coz as expected the external DNS server replies saying “doh I don’t know” and gives it a list of root servers to contact. FYI, I truncated the output:

The 2016 internal DNS server now replies to the client with a fail message. As expected. 

So now we know why the 2016 server is failing. 

Until this is fixed one workaround would be to create a CNAME record directly in the internal DNS server to whatever the external DNS server points to. That is, don’t delegate to external – just create the same record internally too. Only for CNAME records; A is fine. Here’s an example of it working with a record called rakhesh3.rockylabs.zero where I simply made a CNAME on the internal 2016 DNS server. 

That’s all for now!

[Aside] Windows Update tools

Wanted to link to these as I came across them while searching for something Windows Updates related.

  • ABC-Update – didn’t try it out but looks useful from a client side point of view. Free.
  • WuInstall – seems to be a client and server thing. Putting it here so I find it if ever needed in future. Paid.
  • Windows Update PowerShell Module – you had me at PowerShell! :0)
    • A blog post explaining this module. Just in case.

Add-DnsServerZoneDelegation with multiple nameservers

Only reason I am creating this post is coz I Googled for the above and didn’t find any relevant hits

I know I can use the Add-DnsServerZoneDelegation cmdlet to create a new delegated zone (basically a sub-domain of a zone hosted by our DNS server, wherein some other DNS server hosts that sub-domain and we merely delegate any requests for that sub-domain to this other DNS server). But I wasn’t sure how I’d add multiple name servers. The switches give an option to give an array of IP addresses, but that’s just any array of IP addresses for a single name server. What I wanted was to have an array of name servers each with their own IP.

Anyways, turns out all I had to do was run the command for each name server. 

Above I create a delegation from my zone “parentzone.com” to 3 DNS servers “DNS0[1-3].somedomain” (also specified by their respective IPs) for the sub-domain “subzone.parentzone.com”.

OU delegation not working (contd.) – finding protected groups

Turns out I was mistaken in my previous post. A few minutes after enabling inheritance, I noticed it was disabled again. So that means the groups must be protected by AD.

I knew of the AdminSDHolder object and how it provides a template set of permissions that are applied to protected accounts (i.e. members of groups that are protected). I also knew that there were some groups that are protected by default. What I didn’t know, however, what that the defaults can be changed. 

Initially I did a Compare-Object -ReferenceObject (Get-ADPrincipalGroupMembership User1) -DifferenceObject (Get-ADPrincipalGroupMembership User2) -IncludeEqual to compare the memberships of two random accounts that seemed to be protected. These were accounts with totally different roles & group memberships so the idea was to see if they had any common groups (none!) and failing that to see if the groups they were in had any common ancestors (none again!)

Then I Googled a bit :o) and came across a solution. 

Before moving on to that though, as a note to myself: 

  • The AdminSDHolder object is at CN=AdminSDHolder,CN=System,DC=domain,DC=com. Find that via ADSI Edit (replace the domain part accordingly). 
  • Right click the object and its Security tab lists the template permissions that will be applied to members of protected groups. You can make changes here. 
  • SDProp is a process that runs every 60 minutes on the DC holding the PDC Emulator role. The period can be changed via the registry key HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\AdminSDProtectFrequency. (If it doesn’t exist, add it. DWORD). 
  • SDProp can be run manually if required. 

So back to my issue. Turns out if a group as its adminCount attribute set to 1 then it will be protected. So I ran the following against the OU containing my admin  account groups:

Bingo! Most of my admin groups were protected, so most admin accounts were protected. All I have to do now is either un-protect these groups (my preferred solution), or change the template to delegate permissions there. 

Update: Simply un-protecting a group does not un-protect all its members (this is by design). The member objects too have their adminCount attribute set to 1, so apart from fun-protecting the groups we must un-protect the members too. 

Update 2: Found this good post with lots more details. How to run the process manually, what are the default protected groups, etc. Read that post in conjunction with this one and you are set!

Update 3: You can unprotect the following default groups via dsHeuristics: 1) Account Operators, 2) Backup Operators, 3) Server Operators, 4) Print Operators. But that still leaves groups such as Administrators (built-in), Domain Admins, Enterprise Admins, Domain Controllers, Schema Admins, Read-Only Domain Controllers, and the user Administrator (built-in). There’s no way to un-protect members of these.

Something I hadn’t realized about adminCount. This attribute does not mean a group/ user will be protected. Instead, what it means is that if a group/ user is protected, and its ACLs have changed and are now reset to default, then the adminCount attribute will be set. So yes, adminCount will let you find groups/ users that are protected; but merely setting adminCount on a group/ user does not protect it. I learnt this the hard way while I was testing my changes. Set adminCount to 1 for a group and saw that nothing was happening.

Also, it is possible that a protected user/ group does not have adminCount set. This is because adminCount is only set if there is a difference in the ACLs between the user/ group and the AdminSDHolder object. If there’s no difference, a protected object will not have the adminCount attribute set. :)

OU delegation not working

Today I cracked a problem which had troubled us for a while but which I never really sat down and actually tried to troubleshoot. We had an OU with 3rd level admin accounts that no one else had rights to but wanted to delegate certain password related tasks to our Service Desk admins. Basically let them reset password, unlock the account, and enable/ disable. 

Here’s some screenshots for the delegation wizard. Password reset is a common task and can be seen in the screenshot itself. Enable/ Disable can be delegated by giving rights to the userAccountControl attribute. Only force password change rights (i.e. no reset password) can be given via the pwdLastSet attribute. And unlock can be given via the lockoutTime attribute

Problem was that in my case in spite of doing all this the delegated accounts had no rights!

Snooping around a bit I realized that all the admin accounts within the OU had inheritance disabled and so weren’t getting the delegated permissions from the OU (not sure why; and no these weren’t protected group members). 

Of course, enabling is easy. But I wanted to see if I could get a list of all the accounts in there with their inheritance status. Time for PowerShell. :)

The Get-ACL cmdlet can list access control lists. It can work with AD objects via the AD: drive. Needs a distinguished name, that’s all. So all you have to do is (Get-ADUser <accountname>).DistinguishedName) – prefix an AD: to this, and pass it to Get-ACL. Something like this:

The default result is useless. If you pipe and expand the Access property you will get a list of ACLs. 

The result is a series of entries like these:

The attribute names referred to by the GUIDs can be found in the AD Technical Specs

Of interest to us is the AreAccessRulesProtected property. If this is True then inheritance is disabled; if False inheritance is enabled. So it’s straight forward to make a list of accounts and their inheritance status:

So that’s it. Next step would be to enable inheritance on the accounts. I won’t be doing this now (as it’s bed time!) but one can do it manually or script it via the SetAccessRuleProtection method. This method takes two parameters (enable/ disable inheritance; and if disable then should we add/ remove existing ACEs). Only the first parameter is of significance in my case, but I have to pass the second parameter too anyways – SetAccessRuleProtection($False,$True).

Update: Here’s what I rolled out at work today to make the change.

Update 2: Didn’t realize I had many users in the built-in protected groups (these are protected even though their adminCount is 0 – I hadn’t realized that). To unprotect these one must set the dsHeuristics flag. The built-in protected groups are 1) Account Operators, 2) Server Operators, 3) Print Operators, and 4) Backup Operators. See this post on instructions (actually, see the post below for even better instructions).

Update 3: Found this amazing page that goes into a hell of details on this topic. Be sure to read this before modifying dsHeuristics.

Useful offline Windows troubleshooting/ fixing tricks

Had a Windows Server 2008 R2 server that started giving a blank screen since the recent Windows update reboot. This was a VM and it was the same result via VMware console or RDP. Safe Mode didn’t help either. Bummer!

Since this is a VM I mounted its disk on another 2008 R2 VM and tried to fix the problem offline. Most of my attempts didn’t help but I thought of posting them here for reference. 

Note: In the following examples the broken VM’s disk is mounted to F: drive. 

Recent updates

I used dism to list recent updates and remove them. To list updates from this month (March 2017):

To remove an update:

I did this for each of the updates I had. That didn’t help though. And oddly I found that one of the updates kept re-appearing with a slightly different name (a different number suffixed to it actually) each time I’d remove it. Not sure why that was the case but I saw that F:\Windows\SxS had a file called pending.xml and figured this must be doing something to stop the update from being removed. I couldn’t delete the file in-spite of taking ownership and full control, so I opened it in Notepad and cleared all the contents. :o) After that the updates didn’t return but the machine was still broken. 

SFC

I used sfc to check the integrity of all the system files:

No luck with that either!

Event Logs

Maybe the Event Logs have something? These can be found at F:\Windows\System32\Winevt\Logs. Double click the ones of interest to view. 

In my case the Event Logs had nothing! No record at all of the VM starting up or what was causing it to hang. Tough luck!

Bonus info: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Eventlog contains locations of the files backing the Event Logs. Just mentioning it here as I came across this.

Drivers

Could drivers cause any issue? Unlikely. You can’t use dism to query drivers as above but you can check via registry. See this post. Honestly, I didn’t read it much. I didn’t suspect drivers and it seemed too much work fiddling through registry keys and folders. 

Last Known Good Configuration

Whenever I’d boot up the VM I never got the Last Known Good (LKG) Configuration option. I tried pressing F8 a couple of times but it had no effect. So I wondered if I could tweak this via the registry. Turns out I can. And turns out I already knew this just that I had forgotten!

Your current configuration is HKLM\System\CurrentControlSet. This is actually a link to HKLM\System\CurrentControlSet01 or HKLM\System\CurrentControlSet02 or HKLM\System\CurrentControlSet03 or … (you get the point). Each of the CurrentControlSetXXX key is one of your previous configurations. The one that’s actually used can be found via HKLM\System\Select. The entry Current points to the number of the CurrentControlSetXXX key in use. The entry LastKnownGood points to the Last Known Good Configuration. Now we know what to do. 

  1. Mount the HKLM\SYSTEM hive of the broken VM. All registry hives can be found under %windir%\System32\Config. In my case that translates to the file F:\Windows\System32\Config\SYSTEM.
  2. To mount this file open Registry Editor, select the HKLM hive, and go to File > Load Hive. (This is a good post with screenshots etc).  
  3. Go to the Select key above. Change Current to whatever LastKnownGood was. 
  4. That’s all. Now unload the hive and you are done.

This helped in my case! I was finally able to move past the blank screen and get a login prompt. Upon login I was also able to download and install all the patches and confirm that the VM is now working fine (took a snapshot of course, just in case!). I have no idea what went wrong, but at least I have the pleasure of being able to fix it. From the post I link to below, I’d say it looks like a registry hive corruption. 

Since I successfully logged in, my machine’s Last Known Good Configuration will be automatically updated by Windows with the current one. Here’s a blog post that explains this in more detail. 

That’s all! Hope this helps someone. 

Useful WMIC filters

I have these tabs open in my browser from last month when I was doing some WMI based GPO targeting. Meant to write a blog post but I keep getting side tracked and now it’s been nearly a month so I have lost the flow. But I want to put these in the blog as a reference to my future self. 

That’s all.

Go through a group of servers and find whether a particular patch is installed

Patch Tuesday is upon us. Our pilot group of server was patched via SCCM but there were reports that 2012R2 servers were not picking up one of the patches. I wanted to quickly identify the servers that were missing patches. 

Our pilot servers are in two groups. So I did the following:

The first two lines basically enumerate the two groups. If it was just one group I could have replaced it with Get-ADGroupMember "GroupName"

The remaining code checks whether the server is online, filters out 2012 R2 servers (version number 6.3.9600), and makes a list of the servers along with the installed date of the hotfix I am interested in. If the hotfix is not installed, the date will be blank. Simple. 

Oh, and I wanted to get the output as and when it comes so I went with a Width=20 in the name field. I could have avoided that and gone for an -AutoSize but that would mean I’ll have to patiently wait for PowerShell to generate the entire output and then Format-Table to do an autosize. 

Update: While on the Win32_QuickFixEngineering WMI class I wanted to point out to these posts: [1], [2]

Worth keeping in mind that Win32_QuickFixEngineering (or QFE for short) only returns patches installed via the CBS (Component Based Servicing) – which is what Windows Updates do anyway. What this means, however, is that it does not return patches installed via an MSI/ MSP/ MSU. 

IE 11 update fails due to prerequisite updates (KB2729094)

IE 11 update requires the following prerequisite updates – link.

Even after installing those (most of which are already there) IE 11 install will complain and fail. The log files are in C:\Windows\IE_.main.log.

In my case I was getting the following error (seems to be the same for others too):

Thing is I already had this hotfix installed, so there was nothing more to do. Found this useful support post where someone suggested running the hotfix install and side-by-side launching the IE install. Might need to do it 2-3 times but that seems to make a difference. So I tried that and sure enough it helped.

That post is worth a read for some other tricks, especially if you are sequencing this via SCCM. I found this article from Symantec too which seems helpful. Some day when I am in charge of SCCM too I can try such stuff out! :)

P2V a SQL cluster by breaking the cluster

Need to P2V a SQL cluster at work. Here’s screenshots of what I did in a test environment to see if an idea of mine would work.

We have a 2 physical-nodes SQL cluster. The requirement was to convert this into a single virtual machine.

P2V-ing a single server is easy. Use VMware Converter. But P2V-ing a cluster like this is tricky. You could P2V each node and end up with a cluster of 2 virtual-nodes but that wasn’t what we wanted. We didn’t want to deal with RDMs and such for the cluster, so we wanted to get rid of the cluster itself. VMware can provide HA if anything happens to the single node.

My idea was to break the cluster and get one of the nodes of the cluster to assume the identity of the cluster. Have SQL running off that. Virtualize this single node. And since there’s no change as far as the outside world is concerned no one’s the wiser.

Found a blog post that pretty much does what I had in mind. Found one more which was useful but didn’t really pertain to my situation. Have a look at the latter post if your DTC is on the Quorum drive (wasn’t so in my case).

So here we go.

1) Make the node that I want to retain as the active node of the cluster (so it was all the disks and databases). Then shutdown SQL server.

sqlshutdown

2) Shutdown the cluster.

clustershutdown

3) Remove the node we want to retain, from the cluster.

We can’t remove/ evict the node via GUI as the cluster is offline. Nor can we remove the Failover Cluster feature from the node as it is still part of a cluster (even though the cluster is shutdown). So we need to do a bit or “surgery”. :)

Open PowerShell and do the following:

This simply clears any cluster related configuration from the node. It is meant to be used on evicted nodes.

Once that’s done remove the Failover Cluster feature and reboot the node. If you want to do this via PowerShell:

4) Bring online the previously shared disks.

Once the node is up and running, open Disk Management and mark as online the shared disks that were previously part of the cluster.

disksonline

5) Change the IP and name of this node to that of the cluster.

Straight-forward. Add CNAME entries in DNS if required. Also, you will have to remove the cluster computer object from AD first before renaming this node to that name.

6) Make some registry changes.

The SQL Server is still not running as it expects to be on a cluster. So make some registry changes.

First go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Setup and open the entry called SQLCluster and change its value from 1 to 0.

Then take a backup (just in case; we don’t really need it) of the key called HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Cluster and delete it.

Note that MSSQL10_50.MSSQLSERVER may vary depending on whether you have a different version of SQL than in my case.

7) Start the SQL services and change their startup type to Automatic.

I had 3 services.

Now your SQL server should be working.

8) Restart the server – not needed, but I did so anyways.

Test?

If you are doing this in a test environment (like I was) and don’t have any SQL applications to test with, do the following.

Right click the desktop on any computer (or the SQL server computer itself) and create a new text file. Then rename that to blah.udl. The name doesn’t matter as long as the extension is .udl. Double click on that to get a window like this:

udl

Now you can fill in the SQL server name and test it.

One thing to keep in mind (if you are not a SQL person – I am not). The Windows NT Integrated security is what you need to use if you want to authenticate against the server with an AD account. It is tempting to select the “Use a specific user name …” option and put in an AD username/ password there, but that won’t work. That option is for using SQL authentication.

If you want to use a different AD account you will have to do a run as of the tool.

Also, on a fresh install of SQL server SQL authentication is disabled by default. You can create SQL accounts but authentication will fail. To enable SQL authentication right click on the server in SQL Server Management Studio and go to Properties, then go to Security and enable SQL authentication.

sqlauth

That’s all!

Now one can P2V this node.

Installing a new license key in KMS

KMS is something you login to once in a blue moon and then you wonder how the heck are you supposed to install a license key and verify that it got added correctly. So as a reminder to myself.

To install a license key:

Then activate it:

If you want to check that it was added correctly:

I use cscript so that the output comes in the command prompt itself and I can scroll up and down (or put into a text file) as opposed to a GUI window which I can’t navigate.

Very Brief Notes on Windows Memory etc

I have no time nowadays to update this blog but I wanted to dump these notes I made for myself today. Just so the blog has some update and I know I can find these notes here again when needed.

This is on the various types of memory etc in Windows (and other OSes in general). I was reading up on this in the context of Memory Mapped Files.

Virtual Memory

  • Amount of memory the OS can access. Not related to the physical memory. It is only related to processor and OS – is it 32-bit or 64-bit.
  • 32-bit means 2^32 = 4GB; 64-bit means 2^64 = a lot! :)
  • On a default install of 32-bit Windows kernel reserves 2GB for itself and applications can use the balance 2GB. Each application gets 2GB. Coz it doesn’t really exist and is not limited by the physical memory in the machine. The OS just lies to the applications that they have 2GB of virtual memory for themselves.

Physical Memory

  • This is physical. But not limited to the RAM modules. It is RAM modules plus paging file/ disk.

Committed Memory

  • When a virtual memory page is touched (read/ write/ committed) it becomes “real” – i.e. put into Physical Memory. This is Committed Memory. It is a mix of RAM modules and disk.

Commit Limit

  • The total amount of Committed Memory is obviously limited by your Physical Memory – i.e. the RAM modules plus disk space. This is the Commit Limit.

Working Set

  • Set of virtual memory pages that are committed and fully belong to that process.
  • These are memory pages that exist. They are backed by Physical Memory (RAM plus paging files). They are real, not virtual.
  • So a working set can be thought of as the subset of a processes Virtual Memory space that is valid; i.e. can be referenced without a page fault.
    • Page fault means when the process requests for a virtual page and that is not in the Physical Memory, and so has to be loaded from disk (not page file even), the OS will put that process on hold and do this behind the scene. Obviously this causes a performance impact so you want to avoid page faults. Again note: this is not RAM to page file fault; this is Physical Memory to disk fault. Former is “soft” page fault; latter is “hard” page fault.
    • Hard faults are bad and tied to insufficient RAM.

Life cycle of a Page

  • Pages in Working Set -> Modified Pages -> Standby Pages -> Free Pages -> Zeroed pages.
  • All of these are still in Physical RAM, just different lists on the way out.

From http://stackoverflow.com/a/22174816:

Memory can be reserved, committed, first accessed, and be part of the working set. When memory is reserved, a portion of address space is set aside, nothing else happens.

When memory is committed, the operating system guarantees that the corresponding pages could in principle exist either in physical RAM or on the page file. In other words, it counts toward its hard limit of total available pages on the system, and it formally creates pages. That is, it creates pages and pretends that they exist (when in reality they don’t exist yet).

When memory is accessed for the first time, the pages that formally exist are created so they truly exist. Either a zero page is supplied to the process, or data is read into a page from a mapping. The page is moved into the working set of the process (but will not necessarily remain in there forever).

Memory Mapped Files

Windows (and other OSes) have a feature called memory mapped files.

Typically your files are on a physical disk and there’s an I/O cost involved in using them. To improve performance what Windows can do is map a part of the virtual memory allocated to a process to the file(s) on disk.

This doesn’t copy the entire file(s) into RAM, but a part of the virtual memory address range allocated to the process is set aside as mapping to these files on disk. When the process tries to read/ write these files, the parts that are read/ written get copied into the virtual memory. The changes happen in virtual memory, and the process continues to access the data via virtual memory (for better performance) and behind the scenes the data is read/ written to disk if needed. This is what is known as memory mapped files.

My understanding is that even though I say “virtual memory” above, it is actually restricted to the Physical RAM and does not include page files (coz obviously there’s no advantage to using page files instead of the location where the file already is). So memory mapped files are mapped to Physical RAM. Memory mapped files are commonly used by Windows with binary images (EXE & DLL files).

In Task Manager the “Memory (Private Working Set)” column does not show memory mapped files. For this look to the “Commit Size” column.

Also, use tools like RAMMap (from SysInternals) or Performance Monitor.

More info

Solarwinds not seeing correct disk size; “Connection timeout. Job canceled by scheduler.” errors

Had this issue at work today. Notice the disk usage data below in Solarwinds –

Disk Usage

The ‘Logical Volumes’ section shows the correct info but the ‘Disk Volumes’ section shows 0 for everything.

Added to that all the Application Monitors had errors –

Timeout

I searched Google on the error message “Connection timeout. Job canceled by Scheduler.” and found this Solarwinds KB article. Corrupt performance counters seemed to be a suspect. That KB article was a bit confusing me to in that it gives three resolutions and I wasn’t sure if I am to do all three or just pick and choose. :)

Event Logs on the target server did show corrupt performance counters.

Initial Errors

I tried to get the counters via PowerShell to double check and got an error as expected –

Broken Get-Counter

Ok, so performance counter issue indeed. Since the Solarwinds KB article didn’t make much sense to me I searched for the Event ID 3001 as in the screenshot and came across a TechNet article. Solution seemed simple – open up command prompt as an admin, run the command lodctr /R. This command apparently rebuilds the performance counters from scratch based on currently registry settings adn backup INI files (that’s what the help message says). The command completed straight-forwardly too.

lodctr - 1

With this the performance counters started working via PowerShell.

Working Get-Counter

Event Logs still had some error but those were to do with the performance counters of ASP.Net and Oracle etc.

More Errors

The fix for this seemed to be a bit more involved and requires rebooting the server. I decided to skip it for now as I don’t these additional counters have much to do with Solarwinds. So I let those messages be and tried to see if Solarwinds was picking up the correct info. Initially I took a more patient approach of waiting and trying to make it poll again; then I got impatient and did things like removing the node from monitoring and adding it back (and then wait again for Solarwinds to poll it etc) but eventually it began working. Solarwinds now sees the disk space correctly and all the Application Monitors work without any errors too.

Here’s what I am guessing happened (based on that Solarwinds KB article I linked to above). The performance counters of the server got corrupt. Solarwinds uses counters to get the disk info etc. Due to this corruption the poller spent more time than usual when fetching info from the server. This resulted in the Application Monitor components not getting a chance to run as the poller had run out of time to poll the server. Thus the Application Monitors gave the timeout errors above. In reality the timeout was not from those components, it was from the corrupt performance counters.

Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

Using SolarWinds to highlight servers in a pending reboot status

Had a request to use SolarWinds to highlight servers in a pending reboot status. Here’s what I did.

Sorry, this is currently broken. After implementing this I realized I need to enable PowerShell remoting on all servers for it to work, else the script just returns the result from the SolarWinds server. Will update this post after I fix it at my workplace. If you come across this post before that, all you need to do is enable PowerShell remoting across all your servers and change the script execution to “Remote Host”.

SolarWinds has a built in application monitor called “Windows Update Monitoring”. It does a lot more than what I want so I disabled all the components I am not interested in. (I could have also just created a new application monitor, I know, just was lazy).

winupdatemon-1

The part I am interested in is the PowerShell Monitor component. By default it checks for the reboot required status by checking a registry key: HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Auto Update\RebootRequired. Here’s the default script –

Inspired by this blog post which monitors three more registry keys and also queries ConfigMgr, I replaced the default PowerShell script with the following –

Then I added the application monitor to all my Windows servers. The result is that I can see the following information on every node –

winupdatemon-2

Following this I created alerts to send me an email whenever the status of the above component (“Machine restart status …”) went down for any node. And I also created a SolarWinds report to capture all nodes for which the above component was down.

winupdatemon-3

Then I assigned this to a schedule to run once in a month after our patching window to email me a list of nodes that require reboots.