Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Bug in Server 2016 and DNS zone delegation with CNAME records

A colleague at work mentioned a Server 2016 DNS zone delegation bug he had found. I found just one post on the Internet when I searched for this. According to my colleague Microsoft has now confirmed this as a bug in the support call he raised. 

DNS being an area of interest I wanted to replicate the issue and post it – so here goes. Hopefully it makes sense. :)

Imagine a split-DNS scenario for a zone rockylabs.zero

  • This zone is hosted externally by a DNS server (doesn’t matter what OS/ software) called data01
  • This zone is hosted internally by two DNS servers: a Server 2012R2 (called DC2012-01), and a Server 2016 (called DC2016-01). 

Now say there’s a record rakhesh.rockylabs.zero that is the same both internally and externally. As in, we want both internal and external users to get the same (external) IP address for this record. 

What you would typically do is add this record to your external DNS server and create a delegation from your two internal DNS servers, for this record, to the external DNS server. Here’s some screenshots:

The zone on my external DNS server. Notice I have an A record for rakhesh.rockylabs.zero.  

Ignore the rakhesh2.rockylabs.zero record for now. That comes in later. :)

Here’s a look at the delegation from my Server 2012R2 internal DNS server to the external DNS server for the rakhesh.rockylabs.zero record. Basically I create a delegation within the rockylabs.zero zone on the internal server, for the rakhesh domain, and point it to the external DNS server. On the external DNS server rakhesh.rockylabs.zero is defined as an A record so that will be returned as an answer when this delegation chain is followed. 

In my case both the internal DNS servers are also DCs, and the rockylabs.zero zone is AD integrated, so a similar delegation is automatically created on the other DNS server too. 

As would be expected, I am able to resolve this A record correctly from both internal DNS servers.

Now for the fun part!

Notice the rakhesh2.rockylabs.zero record on my external DNS server? Unlike rakhesh.rockylabs.zero this one is a CNAME record. This too should be common for both internal and external users. Shouldn’t be a problem really as it should work similarly to the A record. Following the chain of delegation when I resolve rakhesh2.rockylabs.zero to a CNAME record called rakhesh.com, my DNS server should automatically resolve the A record for rakhesh.com and give me its address as the answer for rakhesh2.rockylabs.zero.  It works with the Server 2012R2 internal DNS server as expected – 

But breaks for the 2016 internal DNS server!

And that’s it! That’s the bug basically. 

Here’s the odd bit though. If I were to query rakhesh.com (the domain to which the CNAME record points to), and then try to resolve the delegated record, it works!

If I go ahead and clear the cache on that 2016 internal server and try the name resolution again, it’s broken as before.

So the issue is that the 2016 DNS Server is able to follow the delegation for rakhesh2.rockylabs.zero to the external DNS server and resolve it to rakhesh.com, but it is doesn’t then go ahead and lookup rakhesh.com to get its A record. But if the A record for rakhesh.com is already cached with it, it is sensible enough to return that address. 

I dug a bit more into this by enabling debug logging on the 2016 server. Here’s what I found.

The 2016 server receives my query:

It passes this on to the external server (10.10.1.11 is data01 – external DNS server where rakhesh2.rockylabs.zero is delegated to). FYI, I am truncating the output here:

It gets a reply with the CNAME record. So far so good. 

Now it queries the external DNS server (data01 – 10.10.1.11) asking for the A record of rakhesh.com! That’s two wrong things: 1) Why ask the external DNS server (who as far as the internal DNS server knows is only delegated the rakhesh2.rockylabs.zero zone and has nothing to do with rakhesh.com) and 2) why ask it for the A record instead of the NS record so it can find the name servers for rakhesh.com and ask those for the IP address of rakhesh.com

It’s pretty much downhill from there, coz as expected the external DNS server replies saying “doh I don’t know” and gives it a list of root servers to contact. FYI, I truncated the output:

The 2016 internal DNS server now replies to the client with a fail message. As expected. 

So now we know why the 2016 server is failing. 

Until this is fixed one workaround would be to create a CNAME record directly in the internal DNS server to whatever the external DNS server points to. That is, don’t delegate to external – just create the same record internally too. Only for CNAME records; A is fine. Here’s an example of it working with a record called rakhesh3.rockylabs.zero where I simply made a CNAME on the internal 2016 DNS server. 

That’s all for now!

Add-DnsServerZoneDelegation with multiple nameservers

Only reason I am creating this post is coz I Googled for the above and didn’t find any relevant hits

I know I can use the Add-DnsServerZoneDelegation cmdlet to create a new delegated zone (basically a sub-domain of a zone hosted by our DNS server, wherein some other DNS server hosts that sub-domain and we merely delegate any requests for that sub-domain to this other DNS server). But I wasn’t sure how I’d add multiple name servers. The switches give an option to give an array of IP addresses, but that’s just any array of IP addresses for a single name server. What I wanted was to have an array of name servers each with their own IP.

Anyways, turns out all I had to do was run the command for each name server. 

Above I create a delegation from my zone “parentzone.com” to 3 DNS servers “DNS0[1-3].somedomain” (also specified by their respective IPs) for the sub-domain “subzone.parentzone.com”.

Installing a new license key in KMS

KMS is something you login to once in a blue moon and then you wonder how the heck are you supposed to install a license key and verify that it got added correctly. So as a reminder to myself.

To install a license key:

Then activate it:

If you want to check that it was added correctly:

I use cscript so that the output comes in the command prompt itself and I can scroll up and down (or put into a text file) as opposed to a GUI window which I can’t navigate.

Solarwinds not seeing correct disk size; “Connection timeout. Job canceled by scheduler.” errors

Had this issue at work today. Notice the disk usage data below in Solarwinds –

Disk Usage

The ‘Logical Volumes’ section shows the correct info but the ‘Disk Volumes’ section shows 0 for everything.

Added to that all the Application Monitors had errors –

Timeout

I searched Google on the error message “Connection timeout. Job canceled by Scheduler.” and found this Solarwinds KB article. Corrupt performance counters seemed to be a suspect. That KB article was a bit confusing me to in that it gives three resolutions and I wasn’t sure if I am to do all three or just pick and choose. :)

Event Logs on the target server did show corrupt performance counters.

Initial Errors

I tried to get the counters via PowerShell to double check and got an error as expected –

Broken Get-Counter

Ok, so performance counter issue indeed. Since the Solarwinds KB article didn’t make much sense to me I searched for the Event ID 3001 as in the screenshot and came across a TechNet article. Solution seemed simple – open up command prompt as an admin, run the command lodctr /R. This command apparently rebuilds the performance counters from scratch based on currently registry settings adn backup INI files (that’s what the help message says). The command completed straight-forwardly too.

lodctr - 1

With this the performance counters started working via PowerShell.

Working Get-Counter

Event Logs still had some error but those were to do with the performance counters of ASP.Net and Oracle etc.

More Errors

The fix for this seemed to be a bit more involved and requires rebooting the server. I decided to skip it for now as I don’t these additional counters have much to do with Solarwinds. So I let those messages be and tried to see if Solarwinds was picking up the correct info. Initially I took a more patient approach of waiting and trying to make it poll again; then I got impatient and did things like removing the node from monitoring and adding it back (and then wait again for Solarwinds to poll it etc) but eventually it began working. Solarwinds now sees the disk space correctly and all the Application Monitors work without any errors too.

Here’s what I am guessing happened (based on that Solarwinds KB article I linked to above). The performance counters of the server got corrupt. Solarwinds uses counters to get the disk info etc. Due to this corruption the poller spent more time than usual when fetching info from the server. This resulted in the Application Monitor components not getting a chance to run as the poller had run out of time to poll the server. Thus the Application Monitors gave the timeout errors above. In reality the timeout was not from those components, it was from the corrupt performance counters.

Find which bay an HP blade server is in

So here’s the situation. We have a bunch of HP rack enclosures. Some blade servers were moved from one rack to another but the person doing the move forgot to note down the new location. I knew the iLO IP address but didn’t know which enclosure it was in. Rather than login to each enclosure OA, expand the device bays and iLO info and find the blade I was interested in, I wrote this batch file that makes use of SSH/ PLINK to quickly find the enclosure the blade was in.

Put this in a batch file in the same folder as PLINK and run it.

Note that this does depend on SSH access being allowed to your enclosures.

Update: An alternative way if you want to use PowerShell for the looping –

 

Using SolarWinds to highlight servers in a pending reboot status

Had a request to use SolarWinds to highlight servers in a pending reboot status. Here’s what I did.

Sorry, this is currently broken. After implementing this I realized I need to enable PowerShell remoting on all servers for it to work, else the script just returns the result from the SolarWinds server. Will update this post after I fix it at my workplace. If you come across this post before that, all you need to do is enable PowerShell remoting across all your servers and change the script execution to “Remote Host”.

SolarWinds has a built in application monitor called “Windows Update Monitoring”. It does a lot more than what I want so I disabled all the components I am not interested in. (I could have also just created a new application monitor, I know, just was lazy).

winupdatemon-1

The part I am interested in is the PowerShell Monitor component. By default it checks for the reboot required status by checking a registry key: HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Auto Update\RebootRequired. Here’s the default script –

Inspired by this blog post which monitors three more registry keys and also queries ConfigMgr, I replaced the default PowerShell script with the following –

Then I added the application monitor to all my Windows servers. The result is that I can see the following information on every node –

winupdatemon-2

Following this I created alerts to send me an email whenever the status of the above component (“Machine restart status …”) went down for any node. And I also created a SolarWinds report to capture all nodes for which the above component was down.

winupdatemon-3

Then I assigned this to a schedule to run once in a month after our patching window to email me a list of nodes that require reboots.

 

Solarwinds AppInsight for IIS – doing a manual install – and hopefully fixing invalid signature (error code: 16007)

AppInsight from Solarwinds is pretty cool. At least the one for Exchange is. Trying out the one for IIS now. Got it configured on a few of our servers easily but it failed on one. Got the following error –

appinsight-error

Bummer!

Manual install it is then. (Or maybe not! Read on and you’ll see a hopeful fix that worked for me).

First step in that is to install PowerShell (easy) and the IIS PowerShell snap-in. The latter can be downloaded from here. This downloads the Web Platform Installer (a.k.a. “webpi” for short) and that connects to the Internet to download the goods. In theory it should be easy, in practice the server doesn’t have connectivity to the Internet except via a proxy so I have to feed it that information first. Go to C:\Program Files\Microsoft\Web Platform Installer for that, find a file called WebPlatformInstaller.exe.config, open it in Notepad or similar, and add the following lines to it –

This should be within the <configuration> -- </configuration> block. Didn’t help though, same error.

webpi-error

Time to look at the logs. Go to %localappdata%\Microsoft\Web Platform Installer\logs\webpi for those.

From the logs it looked like the connection was going through –

But the problem was this –

If I go to the link – https://www.microsoft.com/web/webpi/5.0/webproductlist.xml – via IE on that server I get the following –

untrusted-cert

 

However, when I visit the same link on a different server there’s no error.

Interesting. I viewed the untrusted certificate from IE on the problem server and compared it with the certificate from the non-problem server.

Certificate on the problem server

Certificate on the problem server

Certificate on a non-problem server

Certificate on a non-problem server

Comparing the two I can see that the non-problem server has a VeriSign certificate in the root of the path, because of which there’s a chain of trust.

verisign - g5

If I open Certificate Manager on both servers (open mmc > Add/ Remove Snap-Ins > Certificates > Add > Computer account) and navigate to the “Trusted Root Certification Authorities” store) on both servers I can see that the problem server doesn’t have the VeriSign certificate in its store while the other server has.

cert manager - g5

So here’s what I did. :) I exported the certificate from the server that had it and imported it into the “Trusted Root Certification Authorities” store of the problem server. Then I closed and opened IE and went to the link again, and bingo! the website opens without any issues. Then I tried the Web Platform Installer again and this time it loads. Bam!

The problem though is that it can’t find the IIS PowerShell snap-in. Grr!

no snap-in

no snap-in 2

That sucks!

However, at this point I had an idea. The SolarWinds error message was about an invalid signature, and what do we know of that can cause an invalid signature? Certificate issues! So now that I have installed the required CA certificate for the Web Platform Installer, maybe it sorts out SolarWinds too? So I went back and clicked “Configure Server” again and bingo! it worked this time. :)

Hope this helps someone.

Solarwinds – “The WinRM client cannot process the request”

Added the Exchange 2010 Database Availability Group application monitor to couple of our Exchange 2010 servers and got the following error –

error1

Clicking “More” gives the following –

error2

This is because Solarwinds is trying to run a PowerShell script on the remote server and the script is unable to run due to authentication errors. That’s because Solarwinds is trying to connect to the server using its IP address, and so instead of using Kerberos authentication it resorts to Negotiate authentication (which is disabled). The error message too says the same but you can verify it for yourself from the Solarwinds server too. Try the following command

This is what’s happening behind the scenes and as you will see it fails. Now replace “Negotiate” with “Kerberos” and it succeeds –

So, how to fix this? Logon to the remote server and launch IIS Manager. It’s under “Administrative Tools” and may not be there by default (my server only had “Internet Information Services (IIS) 6.0 Manager”), in which case add it via Server Manager/ PowerShell –

Then open IIS Manager, go to Sites > PowerShell and double click “Authentication”.

iis-1

Select “Windows Authentication” and click “Enable”.

iis-2

Now Solarwinds will work.

Create Solarwinds account limitations based on Custom Properties

I wanted to create account limitations in Solarwinds based on Custom Properties but the web console by default doesn’t give an option to do that.

default limitationsThen I came across this helpful video.

The trick is to logon to your Solarwinds server and find the “Account Limitation Builder”. Then click “Add” and create a new limitation similar to this –

new limitation

Now the limitation you create will come in the list –

new limitation2

Select that, and in the next screen you can choose the value you want to limit to –

new limitation3

Removing a monitored resource from multiple nodes in Solarwinds

Had to remove a drive from being monitored from multiple servers in Solarwinds. Rather than go to each node, edit its properties and untick the drive, I figured that if you do a search for the nodes and expand each of them it’s possible to tick the ones you don’t need and click “Delete”.

multiple nodes

Nice!

Using Solarwinds to monitor Windows Services

This is similar to how I monitored performance counters with Solarwinds.

I want to monitor a bunch of AppSense services.

Similar to the performance counters where I created an application monitor template so I could define the threshold values, here I need to create an application monitor so that the service appears in the alert manager. This part is easy (and similar to what I did for the performance counters) so I’ll skip and just put a screenshot like below –

applimonitor

I created a separate application monitor template but you could very well add it to one of the standard templates (if it’s a service that’s to be monitored on all nodes for instance).

Now for the part where you create alerts.

Initially I thought this would be a case of creating triggers when the above application monitor goes down. Something like this – alert1

And create an alert message like this –

alert1a

With this I was hoping to get a one or more alert messages only for the services that actually went down. Instead, what happened is that whenever any one service went down I’d get an alert for the service that went down and also a message for the services that were up. Am guessing since I was triggering on the application monitor, Solarwinds helpfully sent the status for each of its components – up or down.

alert2

The solution is to modify your trigger such that you target each component.

alert3

Now I get alerts the way I want.

Hope this helps!

Mute Solarwinds alerts during reboots/ maintenance windows

I wanted to mute Solarwinds alerts during our patch weekends when all servers are rebooted because they have to be and our mailboxes get flooded with Solarwinds alerts. I decided to use custom properties for this purpose. Here’s what I did.

Login to the Solarwinds web console. Go to the “Settings” page, and then “Manage Custom Properties” under “Node *& Group Management”.

Click “Add Custom Property”, select the default of “Nodes” from the drop down, and create something along the following lines –

customproperties

Select the nodes you’d like to apply this custom property to. I chose to apply it on all my Windows and VMware nodes. Set the value to be “No”.

customvalues

Now login to Orion Alerts Manager and pick an alert you’d like to mute during patch weekends. Go to its “Alert Suppression” tab and add a condition on the custom property we created earlier.

alert custom properties

alert custom properties2

And that’s it, really!

Update: Not sure why, but the above didn’t seem to work for me. So I added the Mute_Alerts check as part of the trigger condition itself.

new trigger

Note: If you don’t get the custom property in Orion, close and restart it as an administrator (i.e. right click and do “Run as Administrator” even if you are already running it with an admin account). Not sure why, but until I did that the custom property didn’t get picked up. You only need to do it one time; after that you can launch Orion normally.

Next time your server estate is being rebooted/ undergoing maintenance, login to Solarwinds webconsole and change the “Mute_Alerts” custom property to “Y” for a node/ nodes that you want to mute alerts for. Below I show how I will mute the alerts for all my Windows nodes.

Go to “Manage Nodes”. Group by “Vendor” and select Windows. Then select all nodes. (The checkbox to select all nodes got blanked out in the screenshot below but it’s easy to find).

apply custom property

Then click on “Custom Property Editor” to get to the screen below.

apply custom property

Here too select all the nodes and click “Edit multiple values”.

From the drop down, change the value for “Mute_Alerts” to “true”. Then save changes and that’s it. :)

 

Using Solarwinds to monitor Windows Performance Monitor (perfmon) Counters

Had a request from our Exchange admin to setup Solarwinds alerts for some of our Exchange servers based on Performance Monitor counters.

MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length       (above 200)
MSExchangeTransport Queues(_total)\Largest Delivery Queue Length                 (above 200)
MSExchangeTransport Queues(_total)\Messages Queued For Delivery                (above 200)
MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length        (above 20)

Before setting up alerts I need to add them to Solarwinds first. Here’s how you do that.

First, open up the Solarwinds web console, go to Applications, and then SAM Settings.

applicationssam settings

Then go to Component Monitor Wizard.

component monitor

 

Select Windows Performance Counter Monitor.

perfmon

Notice that it says the data is collected using RPC. This means (1) the server must be monitored by Solarwinds using WMI and not SNMP. In case of the latter, switch to monitoring via WMI. And (2) RPC ports must be open between the Solarwinds server and the target server. If not, monitoring will fail.

Enter the name of a server you wish to target. This server would be one that contains the perfmon counters you are interested in. You use this server to setup monitoring for the counters you are interested in. Change to 64bit if 32bit doesn’t work.

target

Change the “Choose Credential” drop down according to your environment. To select the server it’s better to click “Browse” and find the server you are interested in if Solarwinds complains that it cannot find the name you type in.

Note: The next step will fail if you have not opened the required RPC ports.

Select the counters you are interested in. First select the object you want to monitor (MSExchangeTransport Queues, in the screenshot below) and then the counters.

select counters

The next screen will list all the counters you selected and give you a chance to set warning and critical thresholds. Customize these.

 

properties

Select where you would like these counters added to – a new application monitor/ monitor template, or an existing application monitor/ monitor template. I am going with a new application monitor template. Easier to make changes to templates than individual application monitors.

whereadd

 

Choose more nodes you would like to assign this application monitor to. Am skipping this screenshot. This step is optional as you can assign the application monitor to nodes later too.

An optional step – I also went to Manage Application Templates screen after the above steps, selected the template I created, and assigned it some tags and set a custom view.

defineview

A custom view lets you define what details are shown when anyone clicks this application monitor template on a particular node in the Solarwinds web console. You can customize the view by going to Settings (of Solarwinds) and selecting Manage Views.

Next step is to create an alert. For that you have to logon to the Solarwinds server itself, go to Alert Manager, create a new alert (skipping screenshots for all these) and create a new alert whose condition is as follows:solarwinds trigger

Note that the type of property to monitor is “APM: Component”. This is important for the correct variables to be visible in the alert message. Also, note that I am triggering for each of the component (with an “any” condition) and not for the application monitor itself. This lets me get alerts for individual components; if I don’t do this, and instead trigger on the application monitor itself, I will get alert emails for each component including the ones that don’t have an issue.

Here’s the alert message:

solarwinds message

Power cycle/ Reset an HP blade server

Was getting the following error on one of our servers. It’s from ESXi. None of the NICs were working for the server (the NICs seemed to be working, just that the driver wasn’t loading). 

error

Power cycle required. 

I switched off and switched on the server but that didn’t help. Turns out that doesn’t actually power cycle the server (because the server still has power – doh!). What you need to do is do something called an e-fuse reset. This power cycles the blade. You have to do this by opening an SSH session to the Onboard Administrator, finding the bay number of the blade you want to power cycle, and typing the command reset server <bay number>

Good to know!

Note: The command does not appear when you type help, but it’s there:

Also, to get a list of your bays and servers use the show server list command. To do the same for interconnects use the show interconnect list command.

Unable to login to vSphere because the admin@system-domain password cannot be reset

vSphere 5.1 has admin@system-domain as the default admin account. vSphere 5.5 changes that to administrator@vsphere.local. However, if you upgrade from 5.1 to 5.5 the default admin account remains admin@system-domain. Which is fine and dandy until the password for this account expires. Then you are unable to reset or login! See below. :)

Trying to login as usual

1 - login

Password has expired, needs a reset

2 - reset

Reset fails though coz you can only reset for the vsphere.local domain

3 - reset fails

Missed out on taking a screenshot but if you were to try and login with administrator@vsphere.local instead you get an error that the credentials are invalid (because that account doesn’t exist!). So you are stuck!

What do you do?

Solution is to reset the admin password

When you do this vSphere automatically creates the administrator@vsphere.local account. Follow the steps in this KB article.

4 - reset password

Now you can login with administrator@vsphere.local and the generated password.