Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Windows DNS server subnet prioritization and round-robin

Consider the following multiple A records for a DNS record proxy.mydomain.com:

  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5

These records are defined on a DNS server. When a client queries the DNS server for the address to proxy.mydomain.com, the DNS server returns all the addresses above. However, the order of answers returned keeps varying. The first client asking for answers could get them in the following order for instance:

  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5

The second client could get them in the following order:

  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5
  • proxy.mydomain.com IN A 192.168.10.5

The third client could get:

  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5
  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5

This is called round-robin. Basically it rotates between the various IP addresses. All IP addresses are offered as answers, but the order is rotated so that as long as clients choose the first answer in the list every client chooses a different IP address.

Notice I said clients choose the first answer in the list. This needn’t always be the case though. When I said clients above, I meant the client computer that is querying the DNS server for an answer. But that’s not really who’s querying the server. Instead, an application on the client (e.g. Chrome, Internet Explorer) or the client OS itself is the one looking for an answer. These ask the DNS resolver which is usually a part of the OS for an answer, and it’s the resolver that actually queries the server and gets the list of answers above.

The DNS resolver can then return the list as it is to the requesting application, or it can apply a re-ordering of its own. For instance, if the client is from the 192.168.10.0 network, the resolver may re-order the answers such that the 192.168.10.5 answer is always first. This is called Subnet prioritization. Basically, the resolver prioritizes answers that are from the same subnet as the client. The idea being that client applications would prefer reaching out to a server in their same subnet (it’s closer to them, no need to go over the WAN link for instance) than one on a different subnet.

Subnet prioritization can be disabled on the resolver side by adding a registry key PrioritizeRecordData (link) with value 0 (REG_DWORD) at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DnsCache\Parameters. By default this key does not exist and has a default value of 1 (subnet prioritization enabled).

Subnet prioritization can also be set on the server side so it orders the responses based on the client network. This is controlled by the registry key LocalNetPriority (link) under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\ on the DNS server. By default this is 0, so the server doesn’t do any subnet prioritization. Change this to 1 and the server will order its responses according to the client subnet.

By default the server also does round-robin for the results it returns. This can be turned off via the DNS Management tool (under server properties > advanced tab). If round-robin is off the server returns records in the order they were added.

More on subnet prioritization at this link.

That’s is not the end though. :)

Consider a server who has round-robin and subnet prioritization enabled. Now consider the DNS records from above:

  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5

The first and last records are from class C networks. The other three are from Class A networks. In reality though thanks to CIDR these are all class C addresses.

Now say there’s a client with IP address 10.136.50.2/24 asking the server for answers. On the face of it the client network does not match any of the answer record networks so the server will simply return answers as per round-robin, without any re-ordering. But in reality though the client 10.136.50.2/24 is in the same network as 10.136.52.5/24 and both are part of a larger 10.136.48.0/20 network that’s simply been broken into multiple /24 networks (to denote clients, servers, etc). What can we do so the server correctly identifies the proxy record for this client?

This is where the LocalNetPriorityNetMask registry key under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\ on the DNS server comes into play. This key – which does not exist by default – tells the server what subnet mask to assume when it’s trying to subnet prioritize. By default the server assumes a /24 subnet, but by tweaking this key we can tell the server to use a different subnet in its calculations and thus correctly return an answer.

The LocalNetPriorityNetMask key takes a REG_DWORD value in a hex format. Check out this KB article for more info, but a quick run through:

A netmask can be written as xxx.xxx.xxx.xxx. 4 pairs of numbers. The LocalNetPriorityNetMask key is of format 0xaabbccdd – again, 4 pairs of hex numbers. This is a mask that’s applied on the mask of 255.255.255.255 so to calculate this number you subtract the mask you want from 255.255.255.255 and convert the resulting numbers into hex.

For example: you want a /8 netmask. That is 255.0.0.0. Subtracting this from 255.255.255.255 leaves you with 0.255.255.255.255. What’s that in hex? 00ffffff. So LocalNetPriorityNetMask will be 0x00ffffff. Easy?

So in the example above I want a /20 netmask. That is, I am telling the server to assume the clients and the record IPs it has to be in a /20 network, so subnet prioritize accordingly. A /20 netmask is 255.255.240.0. Subtract from 255.255.255.255 to get 0.0.15.255. Which in hex is 00000fff (15 decimal is F hex). So all I have to do is put this value as LocalNetPriorityNetMask on the DNS server, restart the service, and now the server will correctly return subnet prioritized answers for my /20 network.

Update: Some more links as I did some more reading on this topic later.

  • Ace Fekay’s post – a must read!
  • A subnet calculator (also gives you the wildcard, which you can use for calculating the LocalNetPriorityNetMask key)
  • I am not very clear on what happens if you disable RoundRobin but there are multiple entries from the same subnet. What order are they returned in? Here’s a link to the RoundRobin setting, doesn’t explain much but just linking it in case it helps in the future.
  • More as a note to myself (and any others if they wonder the same) – the subnet mask you specify is applied on the client. That is to say if you client has an IP address of say 10.136.20.10, by default the DNS server will assume a subnet mask of /24 (Class C is the default) and assume the client is in a 10.136.20.0/24 network. So any records from that range are prioritized. If you want to include other records, you specify a larger subnet mask. Thus, for example, if you specify a /20 then the client is assumed to have an IP address 10.136.20.10/20, so its network range is considered to be 10.136.16.1 – 10.136.31.254 (don’t wrack your brain – use the subnet calculator for this). So any record in this range is prioritized over records not in this range.
  • The Windows calculator can be used to find the LocalNetPriorityNetMask key value. Say you want a subnet mask of /19. The subnet calculator will tell you this has a wildcard of 0.0.31.255 – i.e. 00011111.11111111. Put this (13 1’s) into the Windows calculator to get the hex value 3FFF.  

vMotion is using the Management Network (and failing)

Was migrating one of our offices to a new IP scheme the other day and vMotion started failing. I had a good idea what the problem could be (coz I encountered something similar a few days ago in another context) so here’s a blog post detailing what I did.

For simplicity let’s say the hosts have two VMkernel NICs – vmk0 and vmk1. vmk0 is connected to the Management Network. vmk1 is for vMotion. Both are on separate VLANs.

When our Network admins gave out the new IPs they gave IPs from the same range for both functions. That is, for example, vmk0 had an IP 10.20.1.2/24 (and 10.20.1.3/24 and 10.20.4/24 on the other hosts) and vmk1 had an IP of 10.20.12/24 (and 10.20.1.13/24 and 10.20.1.14/24 on the other hosts).

Since both interfaces are on separate VLANs (basically separate LANs) the above setup won’t work. That’s because as far as the hosts are concerned both interfaces are on the same network yet physically they are on separate networks. Here’s the routing table on the hosts:

Notice that any traffic to the 10.20.1.0/24 network goes via vmk0. And that includes the vMotion traffic because that too is in the same network! And since the network that vmk0 is on is physically a separate network (because it is a VLAN) this traffic will never reach the vMotion interfaces of the other hosts because they don’t know of it.

So even though you have specific vmk1 as your vMotion traffic NIC, it never gets used because of the default routes.

If you could force the outgoing traffic to specifically use vmk1 it will work. Below are the results of vmkping using the default route vs explicitly using vmk1:

The solution here is to either remove the VLANs and continue with the existing IP scheme, or to keep using VLANs but assign a different IP network for the vMotion interfaces.

Update: Came across the following from this blog post while searching for something else:

If the management network (actually the first VMkernel NIC) and the vMotion network share the same subnet (same IP-range) vMotion sends traffic across the network attached to first VMkernel NIC. It does not matter if you create a vMotion network on a different standard switch or distributed switch or assign different NICs to it, vMotion will default to the first VMkernel NIC if same IP-range/subnet is detected.

Please be aware that this behavior is only applicable to traffic that is sent by the source host. The destination host receives incoming vMotion traffic on the vMotion network!

That answered another question I had but didn’t blog about in my post above. You see, my network admins had also set the iSCSI networks to be in the same subnet as the management network – but separate VLANs – yet the iSCSI traffic was correctly flowing over that VLAN instead of defaulting to the management VMkernel NIC. Now I understand why! It’s only vMotion that defaults to the first VMkernel NIC in the same IP range/ subnet as vMotion. 

 

How to fail-over the primary VC module to another

Probably obvious to most people but it was new to me. To failover a VC module from primary to backup login to the VC Manager, then go to “Tools > Reset Virtual Connect Manager”.

toolsClick “Yes”, but before that tick the box that says “Force Failover”.

force failoverThat’s all. Now you will be logged out and after a few minutes you can login to the module that is now primary. It takes a few minutes so be patient.

This has no impact on the network or FC connectivity.

Some more ways of doing this can be found at this HP link.

Brief notes on Windows Time

The w32time service provides time for Windows. Since Windows XP NTP (Network Time Protocol) is supported. Prior to that it was only SNTP (Simple NTP).

Non domain joined computers (including servers) use SNTP.

This is a good article that explains the Windows Time service and its configurations. Covers both registry keys and GPOs. This is another good article that goes into even more detail.

Any Windows machine can be set up to sync time in one of four ways: (1) no syncing! (2) sync from specified NTP servers (3) sync via domain hierarchy (i.e. members sync from a DC in the domain; DCs sync from PDC of the parent domain/ forest root domain) (4) use either of the above (i.e. NTP and domain hierarchy). Default mechanism on domain joined computers is domain hierarchy (the setting is called NT5DS). Stand-alone machines have the default as NTP servers (the setting is called NTP; the default server is time.windows.com though you can change it (and probably recommended that you change it?)).

For machines that are off and on the domain – e.g. laptops – it is better to set their time sync mechanism as any. They needn’t always have contact with the DC to sync time.

When specifying NTP time servers you also specify flags. Check this post for an explanation of the flags. There are four possible flags: 0x01 SpecialInterval; 0x02 UseAsFallbackOnly; 0x04 SymmetricActive; 0x08 Client.

  • Flag UseAsFallbackOnly means the server is only used if the others are unavailable. Check out this post for an example of this.
  • Flag SpecialInterval lets you change how often the NTP server is polled. By default the interval is determined by Windows based on the quality of time samples, but you can use the above flag and set a registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient\SpecialPollInterval to change the polling interval.
  • I am not sure what the other two flags do. The Client flag seems to be a commonly used one. Some posts/ articles use it, others don’t. The default time.windows.com setting uses this flag as well as the SpecialInterval.

p.s. To turn on w32tm debugging check out this link.

The dig command

Just a quick post on how to use the dig command. Had to troubleshooting some DNS issues today and nslookup just wasn’t cutting it so I downloaded BIND for Windows and ran dig from there.  (For anyone else that’s interested – if you download BIND (it’s a zip file), extract the contents, and copy dig.exe and all the .dll files elsewhere that’s all you need to run dig).

Here’s the simple way of running dig:

A bit confusing when you come from nslookup where you type the name server first and then the name and type. Here’s the nslookup syntax for reference:

Although all documentation shows the dig syntax as above I think the order can be changed too. Since the name server is identified by the @ sign you can specify it later on too:

I will use this syntax everywhere as it’s easier for me to remember.

You can complicate the simple syntax by adding options to dig. There’s two sort of options: query options and domain options (I made the names up). These are usually placed after the type but again the order doesn’t really matter:

  • Query options start with a - (dash) and look like the switches you have in most command line tools. These are useful to control the behavior of dig when making the query.
    • For instance: say I want to for dig to only use IPv4. I can add the option -4. Similarly for only IPv6 it is -6.
    • If I want to specify the port of the name server (i.e. apart from 53) the -p option is for that.
    • (Very useful) If I want to specify a different source port for the client (i.e. the port from which the outgoing request is made and to which the reply is sent, in case you want to eliminate any issues with that) the -b switch is for that. It takes as argument address#port, if you don’t care about the address leave it as 0.0.0.0#port.
    • Apart from specifying the name and type as above it is also possible to pass these as query options via -n and -t.
    • There are some more query options. Above are just the common ones I can remember.
  • Domain options start with a + (plus). These pertain to the name lookup you are doing.
    • For instance: typically DNS queries are performed using UDP. To use TCP instead do +tcp. Most of these options can be prefixed with a “no”, so if you explicitly want to not use TCP then you can do +notcp instead.
    • To specify the default domain name do +domain=xxx.
    • To turn off or on recursive mode do +norecurse or +recurse. (This is similar to the -recurse or -norecurse switch in nslookup).
      • Note: By default recursion is on.
      • A quick note on recursion: When recursion is on dig asks the name server you supply (or that used by the OS) for an answer and if that name server doesn’t have an answer the name server is expected ask other name servers and get back with an answer. Hence the term “recursive”. The name server will send a recursive query to the name server it knows of, who will send a recursive query to the name server it knows of, until an answer is got down the chain of delegation. 
      • A quick note on iterative (non-recursive queries): When recursion is off dig gets an answer from the first name server, then it queries the second name server, gets an answer and queries the third name server, and so on … until dig itself gets an answer from a server who knows.
    • The output from dig is usually verbose and shows all the sections. To keep it short use the +short option.
      • This doesn’t show the name of the server that dig got an answer from. To print this too use the +identify option.
    • To list all the authoritative name servers of a domain along with their SOA records use the +nssearch option. This disables recursion.
    • To list a trace of the delegation path use the +trace option. This too disables recursion. The delegation path is listed by dig contacting each name server down the chain iteratively.
    • Couple of output modifying options. Most of these require no changing:
      • By default the first line of the output shows the version of  dig, the query, and some extra bits of info. To hide that use +nocmd.
      • By default the answer section of replies is shown. To hide it do +noanswer.
      • By default the additional sections of a reply are shown. To hide it do +noadditional.
      • By default the authority section of a reply is shown. To hide it do +noauthority.
      • By default the question asked is shown as a comment (this is useful because you might type something like dig rakhesh.com which dig interprets as dig rakhesh.com A so this makes the question explicit). To hide this do +noquestion.
        • Note that this is not same as the query that is sent. The query is the question plus other flags like whether we want a recursive query etc. By default the query that is sent is not shown. To show that do +qr.
      • By default comments are printed. To hide it do +nocomments.
      • By default some stats are printed. To hide these do +nostats.
    • To change the timeout use +time=X. Default is 5 seconds. This is equivalent to nslookup -timeout=X.
    • To change the number of retries use +retry=X. Default is 3. This is equivalent to nslookup -retry=X.
    • There are some more domain options. Above are just the common ones I can remember.

An interesting thing about dig (since BIND 9) is that it can do multiple queries (the brackets below are not part of the command, I put them to show the multiple queries):

Each of these queries can take its own options but in addition to that you can also provide global options. So the full command when you have multiple queries and global options is as below (again, without the brackets):

To add to the fun you can even have one name server to be used for all queries and over-ride these via name servers for each query! Thus a fuller syntax is:

This is also why the options can be at the beginning of the query rather than at the end of the query. When you specify global options you can over-ride these per query (except the +[no]cmd option).

Exporting and Importing Windows DNS zones

Exporting a DNS zone is easy. Use the dnscmd.

Importing too is easy but the commands aren’t so obvious. Again you use dnscmd, with the /zoneadd switch as though you are creating a new zone. The help page for this misses out on an important switch though – /load – which lets you load the zone from an exported or pre-existing file.

You can find this switch in the dnscmd help:

So the way to import a zone is as follows: first, copy the exported file into the c:\windows\system32\dns folder of the DNS server and preferably rename it so the extension is a .dns (not required, just a nice thing to do). Then run a command similar to below:

That’s it. This will create a primary zone called “blah.com” and use the zone file that’s already in the location.

Note that you can’t use this technique for AD integrated zones. But that’s no issue. Simply import as above and then convert the zone to AD integrated via the GUI.

A problem occurred while trying to add the conditional forwarder

I was trying to create a conditional forwarder on my work DNS servers and kept hitting this cryptic error message:

forwarder error

I was trying to create a conditional forwarder called some.sub.zone.com.  At first I thought maybe I had this as an existing zone or perhaps as a stub zone – but nope, I don’t have that. Some forum posts mentioned the lack of root hints could lead to this error, but that doesn’t make sense to me – why would I need root hints for this? Next I created a test conditional forwarder to some random domain name and that worked – so surely it wasn’t a server issue.

I recreated this in my test lab and found the problem. The issue is that I am trying to create a forwarder to some.sub.zone.com while zone.com already exists on the DNS server. I was under the impression you could have conditional forwarders even for zones you host, but nope that’s a no can do. From the official docs here’s a para of interest:

A DNS server cannot forward queries for the domain names in the zones it hosts. For example, the authoritative DNS server for the zone microsoft.com cannot forward queries according to the domain name microsoft.com. The DNS server authoritative for microsoft.com can forward queries for DNS names that end with example.microsoft.com, if example.microsoft.com is delegated to another DNS server.

The emphasis is mine and that’s the work-around to use here. You have two options – either delete the zone.com zone from your DNS servers and then create a conditional forwarder for some.sub.zone.com; or create a delegation for some.sub.zone.com – you could do that to yourself too – and then create the conditional forwarder.

Here’s a screenshot from my test lab –

delegationThe some.sub delegation is to my server itself. You don’t need to create a zone for the delegation to succeed. The delegation is just a one way pointer of sorts telling the server to ask the delegated server for any queries concerning this sub zone – it basically tells the server hosting zone.com that it is no longer responsible for some.sub.zone.com (even though the delegation points back to itself!). Once that is done the server will allow you to create a conditional forwarder for some.sub.zone.com.

Notes on Virtual Connect and firmware upgrading without network outage

I am yet to read this but in case you didn’t know there’s a book by HP on Virtual Connect. I haven’t used Virtual Connect at all except briefly see it for the first time when my colleagues showed it to me last month. I have to update the Virtual Connect firmware for our enclosures now so am looking into how I can do that. Here are some more documents I am yet to read; linking them here as a bookmark to myself:

  • A PDF giving an overview of Virtual Connect
  • A page with all the documentation HP has on Virtual Connect and related
  • A page with many whitepapers and manuals on how Virtual Connect works

Virtual Connect firmware can be done via HP SUM/ SPP. It can also be done independently via the Virtual Connect Support Utility (VCSU).

  • This PDF (which can be found via the second bullet point above) is very useful. It is a document outlining the steps involved in upgrading the Virtual Connect firmware. It’s from 2013 but I couldn’t find anything newer on HP’s website.
  • The above PDF is also linked to from this excellent blog post that talks about how to upgrade the firmware without any downtime. 
  • VCSU can be downloaded from this page.
  • Here’s a page with some of its more useful commands.
  • Finally, this page has the latest version of the firmware. You can download the version of Windows and extract the binary image of the firmware.

Upgrading the Virtual Connect firmware seems very straightforward. As I said you can do it via the SUM/ SPP too. Recommended order is to first upgrade the OA (easily done via SUM/ SPP – no reboot required); then the ROM, iLO, and any other firmware for the blades (again easily done via SUM/ SPP – ROM & iLO don’t require any reboot); and finally the VC.  For me the big question was whether I can do the VC upgrade without any network impact.

The PDF I mentioned above (this one) is a must read on the upgrade process. Page 10 onwards talks about the upgrade process.

One thing to note is that before upgrade VCSU (which is what SUM/ SPP too use behind the scenes I suppose) takes a backup of the configuration and does health checks. If the VCs don’t pass health checks the upgrade doesn’t happen. Each Ethernet module of the VC takes about 20 minute to upgrade; each FC module takes about 5 mins. An overview of the upgrade process can be found on page 11 – in short, here’s what happens:

  1. Via SFTP the new firmware is copied in parallel to all modules.
  2. Firmware is upgraded on all modules in parallel. This can be thought of as the update phase.
  3. Then the firmware are activated. The default order is odd-even in which modules on the odd side of the enclosure are activated, then those on the even side.
    1. It is also possible to do serial activation (one after the other), or parallel (everything at the same time), or manual.
  4. Post activation the module is rebooted.
    1. I am not very clear here but it seems the modules on the backup VC side of things (including the backup VC) get rebooted first.
    2. Then the modules on the primary VC side of things (except the primary VC) get rebooted.
    3. Failover VC Manager (VCM) to the backup VC module, and then the primary VC module is rebooted.
    4. Post-reboot the VCM fails over back to the primary VC module (this is only for the Ethernet modules I think, not FC).

Notice the bit about the reboots above? That’s when network connectivity can be lost. On page 12 the document talks about how network outages can be avoided via redundant configuration and NIC bonding but then on page 13 it clarifies that because the reboot is a graceful one there is a possibility that there could be a 20 second network outage because the blade hardware (and the OS running on it) might not be notified that the VC module is down. You see, something called the SmartLink and DCC protocol are responsible for informing the blades that the VC modules are down and so the NICs they map to are down – and so they should fail over to another NIC using the backup VC – but because the firmware is being upgraded the SmartLink and DCC protocol are unavailable, no one informs the blades. So it only when the OS in the blades realize that it has lost network connectivity and must take corrective action, does the OS fail over to using the backup NIC – leading to a potential 20 second outage.

(What I said above is also what this blog post mentions. To give credit I came across the blog post first and through it the guide).

The workaround to the above outage is to set the activation order as manual. And then reset the VC modules through the OA. Since that’s a reset – as opposed to a graceful reboot – the blades will get a notification immediately that the module is down.

Here’s how I updated the VC firmware on my servers without any network outage. First I used VCSU (in update mode) to update & activate the VC modules. Note I select “manual” as the option in two places below. 

I set a time of 5 mins to wait between activation of each VC module. That’s generally recommended.

After that I got the screens below – the whole process took about 40 minutes:

That completes the updating and activation but the firmware isn’t activated yet because I chose not to reboot. Because of that there’s no network downtime so far.

After that I logged into the OA, went over to the Interconnect Bays section > selected the first VC module > Virtual Buttons tab > and clicked Reset.

vc module reset

This resets the VC module. Again no network outage (I was continually pinging some of the hosts and the VMs – one of the VMs had 3 packet drops, that’s it; the hosts I pinged had no drops). Post resetting (which is instant on the UI) I waited some 5 mins, then checked the Information tab to see the firmware level. It was showing the new firmware:

firmware infoAfter that did the same (reset) for the second VC module. Waited 5-6 mins and then I ran VCSU again (in healthcheck mode) to confirm the state of the modules. (To make the output smaller I used input switches to VCSU. Could have done the same above too).

As can be seen the modules are in sync and both the latest firmware version. All done without any network outage! :)

Update Jan 2016: Chris Lynch (from HPE) wrote to me three months ago clarifying some misinformation in my post above. Turns out you no longer have the 20 second outage and all that I wrote above is more or less incorrect. :) Rather than copy paste his email here I’ve printed it to a PDF and you can read it here – Chris Lynch update. Thanks Chris!

HP SUM/ SPP configuration location

Before I forget – HP SUM & HP SPP store their configuration stuff in the following folder – C:\Users\(Username)\AppData\Local\Temp\2\HPSUM. I spent a while today trying to discover where this information is stored. Thought it would be in the same folder as HP SUM or perhaps in the registry, but no – it’s stored as above!

In my case HP SUM was acting weird and not talking to all my nodes properly. It did so correctly in the beginning, and even updated a few, but after that it kept hanging at the inventory stage and would complain about the username/ password being wrong. Figured I’d nuke it and start again but couldn’t make much progress until I figured the above.

Pause a DNS zone on all DNS servers

Here’s how I paused a zone on all the DNS servers hosting that zone:

This looks up the name servers for the zone and suspends the zone on each of those servers. If there are any servers that host this zone but aren’t specified as name servers for the zone (for example it could be an AD integrated zone but the NS records are incomplete) it misses out those servers. So it’s not a great script, there’s probably better ways to do this.

In my case the zone in question was being replicated to all DCs in the domain. So I got a list of all DCs in the domain and targeted those instead:

 

Stub zones do not need zone transfer (with screenshots!)

I had to write an email about this and so take the trouble to set up a test zone and create screenshots. Figured I might as well put it in a blog post too.

Exhibit A: An AD integrated zone called some.zone.com.

somezoneDoesn’t matter that it’s AD integrated or what NS records it holds. I just created an AD integrated zone to simulate our work environment.

Note that this zone doesn’t have zone transfers enabled.

nozonetransferExhibit B: A regular Windows Server 2012 machine called WIN-SVR01. Not domain joined (just in case anyone points out that could make a difference). It has access to the master server and Name Servers and that’s it. Create a stub zone as usual, pointing it to the master servers (in the screenshot below I point to just one master server).

new stub zone

Exhibit C: And that’s it! As soon as I do the above, the zone loads and I am able to query records in it.

stub zone works

That’s it!

One source of confusion seems to be the Get-DnsServerZone cmdlet. Here’s the cmdlet output once the stub zone has loaded:

Note the attributes LastZoneTransferAttempt and LastZoneTransferResult – these give the impression a zone transfer is being carried out.

Now watch the same output after I recreated the stub zone but this time I blocked it from accessing the master servers (so the stub zone doesn’t load):

Even though the zone hasn’t loaded LastSuccessfulZoneTransfer gives the impression it has succeeded. LastZoneTransferResult gives an error code though. Best to ignore these attributes for stub zones – as I showed above stub zones don’t require a zone transfer.

Stub zones do not need zone transfer

At work there was some confusion that creating stub zones requires the master servers to allow zone transfers to the servers holding the stub zones. That’s not correct and oddly I couldn’t find any direct hits when I typed this query into Google so I could show some blog posts/ articles for support.

I mentioned this briefly in one of my earlier blog posts. But don’t take my word for it here are two blog posts and a book except mentioning the same:

If stub zones don’t require any zone transfers what’s the difference between them and conditional forwarders? Again, check the second link above but the long and short of it is that stub zones query the master server IPs you give and asks these servers for a list of NS records and their addresses, and then queries these name servers for whatever record you want; while conditional forwarders have a predefined list of name servers and so always query these for the record you want. Stub zones are more resilient to changes. If the remote end adds/ removes a name server the stub zone will automatically pick it up as long as the master servers are up-to-date and reachable. A conditional forwarder won’t do this automatically – the remote end admins will have to communicate the new name server details to the conditional forwarder admins and they will have to update at their end.

Hope that clarifies!

p.s. See my next post.

p.p.s. Hadn’t thought of this. Good point (via):

Stub zones: will use whatever is in the NS records of the zone (or descendants of the zone, if not otherwise defined) to resolve queries  which are below a zone cut.

Forward zones: will always use the configured forwarders, which must support recursion, even for names which are known to be deeper in the delegation hierarchy and whose delegated/authoritative nameservers might respond more quickly than the forwarders, if asked.

The “Administrators” group

Note to self: the “Domain Admins” and “Enterprise Admins” groups aren’t the primary groups in a domain. The primary group is the “Administrators” group, present in the “Builtin” folder. The other two groups are members of this group and thus get rights over the domain. The “Enterprise Admins” group is also a member of the “Administrator” group in all other domains/ child-domains of that forest, hence its members get rights over those domains too.

So if you want to create a separate group in your domain and want to give its members domain admin rights over (say) a child domain, all you need to do is create the group (must be Global or Universal) an add this group to the “Administrators” group in the child domain. That’s it!

Cannot ping an address but nslookup works (contd)

Earlier today I had blogged about nslookup working but ping and other methods not resolving names to IP addresses. That problem started again, later in the evening.

Today morning though as a precaution I had enabled the DNS Client logs on my computer. (To do that open Event Viewer with an admin account, go down to Applications and Services Logs > Microsoft > Windows > DNS Client Events > Operational – and click “Enable log” in the “Actions” pane on the right).

That showed me an error along the following lines:

A name not found error was returned for the name vcenter01.rakhesh.local. Check to ensure that the name is correct. The response was sent by the server at 10.50.1.21:53.

Interesting. So it looked like a particular DC was the culprit. Most likely when I restarted the DNS Client service it just chose a different DC and the problem temporarily went away. And sure enough nslookup for this record against this DNS server returned no answers.

I fired up DNS Manager and looked at this server. It seemed quite outdated with many missing records. This is my simulated branch office DC so I don’t always keep it on/ online. Looks like that was coming back to bite me now.

The DNS logs in Event Manager on that server had errors like this:

The DNS server was unable to complete directory service enumeration of zone TrustAnchors.  This DNS server is configured to use information obtained from Active Directory for this zone and is unable to load the zone without it.  Check that the Active Directory is functioning properly and repeat enumeration of the zone. The extended error debug information (which may be empty) is “”. The event data contains the error.

So Active Directory is the culprit (not surprising as these zones are AD integrated so the fact that they weren’t up to date indicated AD issues to me). I ran repadmin /showrepl and that had many errors:

Naming Context: CN=Configuration,DC=rakhesh,DC=local

Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: CN=Configuration,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Great! I fired up AD Sites and Services and the links seemed ok. Moreover I could ping the DCs from each other. Event Logs on the problem DC (WIN-DC02) had many entries like this though:

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server win-dc01$. The target name used was Rpcss/WIN-DC01. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (RAKHESH.LOCAL) is different from the client domain (RAKHESH.LOCAL), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Hmm, secure channel issues? I tried resetting it but that too failed:

(Ignore the above though. Later I realized that this was because I wasn’t running command prompt as an admin. Because of UAC even though I was logged in as admin I should have right clicked and ran command prompt as admin).

Since I know my environment it looks likely to be a case of this DC losing trust with other DCs. The KRB_AP_ERR_MODIFIED error also seems to be related to Windows Server 2003 and Windows Server 2012 R2 but mine wasn’t a Windows Server 2003. That blog post confirmed my suspicions that this was password related.

Time to check the last password set attribute for this DC object on all my other DCs. Time to run repadmin /showobjmeta.

The above output gives metadata of the WIN-DC02 object on WIN-DC01. I am interested in the pwdLastSet attribute and its timestamp.  Here’s a comparison of this across my three DCs:

That confirms the problem. WIN-DC02 thinks its password last changed on 9th May whereas WIN-DC01 changed it on 25th July and replicated it to WIN-DC03.

Interestingly that date of 25th July is when I first started having problems in my test lab. I thought I had sorted them but apparently they were only lurking beneath. The solution here is to reset the WIN-DC02 password on itself and WIN-DC01 and replicate it across. The steps are in this KB article, here’s what I did:

  1. On WIN-DC02 (the problem DC) I stopped the KDC service and set it to start Manual.
  2. Purge the Kerberos ticket cache. You can view the ticket cache by typing the command: klist.  To purge, do: klist purge.
  3. Open a command prompt as administrator (i.e. right click and do a “Run as administrator”) then type the following command: netdom resetpwd /server WIN-DC01.rakhesh.local /UserD MyAdminAccount /PasswordD *
  4. Restart WIN-DC02.
  5. After logon start the KDC service and set it to Automatic.

Checked the Event Logs to see if there are any errors. None initially but after I forced a sync via repadmin /syncall /e I got a few. All of them had the following as an error:

2148074274 The target principal name is incorrect.

Odd. But at least it was different from the previous errors and we seemed to be making progress.

After a bit of trial and error I noticed that whenever the KDC service on the DC was stopped it seemed to work fine.

I could access other servers (file shares), connect to them via DNS Manager, etc. But start KDC and all these would break with errors indicating the target name was wrong or that “a security package specific error occurred”. Eventually I left KDC stay off, let the domain sync via repadmin /syncall, and waited a fair amount of time (about 15-20 mins) for things to settle. I kept an eye on repadmin /replsummary to see the deltas between WIN-DC02 and the rest, and also kept an eye on the DNS zones to see if WIN-DC02 was picking up newer entries from the others. Once these two looked positive, I started KDC. And finally things were working!

 

Brief notes on HP SUM and SPP

HP SUM (Smart Update Manager) can be downloaded from http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/hpsum/index.aspx. This is just the tool. Its home page is http://www8.hp.com/us/en/products/server-software/product-detail.html?oid=5182020. As of this post date the home page says the latest version is 7.3.0 but the download page only has 7.1.0. Not sure why.

I am on Windows so I downloaded the ISO and the ZIP file (which can be found later on in the page). The ISO file is bootable. You can add firmware and drivers to this and boot up. The ZIP file has the HP SUM tool for Windows and Linux and can be extracted to these OSes and run from there. It’s not meant for booting up and deploying.

From Windows computers you can run HP SUM and update Windows, Linux, VMware, HP-UX, iLO, Virtual Connect, etc. From Linux computers you can do all these except Windows.

Documentation can be found at http://h17007.www1.hp.com/us/en/enterprise/servers/solutions/info-library/index.aspx?cat=smartupdate&subcat=hp_sum.

An SPP (Service Pack for Proliant) is the SUM along with a set of firmware and drivers. As of a certain date. These have been tested to ensure they work well together.

HP SUM only works with VMware if you are using the HP customized version of VMware. These can be found at http://www8.hp.com/us/en/products/servers/solutions.html?compURI=1499005#tab=TAB4. If your installation of VMware is not an HP customized version then the inventory step will fail with an error that the username/ password is incorrect.

A baseline is a set of updates that you want all the nodes added into SUM to be at. If you run SUM from an SPP then the baseline that of the SPP – for example 2015.04 if you are running the 2015.04 SPP. SUM creates a baseline from the packages you add to it the first time it runs. In addition to a baseline you can also add extra components (I am not too sure about that, haven’t played with it).

So you create a baseline (or it happens implicitly). You add nodes and do an inventory of the nodes. That tells you what’s present on the system. Then in the next screen you review what needs to be done and deploy accordingly. On this scren you can choose whether reboots happen or should be delayed. You can also see which updates will cause a reboot. In some cases you can even downgrade via this screen.

Some of the components will appear as “Force” or “Disabled”. This means no update is required. If you click on the details link for these components you will usually see that the installed component is already at the version with SUM. If you want you can re-install/ overwrite some of these components. The ones you can overwrite are shown as “Force”; the ones you cannot are shown as “Disabled”. If you toggle “Force” it becomes “Forced”.

SUM can be run via GUI. In this case the GUI is actually run via a web server you have to point to. Or you can run via command-line. The latter gives you more fine control over the process I think.