Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

DNS request timed out

Whenever I’d do an nslookup I noticed these timeout messages as below:

Everything seemed to be working but I was curious about the timeouts.

Then I realized the problem was that I had missed the root domain at the end of the name. You see, a name such as “www.msftncsi.com” isn’t an absolute name. For all the DNS resolver knows it could just be a hostname – like say “eu.mail” in a full name such as “eu.mail.somedomain.com”. To tell the resolver this is the full name one must terminate it with a dot like thus: “www.msftncsi.com.“. When we omit the dot in common practice the DNS resolver puts it in implicitly and so we don’t notice it usually.

The dot is what tells the resolver about the root of the domain name hierarchy. A name such as “rakhesh.com.” actually means the “rakhesh” domain in the “com” domain in the “.” domain. It is “.” that knows of all the sub-domains such as “com”, “io”, “net”, “pl”, etc.

In the case above since I had omitted the dot my resolver was trying to append my DNS search suffixes to the name “www.msftncsi.com” to come up with a complete name. I have search suffixes “rakhesh.local.” and “dyn.rakhesh.local.” (I didn’t put the dot while specifying the search suffixes but the resolver puts it in because I am telling it these are the absolute domain names) so the resolver was actually expanding “www.msftncsi.com” to “www.msftncsi.com.rakhesh.local.” and “www.msftncsi.com.dyn.rakhesh.local.” and searching for these. That fails and so I get these “DNS request timed out” messages.

If I re-try with the proper name the query goes through fine:

Just to eliminate any questions of whether a larger timeout is the solution, no it doesn’t help:

If I replace my existing DNS suffixes search list with the dot domain, that too helps (not practical because then I can’t query by just the hostname, I will always have to put in the fully qualified name):

Or I could tell nslookup to not use the DNS suffix search lists at all (a more practical solution):

So there you go. That’s why I was getting those timeout errors.

Updating Windows DNS Server root hints

Somehow I came upon the root hints of my Windows DNS Server today and had a thought to update it. Never done that before so why not give it a shot?

You can find the root hints by right clicking on the server and going to the ‘Root Hints’ tab.

root hints

Or you could click the server name in DNS Manager and select ‘Root Hints’ in the right pane. Either ways you get to the screen above. From here you can add/ remove/ edit root server names and IP addresses. If you want to update this list you can do so by each entry, or click the ‘Copy from Server’ button to update the list with a new bunch of entries. Note that ‘Copy from Server’ does not over-write the list, so you are better off removing all the entries first and then doing ‘Copy from Server’.

The ‘Copy from Server’ option had me stumped though. You can find the root hints on the IANA website – there’s an FTP link to the file containing root hints, as well as an HTTP link (http://www.internic.net/domain/named.root). I thought simply entering this in the ‘Copy from Server’ window should suffice but it doesn’t. Notice the OK  button is grayed out.

copy from serverThe window says it wants a server name or IP address so I removed everything above except the server name and then clicked OK. That looked like it was doing something but then failed with a message that it couldn’t get the root hints. The message said the specified DNS server could not be contacted so that gave me the idea it was looking for a DNS server which had the root hints.

searching for root hintssearch failsSo I tried inputting the name of one of my DNS servers. This DNS server knows of the root servers because it has them already. (You can verify that a server knows of the root hints via nslookup as below).

My DNS server doesn’t have an authoritative answer (notice the output above) because it only has the info that’s present with it by default. The real answers could have changed by now (and it often does – the root hints list that these servers come with can have outdated entries) but that’s fine because it has some answers at least. If I were to input this server’s name or IP address to the ‘Copy from Server’ dialog above, that DNS server gets the root hints from this DNS server and updates itself.

Even better though would be to put the IP address of one of the root servers returned above. Like a.root-servers.net which has an IPv4 address of 198.41.0.4. (Don’t go by the output above, you can get the latest IP addresses from IANA). If I query that address for the root servers it has an authoritative answer:

So there you have it. I put in this IPv4 address into the ‘Copy from Server’ window and my server updated itself with the IP addresses. I noticed that it had missed some of the IPv6 addresses (not sure why, maybe coz it can’t validate these?) but when I did a ‘Copy from Server’ again without removing any existing entries and input in the same IPv4 address and did an update, this time it picked up all the addresses.

(Note to self: The %WINDIR%\System32\dns\cache.dns file seems to contain root hints. I replaced this file with the root hints from IANA but that did not update the server. When I updated the server as above and checked this file it hadn’t changed either. Restarting the DNS service didn’t update the file/ root hints either, so am not sure how this file comes into play).

 

Fixing “The DNS server was unable to open Active Directory” errors

For no apparent reason my home testlab went wonky today! Not entirely surprising. The DCs in there are not always on/ connected; and I keep hibernating the entire lab as it runs off my laptop so there’s bound to be errors lurking behind the scenes.

Anyways, after a reboot my main DC was acting weird. For one it took a long time to start up – indicating DNS issues, but that shouldn’t be the case as I had another DC/ DNS server running – and after boot up DNS refused to work. Gave the above error message. The Event Logs were filled with two errors:

  • Event ID 4000: The DNS server was unable to open Active Directory.  This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.
  • Event id 4007: The DNS server was unable to open zone <zone> in the Active Directory from the application directory partition <partition name>. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.

A quick Google search brought up this Microsoft KB. Looks like the DC has either lost its secure channel with the PDC, or it holds all the FSMO roles and is pointing to itself as a DNS server. Either of these could be the culprit in my case as this DC indeed had all the FSMO roles (and hence was also the PDC), and so maybe it lost trust with itself? Pretty bad state to be in, having no trust in oneself … ;-)

The KB article is worth reading for possible resolutions. In my case since I suspected DNS issues in the first place, and the slow loading usually indicates the server is looking to itself for DNS, I checked that out and sure enough it was pointing to itself as the first nameserver. So I changed the order, gave the DC a reboot, and all was well!

In case the DC had lost trust with itself the solution (according to the KB article) was to reset the DC password. Not sure how that would reset trust, but apparently it does. This involves using the netdom command which is installed on Server 2008 and up (as well as on Windows 8 or if RSAT is installed and can be downloaded for 2003 from the Support Tools package). The command has to be run on the computer whose password you want to reset (so you must login with an account whose initials are cached, or use a local account). Then run the command thus:

Of course the computer must have access to the PDC. And if you are running it on a DC the KDC service must be stopped first.

I have used netdom in the past to reset my testlab computer passwords. Since a lot of the machines are usually offline for many days, and after a while AD changes the computer account password but the machine still has the old password, when I later boot up the machine it usually gives are error like this: “The trust relationship between this workstation and the primary domain failed.”

A common suggestion for such messages is to dis-join the machine from the domain and re-join it, effectively getting it a new password. That’s a PITA though – I just use netdom and reset the password as above. :)

 

DNS zone and domain

Once upon a time I used to play with DNS zones & domains for breakfast, but it’s been a while and I find myself to be a bit rusty.

Anyways, something I realized / remembered today – a DNS domain is not equal to a DNS zone. When creating a DNS domain under Windows, using the GUI, it is easy to equate the domain to the zone; but if you come from a *nix background then you know a zone is the zone file whereas domains are different from that.

For example here’s a domain called “domain.com” and its sub-domain “sub.domain.com”.

domainYou would think there wouldn’t be much difference between the two but the fact is that “domain.com” is also the zone here and “sub.domain.com” is a part of that zone. The domain “sub.domain.com” is not independent of the main domain “domain.com”. It can’t have its own name servers. And when it comes to zone transfers “sub.domain.com” follows whatever is set for “domain.com”. You can’t, for instance, have “domain.com” be denied zone transfers while allowing zone transfers for “sub.domain.com” – it’s simply not possible, and if you think about it that makes sense too because after all “sub.domain.com” doesn’t have its own name servers.

In this case the zone “domain.com” consists of both the domain “domain.com” and its sub-domain “sub.domain.com”.

In contrast below is an example where there are two zones, one for “domain.com” and another for “sub.domain.com”. Both domain and sub-domain have their own zones (and hence name servers) in this case.

subdomainWhen creating a new domain / zone the GUI makes this clear too but it’s easy to miss out the distinction.

New domain

New domain

New Zone

New Zone

Stub zones

We use stub zones at work and initially I had a domain “sub.domain.com” which I wanted to create a a stub zone on another server. That failed with an error that the zone transfer failed.

transferInitially I took this to mean the stub zone was failing because the zone wasn’t getting transferred from the main server. That was correct – sort of.  Because “sub.domain.com” isn’t a zone of its own, it doesn’t have any name servers. And the way stub zones work is that the stub server contacts the name servers of “sub.domain.com” to get a list of name servers for the stub zone but that fails because “sub.domain.com” doesn’t have any name servers! It is not a zone, and only zones have name servers, not (sub-)domains.

So the error message was misleading. Yes, the zone transfer failed, but that’s not because the transfer failed but because there were no servers with the “sub.domain.com” zone. What I have to do is convert “sub.domain.com” to a zone of its own. (Create a zone called “sub.domain.com”, create new records in that zone, then delete the “sub.domain.com” domain).

Worth noting: Stub zones don’t need zone transfers allowed. Stub zones work via the stub server contacting the name servers of the stub zone and asking for a list of NS, A, and SOA records. These are available without any zone transfer required.

In our case we wanted to create a stub host record. We had an A record “host.sub.domain.com” and wanted to create a stub to that from another server. The solution is very simple – create a new zone called “host.sub.domain.com”, create a blank A record in that with the IP address you want (same IP that was in the “host.sub.domain.com” A record), then delete the previous “host.sub.domain.com” A record. 

Now create a stub zone for that record:stubzoneAnd that’s it.

Just to recap: zones contain domains. A domain can be spread (as sub-domains) among multiple zones. For zone transfers and stub zones you need the domain in question to be in a zone of its own. 

Notes on Dynamic DNS updates, DNS Scavenging, etc.

Dynamic DNS updates can be set to one of these (per zone):

  • None => No dynamic updates are allowed for the zone on this server
  • Secure => Only secure updates are allowed.
    • Note: This is only applicable to AD integrated zones.
    • By default only domain members (domain joined computers & domain users) are allowed to update the zone for secure updates. This is controlled by the ACLs on the zone (which can be viewed via the Security tab of the zone – check out the ACE for “Authenticated Users“). See this link for more
  • Nonsecure and secure => Both secure and nonsecure updates are allowed.

Scavenging

Dynamic DNS updates result in records being added and deleted to DNS. But while records are correctly added, it is not always the case that the record is also correctly removed. For instance, a client could have got an IP address from DHCP and dynamically registered its A record. Maybe the client then crashed so it never removed the dynamically registered record. The address, however, is removed from DHCP after the lease expires and could later be assigned to another client – who also dynamically registers itself in DNS – resulting in two A records, both to the same IP address, but one of them incorrect. To prevent such issues DNS scavenging is required. This removes stale DNS records after a pre-defined period. 

  • Scavenging is set at 3 places on a Windows server and all three must coincide for a record to be scavenged. These places are:
    • an individual record; 
    • the zone; and
    • the server performing the scavenging. 
  • The scavenging setting on an individual record can be viewed only after selecting View > Advanced in the DNS MMC and then viewing the properties of a record. 
    • When a dynamic DNS record is created it has a timestamp (rounded down to the nearest hour when the record was created).
      • When a record is first created it is considered an “Update”.
      • When an existing record is updated with the same IP address it is considered a “Refresh”.
      • When an existing record is updated with a new IP address it is considered an “Update”.
    • Every 24 hours Windows clients will attempt to to dynamically update the DNS record. The update could be considered an update or a refresh depending on whether the IP changes as above.
    • If a record is enabled for scavenging its properties window will have a tick next to “Delete this record when it becomes stale”. 
      • Static records don’t have this ticked by default (because they are not meant to be scavenged). 
      • If this is manually ticked (for a static or dynamic record) then a timestamp will be set to when it was ticked (rounded down to the nearest hour). 
    • It is possible to set scavenging on a zone and all its records via the following command: dnscmd /ageallrecords
      • This is not recommended as it enables scavenging on all records – even static. Do not use this command on zones with static records. 
  • The scavenging setting for a zone can be viewed via the Aging button in the zone properties (by default the setting is off).
    • The aging values for a zone are replicated to all DNS servers hosting the zone.
    • Two intervals are in play here:
      • No-refresh interval: Once a record is refreshed, it is not refreshed again until this time period has passed. 
        • The purpose of this seems to be to reduce replication traffic. If a client refreshes its DNS record every 24 hours, those are ignored by the DNS server for the no-refresh interval, and not replicated to other DNS servers.  
      • Refresh interval: How much time to wait once a record is refreshed before it can be scavenged?
        • So this interval specifies how long the server should wait after a record has refreshed before it can considered it a candidate for scavenging.
        • The default value is 7 days. This means, if a record is refreshed today, the server will wait for 7 days to see if it’s refreshed again. If it is not, the server considers this record ready for scavenging. 
    • Both intervals must be passed for a record to be expired. By default both are 7 days, so what this means is:
      • If a record is created/ updated/ refreshed today, for the next 7 days the record is considered current – irrespective of whether any refreshes happen or not (because remember: during the no-refresh interval refreshes, if they happen, are ignored so the server considers the record as current for this period). 
      • After those 7 days have passed, the server checks if there are any refreshes.
        • If there are, the timestamp is accordingly updated and it goes back into waiting the no-refresh interval again. 
        • If there are no refreshes, the server now waits 7 days of refresh interval to see if any refreshes happen. If they do, the record goes back into the no-refresh interval; if there aren’t any, the record is ready for scavenging. 
    • As an aside, the default lease duration for Windows server DHCP leases is 8 days. Which is why the no-refresh interval is set to 7 days by default. During these 7 days the address won’t be allocated to any other client, nor will it change with the client, so chances of a refresh are minimal. 
      • DHCP leases and Dynamic DNS updates can conflict if clients are responsible for updating DNS with their addresses (which is usually the default).
      • Say a client got an IP address from DHCP (leased to it for 8 days, remember). The client will update that in DNS. For the next 7 days any refreshes from the client are ignored (no-fresh interval, expected). From the 7th day any refreshes/ updates will be considered.
      • Say our client went offline on the 3rd day. So on day 7 it doesn’t send a refresh – no problemo, DNS will not scavenge the record yet, it will simply wait for another 7 days.
      • On the 8th day, however, DHCP will release that IP address for others to use. Any new client that comes up will now get this address. This new client will send a Dynamic DNS update to the DNS server – creating a new A record to the same address, but with this new client’s name. Thus there are two DNS entries now to the same IP address!
      • Only after the refresh interval expires (7 days) can the old record be actually scavenged by the server (and even then there could be a delay based on the server setting – see below). 
      • For this reason it is recommended that the DHCP lease duration match the “no-refresh+refresh” interval of DNS scavenging. In the default case, either increase the DHCP lease to 14 days (7+7 days) or decrease the no-refresh and refresh intervals to 4 days (so the sum is 8 days, the DHCP lease).
        • Alternatively, allow the DHCP server to make updates on behalf of clients and disable (via GPO?) clients from registering updates with the DNS server.
          • Read this post on DHCP servers and Dynamic DNS updates.
          • Typical solution is to put all DHCP servers in a group called DnsUpdateProxy, but that’s not recommended – especially for DCs – because if a server is in this group the dynamic DNS records it creates have no security (so in the case of a DC this means the SRV records written by netlogon can be changed by anyone!)
          • It is better to create a low privilege AD user and get all DHCP users to use that account to register records. This way the dynamic DNS records are secured to that user.
          • Also note: if dynamic DNS records are written by a server in the DnsUpdateProxy group – i.e. with no security – if any other machine (even one not in this group) changes this record (because the records are open to all) the ACLs of that record will be changed to only grant that machine permissions to the record. Thus the original DHCP server will lose rights to that record. DnsUpdateProxy is not a good idea. 
        • It is important that clients be disabled from registering dynamic DNS updates in this case. Else the ACLs on the DNS record created by the client will prevent updates/ deletions from the DHCP server to the DNS server for those records.  
    • When scavenging is enabled for a zone, the “Date and time” the zone can be scavenged after value is set to the time the setting was enabled (rounded down to the nearest hour) plus the refresh interval period. 
    • It is also possible to right click a DNS server and set scavenging values for all zones on that server. These only apply to zones created after this setting was changed (unless the setting to modify existing zones is explicitly selected). 
  • The scavenging setting on the server can be enabled via the “Advanced” tab of the server properties in the DNS MMC (by default the setting is off).
    • When this setting is enabled, the scavenging period is set to a default of 7 days. The scavenging period defines how often the server will try to scavenge records. 
      • Does this mean every time the server starts scavenging all records are immediately deleted? No – because you have to also consider the no-refresh and refresh intervals of above. When a server runs its scavenging task, if a record to be scavenged has not crossed the refresh interval, it will not be removed. Similarly, if a record has crossed its refresh interval and is ready to be scavenged, if the server’s scavenging period isn’t due for a few more days nothing will happen. It’s only when the scavenging period is due that this record will be scavenged. 
    • When the server scavenges records it logs an event ID 2501 indicating how many records were scavenged. If no records were scavenged, an event ID 2502 will be logged. 
    • Note: You needn’t enable the scavenging setting on all servers hosting a zone. As long as any one server scavenges, the changes will propagate to others. In fact, it’s preferred to have only one server (or a set of servers) scavenge a zone as that will make it easier to troubleshoot. If all servers hosting a zone have scavenging on and the zone records are not being scavenged, we will have to check all these servers to see why scavenging isn’t happening. 
    • In practice, it is likely that all servers have scavenging turned on (because they are hosting multiple zones and could be responsible for scavenging one of those zones). But once a server has scavenging turned on it will scavenge any zones that has scavenging turned on. It is possible to restrict the servers that are allowed to scavenge a zone – even if the server and zone have scavenging turned on – via the dnscmd command. The syntax is as follows:

      The IP addresses are optional. If no address is given, all servers are allowed to scavenge it. Example:

      To see what servers have permissions to scavenge a zone the same command with a different switch can be used:

      Resetting this is simple – just don’t specify an IP addresses, that’s all:

The latter part of this blog post gives an example of how scavenging works with all the intervals above. In fact the whole blog post is worth a read.

SOA records (and dynamic DNS in Windows)

I am on the DNS section of my notes from the AD WorkshopPLUS I attended a few months back. That’s why the recent posts are about DNS …

The SOA (Start of Authority) record is something DNS administrators are familiar with. It specifies details about the zone such as the serial number (which can be used by secondary name servers to know the zone has changed), the preferred refresh periods for secondary name servers to sync the zone, the time between retries, whom to contact, the primary name server, and so on. Here’s the SOA record for my rakhesh.com domain:

(In the example above the results also include all the name server records of the zone, but that needn’t be the case always).

In traditional zones you have one primary name server and many secondaries. So you can set one server as the primary in the record above. But what about AD-integrated zones? Since each DNS server is also a primary in that case, things are a bit different. 

What happens is that the primary name server is set to the name of whichever DNS server you ask. Thus, if you query WIN-DC01 for the SOA record to rakhesh.local, it will give itself as the primary, while if you query WIN-DC02 it will return itself as the primary. 

In Windows the name server returned by the SOA record is also used by clients for dynamic DNS updates. Clients query DNS for the SOA record. Whichever server they get a response from will return an SOA record containing itself as the primary name server. Clients then use that name server to dynamically register their A and PTR records. 

An exception to the above is Read-Only DCs (RODCs). These point to another server as the SOA for the zone. A new server is selected every 20 mins. When clients contact a RODC DNS server, they thus get another server as primary in the SOA record and send their dynamic updates to this other server. 

SRV records and AD

Example of an SRV record:

Format is:

Where:

  • class is always IN
  • TTL is in seconds 
  • service is the name of the service (example: LDAP, GC, DC, Kerberos). It is preceded by an underscore (prevents collisions with any existing names). 
  • protocol is the protocol over which the service is offered (example: TCP, UDP). It is preceded by an underscore (prevents collisions with any existing names). 
  • domain is the DNS name of the domain for which the service is offered. 
  • host is the machine that will provide this service. 
  • SRV is the text SRV. Indicates that this is an SRV record. 
  • priority is the priority of the host. Similar to SMTP MX record priorities. Host with lower number has higher preference (similar to SMTP records). 
  • weight allows for load balancing between hosts of the same priority. Host with higher number has higher preference (intuitive: higher weight wins).

Similar to MX records, the host of an SRV record must be a name with an A or AAAA record. It cannot point to a CNAME record. 

A machine starting up in a domain can easily query for DCs of that domain via DNS. DCs provide the LDAP service, so a query for _ldap._tcp.dnsdomain will return a list of DCs the machine can use. 

On DCs, the netlogon service registers many records relevant to the services offered by the DC. The A records of the DC are registered by the DNS client service (for Server 2008 and above) but the other records are taken care of by the netlogon service. A copy of the registered records is also stored in the %systemroot%\system32\config\netlogon.dns file in case it needs to be imported manually or compared for troubleshooting. Note that each DC only adds the records for itself. That is, WIN-DC01 for instance, will add a record like this:

It will not add records for WIN-DC02 and others as it does not know of them. The records added by each DC will be combined together in the DNS database as it replicates and gets records from all the DCs, so when a client queries for the above record, for instance, it will get records added by all DCs. 

AD creates a sub-domain called _msdcs.domainname to hold items pertaining to AD for that domain. (MSDCS = Microsoft Domain Controller Services) This sub-domain is created for each domain (including child-domains) of the forest. The _msdcs sub-domain belonging to the forest root domain is special in that it is a separate zone that is replicated to all DNS servers in the forest. The other _msdcs sub-domains are part of the parent zone itself. For instance below are the _msdcs sub-domains for a forest root domain (rakhesh.local) and the _msdcs sub-domain for a child domain (anoushka.rakhesh.local). Notice how the former is a separate zone with a delegation to it, while the latter is a part of the parent zone. 

For forest-root domain

For forest-root domain

For child-domain

For child-domain

Under the _msdcs sub-domain a convention such as the following is used:

Here DcType is one of dc (domain controller), gc (global catalog), pdc (primary domain controller), or domains (GUIDs of the domains).

The various SRV records registered by netlogon can be found at this link. Note that SRV records are created under both the domain/ child-domain and the forest root domain (the table in the link marks these accordingly). Below are some of the entries added by netlogon for DCs – WIN-DC04 and WIN-DC05 – in a site called KOTTAYAM for my domain anoushka.rakhesh.local. Of these WIN-DC05 is also a GC. 

Advertise that the DCs offer the LDAP service over TCP protocol for the anoushka.rakhesh.local domain:

Advertise that the DCs offer the LDAP service over TCP protocol for the anoushka.rakhesh.local domain for both sites of the forest (even though the DCs themselves are only in a single site – this way clients in any site can get these DC names when querying DNS SRV records):

WIN-DC05 has a few additional records compared to WIN-DC04 because it is a GC. 

Notice all of them are specific to its site and are created in the forest root domain zone/ _msdcs sub-domain of the forest root domain. This is because GCs are used forest-wide also. In contrast, similar records advertising the DC service are created for both sites and in the _msdcs sub-domain of the child zone:

To re-register SRV records, either restart the netlogon service of a DC or use nltest as below:

 

Switching to Route 53

Started using Amazon’s Route 53 (cool name!) today to serve DNS queries for this domain (rakhesh.com). Previously I was using DNS services from my excellent email provider FastMail but I thought I must try other options just to know what else is there. I was testing another domain of mine with Route 53 this past month, today I switched this main domain over. 

For my simple DNS requirements the two things that attracted me to Route 53 were the cost-effective pricing and the fact that it uses Anycast (see this CloudFlare post for an intro to Anycast).  It also has two other features that I am interested in exploring – Health Checks and the fact that it has an API. Need to explore these when I get some time. 

A cool thing about the AWS documentation is that it lets you download as a PDF or send to Kindle. Wish the TechNet/ MSDN documentation had a similar feature. Currently it’s a hit and miss when I send pages from TechNet/ MSDN to Kindle (via the Chrome extension). Sometimes bits and pieces are missed out, which I only realize later when reading the article. InstaPaper never manages to get any of the text, while Pocket is generally good at getting it (but I don’t use Pocket as it doesn’t have highlights, while both Kindle and InstaPaper have that). 

Active Directory: Troubleshooting with DcDiag (part 2)

Continuing from here

LocatorCheck

  • Checks whether DCs have certain required knowledge/ ability. Specifically, whether the DC that’s tested knows of or can be a:
    • The Global Catalog (GC)
    • The Primary Domain Controller (PDC)
    • Kerberos Key Distribution Centre (KDC)
    • Time Server
    • Preferred Time Server
  • By itself the test doesn’t output much info:

    To get more details one has to use the /v switch. Then output similar to the following will be returned:

    Note that the DC itself needn’t be offering one of the servers. But it must know who else offers these and be able to refer. For instance, in the case of my domain WIN-DC03 (the server I am testing against) isn’t a GC or PDC so it returns WIN-DC01 as these. It is a time server, but is not a preferred time server (as that’s the forest root domain PDC), so the output is accordingly.

Intersite

  • Checks for failures that could affect Intersite replication.
  • Warning: By default the test silently skips doing anything and simply returns a success! Note the output below:

    As you can see from the verbose output the test actually does nothing.

  • To make the test do something one must specify the /a or /e switches (all DCs in the site or all DCs in the enterprise, respectively).

    Now WIN-DC02 is flagged as having issues. The /e will throw even more light:

    (In this case the router between the two sites was shutdown and so Intersite replication was failing. Hence the errors above.

  • This test doesn’t seem to force an Intersite replication. It only connects to the servers and checks for errors, I think. For instance, when I turned on the router above and verified the two DCs can see each other, forced an enterprise wide replication (repadmin /syncall win-dc01 /e /A) (tell WIN-DC01 to ask all its partners to replication, enterprise-wide, all NCs), and double checked the replication status (repadmin.exe /showrepl WIN-DC01) – everything was working fine, but the Intersite test still complains. Not the same errors as above, but different errors. The test passes but there are warnings that each site doesn’t have a Bridgehead yet because of errors. After about 15 mins the errors clears.
  • Intersite replication, Bridgeheads, and InterSite Topology Generators (ISTG) are part of later posts.

KccEvent

  • Checks whether the Knowledge Consistency Checker (KCC) has any errors. 
  • This test only checks the “Directory Services” event log of the specified server for any errors in the last 15 mins. (If you run the test with the /v switch it even says so). 

KnowsOfRoleHolders

  • Checks whether the DC knows of various Flexible Single Master Operations (FSMO) role holders in the domain. (FSMO is part of a later post so I won’t elaborate it here). 
  • By default the answer is just a pass or fail. 
  • Use with the /v switch to know what the DC thinks it knows: 
  • Good test to run after a role change to see whether all DCs in the domain/ enterprise know of the new role holder.

MachineAccount

  • Checks whether the DC’s machine account exists, is in the Domain Controllers OU, and Service Principal Names (SPNs) are correctly registered.
  • This is yet another test that only returns a pass or fail by default. Use with the /v switch to get a list of the registered SPNs.
  • Notice that the CheckSecurityError test also checks SPNs. CheckSecurityError is only run on demand, however.
  • Add the /RecreateMachineAccount switch to recreate the machine account if missing. Note: this does not recreate missing SPNs.
  • Add the /FixMachineAccount switch to fix if the machine account flags are incorrect (am not sure what flags these are …).
  • SPNs can be added/ modified/ deleted using the Setspn command.

NCSecDesc

  • Checks whether all the Naming Contexts on the DC have correct security permissions for replication.

NetLogons

  • Checks whether the Netlogon and SYSVOL shares are available and can be accessed.
  • I pointed out this test previously under the SysVolCheck test. The latter gives the impression it actually checks the SYSVOL shares, but it doesn’t. NetLogons is the one that checks.

ObjectsReplicated

  • Checks whether the DCs machine account and DSA objects have replicated. The DC machine account object is CN=,OU=Domain Controllers,... in the domain NC; the DSA object is CN=NTDS Settings,CN=,CN=Servers,CN=,... in the configuration NC.
  • This test is better run with the /a or /e switches. Without these switches it only checks the DC you test against to see whether it has its own objects. With the switches it checks all the objects for all DCs in the site/ enterprise on all DCs in the site/ enterprise. Which is what you really want.
  • It is also possible to check a specific object via the /objectdn: or limit to DCs holding a specific NC via the /n: switch.

    For example:

    Check all DCs holding the default naming context (rakhesh.local) across all sites:

    Check al DCs holding a specified application NC across all sites:

    I had created the SomeApp2 previously. It is only replicated to the WIN-DC01 and WIN-DC03 servers so the test above will only check those servers. (To recap: you can find the DCs a NC is replicated to from the ms-DS-NC-Replica-Locations attribute of its object in the Partitions container). Note that I had to specify a server above. That’s because without specifying a DC name there’s no way to identify which DCs know of this NC (Note: “know of”, its not necessary they hold the NC, they should only know where to point to). Unlike a domain NC which has DNS entries to help identify the DCs holding it, other NCs have no such mechanism. Below is the error you get if you don’t specify a DC name as above:

    Lastly, it’s also possible to check for the replication status of a specific object. Very useful for testing purposes. Make a test object on one DC, force a replication, wait some time, then test whether that object has replicated to all DCs in your site/ enterprise. (Sure you could connect to each DC via ADUC or ADSIEdit, but this is way more convenient!)

    Below command checks whether the specified user account has replicated to all DCs in the domain:

    I specify a NC above (the /n switch) because I am running DCDiag from a client so I must specify either a server to use (the /s switch) or a NC based on which a DC can be found. If run from a DC then the NC can be omitted.

OutboundSecureChannels

  • Checks whether all DCs in the domain (by default only those in the current site) have a secure channel to DCs in the trusted domain specified by the /testdomain: switch.
  • There seems to be a misunderstanding that this test checks secure channels between DCs of the same domain. That’s not the case, it’s between DCs of two trusted domains.
  • Use the /nositerestriction switch to not limit the test to all DCs in the same site.
  • This test is not run by default. It must be explicitly specified.

RegisterInDNS

  • Checks whether the server being tested can register “A” DNS records. The DNS domain name must be specified via the /DnsDomain: switch.
  • This test is similar to the DcPromo test mentioned previously.
  • This test isn’t run by default.

Replications

  • Checks whether all of the DCs replication partners are able to replicate to it. By default only those in the same site are tested.
  • It contacts each of the partners to get a status update from them. The test also checks whether there’s a replication latency of more than 12 hours.
  • Output from WIN-DC01 in my domain when I disconnected its partner WIN-DC03. WIN-DC02 is not checked as it’s in a different site.

RidManager

  • Checks whether the DC with the RID Master FSMO role is accessible and contains proper information. Use with the /v to get more details on the findings (allocation pool, next available, etc).
  • Example output:

Services

  • Checks whether various AD required services are running the DC.
  • Following services are tested:

    This list is similar (not same!) to the DC critical services list. Notably it doesn’t check if the “DNS Server” and “AD WS” services are running.

SystemLog

  • Checks the System Log for any errors in the last 60 mins (or less if the server uptime is less than 60 mins).

Topology

  • Checks whether the server has a fully connected topology for replication of each of its NCs.
  • Note that the test does not actually check if the servers in the topology are online/ connected. For that use the Replications and CutOffServers tests. This test only checks if the topology is logically fully connected.
  • This test is not run by default. It must be explicitly specified.

VerifyEnterpriseReferences

and

VerifyReferences

  • Checks whether system references required for the FRS and replication infrastructure are present on each DCs. The “Enterprise” variant tests whether references for replication to all DCs in the enterprise are present.
  • Note: I am not very clear what this test does (but feel free to look at Ned’s blog post for more info) and I have been writing this post over many days so I am too lazy to research further either. :) I’ll update this post later if I find more info on the test.
  • This test is not run by default. It must be explicitly specified.

VerifyReplicas

  • Checks whether all the application NCs have replicated to the DCs that should contain a copy.
  • Seems to be similar to the CheckSDRefDom test but more concerned with whether the DCs host a copy or not.
  • This test is not run by default. It must be explicitly specified.

That’s all! Phew! :)

Active Directory: Troubleshooting with DcDiag (part 1)

This post originally began as notes on troubleshooting Domain Controller critical services. But when I started talking about DcDiag I went into a tangent explaining each of its tests. That took much longer than I expected – both in terms of effort and post length – so I decided to split it into a post of its own. My notes aren’t complete yet, what follows below is only the first part.

While writing this post I discovered a similar one from the Directory Services Team. It’s by NedPyle, who’s just great when it comes to writing cool posts that explain things, so you should definitely check it out.

DcDiag is your best friend when it comes to troubleshooting AD issues. It is a command-line tool that can identify issues with AD. By default DcDiag will run a series of “default” tests on the DC it is invoked, but it can be asked to run more tests and also test multiple DCs in the site (the /a switch) or across all sites (the /e switch). A quick glance at the DcDiag output is usually enough to tell me where to look further.

For instance, while typing this post I ran DcDiag to check all DCs in one of my sites:

I ran the above from WIN-DC01 and you can see I was straight-away alerted that WIN-DC03 could be having issues. I say “could be” because the errors only say that DcDiag cannot contact the RPC server on WIN-DC03 to check for those particular tests – this doesn’t necessarily mean WIN-DC03 is failing those tests, just that maybe there’s a firewall blocking communication or perhaps the RPC service is down. To confirm this I ran the same test on WIN-DC03 and they succeeded, indicating that WIN-DC03 itself is fine so there’s a communication problem between DcDiag on WIN-DC01 and WIN-DC03. Moreover, DcDiag from WIN-DC03 can query WIN-DC01 so the issue is with WIN-DC03. (In this particular case it was the firewall on WIN-DC03).

Here’s a list of the tests DcDiag can perform:

Advertising

  • Checks whether the Directory System Agent (DSA) is advertising itself. The DSA is a set of services and processes running on every DC. The DSA is what allows clients to access the Active Directory data store. Clients talk to DSA using LDAP (used by Window XP and above), SAM (used by Windows NT), MAPI RPC (used by Exchange server and other MAPI clients), or RPC (used by DCs/DSAs to talk to each other and replicate AD information). More info on the DSA can be found in this Microsoft document.
  • You can think of the DSA as the kernel of the DC – the DSA is what lets a DC behave like a DC, the DSA is what we are really talking about when referring to DCs.
  • Although DNS is used by domain members (and other DCs) to locate DCs in the domain, for a DC to be actually used by others the DSA must be advertising the roles it provides. The nltest command can be used to view what roles a DSA is advertising. For example:

    Notice the flags section. Among other things the DSA advertises that this DC holds the PDC FSMO role (PDC), is a Global Catalog (GC), and that it is a reliable time source (GTIMESERV). Compare the above output with another DC:

    The PDC, GC, and GTIMESERV flags advertised by WIN-DC01 are missing here because this DC does not provide any of those roles. Being a DC it can act as a time source for domain member, hence the TIMESERV flag is present.

  • When DCs replicate they refer to each other via the DSA name rather than the DC name (further enforcing my point from before that the DSA can be thought of as the kernel of the DC – it is what really matters).

    That is why even though a DC in my domain may have the DNS name WIN-DC01.rakhesh.local, in the additional structure that’s used by AD (which I’ll come to later) there’s an entry such as bdb02ab9-5103-4254-9403-a7687ba91488._msdcs.rakhesh.local which is a CNAME to the regular name. These CNAME entries are created by the Netlogon service and are of the format DsaGuid._msdcs.DNSForestName – the CNAME hostname is actually the GUID of the DSA.

  • If you open Active Directory Sites and Services, drill down to a site, then Servers, then expand a particular server – you’ll see the “NTDS Settings” object. This is the DSA. If you right click this object, go to Properties, and select the “Attribute Editor” tab, you will find an attribute called objectGUID. That is the GUID of the DSA – the same GUID that’s there in the CNAME entry.
    ntds-settings

CheckSDRefDom

Before talking about CheckSDRefDom it’s worth talking about directory partitions (also called as Naming Contexts (NC)).

An AD domain is part of a forest. A forest can contain many domains. All these domains share the same schema and configuration, but different domain data. Every DC in the forest thus has some data that’s particular to the domain it belongs to and is replicated with other DCs in the domain; and other data that’s common to the forest and replicated with all DCs in the forest. These are what’s referred to as directory partitions / naming contexts.

Every DC has four directory partitions. These can be viewed using ADSI Edit (adsiedit.msc) tool.

  • “Default naming context” (also known as “Domain”) which contains the domain specific data;
  • “Configuration” (CN=Configuration,DC=forestRootDomain) which contains the configuration objects for the entire forest; and
  • “Schema” (CN=Schema,CN=Configuration,DC=forestRootDomain) which contains class and attribute definitions for all existing and possible objects of the forest. Even though the Schema partition hierarchically looks like it is under the Configuration partition, it is a separate partition.
  • “Application” (CN=...,CN=forestRootDomain – there can be many such partitions) which was introduced in Server 2003 and are user/ application defined partitions that can contain any object except security principals. The replication of these partitions is not bound by domain boundaries – they can be replicated to selected DCs in the forest even if they are in different domains.
    • A common example of Application partitions are CN=ForestDnsZones,CN=forestRootDomain and CN=DomainDnsZones,CN=forestRootDomain which hold DNS zones replicated to all DNS servers in the forest and domain respectively (note that it is not replicated to all DCs in the forest and domain respectively, only a subset of the DCs – the ones that are also DNS servers).

 

If you open ADSI Edit and connect to the RootDSE “context”, then right click the RootDSE container and check its namingContexts attribute you’ll find a list of all directory partitions, including the multiple Application partitions.

rootDSE

Here you’ll also find other attributes such as:

  • defaultNamingContext (DN of the Domain directory partition the DC you are connected to is authoritative for),
  • configurationNamingContext (DN of the Configuration directory partition),
  • schemaNamingContext (DN of the Schema directory partition), and
  • rootNamingContext (DN of the Domain directory partition for the Forest Root domain)

The Configuration partition has a container called Partitions (CN=Partitions,CN=Configuration,DC=forestRootDomain) which contains cross-references to every directory partition in the forest – i.e. Application, Schema, and Configuration directory partitions, as well as all Domain directory partitions. The beauty of cross-references is that they are present in the Configuration partition and hence replicated to all DCs in the forest. Thus even if a DC doesn’t hold a particular NC it can check these cross-references and identify which DC/ domain might hold more information. This makes it possible to refer clients asking for more info to other domains.

The cross-references are actually objects of a class called crossRef.

  • What the CheckSDRefDom test does is that it checks whether the cross-references have an attribute called msDS-SDReferenceDomain set.
  • What does this mean?
    • An Application NC, by definition, isn’t tied to a particular domain. That makes it tricky from a security point of view because if its ACL has security descriptor referring to groups/ users that could belong to any domain (e.g. “Domain Admins”, “Administrator”) there’s no way to identify which domain must be used as the reference.
    • To avoid such situations, cross references to Application directory partitions contain an msDS-SDReferenceDomain attribute which specifies the reference domain.
  • So what the CheckSDRefDom test really does is that it verifies all the Application directory partitions have a reference domain set.
    • In case a reference domain isn’t set, you can always set it using ADSI Edit or other techniques. You can also delegate this.

CheckSecurityError

  • Checks for any security related errors on the DC that might be causing replication issues.
  • Some of the tests done are:
    1. Verify that KDC is working (not necessarily on the target DC, the test only checks that a KDC server is reachable anywhere in the domain, preferably in the same site; even if the target DC KDC service is down but some other KDC server is reachable the test passes)
    2. Verify that the DC”s computer object exists and is within the “Domain Controllers” OU and replicated to other DCs
    3. Check for any KDC packet fragmentation that could cause issues
    4. Check KDC time skew (remember I mentioned previously of the 5 minute tolerance)
    5. Check Service Principle Name (SPN) registration (I’ll talk about SPNs in a later post; check this link for a quick look at what they are and the errors they can cause)
  • This test is not run by default. It must be explicitly specified.
  • Can specify an optional parameter /replsource:... to perform similar tests on that DC and also check the ability to create a replication link between that DC and the DC we are testing against.

Connectivity

  • This is the only DcDiag test that you cannot skip. It runs by default, and is also run even if you perform a specific test.
  • It tests whether the DSAs are registered in DNS, whether they are ping-able, and have LDAP/ RPC connectivity.

CrossRefValidation

Before talking about CheckRefValidation it’s worth revisiting cross-references and application NCs.

Application NCs are actually objects of a class domainDNS with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).

You can create an application NC, for instance, by opening up ADSI Edit and going to the Domain NC, right click, new object, of type domainDNS, enter a Domain Component (DC) value what you want, click Next, then click “More Attributes”, select to view Mandatory/ Both type of properties, find instanceType from the property drop list, and enter a value of 5.
dnsDomain
dnsDomain2The above can be done anywhere in the domain NC. It is also possible to nest application NCs within other application NCs.

Here’s what happens behind the scenes when you make an application NC as above:

  • The application NC isn’t created straight-way.
  • First, the the DSA will check the cross-references in CN=Partitions,CN=Configuration,DC=forestRootDomain to see if one already exists to an Application NC with the same name as you specified.
    • If a cross-reference is found and the NC it points to actually exists then an error will be thrown.
    • If a cross-reference is found but the NC it points to doesn’t exist, then that cross-reference will be used for the new Application NC.
    • If a cross-reference cannot be found, a new one will be created.
  • Cross references (objects of class crossRef) have some important attributes:
    1. CN – the CN of this cross-reference (could be a name such as “CN=SomeApp” or a random GUID “CN=a97a34e3-f751-489d-b1d7-1041366c2b32”)
    2. nCName – the DN of the application NC (e.g. DC=SomeApp,DC=rakhesh,DC=local)
    3. dnsRoot – the DNS domain name where servers that contain this NC can be found (e.g. SomeApp.rakhesh.local).

      (Note this as it’s quite brilliant!) When a new application NC is created, DSA also creates a corresponding zone in DNS. This zone contains all the servers that carry this zone. In the screenshot below, for instance, note the zones DomainDnsZones, ForestDnsZones, and SomeApp2 (which belongs to a zone I created). Note that by querying for all SRV records of name _ldap in _tcp.SomeApp2.rakhesh.local one can easily find the DCs carrying this partition: dnsRoot For the example above, dnsRoot would be “SomeApp2.rakhesh.local” as that’s the DNS domain name.

    4. msDS-NC-Replica-Locations – a list of Distinguished Names (DNs) of DSAs where this application NC is replicated to (e.g. CN=NTDS Settings,CN=WIN-DC01,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local, CN=NTDS Settings,CN=WIN-DC03,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local). replica-locations Initially this attribute has only one entry – the DC where the NC was first created. Other entries can be added later.
    5. Enabled – usually not set, but if it’s set to FALSE it indicates the cross-reference is not in use
  • Once a cross-reference is identified (an existing or a new one) the Configuration NC is replicated through the forest. Then the Application NC is actually created (an object of class domainDNS object as mentioned earlier with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).
  • Lastly, all DCs that hold a copy of this Application NC have their ms-DS-Has-Master-NCs attribute in the DSA object modified to include a DN of this NC. masterNCs

Back to the CrossRefValidation test, it validates the cross-references and the NCs they point to. For instance:

  • Ensure dnsRoot is valid (see here and here for some error messages)
  • Ensure nCName and other attributes are valid
  • Ensure the DN (and CN) are not mangled (in case of conflicts AD can “mangle” the names to reflect that there’s a conflict) (see here for an example of mangled entries)

CutoffServers

If you open AD Sites and Services, expand down to each site, the servers within them, and the NTDS Settings object under each server (which is basically the DSA), you can see the replication partners of each server. For instance here are the partners for two of my servers in one site:

partners1

partners2

Reason WIN-DC01 has links to both WIN-DC03 (in the same site as it) and WIN-DC02 (in a different site) while WIN-DC03 only has links to WIN-DC01 (and not WIN-DC02 which is in a different site) is because WIN-DC01 is acting as a the bridgehead server. The bridgehead server is the server that’s automatically chosen by AD to replicate changes between sites. Each site has a bridgehead server and these servers talk to each other for replication across the site link. All other DCs in the site only get inter-site changes via the bridgehead server of that site. More on it later when I talk about bridgehead servers some other day … for now this is a good post to give an intro on bridgehead servers.

partners3

WIN-DC02, which is my DC in the other site, similarly has only one replication partner WIN-DC01. So WIN-DC01 is kind of link the link between WIN-DC02 and WIN-DC03. If WIN-DC01 were to be offline then WIN-DC02 and WIN-DC03 would be cut off from each other (for a period until the mechanism that creates the topology between sites kicks in and makes WIN-DC03 the bridgehead server between site; or even forever if I “pin” WIN-DC01 as my preferred bridgehead server in which case when it goes down no one else can takeover). Or if the link that connects the two sites to each other were to fail again they’d be cut-off from each other.

  • So what the CutoffServers test does is that it tells you if any servers are cut-off from each other in the domain.
  • This test is not run by default. It must be explicitly specified.
  • This test is best run with the /e switch – which tells DcDiag to test all servers in the enterprise, across sites. In my experience is it’s run against a specific server it usually passes the test even if replication is down.
  • Also in my experience a server is up and running but only LDAP is down (maybe the AD DS service is stopped for instance) – and so it can’t replicate with partners and they are cut-off – the test doesn’t identify the servers as being cut-off. If the server/ link is down then the other servers are highlighted as cut-off.
  • For example I set WIN-DC01 as the preferred bridgehead in my scenario above. Then I disconnect it from the network, leaving WIN-DC02 and WIN-DC03 cut-off.

    If I test WINDC-03 only it passes the test:

    That’s clearly misleading because replication isn’t happening:

    However if I run CutoffServers for the enterprise both WIN-DC02 and WIN-DC03 are correctly flagged:

    Not only is WIN-DC01 flagged in the Connectivity tests but the CutoffServers test also fails WIN-DC02 and WIN-DC03.

  • The /v switch (verbose) is also useful with this test. It will also show which NCs are failing due to the server being cut-off.

DcPromo

  • Checks whether from a DNS point of view the target server can be made a Domain Controller. If the test fails suggestions given.
  • The test has some mandatory switches:
    • /dnsdomain:...
    • /NewForest (a new forest) or /NewTree (a new domain in the forest you specify via /ForestRoot:...)or /ChildDomain (a new child domain) or /ReplicaDC (another DC in the same domain)
  • Needless to say this test isn’t run by default.

DNS

  • Checks the DNS health of the whole enterprise. It has many sub-tests. By default all sub-tests except one are run, but you can do specific sub-tests too.
  • This TechNet page is a comprehensive source of info on what the DNS test does. Tests include checking for zones, network connectivity, client configuration, delegations, dynamic updates, name resolution, and so on.
  • This test is not run by default.
  • Since it is an enterprise-wide test DcDiag requires Enterprise Admin credentials to run tests.

FrsEvent

  • Checks for any errors with the File Replication System (FRS).
  • It doesn’t seem to do an active test. It only checks the FRS Event Logs for any messages in the last 24 hours. If FRS is not used in the domain the test is silently skipped. (Specifying the /v switch will show that it’s being skipped).
  • Take the results with a pinch of salt. Chances are you had some errors but they are now resolved, but since the last 24 hours worth of logs are checked the test will flag previous error messages. Also, FRS may being used for non-SYSVOL replication and these might have errors but that doesn’t really matter as far as the DCs are concerned.
  • There may also be spurious errors a server’s Event Log is not accessible remotely and so the test fails.

DFSREvent

  • Checks for any errors with the Distributed File System Replication (DFSR).
  • Similar to the FrsEvent test. Same caveats apply.

SysVolCheck

  • Checks whether the SYSVOL share is ready.
  • In my experience this doesn’t test doesn’t seem to actually check whether the SYSVOL share is accessible. For example, consider the following:

    Notice SYSVOL exists. Now I delete it.

    But SysVolCheck will happy clear the DC as passing the test:

    So take these test results with a pinch of salt!

  • As an aside, in the case above the Netlogons test will flag the share as missing:
  • There is a registry key HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\SysvolReady which has a value of 1 when SYSVOL is ready and a value of 0 when SYSVOL is not ready. Even if I turn this value to 0 – thus disabling SYSVOL, the SYSVOL and NETLOGON shares stop being shared – the SysvolCheck test still passes. NetLogons flags an error though.

Rest of the tests will be covered in a later post. Stay tuned!

Active Directory: Domain Controller critical services

The first of my (hopefully!) many posts on Active Directory, based on the WorkshopPLUS sessions I attended last month. Progress is slow as I don’t have much time, plus I am going through the slides and my notes and adding more information from the Internet and such. 

This one’s on the services that are critical for Domain Controllers to function properly. 

DHCP Client

  • In Server 2003 and before the DHCP Client service registers A, AAAA, and PTR records for the DC with DNS
  • In Server 2008 and above this is done by the DNS Client
  • Note that only the A and PTR records are registered. Other records are by the Netlogon service.

File Replication Services (FRS)

  • Replicates SVSVOL amongst DCs.
  • Starting with Server 2008 it is now in maintenance mode. DFSR replaces it.
    • To check whether your domain is still using FRS for SYSVOL replication, open the DFS Management console and see whether the “Domain System Volume” entry is present under “Replication” (if it is not, see whether it is available for adding to the display). If it is present then your domain is using DFSR for SYSVOL replication.
    • Alternatively, type the following command on your DC. If the output says “Eliminated” as below, your domain is using DFSR for SYSVOL. (Note this only works with domain functional level 2008 and above).
  • Stopping FRS for long periods can result in Group Policy distribution errors as SYSVOL isn’t replicated. Event ID 13568 in FRS log.

Distributed File System Replication (DFSR)

  • Replicates SYSVOL amongst DCs. Replaced functionality previously provided by FRS. 
  • DFSR was introduced with Server 2003 R2.
  • If the domain was born functional level 2008 – meaning all DCs are Server 2008 or higher – DFSR is used for SYSVOL replication.
    •  Once all pre-Server 2008 DCs are removed FRS can be migrated to DFSR. 
    • Apart from the dfsrmig command mentioned in the FRS section, the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\DFSR\Parameters\SysVols\Migrating Sysvols\LocalState registry key can also be checked to see if DFSR is in use (a value of 3 means it is in use). 
  • If a DC is offline/ disconnected from its peers for a long time and Content Freshness Protection is turned on, when the DC is online/ reconnected DFSR might block SYSVOL replications to & from this DC – resuling in Group Policy distribution errors.
    • Content Freshness Protection is off by default. It needs to be manually turned on for each server.
    • Run the following command on each server to turn it on:

      Replace 60 with the maximum number of days it is acceptable for a DC or DFSR member to be offline. The recommended value is 60 days. And to turn off:

      To view the current setting:

    • Content Freshness Protection exists because of the way deletions work.
      • DFSR is multi-master, like AD, which means changes can be made on any server.
      • When you delete an item on one server, it can’t simply be deleted because then the item won’t exist any more and there’s no way for other servers to know if that’s the case because the item was deleted or because it wasn’t replicated to that server in the first place.
      • So what happens is that a deleted item is “tombstoned“. The item is removed from disk but a record for it remains the in DFSR database for 60 days (this period is called the “tombstone lifetime”) indicating this item as being deleted.
      • During these 60 days other DFSR servers can learn that the item is marked as deleted and thus act upon their copy of the item. After 60 days the record is removed from the database too.
      • In such a context, say we have DC that is offline for more than 60 days and say we have other DCs where files were removed from SYSVOL (replicated via DFSR). All the other DCs no longer have a copy of the file nor a record that it is deleted as 60 days has past and the file is removed for good.
      • When the previously offline DC replicates, it still has a copy of the file and it will pass this on to the other DCs. The other DCs don’t remember that this file was deleted (because they don’t have a record of its deletion any more as as 60 days has past) and so will happily replicate this file to their folders – resulting in a deleted file now appearing and causing corruption.
      • It is to avoid such situations that Content Freshness Protection was invented and is recommended to be turned on.
    • Here’s a good blog post from the Directory Services team explaining Content Freshness Protection.

DNS Client

  • For Server 2008 and above registers the A, AAAA, and PTR records for the DC with DNS (notice that when you change the DC IP address you do not have to update DNS manually – it is updated automatically. This is because of the DNS Client service).
  • Note that only the A, AAAA, and PTR records are registered. Other records are by the Netlogon service.  

DNS Server

  •  The glue for Active Directory. DNS is what domain controllers use to locate each other. DNS is what client computers use to find domain controllers. If this service is down both these functions fail.  

Kerberos Distribution Center (KDC)

  • Required for Kerberos 5.0 authentication. AD domains use Kerberos for authentication. If the KDC service is stopped Kerberos authentication fails. 
  • NTLM is not affected by this service. 

Netlogon

  • Maintains the secure channel between DCs and domain members (including other DCs). This secure channel is used for authentication (NTLS and Kerberos) and DC replication.
  • Writes the SRV and other records to DNS. These records are what domain members use to find DCs.
    • The records are also written to a file %systemroot%\system32\config\Netlogon.DNS. If the DNS server doesn’t support dynamic updates then the records in this text file must be manually created on the DNS server. 

Windows Time

  • Acts as an NTP client and server to keep time in sync across the domain. If this service is down and time is not in sync then Kerberos authentication and AD replication will fail (see resolution #5 in this link).
    • Kerberos authentication may not necessarily break for newer versions of Windows. But AD replication is still sensitive to time.  
  • The PDC of the forest root domain is the master time keeper of the forest. All other DCs in the forest will sync time from it.
    • The Windows Time service on every domain member looks to the DC that authenticates them for time time updates.
    • DCs in the domain look to the domain PDC for time updates. 
    • Domain PDCs look to the domain PDC of the domain above/ sibling to them. Except the forest root domain PDC who gets time from an external source (hardware source, Internet, etc).
  • From this link: there are two registry keys HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxPosPhaseCorrection and HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxNegPhaseCorrection that restrict the time updates accepted by the Windows Time service to the number of seconds defined by these values (the maximum and minimum range). This can be set directly in the registry or via a GPO. The recommended value is 172800 (i.e. 48 hours).

w32tm

The w32tm command can be used to manage time. For instance:

  • To get an idea of the time situation in the domain (who is the master time keeper, what is the offset of each of the DCs from this time keeper):
  • To ask the Windows Time service to resync as soon as possible (the command can target a remote computer too via the /computer: switch)

    • Same as above but before resyncing redetect any network configuration changes and rediscover the sources:
  • To get the status of the local computer (use the /computer: switch to target a different computer)
  • To show what time sources are being used:
  • To show who the peers are:
  • To show the current time zone:

    • You can’t change the time zone using this command; you have to do:

On the PDC in the forest root domain you would typically run a command like this if you want it to get time from an NTP pool on the Internet:

Here’s what the command does:

  • specify a list of peers to sync time from (in this example the NTP Pool servers on the Internet);
  • the /update switch tells w32tm to update the Windows Time service with this configuration change;
  • the /syncfromflags:MANUAL tells the Windows Time service that it must only sync from these sources (other options such as “DOMHIER” tells it to sync from the domain peers only, “NO” tells it sync from none, “ALL” tells it to sync from both the domain peers and this manual list);
  • the /reliable:YES switch marks this machine as special in that it is a reliable source of time for the domain (read this link on what gets set when you set a machine as RELIABLE).

Note: You must manually configure the time source on the PDC in the forest root domain and mark it as reliable. If that server were to fail and you transfer the role to another DC, be sure to repeat the step. 

On other machines in the domain you would run a command like this:

This tells those DCs to follow the domain hierarchy (and only the domain hierarchy) and that they are not reliable time sources (this switch is not really needed if these other DCs are not PDCs).

Active Directory Domain Services (AD DS)

  • Provides the DC services. If this service is stopped the DC stops acting as a DC. 
  • Pre-Server 2008 this service could not be stopped while the OS was online. But since Server 2008 it can be stopped and started. 

Active Directory Web Services (AD WS)

  • Introduced in Windows Server 2008 R2 to provide a web service interface to Active Directory Domain Services (AD DS), Active Directory Lightweight Domain Services (AD LDS), and Active Directory Database Mounting Tool instances running on the DC.
    • The Active Directory Database Mounting Tool was new to me so here’s a link to what it does. It’s a pretty cool tool. Starting from Server 2008 you can take AD DS and AD LDS snapshots via the Volume Snapshots Service (VSS) (I am writing a post on VSS side by side so expect to see one soon). This makes use of the NTDS VSS writer which ensures that consistent snapshots of the AD databases can be taken. The AD snapshots can be taken manually via the ntdsutil snapshot command or via backup software  or even via images of the whole system. Either ways, once you have such snapshots you can mount the one(s) you want via ntdsutil and point Active Directory Database Mounting Tool to it. As the tool name says it “mounts” the AD database in the snapshot and exposes it as an LDAP server. You can then use tools such as ldp.exe of the AD Users and Computers to go through this instance of the AD database. More info on this tool can be found at this and this link.
  • AD WS is what the PowerShell Active Directory module connects to. 
  • It is also what the new Active Directory Administrative Center (which in turn uses PowerShell) too connects to.
  • AD WS is installed automatically when the AD DS or AD LDS roles are installed. It is only activated once the server is promoted to a DC or if and AD LDS instance is created on it. 

DNS zone expired before it could be updated

Earlier this week Outlook for all our users stopped working. The Exchange server was fine but Outlook showed disconnected.

Checked the server. The name exchange.mydomain could not be resolved but the IP address itself was ping-able.

As a quick fix I wrote a PowerShell script that added a hosts file entry for exchange.mydomain to the IP address on all machines. This got all computers working while I investigated further. It is very easy to do this via PowerShell. All you need is the Remote Server Admin Tools via which you install the Active Directory module for PowerShell. This gives you cmdlets such as Get-ADComputer through which you can get a list of all computers in the OU. Pipe this though a ForEach-Object and put in a Test-Connection to only target computers that are online. I was lazy and made a fresh hosts hosts file on my computer with this mapping, and copied that to all the computers in our network. I could do this because I know all machines have the same hosts file, but it’s always possible to just insert the mapping into each computer’s file rather than copy a fresh one.

Anyhow, after that I checked the mydomain DNS server and noticed that the zone had unloaded. This was a secondary zone that refreshed itself from two masters in our US office. Tried pinging the servers – both were unreachable. Oops! Then I remembered that our firewall does not permit ICMP packets to these servers. So I tried telnetting to port 53 of the servers. First one worked, second did not. Ah ha! So one of the servers is definitely down. Still – the zone should have refreshed from the first server, so why didn’t it?

Next I checked the event logs of my DNS server. Found an entry stating the zone name expired before either server could be contacted and so the zone is disabled. Interesting. So I right clicked the zone and did a manual reload and sure enough it worked (the first server is reachable after all).

It’s odd the zone failed in the first place though! I checked its settings and noticed the expiry period is set to one day. So it looks like when the expiry period came about both servers must have been unreachable. Contacted our US team and sure enough they had some maintenance work going on – so it’s likely both servers were unreachable from our offices. Cool!

NSLookup doesn’t query the alternate name servers

It’s obvious when I think about it, but easy to forget I guess. If you use nslookup (on Windows) to resolve a name, and if the first name server in your list of servers is down, nslookup doesn’t automatically query the next one in the list. Instead it just returns an error.

Notice how it says default server. That’s it. Just the first server in the list, and if that’s down then the rest aren’t queried. If you want to query the rest, you have to explicitly pass the server name to nslookup.

As a result of this nslookup could give a name as non-resolvable but other commands such as ping will just work fine. Because they use the in-built resolver and that queries the other servers in the list if the first one is down.

Also, just coz it’s good to know: once the in-built resolver finds a DNS server as not responding, it doesn’t query it again for the next 15 mins. So if you have two DNS servers – ns1 and ns2 – and ns1 is currently down, the in-built resolver won’t waste time trying to query ns1, rather it will straight away go to ns2. Every 15 mins this is reset and so after that ns1 will be tried again.

Missing AD SRV records

image.png

In an Active Directory domain the domain controllers register their service records (SRV records) with DNS when they are promoted to become domain controllers. These SRV records are how other machines on the network figure out who the domain controllers for a domain are. They are used, for instance, when a new machine is to be joined to the domain, or when existing domain machines are starting up and need to get a list of the domain controllers.

These SRV records are published at standard locations and have standard names. In the screenshot, for example, the domain in question is contoso.local and you can see there exists a sub-domain called dc._msdcs.contoso.local which contains records showing that the machine dc1-2008.contoso.local provides the TCP based LDAP service for this domain (as in: it is the domain controller for this domain).

image.png

Turns out that when you promote a server to be domain controller, if in the server’s network adapter settings you have set it to not register it’s address in DNS (the default is to register the address in DNS so chances are you changed it while fiddling around!) then the server’s SRV records are not created while being promoted to domain controller status. And if this server is the only domain controller in your domain, it pretty much means no one else can join the domain until this situation is fixed.

So what do you do to get the records back? First off, you go and unclear the box which tells the server not to register it’s records in DNS. And then you restart the NetLogon service. The NetLogon service is an important one for domain controllers, and amongst other things it ensures that the SRV records for a domain controller are registered in DNS. By default it registers the records every 24 hours, but obviously you are in a hurry in this case and so restarting the service ensures that the records are registered when the service starts up.

Hope this helps!