Active Directory: Troubleshooting with DcDiag (part 1)

This post originally began as notes on troubleshooting Domain Controller critical services. But when I started talking about DcDiag I went into a tangent explaining each of its tests. That took much longer than I expected – both in terms of effort and post length – so I decided to split it into a post of its own. My notes aren’t complete yet, what follows below is only the first part.

While writing this post I discovered a similar one from the Directory Services Team. It’s by NedPyle, who’s just great when it comes to writing cool posts that explain things, so you should definitely check it out.

DcDiag is your best friend when it comes to troubleshooting AD issues. It is a command-line tool that can identify issues with AD. By default DcDiag will run a series of “default” tests on the DC it is invoked, but it can be asked to run more tests and also test multiple DCs in the site (the /a switch) or across all sites (the /e switch). A quick glance at the DcDiag output is usually enough to tell me where to look further.

For instance, while typing this post I ran DcDiag to check all DCs in one of my sites:

I ran the above from WIN-DC01 and you can see I was straight-away alerted that WIN-DC03 could be having issues. I say “could be” because the errors only say that DcDiag cannot contact the RPC server on WIN-DC03 to check for those particular tests – this doesn’t necessarily mean WIN-DC03 is failing those tests, just that maybe there’s a firewall blocking communication or perhaps the RPC service is down. To confirm this I ran the same test on WIN-DC03 and they succeeded, indicating that WIN-DC03 itself is fine so there’s a communication problem between DcDiag on WIN-DC01 and WIN-DC03. Moreover, DcDiag from WIN-DC03 can query WIN-DC01 so the issue is with WIN-DC03. (In this particular case it was the firewall on WIN-DC03).

Here’s a list of the tests DcDiag can perform:

Advertising

  • Checks whether the Directory System Agent (DSA) is advertising itself. The DSA is a set of services and processes running on every DC. The DSA is what allows clients to access the Active Directory data store. Clients talk to DSA using LDAP (used by Window XP and above), SAM (used by Windows NT), MAPI RPC (used by Exchange server and other MAPI clients), or RPC (used by DCs/DSAs to talk to each other and replicate AD information). More info on the DSA can be found in this Microsoft document.
  • You can think of the DSA as the kernel of the DC – the DSA is what lets a DC behave like a DC, the DSA is what we are really talking about when referring to DCs.
  • Although DNS is used by domain members (and other DCs) to locate DCs in the domain, for a DC to be actually used by others the DSA must be advertising the roles it provides. The nltest command can be used to view what roles a DSA is advertising. For example:

    Notice the flags section. Among other things the DSA advertises that this DC holds the PDC FSMO role (PDC), is a Global Catalog (GC), and that it is a reliable time source (GTIMESERV). Compare the above output with another DC:

    The PDC, GC, and GTIMESERV flags advertised by WIN-DC01 are missing here because this DC does not provide any of those roles. Being a DC it can act as a time source for domain member, hence the TIMESERV flag is present.

  • When DCs replicate they refer to each other via the DSA name rather than the DC name (further enforcing my point from before that the DSA can be thought of as the kernel of the DC – it is what really matters).

    That is why even though a DC in my domain may have the DNS name WIN-DC01.rakhesh.local, in the additional structure that’s used by AD (which I’ll come to later) there’s an entry such as bdb02ab9-5103-4254-9403-a7687ba91488._msdcs.rakhesh.local which is a CNAME to the regular name. These CNAME entries are created by the Netlogon service and are of the format DsaGuid._msdcs.DNSForestName – the CNAME hostname is actually the GUID of the DSA.

  • If you open Active Directory Sites and Services, drill down to a site, then Servers, then expand a particular server – you’ll see the “NTDS Settings” object. This is the DSA. If you right click this object, go to Properties, and select the “Attribute Editor” tab, you will find an attribute called objectGUID. That is the GUID of the DSA – the same GUID that’s there in the CNAME entry.
    ntds-settings

CheckSDRefDom

Before talking about CheckSDRefDom it’s worth talking about directory partitions (also called as Naming Contexts (NC)).

An AD domain is part of a forest. A forest can contain many domains. All these domains share the same schema and configuration, but different domain data. Every DC in the forest thus has some data that’s particular to the domain it belongs to and is replicated with other DCs in the domain; and other data that’s common to the forest and replicated with all DCs in the forest. These are what’s referred to as directory partitions / naming contexts.

Every DC has four directory partitions. These can be viewed using ADSI Edit (adsiedit.msc) tool.

  • “Default naming context” (also known as “Domain”) which contains the domain specific data;
  • “Configuration” (CN=Configuration,DC=forestRootDomain) which contains the configuration objects for the entire forest; and
  • “Schema” (CN=Schema,CN=Configuration,DC=forestRootDomain) which contains class and attribute definitions for all existing and possible objects of the forest. Even though the Schema partition hierarchically looks like it is under the Configuration partition, it is a separate partition.
  • “Application” (CN=...,CN=forestRootDomain – there can be many such partitions) which was introduced in Server 2003 and are user/ application defined partitions that can contain any object except security principals. The replication of these partitions is not bound by domain boundaries – they can be replicated to selected DCs in the forest even if they are in different domains.
    • A common example of Application partitions are CN=ForestDnsZones,CN=forestRootDomain and CN=DomainDnsZones,CN=forestRootDomain which hold DNS zones replicated to all DNS servers in the forest and domain respectively (note that it is not replicated to all DCs in the forest and domain respectively, only a subset of the DCs – the ones that are also DNS servers).

 

If you open ADSI Edit and connect to the RootDSE “context”, then right click the RootDSE container and check its namingContexts attribute you’ll find a list of all directory partitions, including the multiple Application partitions.

rootDSE

Here you’ll also find other attributes such as:

  • defaultNamingContext (DN of the Domain directory partition the DC you are connected to is authoritative for),
  • configurationNamingContext (DN of the Configuration directory partition),
  • schemaNamingContext (DN of the Schema directory partition), and
  • rootNamingContext (DN of the Domain directory partition for the Forest Root domain)

The Configuration partition has a container called Partitions (CN=Partitions,CN=Configuration,DC=forestRootDomain) which contains cross-references to every directory partition in the forest – i.e. Application, Schema, and Configuration directory partitions, as well as all Domain directory partitions. The beauty of cross-references is that they are present in the Configuration partition and hence replicated to all DCs in the forest. Thus even if a DC doesn’t hold a particular NC it can check these cross-references and identify which DC/ domain might hold more information. This makes it possible to refer clients asking for more info to other domains.

The cross-references are actually objects of a class called crossRef.

  • What the CheckSDRefDom test does is that it checks whether the cross-references have an attribute called msDS-SDReferenceDomain set.
  • What does this mean?
    • An Application NC, by definition, isn’t tied to a particular domain. That makes it tricky from a security point of view because if its ACL has security descriptor referring to groups/ users that could belong to any domain (e.g. “Domain Admins”, “Administrator”) there’s no way to identify which domain must be used as the reference.
    • To avoid such situations, cross references to Application directory partitions contain an msDS-SDReferenceDomain attribute which specifies the reference domain.
  • So what the CheckSDRefDom test really does is that it verifies all the Application directory partitions have a reference domain set.
    • In case a reference domain isn’t set, you can always set it using ADSI Edit or other techniques. You can also delegate this.

CheckSecurityError

  • Checks for any security related errors on the DC that might be causing replication issues.
  • Some of the tests done are:
    1. Verify that KDC is working (not necessarily on the target DC, the test only checks that a KDC server is reachable anywhere in the domain, preferably in the same site; even if the target DC KDC service is down but some other KDC server is reachable the test passes)
    2. Verify that the DC”s computer object exists and is within the “Domain Controllers” OU and replicated to other DCs
    3. Check for any KDC packet fragmentation that could cause issues
    4. Check KDC time skew (remember I mentioned previously of the 5 minute tolerance)
    5. Check Service Principle Name (SPN) registration (I’ll talk about SPNs in a later post; check this link for a quick look at what they are and the errors they can cause)
  • This test is not run by default. It must be explicitly specified.
  • Can specify an optional parameter /replsource:... to perform similar tests on that DC and also check the ability to create a replication link between that DC and the DC we are testing against.

Connectivity

  • This is the only DcDiag test that you cannot skip. It runs by default, and is also run even if you perform a specific test.
  • It tests whether the DSAs are registered in DNS, whether they are ping-able, and have LDAP/ RPC connectivity.

CrossRefValidation

Before talking about CheckRefValidation it’s worth revisiting cross-references and application NCs.

Application NCs are actually objects of a class domainDNS with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).

You can create an application NC, for instance, by opening up ADSI Edit and going to the Domain NC, right click, new object, of type domainDNS, enter a Domain Component (DC) value what you want, click Next, then click “More Attributes”, select to view Mandatory/ Both type of properties, find instanceType from the property drop list, and enter a value of 5.
dnsDomain
dnsDomain2The above can be done anywhere in the domain NC. It is also possible to nest application NCs within other application NCs.

Here’s what happens behind the scenes when you make an application NC as above:

  • The application NC isn’t created straight-way.
  • First, the the DSA will check the cross-references in CN=Partitions,CN=Configuration,DC=forestRootDomain to see if one already exists to an Application NC with the same name as you specified.
    • If a cross-reference is found and the NC it points to actually exists then an error will be thrown.
    • If a cross-reference is found but the NC it points to doesn’t exist, then that cross-reference will be used for the new Application NC.
    • If a cross-reference cannot be found, a new one will be created.
  • Cross references (objects of class crossRef) have some important attributes:
    1. CN – the CN of this cross-reference (could be a name such as “CN=SomeApp” or a random GUID “CN=a97a34e3-f751-489d-b1d7-1041366c2b32”)
    2. nCName – the DN of the application NC (e.g. DC=SomeApp,DC=rakhesh,DC=local)
    3. dnsRoot – the DNS domain name where servers that contain this NC can be found (e.g. SomeApp.rakhesh.local).

      (Note this as it’s quite brilliant!) When a new application NC is created, DSA also creates a corresponding zone in DNS. This zone contains all the servers that carry this zone. In the screenshot below, for instance, note the zones DomainDnsZones, ForestDnsZones, and SomeApp2 (which belongs to a zone I created). Note that by querying for all SRV records of name _ldap in _tcp.SomeApp2.rakhesh.local one can easily find the DCs carrying this partition: dnsRoot For the example above, dnsRoot would be “SomeApp2.rakhesh.local” as that’s the DNS domain name.

    4. msDS-NC-Replica-Locations – a list of Distinguished Names (DNs) of DSAs where this application NC is replicated to (e.g. CN=NTDS Settings,CN=WIN-DC01,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local, CN=NTDS Settings,CN=WIN-DC03,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local). replica-locations Initially this attribute has only one entry – the DC where the NC was first created. Other entries can be added later.
    5. Enabled – usually not set, but if it’s set to FALSE it indicates the cross-reference is not in use
  • Once a cross-reference is identified (an existing or a new one) the Configuration NC is replicated through the forest. Then the Application NC is actually created (an object of class domainDNS object as mentioned earlier with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).
  • Lastly, all DCs that hold a copy of this Application NC have their ms-DS-Has-Master-NCs attribute in the DSA object modified to include a DN of this NC. masterNCs

Back to the CrossRefValidation test, it validates the cross-references and the NCs they point to. For instance:

  • Ensure dnsRoot is valid (see here and here for some error messages)
  • Ensure nCName and other attributes are valid
  • Ensure the DN (and CN) are not mangled (in case of conflicts AD can “mangle” the names to reflect that there’s a conflict) (see here for an example of mangled entries)

CutoffServers

If you open AD Sites and Services, expand down to each site, the servers within them, and the NTDS Settings object under each server (which is basically the DSA), you can see the replication partners of each server. For instance here are the partners for two of my servers in one site:

partners1

partners2

Reason WIN-DC01 has links to both WIN-DC03 (in the same site as it) and WIN-DC02 (in a different site) while WIN-DC03 only has links to WIN-DC01 (and not WIN-DC02 which is in a different site) is because WIN-DC01 is acting as a the bridgehead server. The bridgehead server is the server that’s automatically chosen by AD to replicate changes between sites. Each site has a bridgehead server and these servers talk to each other for replication across the site link. All other DCs in the site only get inter-site changes via the bridgehead server of that site. More on it later when I talk about bridgehead servers some other day … for now this is a good post to give an intro on bridgehead servers.

partners3

WIN-DC02, which is my DC in the other site, similarly has only one replication partner WIN-DC01. So WIN-DC01 is kind of link the link between WIN-DC02 and WIN-DC03. If WIN-DC01 were to be offline then WIN-DC02 and WIN-DC03 would be cut off from each other (for a period until the mechanism that creates the topology between sites kicks in and makes WIN-DC03 the bridgehead server between site; or even forever if I “pin” WIN-DC01 as my preferred bridgehead server in which case when it goes down no one else can takeover). Or if the link that connects the two sites to each other were to fail again they’d be cut-off from each other.

  • So what the CutoffServers test does is that it tells you if any servers are cut-off from each other in the domain.
  • This test is not run by default. It must be explicitly specified.
  • This test is best run with the /e switch – which tells DcDiag to test all servers in the enterprise, across sites. In my experience is it’s run against a specific server it usually passes the test even if replication is down.
  • Also in my experience a server is up and running but only LDAP is down (maybe the AD DS service is stopped for instance) – and so it can’t replicate with partners and they are cut-off – the test doesn’t identify the servers as being cut-off. If the server/ link is down then the other servers are highlighted as cut-off.
  • For example I set WIN-DC01 as the preferred bridgehead in my scenario above. Then I disconnect it from the network, leaving WIN-DC02 and WIN-DC03 cut-off.

    If I test WINDC-03 only it passes the test:

    That’s clearly misleading because replication isn’t happening:

    However if I run CutoffServers for the enterprise both WIN-DC02 and WIN-DC03 are correctly flagged:

    Not only is WIN-DC01 flagged in the Connectivity tests but the CutoffServers test also fails WIN-DC02 and WIN-DC03.

  • The /v switch (verbose) is also useful with this test. It will also show which NCs are failing due to the server being cut-off.

DcPromo

  • Checks whether from a DNS point of view the target server can be made a Domain Controller. If the test fails suggestions given.
  • The test has some mandatory switches:
    • /dnsdomain:...
    • /NewForest (a new forest) or /NewTree (a new domain in the forest you specify via /ForestRoot:...)or /ChildDomain (a new child domain) or /ReplicaDC (another DC in the same domain)
  • Needless to say this test isn’t run by default.

DNS

  • Checks the DNS health of the whole enterprise. It has many sub-tests. By default all sub-tests except one are run, but you can do specific sub-tests too.
  • This TechNet page is a comprehensive source of info on what the DNS test does. Tests include checking for zones, network connectivity, client configuration, delegations, dynamic updates, name resolution, and so on.
  • This test is not run by default.
  • Since it is an enterprise-wide test DcDiag requires Enterprise Admin credentials to run tests.

FrsEvent

  • Checks for any errors with the File Replication System (FRS).
  • It doesn’t seem to do an active test. It only checks the FRS Event Logs for any messages in the last 24 hours. If FRS is not used in the domain the test is silently skipped. (Specifying the /v switch will show that it’s being skipped).
  • Take the results with a pinch of salt. Chances are you had some errors but they are now resolved, but since the last 24 hours worth of logs are checked the test will flag previous error messages. Also, FRS may being used for non-SYSVOL replication and these might have errors but that doesn’t really matter as far as the DCs are concerned.
  • There may also be spurious errors a server’s Event Log is not accessible remotely and so the test fails.

DFSREvent

  • Checks for any errors with the Distributed File System Replication (DFSR).
  • Similar to the FrsEvent test. Same caveats apply.

SysVolCheck

  • Checks whether the SYSVOL share is ready.
  • In my experience this doesn’t test doesn’t seem to actually check whether the SYSVOL share is accessible. For example, consider the following:

    Notice SYSVOL exists. Now I delete it.

    But SysVolCheck will happy clear the DC as passing the test:

    So take these test results with a pinch of salt!

  • As an aside, in the case above the Netlogons test will flag the share as missing:
  • There is a registry key HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\SysvolReady which has a value of 1 when SYSVOL is ready and a value of 0 when SYSVOL is not ready. Even if I turn this value to 0 – thus disabling SYSVOL, the SYSVOL and NETLOGON shares stop being shared – the SysvolCheck test still passes. NetLogons flags an error though.

Rest of the tests will be covered in a later post. Stay tuned!