Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Notes on AD Replication, Updates, Attributes, USN, High-Watermark Vector, Up-to-dateness Vector, Metadata, etc.

Reading a couple of AD Replication posts from TechNet/ MSDN. Below are my notes on them. 

Note

This is a super long post! It began as notes to the TechNet/ MSDN posts but it quickly grew into more. Lots of additional stuff and examples from my side/ other blog posts. Perhaps I should have split these into separate posts but I didn’t feel like it.

What is the AD Replication Model?

(source)

  • AD makes use of pull replication and not push. That is, each DC pulls in changes from other DCs (as opposed to the DC that has changes pushing these to targets). Reason for pulling instead of pushing is that only the destination DC knows what changes it needs. It is possible it has got some changes from another DC – remember AD is multimaster – so no point pushing all changes to a destination DC.
  • Initial replication to a new DC in the domain is via LDAP. Subsequent replications are via RPC.
  • All DCs do not replicate with all other DCs. They only replicate with a subset of DCs, as determined by the replication topology. (I am not sure, but I think this is only in the case of InterSite replication …). This is known as store and forward replication. DCs store updates they get from replication partners ams forward these to other DCs.
  • There are two sorts of replication – state-based and log-based.
    • State-based means DCs apply updates to their replicas (their copies of the partitions) as and when they arrive.
    • In log-based replication each DC maintains a log of the changes and sends this log to all other DCs. Once a destination DC receives this log it applies it to its replica and becomes up-to-date. This also means the destination DC will receive a list of all changes (from the source DC perspective), not just the changes it wants. 
    • Active Directory (actually called Active Directory Domain Services since Server 2008) uses state-based replication. 
    • Each DC maintains a list of objects by last modification. This way it can easily examine the last few objects in the list to identify how much of the replication source’s changes have already been processed. 

Linked & Multivalued attributes

Before the next post it’s worth going into Linked attributes and Multivalued attributes.

  • Linked attributes are pairs of attributes. AD calculates the value of one of the attributes (called the “back link”) based on the value of the other attribute (called the “forward link”). For example: group membership. Every user object has an attribute called memberOf. You can’t see this by default in ADUC in the “Attribute Editor” tab. But if you click “Filter” and select to show “Backlinks” too then you can view this attribute. backlinks The memberOf attribute is an example of a back link. And it’s generated by AD, which is why it’s a read-only attribute. It’s counterpart, the forward link, is the member attribute of a group object. Linked attributes are linked together in the schema. In the schema definitions of those attributes – which are instances of an attributeSchema class – there is a linkID attribute that defines the back and forward links. The attribute with an even linkID is the forward link (the write-able attribute that we can set) while the attribute with a linkID that’s one more than the linkID of the forward link is the back link (the read-only, AD generated attribute). For instance, using ADSI Edit I found the CN=Member,CN=Schema,CN=Configuration,DC=mydomain object in the Schema partition. This is the schema definition of the member attribute. Being a forward link it has an even linkID. linkid Its counterpart can be easily found using ldp.exe. Search the Schema partition for all objects with linkID of 3. linkid-2 The result is CN=Is-Member-Of-DL,CN=Schema,CN=Configuration,DC=mydomain, which is the attributeSchema object defining the memberOf attribute (notice the lDAPDisplayName attribute of this object in the above screenshot).
    • It’s worth taking a moment to understand the above example. When one uses ADUC to change a user/ group’s group membership, all one does is go to the user/ group object, the “Member Of” tab, and add a group. This gives the impression that the actual action happens in the user/ group object. But as we see above that’s not the case. Although ADUC may give such an impression, when we add a user/ group to a group, it is the group that gets modified with the new members and this reflects in the user/ group object that we see. ADUC makes the change on the user/ group, but the DSA makes the real change behind the scenes. The cause and effect is backwards here …
    • Another obvious but not so obvious point is that when you add a user/ group to a group, it is the group’s whenChanged attribute that gets changed and it is the change to the group that is replicated throughout the domain. The user object has no change and nothing regarding it is replicated. Obvious again, because the change really happens on the group and the effect we see on the user/ group is what AD calculates and shows us. 
  • Multivalued attributes, as the name suggests, are attributes that can hold multiple values. Again, a good example would be group membership. The member and memberOf attributes mentioned above are also multivalued attributes. Obvious, because a group can contain multiple members, and a user/ group can be a member of multiple groups. Multivalued attributes can be identified by an attribute isSingleValued in their attributeSchema definition. If this value is TRUE it’s a single valued attribute, else its a multivalued attribute.

How the AD Replication Model Works

(source)

A must read post if you are interested in the details! 

On replication

  • AD objects have attributes. AD replicates data at the attribute level – i.e. only changes to the attributes are replicated, not the entire object itself. 
    • Attributes that cannot be changed (for e.g. back links, administrative attributes) are never replicated. 
  • Updates from the same directory partition are replicated as a unit to the destination DC – i.e. over the same network connection – to optimize network traffic.
  • Updates are always specific a single directory partition. 
  • Replication of data between DCs consist of replication cycles (i.e. the DCs keep replicating until all the data has replicated). 
    • There’s no limit to the number of values that can be transferred in a replication cycle. 
    • The transfer of data happens via replication packets. Approximately 100 values can be transferred per packet. 
      • Replication packets encode the data using ASN.1 (Abstract Syntax Notation One) BER (Basic Encoding Rules). Check out Wikipedia and this post for more information on ASN.1 BER (note: ASN.1 defines the syntax; BER defines the encoding. There are many other encodings such as CER, DER, XER but BER is what LDAP uses). See this blog post for an example LDAP message (scroll down to the end). 
    • Once a replication cycle has begun, all updates on the source DC – including updates that occur during the replication cycle – are send to the destination DC. 
    • If an attribute changes multiple times between replication cycles only the latest value is replicated. 
    • As soon as an attribute is written to the AD database of a DC, it is available for replication. 
  • The number of values that can be written in a single database transaction (i.e. the destination DC has all the updates and needs to commit them to the AD database) is 5000. 
    • All updates to a single object are guaranteed to be written as a single database transaction. This way the state of an object is always consistent with the DC and there are no partially updated objects. 
      • However, for updates to a large number of values to multivalued attributes (e.g. the member attribute from the examples above) the update may not happen in the same transaction. In such cases the values are still guaranteed to be written in the same replication cycle (i.e. all updates from the source DC to that object will be written before the DC performs another replication cycle). (This exception for multivalued attributes is from Server 2003 and above domains, I think).
    • Prior to Windows Server 2003, when a multivalued attribute had a change the entire attribute was replicated (i.e. the smallest change that could be replicated was an attribute). 
      • For example, if a multivalued attribute such as member (referring to the members of a group) had 300 entries and now an additional entry was added, all 301 entries were replicated rather than just the new entry. 
      • So if a group had 5000 members and you added an extra member, 5001 entries were replicated as updates to its member attribute. Since that is larger than the number of values that can be committed in a single transaction, it would fail. Hence the maximum number of members in a group in a Windows 2000 forest functional level domain was 5000. 
    • Starting from Windows Server 2003, when a multivalued attribute has a change only the changed value is replicated (i.e. the smallest change that can be replicated is a value). Thus groups in Windows Server 2003 or Windows Server 2003 interim functional level domains don’t have the limitation of 5000 members. 
      • This feature is known as Linked Value Replication (LVR). I mentioned this in an earlier post of mine in the context of having multiple domains vs a single domain. Not only does LVR remove limitations such as the above, it also leads to efficient replication by reducing the network traffic.
      • When a Windows 2000 forest level domain is raised to Windows 2003/ 2003 interim forest level, existing multivalued attributes are not affected in any way. (If they were, there’d be huge network traffic as these changes propagate through the domain). Only when the multivalued attribute is modified does it convert to a replicate as single values. 
  • Earlier I mentioned how all attributes are actually objects of the class attributeSchema.  The AD schema is what defines attributes and the relations between them.
    • Before replication between two DCs can happen, the replication system (DRS – see below) checks whether the schema of both DCs match. If the schemas do not match, replication is rescheduled until the schemas match (makes sense – if replication happens between DCs with different schemas, the attributes may have different meanings and relationships, and replication could mess things up). 
    • If the forest schema is changed, schema replication takes precedence over all other replication. In fact, all other replication is blocked until the schema replicates. 

Architecture

  • On each DC, replication operates within the Directory System Agent (DSA) component – ntdsa.dll
    • DSA is the interface through which clients and other DCs gain access to the AD database of that DC. 
    • DSA is also the LDAP server. Directory-aware applications (basically, LDAP-aware applications) access DSA via the LDAP protocol or LDAP C API, through the wldap32.dll component  (am not very clear about this bit). LDAPv3 is used. 
  • A sub-component of the DSA is DRS (Directory Replication System) using the DRS RPC protocol (Microsoft’s specification of this protocol can be found at this MSDN link). Client & server components of the DRS interact with each other to transfer and apply updates between DCs. 
    • Updates happen in two ways: via RPC or SMTP (the latter is only for non-domain updates, which I infer to mean Schema or Configuration partition updates) 
    • Domain joined computers have a component called ntdsapi.dll which lets them communicate with DCs over RPC.
      • This is what domain joined computers use to communicate with the DC.
      • This is also what tools such as repadmin.exe or AD Sites and Services (dssites.msc) use to communicate with the DC (even if these tools are running on the DC). 
    • DCs additionally have a private version of ntdsapi.dll which lets them replicate with other DCs over RPC. (I infer the word “private” here to mean it’s meant only for DCs).
      • As mentioned earlier, DC to DC replication can also bypass RPC and use SMTP instead. Then the ntdsapi.dll component is not used. Other components, such as ISMServ.exe and the CDO library are used in that case. Remember: this is only for non-domain updates. 
    • The RPC interface used by DRS is called drsuapi. It provides the functions (also see this link) for replication and management. Check this section of the TechNet post for an architecture diagram and description of the components. 
      • The TechNet post also mentions DRS.idl. This is a file that contains the interface and type library definitions. 
  • The DSA also has a sub-component for performing low-level operations on the AD database. 
  • The DSA also has a sub-component that provides an API for applications to access the AD database. 
    • The AD database is called ntds.dit (DIT == Directory Information Tree; see this link for more info). The database engine is Extensible Storage Engine (ESE; esent.dll)

Characteristics of AD replication

  • Multimaster
    • Changes can be made to any DCs (as long as they are authoritative for the objects).
  • Pull
    • When an update occurs on a DC it notifies its partners. The partners then requests (pulls) changes from the source DC. 
  • Store-and-forward
    • When a DC receives a change from its partners, it stores the change and in-turn forwards on to others (i.e. informs others so they can issue pull requests from itself). Thus, the DC where a change is made does not have to update every other DCs in the domain. 
    • Because of the store-and-forward mechanism replication must be thought of as sequentially moving through the domain. 
  • State-based
    • Each DC has metadata to know the “state” of its replicas. This state is compared with that of its partner to identify the changes required. (This is in contrast to log-based replication where each DC keeps a log of changes it made and sends that log to its partners so they can replay the log and update themselves). 
    • This makes uses of metadata such as Update Sequence Number (USN), Up-to-dateness vector, and High-watermark vector. Synchronized time is not primarily used or required to keep track of this (it is used, but in addition to other mechanisms). 

Updates

LDAP/ AD supports the following four types of update requests:

  • Add an object to the directory.
  • Delete an object from the directory. 
  • Modify (add, delete, remove) attribute values of an existing object in the directory. 
  • Move an object by changing the name or parent of the object. 

Each of the above update requests generates a separate write transaction (because remember, all updates to a single object happen in a single transaction). Updates are an all-or-nothing event. If multiple attributes of an object are updated, and even one of those updates fail when writing to the database, all attributes fail and are not updated. 

Once an update request is committed to the database it is called an originating update. When a DC receives an originating update, writes it to its database, and sends out updates to other DCs these are called replication updates. There is no one-to-one relation between originating updates and replication updates. For instance, a DC could receive multiple originating updates to an object – possible even from different DCs – and then send a replication update with the new state of that object. (Remember: AD is state-based, not log-based. It is the state that matters, not the steps taken to reach that state). 

Adding request:

  • Creates a new object with a unique objectGUID attribute. Values of all replicated attributes are set to Version = 1. (The Version attribute will be discussed later, below). 
  • The add request fails if the parent object does not exist or the DC does not contain a writeable replica of the parent object’s directory partition. 

Modify request:

  • If an attribute is deleted then its value is set with NULL and Version is incremented. 
  • If values are to be added/ removed from an attribute, it is done so and the attribute Version is incremented. The Modify request compares the existing and new values. If there are no changes then nothing happens – the request is ignored, the Version does not change. 

Move request:

  • This is a special case of the Modify request in that only the name attribute is changed. 

Delete request:

  • The isDeleted attribute is set to TRUE. This marks the object as tombstoned (i.e. deleted but not removed from AD). 
  • The DN of the object is set to a value that cannot be set by an LDAP application (i.e. an impossible value). 
  • Except a few important attributes, the rest are stripped from the object (“stripped”, not removed as one would do above). 
  • The object itself is moved to the Deleted Objects container. 

Certain objects are protected from deletion:

  • Cross-references (class crossRef) and RID Object for the DC. Any delete request for these objects are rejected. And any originating update deletion of these objects is rejected, and all attributes of the object are updated (Version incremented) so the object replicates to other DCs and is reanimated wherever it was deleted.  
    • Cross-references are present in the Configuration partition: CN=Partitions,CN=Configuration,DC=(domain)
    • RID Objects are present under each DC at CN=RID Set,CN=(DC),OU=Domain Controllers,DC=(domain)
  • The NTDS Settings (class nTDSDSA) object.
    • This object represents the DSA (Directory System Agent). It is present in the Configuration partition under the site and DC: CN=NTDS Settings,CN=(DC),CN=Servers,CN=(SITE),CN=Sites,CN=Configuration,DC=(domain)
    • Remember, the DSA is the LDAP server within the DC. The DSA’s GUID is what we see when DCs replicate or in their DNS names (DSAGUID._msdcs.domain is a CNAME to the DC name).
    • The objectGUID of the DC object is not the same as the objectGUID of its DSA object. The latter is what matters. When a DC computer object is opened in ADUC, it is possible to view the corresponding DSA object by clicking on the “NTDS Settings” button in the first tab. 
    • Trivia: to find all the DCs in a forest one can query for objects of class nTDSDSA. Below query, for instance, finds all objects of that class and returns their DN and objectGUID

      nTDSDSA

    • When a DC is demoted, its DSA object is deleted but not really deleted. It is protected and disabled from receiving replication requests. Thus, a query such as the above will return DSA objects of DCs that may no longer exist in the forest.

      A better way to structure the above query then would be to also filter out objects whose isDeleted attribute is not TRUE.

      nTDSDSA-2

Conflicts

AD does not depend on timestamps to resolve conflicts. In case of conflicts the “volatility” of changes is what primarily matters. That is, if an attribute was updated twice on one DC and thrice on another DC, even if the timestamps of changes from the first DC are later than the second DC, the change from the second DC will win because the attribute was updated thrice there. More on this in my next post

Database GUID

I mentioned above that each DSA has its own objectGUID. This GUID is created with the server is promoted to a DC and deleted (sort of) when the DC is demoted. Thus the GUID is present for the lifetime of the DC and doesn’t change even if the DC name changes. 

The AD database (ntds.dit) on each DC has its own GUID. This GUID is stored in the invocationId attribute of the DSA. Unlike the DSA GUID, the database GUID can change. This happens when (1) the DC is demoted and re-promoted (so the database changes), (2) when an application NC is added/ removed to the DC, (3) when the DC/ database is restored from a backup, or (4) (only in Server 2012 and above) whenever a DC running in a VM is snapshotted or copied.

The invocationId attribute can be viewed via ldp.exe as above, or in the “NTDS Settings” of the DC computer object in AD Users & Computers, or in the “NTDS Settings” in AD Sites & Services. It can also be viewed in the output of repadmin.exe.

The invocationID is a part of replication requests (more later). When the invocationID changes other know that they have to replicate changes to this DC so it can catch up. The first DC in a domain will have its invocationID and objectGUID as same (until the invocationID changes). All other DCs will have different values for these both. 

Update Sequence Numbers (USNs)

(I was planning on covering this separately as part of my AD Troubleshooting WorkshopPLUS posts, but USNs are mentioned in this TechNet post so I might as well cover them now). 

USNs are 64-bit counters maintained on each DC. The number is different for each DC and is stored in the highestCommittedUsn attribute of the rootDSE object.

rootDSE is an imaginary object. It is the root of the directory namespace. Under it we have the various domain partitions, configuration partition, schema partition, and application partitions. rootDSE can be queried to find the various partitions known to the DC. It can be viewed through ADSI Edit or ldp.exe (the latter makes it easier to view all the attributes together).

rootDSE

As can be seen in the screenshot above, one of the attributes is highestCommittedUSN.

Each time a DC commits an update (originating update/ replication update), it updates the highestCommittedUSN attribute. Thus the USN is associated with each successful transaction to the AD database. Remember: all updates to a single object are written in a single transaction, so a USN is essentially associated with each successful update to an object – be it a single attribute, or multiple updates to the same/ multiple attributes (but all written in the same transaction). 

USNs are update incrementally for each write transaction. So, for example, when I took the above screenshot the USN for WIN-DC02 was 77102. When I am writing this paragraph (the next day) the USN is 77230. This means between yesterday and today WIN-DC02 has written 128 transactions (77230-77102) to its database.  

Every attribute of an object has a USN associated with it. This is a part of the replication metadata for that object, and can be viewed through the repadmin.exe command. For instance, using the /showobjmeta switch (note that here we specify DC first and then object):

Notice how the attribute USNs vary between the two DCs. Also notice the metadata stored – the Local USN, the originating DSA, the Originating DSA's USN, the timestamp of the update, and Version. If you are running the above command it is best to set the command prompt window width to 160, else the output doesn’t make much sense. I will talk about Local USN and Originating USN in a moment. 

Another switch is /showmeta (here we specify the object first and then the DC):

Both switches seem to produce the same output. The important items are the Local USN and Originating USN, and the Version

  • Version starts from 1 and is incremented for each update. Version will be same on all DCs – when an DC commits an update request (originating update or replication update) it will increment the Version. Since all attributes start with the same Version = 1, the current value will be the same on all DCs. 
  • Local USN is the USN value for that attribute on the DC we are looking at. It is the value of the highestCommittedUSN for the transaction that committed the update to this attribute/ set of attributes
  • Originating USN is the USN for that attribute on the DSA where the originating update was sent from.

For instance: the attribute description. WIN-DC01 has local USN 46335, WIN-DC02 has 19340, and WIN-DC03 has 8371. The latter two DCs got their update for this attribute from WIN-DC01 so they show the originating DSA as WIN-DC01 and the Originating USN as 46335.

Every object has an attribute called uSNChanged. This is the highest Local USN among all the attributes of that object. (What this means is that from the USN of an object we can see which of its attributes have the same local USN and so easily determine the attributes changed last). Unlike the attribute USNs which are metadata, uSNChanged is an object attribute and thus can be viewed via ADSI Edit or ADUC. From the command line, repadmin.exe with the /showattr switch can be used to view all attributes of an object (this switch takes the DC name first, then the object). 

Above output tells me that on WIN-DC01 the uSNChanged for this object is 46335.  From the earlier /showobjmeta output I know that only the description attribute has that local USN, so now I know this was the last attribute changed for this object. 

So, to recap: DCs have a USN counter stored in highestCommitedUSN attribute. This counter is updated for each successful write transaction to the database. Since writes happen per object, it means this USN is updated every time an object is updated in the database. Each object attribute has its own USN – this too is per DC, but is not an attribute, it is metadata. Finally, each object has a uSNChanged attribute which is simply the highest attribute USN of that object. This too is per DC as the attribute USN is per DC. The DC’s highestCommittedUSN attribute and an object’s uSNChanged attribute are related thus: 

  • When an attribute update is committed to the database the DC’s highestCommittedUSN is incremented by 1.
  • The Local USN of the attribute/ attributes is set to this new highestCommittedUSN
  • This in turns updates the object’s uSNChanged to be the new highestCommittedUSN (because that is the highest attribute Local USN now)

Thus, the highestCommittedUSN is the highest uSNChanged attribute among all the replicas held by the DC.  

Here’s an example. First I note the DC’s highestCommitedUSN (I have removed the other attributes from the output). 

Then I note an object’s uSNChanged:

Now I connect via ADUC and change the description field. Note the new USN. 

And note the DC’s USN:

It has increased, but due to the other changes happening on the DC it is about 10 higher than the uSNChanged of the object I updated. I searched the other replicas on my DC for changes but couldn’t find anything. (I used a filter like (uSNChanged>=125979) in ldp.exe and searched every replica but couldn’t find any other changes. Maybe I missed some replica – dunno!) This behavior is observed by others too. (From the previous linked forum post I came across this blog post. Good read).  

Finally, I must point out that even though I said the attribute metadata (Local USN, Originating USN, Originating DSA, Originating Time/ Date, Version) are not stored as attributes, that is not entirely correct. Each object has an attribute called replPropertyMetaData. This is an hexadecimal value that contains the metadata stored as a binary value. In fact, if we right click on an object in ldp.exe it is possible to view the replication metadata because ldp.exe will read the above attribute and output its contents in a readable form. 

ldp-metadata

Bear in mind this attribute is not replicated. It is an attribute that cannot be administratively changed, so can never be updated by end-users and doesn’t need replicating. It is only calculated and maintained by each DC. uSNChanged is a similar attribute – not replicated, only administratively changed. 

Note to self: I need to investigate further but it looks like uSNChanged cannot be viewed by ldp.exe for all objects. For instance, ldp.exe shows this attribute for one user object in my domain but not for the other! I think that’s because the attribute is generated by the AD tools when we view it. repadmin.exe and the GUI tools show it always. Similarly, the attribute level metadata attribute too cannot be viewed by ldp.exe for all objects. For some objects it gives an error that the replPropertyMetaData attribute cannot be found and so cannot show the replication metadata. This could also be why there was a gap between the uSNChanged and highestCommittedUSN above. 

High-watermark Vector (HWMV) & Up-to-dateness Vector (UTDV)

To recap once more, what we know so far is:

  • Every attribute has replication metadata that contains its Local USN, the originating DSA's USN, the originating DSA, the timestamp, and Version
  • The highest Local USN of an attribute is stored as the object’s uSNChanged attribute. 
  • The highest uSNChanged attribute among all objects on all replicas held by the DSA is stored as its highestCommittedUSN attribute in the rootDSA
  • All these USN counters are local to the DC (except for the Originating USN which is local to that the originating DSA). 

How can DCs replicate with each other using the above info? Two things are used for this: High-watermark vector and Up-to-dateness vector. 

The HWMV is a table maintained on each DC.

  • There is one table for each directory partition the DC holds replicas for (so at minimum there will be three HWMV tables – for the domain partition, the Schema partition, and the Configuration partition). 
  • The HWMV table contains the highest USN of the updates the DC has received from each of its replication partners for that replica. Note: the table contains the USN that’s local to the replication partner.
  • Thus the HWMV can be thought of as a table contain the highest uSNChanged value for each directory partition of its replication partners when they last replicated.

Got it? Now whenever a DC needs to replicate with another DC, all it needs to do is ask it for each of the changes since the last uSNChanged value the destination is aware of! Neat, right! The process looks roughly like this:

  1. The originating DC will have some changes. These changes will cause uSNChanged values of the affected objects to change. The highestCommittedUSN of the DC too changes. All these are local changes.
  2. Now the DC will inform all its replication partners that there are changes. 
  3. Each of these replication partners will send their HWMV for the partition in question.
  4. From the received HWMV table, the originating DC can work what changes need to be sent to the destination DC. 
    1. The originating DC now knows what USNs it needs to look for. From this it knows what objects, and in turn which of their attributes, were actually changed. So it sends just these changed attributes to the destination DC. 

To make the process of replication faster, all DCs have one more table. That is the Up-to-dateness Vector (UTDV).

  • Like the HWMV, this too is for every replica the DC holds. 
  • Unlike the HWMV this contains the highest USN of the updates the DC has received from every other DC in the domain/ forest for that replica. 
  • The table also contains the timestamp of last successful replication with each of those DCs for that replica.

The UTDV table has a different purpose to the HWMV. This table is sent by the destination DC to the originating DC when it requests for changes – i.e. in step 3 above.

When the originating DC gets the UTDV table for its destination DC, it can look at the table and note the destination DC’s highest received USNs for that partition from other DCs. Maybe the destination DC has asked for changes from USN number x upwards (the USN number being of the originating DC). But these changes were already received by the destination DC from another DC, under USN number y and upwards (the USN number being of that other DC). The destination DC does not know that the changes requested by USN x and upwards are already with it from another DC, but by looking at the UTDV table the originating DC can see that USNs y and above already contain the changes the destination DC is requesting, so it can filter out those updates when sending. (This feature is called “propagation dampening”). 

  • In practice, once the originating DC compiles a list of USNs that need to be sent to the destination DC – at step 4 above – it goes through its replication metadata to check each attribute and the originating DSA and originating USN associated with that attribute.
  • The originating DC then looks at the UTDV table of the destination DC, specifically at the entry for the DC that had sent it an update for the changed attribute. (This DC could be same as the originating DC). From the table it can see what USNs from this DC are already present at the destination DC. If the USN value in the table is higher than the USN value of the attribute, it means the destination DC already has the change with it. So the originating DC can filter out these and similar attributes from sending to the destination DC.  

Thus the UTDV table works along with the HWMV table to speed up replication (and also avoid loops wherein one DC updates another DC who updates the first DC and thus they keep looping). And that is how replications happen behind the scenes! 

Once a destination DC updates itself from an originating DC – i.e. the replication cycle completes – the source DC sends its UTDV table to the destination DC. The destination DC then updates its UTDV table with the info from the received UTDV table. Each entry in the received table is compared with the one it has and one of the following happens:

  • If the received table has an entry that the destination DC’s UTDV table does not have – meaning there’s another DC for this replica that it isn’t aware of, this DC has replicated successfully with the originating DC and so all the info it has is now also present with the destination DC, and so it is as good as saying this new DC has replicated with the destination DC and we are aware of it the same way the originating DC is aware – so a new entry is added to the destination DC’s UTDV table with the name of this unknown DC and the corresponding info from the received UTDV table. 
  • If the received table has an entry that the destination DC’s UTDV table already has, and its USN value is higher than what the destination DC’s table notes – meaning whatever changes this known DC had for this partition has already replicated with the originating DC and thus the destination DC – and so its entry in the UTDV can actually be updated, the UTDV table for that server is updated with the value from the received UTDV table.  

The UTDV table also records timestamps along with the USN value. This way DCs can quickly identify other DCs that are not replicating. These timestamps record the time the DC last replicated with the other DC – either directly or indirectly. 

Both HWMV and UTDV tables also include the invocationID (the database GUID) of the DCs. Thus, if a DC’s database is restored and its invocationID changes, other DCs can take this into account and replicate any changes they might have already replicated in the past.  

From the book I am reading side-by-side (excellent book, highly recommended!) I learnt that apart from the HWMV and UTDV tables and the naming context it wants to replicate, the destination DC also sends two other pieces of info to originating DC. These are: (1) the maximum number of object updates the destination DC wishes to receive during that replication cycle, and (2) the maximum number of values the destination DC wishes to receive during that replication cycle. I am not entirely clear what these two do. Once a replication cycle begins all object updates and values are sent to the destination DC, so the two pieces above seems to be about whether all the updates are sent in one replication packet or whether they are split and sent in multiple packets. The maximum number of values in a single packet is about 100, so I guess these two numbers are useful if you can only accept less than 100 values per packet – possibly due to network constraints. 

More later …

Unfortunately I have to take a break with this post here. I am about halfway down the TechNet post but I am pressed for time at the moment so rather than keep delaying this post I am going to publish it now and continue with the rest in another post (hopefully I get around to writing it!). As a note to myself, I am currently at the Active Directory Data Updates section, in the sub-section called “Multimaster Conflict Resolution Policy”.

[Aside] Multiple domain forests

Was reading about multiple domains/ child domains in a forest and came across these interesting posts. They talk pretty much the same stuff. 

Key points are:

  • In the past a domain was considered to be the security boundary. But since Windows 2000 a domain is no longer considered a security boundary, a forest is.
  • Domain Admins from child domain can gain access to control the forest. The posts don’t explain how but allude that it is possible and widely known.
  • Another reason for multiple domains was password policies. In Windows Server 2000 and 2003 password policies were per domain. But since Windows Server 2008 it is possible to define Fine-Grain Password Policies (FGPPs) that can override the default domain password policy. 
  • Multiple domains were also used when security was a concern. Maybe a remote location had poor security and IT Admins weren’t comfortable with having all the domain usernames and password replicating to DCs in such locations. Solution was the create a separate domain with just the users of that domain. But since Windows Server 2008 we have Read-Only Domain Controllers (RODCs) that do not store any password and can be set to cache passwords of only specified users.
  • Yet another reason for multiple domains was DNS replication. In Windows Server 2000 AD integrated DNS zones replicated to all DCs of the domain – that is, even DCs not holding the DNS role. To avoid such replication traffic multiple domains were created so the DNS replication was limited to only DCs of those domains. Again, starting Windows Server 2003 we have Application Partitions which can be set to replicate to specific DCs. In fact, Server 2003 introduced two Application Partitions specifically for DNS – a Forest DNS Zone partition, and a Domain DNS Zone partition (per domain). These replicate to all DCs that are also DNS servers in the forest and domain respectively, thus reducing DNS replication traffic. 
  • Something I wasn’t aware of until I read these articles was the Linked Value Replication (LVR). In Server 2000 whenever an attribute changed the entire attribute was replicated – for example, if a user is added to a group, the list of all group members is replicated – obviously too much traffic, and yet another reason for multiple domains (to contain the replication traffic). But since Server 2003 we have LVR which only replicates the change – thus, if a user is added to the group, only the addition is replicated. 

One recommendation (itself a matter of debate and recommended against in the above two posts) is to have two domains in the forest with one of them being a placeholder:

  1. A root domain, which will be the forest root domain and will contain the forest FSMO roles as well as Schema Admins and Enterprise Admins; and 
  2. A child domain, which will be the regular domain and will contain everything else (all the OUs, users, Domain Admins)

The root domain will not contain any objects except the Enterprise & Schema admins and the DCs. Check out this article for a nice picture and more details on this model. It’s worth pointing out that such a model is only recommended for medium to large domains, not small domains (because of overhead of maintaining two domains with the additional DCs).

Also check out this post on domain models in general. It is a great post and mentions the “placeholder forest root domain” model of above and how it is often used. From the comments I learnt why it’s better to create child domains rather than peer domains in case of the placeholder forest root domain model. If you create peers there’s no way to indicate a specific domain is the forest root – from the name they all appear the same – while if you create child domains you can easily identify who the forest root is. Also, with child domains you know that the parent forest root domain is important because you can’t remove that domain (without realizing its role) because the child domain namespace depends on it. Note that creating a domain as child to another does not give Domain Admins of the latter administrative rights to it (except of course if these Domain Admins are also Enterprise Admins). The domain is a child only in that its namespace is a child. The two domains have a two way trust relationship – be it peers or parent/ child – so users can logon to each domain using credentials from their domain, but they have no other rights unless explicitly granted. 

The authors of the “Active Directory (5th ed.)” book (a great book!) recommend keeping things simple and avoiding placeholder forest root domains.

Active Directory: Operations Master Roles (contd.)

Continuation to my previous post on FSMO roles.

If the Schema Master role is down it isn’t crucial that it be brought up immediately. This role is only used when making schema changes. 

If the Domain Master role is down it isn’t crucial that it be brought up immediately. This role is only used when adding/ removing domains and new partitions. 

If the Infrastructure Master role is down two things could be affected – groups containing users from other accounts, and you can’t run adprep /domainprep. The former only runs every two days, so it isn’t that crucial. Moreover, if your domain has Recycle Bin enabled, or all your DCs are GCs, the Infrastructure Master role doesn’t matter any more. The latter is only run when you add a new DC of a later Windows version than the ones you already have (see this link for what Adprep does for each version of Windows) – this doesn’t happen often, so isn’t crucial. 

If the RID Master role is down RID pools can’t be issued to DCs. But RIDs are handed out in blocks of 500 per DC, and when a DC reaches 250 of these it makes a request for new RIDs. So again, unless the domain is having a large number of security objects being created suddenly – thus exhausting the RID pool with all the DCs – this role isn’t crucial either. 

Note

When I keep saying a role isn’t crucial, what I mean is that you have a few days before seizing the role to another DC or putting yourself under pressure to bring the DC back up. All these roles are important and matter, but they don’t always need to be up and running for the domain to function properly. Also, if a DC holding a role is down and you seize the role to another DC, it’s recommended not to bring the old DC up – to avoid clashes. Better to recreate it as a new DC. 

Lastly, the PDC Emulator role. This is an important role because it is used in so many areas – for password chaining, for talking to older DC, for keeping time in the domain, for GPO manipulation, DFS (if you have enabled optimize for consistency), etc. Things won’t break, and you can transfer many of the functions to other DCs, but it’s better they all be on the PDC Emulator. Thus for all the roles the PDC Emulator is the one you should try to bring back up first. Again, it’s not crucial, and things will function well while the PDC Emulator, but it is the most important role of all. 

Active Directory: Troubleshooting

This is intended to be a “running post” with bits and pieces I find on AD troubleshooting. If I bookmark these I’ll forget them. But if I put them here I can search easily and also put some notes alongside. 

DCDiag switches and other commands

From Paul Bergson:

  • dcdiag /v /c /d /e /s:dcname > c:\dcdiag.log
    • /v tells it to be verbose
    • /d tells it to also show debug out – i.e. even more verbosity
    • /c tells it to be comprehensive – do all the non-default tests too (except DCPromo and RegisterInDNS)
    • /e tells it to test all servers in the enterprise – i.e. across site links

This prompted me to make a table with the list of DcDiag tests that are run by default and in comprehensive mode. 

Test NameBy default?Comprehensive?
AdvertisingYY
CheckSDRefDomYY
CheckSecurityErrorNY
ConnectivityYY
CrossRefValidationYY
CutOffServersNY
DcPromoN/AN/A
DNSNY
FrsEventYY
DFSREventYY
SysVolCheckYY
LocatorCheckYY
IntersiteYY
KccEventYY
KnowsOfRoleHoldersYY
MachineAccountYY
NCSecDescYY
NetLogosYY
ObjectsReplicatedYY
OutboundSecureChannelsYY
RegisterInDNSN/AN/A
ReplicationsYY
RidManagerYY
ServicesYY
SystemLogYY
TopologyNY
VerifyEnterpriseReferencesNY
VerifyReferences YY
VerifyReplicasNY

Replication error 1722 The RPC server is unavailable

Came across this after I setup a new child domain. Other DCs in the forest were unable to replicate to this for about 2 hours. The error was due to DNS – the CNAME records for the new DC hadn’t replicated yet. 

This TechNet post was a good read. Gives a few commands worth keeping in mind, and shows a logical way of troubleshooting.

Replication error 8524 The DSA operation is unable to proceed because of a DNS lookup failure

Another TechNet post came across in relation to the above DNS issue. 

This command is worth remembering:

Shows all the replication partners and a summary of last replication. Seems to be similar to:

 Especially useful is the fact that both commands give the DSA GUIDs of the target DC and its partners:

It is possible to specify a DC by giving its name. Have the GUIDs is useful when you suspect DNS issues. Check that the CNAMEs can be resolved from both source and destination DCs.  

Active Directory: Operations Master Roles (contd.)

This is a continuation to my previous post on AD FSMO roles. I had to wrap that up in a hurry as I was distracted by other things. 

Identifying the FSMO role holders

Easiest would be to use netdom or dcdiag

You could also use ntdsutil (if you can remember the long commands!)

The overall idea with all these commands is the same. You connect to a specified server or any sever in a specified domain. Then you query that server for the FSMO roles it knows of. 

If PowerShell is your friend, then the Get-ADDomain and Get-ADForest cmdlets are your friends. The output of these cmdlets show you the current FSMO role holders:

Finally, if a GUI is your weapon of choice, there’s three different places you will have to look:

  1. In AD Users and Computers, right click the domain and select “Operations Masters”. This will show the RID Master, PDC, and Infrastructure Master – the three domain specific FSMOs.
  2. In AD Domains and Trusts, right click on AD Domains and Trusts and select “Operations Masters”. This will show the Domain Naming Master. 
  3. Open a command prompt or run window, type regsvr32 schmmgmt.dll, then open MMC, add a snap-in called AD Schema and open it, and then right click on AD Schema and select “Operations Masters”. This will show the Schema Master. 

Transferring FSMO role holders

With a GUI transferring role holders is easy. In each of the screens above when you view the role holder there’s also an option to change it. 

To transfer/ seize a role with ntdsutil you first connect to the DC that will now hold the role, and issue a transfer/ seize command. As the names imply transfer is a “clean” way of moving the role, while seize is a seizing of the role. You seize roles when the DC that currently has the role is down/ unreachable and you can’t wait for a graceful transfer. For example: 

When you attempt a seize, ntdsutil attempts a transfer first and if that succeeds then a transfer is done instead of a seize.  

Things are much easier with PowerShell. One cmdlet, and you are done:

A neat thing about this cmdlet is that you don’t have to necessarily specify the role name. Each role has a number associated with it so you could simply specify the number instead of the role name as a parameter. 

  • PDCEmulator is 0
  • RIDMaster is 1
  • InfrastructureMaster is 2
  • SchemaMaster is 3
  • DomainNamingMaster is 4

This is quite convenient when you want to specify multiple roles to be moved. Much easier typing -OperationMasterRole 1,2,3 than -OperationMasterRole RIDMaster,InfrastructureMaster,SchemaMaster

I prefer ntdsutil over PowerShell as it gives confirmation and so I know the transfer/ seize has succeeded. 

The fsMORoleOwner attribute

The commands above show how to view and/ or transfer FSMO roles. But where exactly is the information on various roles stored in AD? Many places actually …

There is an attribute called fsMORoleOwner. Different locations have this attribute present, and the value of this attribute at these locations indicate which DC holds a particular role. (The attribute can be viewed using ADSI Edit or PowerShell).

  • In the domain partition, the DC=domainName container has this attribute. The value there points to the DC with the PDC Emulator role. 
  • In the domain partition, the CN=Infrastructure,DC=domainName container has this attribute. The value there points to the DC with the Infrastructure Master role. 
  • In the domain partition, the CN=RID Manager$,CN=System,DC=domainName container has this attribute. The value there points to the DC with the RID Master role.
  • In the configuration partition, the CN=Schema,CN=Configuration,DC=forestRootDomain container has this attribute. The value there points to the DC with the Schema Master role.
  • Lastly, in the configuration partition, the CN=Partitions,CN=Configuration,DC=forestRootDomain container has this attribute. The value there points to the DC with the Domain Naming Master role.

What this means is that you can change this attribute and effectively transfer the FSMO role from one DC to another. For instance: 

That said I wouldn’t generally transfer/ seize roles this way. For one, I am not sure whether ntdsutil and/ or PowerShell does anything else behind the scenes (maybe replicate the change with priority?). For another, occasionally I have got errors like “The role owner attribute could not be read” when trying to change the attribute. These errors seem to be related to corrupt DC entries in FSMO roles, so don’t push your luck. (Or maybe I wasn’t connecting to the DC currently holding that role – not sure. You have to be connected to the DC holding the role to change this attribute). 

Another thing to keep in mind is that after you transfer/ seize a role the change has to update through out the domain. Once you think of it in terms of the attributes above that makes sense. When a change is made the attribute is updated and that update as to replicate throughout the domain for other DCs to know.

Lastly, I hadn’t realized this before, but FSMO roles apply to each application partition too. They are not only for the domain partitions (as I had previously thought). To transfer/ seize FSMO roles of application partitions one must update the attribute directly as above. 

That’s all for now!

Update: When a DC holding one of the FSMO roles comes online after a reboot it will do an initial sync with its partners before advertising itself as a DC. During this initial sync it will check the fsMORoleOwner attribute to verify that it is still the FSMO role holder. If it still it, it will start advertising its services; if not, it will stop advertising that role. More here …

Active Directory: Operations Master Roles

Active Directory is a multimaster database. Which means it has no single master – any of the domain controllers (the read-write ones) can make changes to the Active Directory database. However, there are some tasks that necessarily need a single domain controller to be the one in charge. You can still make changes from any domain controller, but they will check with a select domain controller to ensure there’s no conflicts in making the change or perhaps ask this Domain Controller to actually make the change.

There are five such tasks where Active Directory behaves as a singlemaster database. For these tasks only a designated Domain Controller can update the database. Mind you, not all tasks need to be performed by the same Domain Controller. A Domain Controller that can carry out a particular task is one that holds the role to carry out that task. All five roles can be in a single Domain Controller, or they can be in separate Domain Controllers. Any Domain Controller holding a particular role is said to be the Flexible Single Master Operator (FSMO) for that task.

As you know a domain is part of a forest. A single forest can contain multiple domains. Two of these roles are held by a single Domain Controllers in the entire forest. Three of these roles are held by single Domain Controllers in the particular domain. Thus a forest with a single domain has 5 roles, while a forest with two domains has 8 roles (2 for the forest and 3×2 for each domain). The roles are automatically assigned when the forest/ domain is created. The first DC in the forest has the two forest roles assigned to it; the first DC in a domain has the three domain roles assigned to it. Administrators who have appropriate rights can then move these roles to other DCs.

Forest-wide roles

These are roles held by a Domain Controller/ Domain Controllers across all domains in the forest. These roles can be on any DCs in the forest, not necessarily the forest root domain.

Schema Master

The DC holding the Schema Master role is the only one that can update the AD schema. The schema is Active Directory’s blueprint. It is what defines the sort of objects the directory can contain and what attributes can be set for these objects. The schema is set at the forest level and shared by all domains in the forest. Administrators rarely need to update the schema except when installing programs that add new attributes to the objects. For instance, Exchange installs require a schema update as the objects now contain additional attributes such as the email address.

The schema itself is stored in Active Directory in a separate partition and replicated to all DCs. The schema partition is an instance of a dMD (Directory Management Domain) class object. All DCs in the forest thus have a copy of the schema and can read it, but only the DC holding the Schema Master role can write to it. This way there can be no conflicts if multiple DCs try and update the schema.

The current version of the schema can be found using ADSI Edit, connect to the Schema context and check the objectVersion attribute.

objectVersion

Via command-line the schema can be checked using repadmin /showattr:

Or PowerShell:

My domain is at the schema version of Windows Server 2012 R2. If I were to install Exchange the schema version will change (each version of Exchange has its own schema version). When I extend the schema or run adprep /forestprep the DC with the Schema Master role is the one that’s responsible. In fact, the adprep /forestprep command must be run on the DC with the Schema Master role.

The “Change Schema Master” right is required to transfer/ seize the Schema Master role to a different DC. By default only the Schema Admins group members have this right.

More about schema and how it works can be found in this TechNet article.

Domain Naming Master

Remember my post on DCDiag where I introduced the Partitions container (CN=Partitions,CN=Configuration,DC=forestRootDomain) in the Configuration NC? This container has objects of class crossRef that are cross-references to all domain partitions/ naming contexts in the forest. Well, the DC holding the Domain Naming Master is the only one that can make changes to this container – which means it is the only one that can add/ remove/ rename/ move domains in the forest and authorize creation/ deletion of application NCs. This way there’s one DC in the forest who is responsible for the forest-wide namespace. Conflicts are avoided as multiple DCs can not make changes here. 

The “Change Domain Master” right is required to transfer/ seize the Domain Naming Master role to a different DC. By default only the Enterprise Admins group members have this right.

Domain-wide roles

These are roles that are held by Domain Controllers in each domain.

PDC Emulator

The PDC Emulator – so named because it emulates a Primary Domain Controller (PDC) of Windows NT domains – has many functions:

  • It tries to maintain the latest password for any account by having all other DCs forward password changes to the PDC. (This can be avoided for PDCs over WAN links via a registry key).
  • If a user authenticates with a DC and fails, before informing the user so the DC will check with the PDC whether password is valid. This avoids situations where the password was changed on a different DC and the one the user is authenticating against isn’t aware of the change.
  • Account lockouts are processed on the PDC.
  • Group Policy Management tools default to the PDC to make changes. (You can choose a different DC however).
  • The PDC in each domain is the primary source of time for that domain. The PDC in the forest-root domain is the primary time source for all these other PDCs – and hence the primary time source for the forest.
  • The AdminSDHolder process I blogged about earlier runs on the PDC.
  • Since Windows Server 2012 DC cloning is supported. This requires the PDC to be online, available, and running Windows Server 2012 or higher.

The “Change PDC” right is required to transfer/ seize the PDC Emulator role to a different DC. By default the Domain Admins and Enterprise Admins group members have this right.

RID Master

Before talking about the RID Master it’s important to talk about RIDs.

Every security principal (i.e. objects to which one can assign security rights – for e.g. users, computers, security groups; but not distribution groups) in Windows has a Security IDentifier (SID). These are unique identifiers used by Windows internally when referring to objects. Both domain and standalone objects have SIDs. 

SIDs have a format like this: S-V-X-(48 bits of domain or standalone machine IDentifier)-(32 bits of Relative IDentifier (RID) value)

  • The underlined part can be thought of as a base SID. It is common for all objects in the same domain or standalone machine. 
  • The “S” is the letter “S”. It identifies what follows as a SID.
  • The “V” stands for the version of the SID specification.
  • The “X” is a number from 0-9. It defines the identifier authority value. For example, some objects (like Everyone) has the same SID everywhere. These are issued by a “World Authority” whose X number is 1. Most other objects – domain and standalone – have a value of 5 which stands for “NT Authority”. This Wikipedia page lists the values and authority.
  • The 48 bits of domain/ standalone machine ID are either assigned by the domain of which the object is a member (in which case it is generated at random when the domain was first created) or are assigned by the standalone computer of which the object is a member (in which case it is generated at random by Windows Setup when the OS was installed). All objects of the same domain/ standalone computer have this part common.Every domain in a forest has a unique ID (this domain ID is actually the machine SID of the first DC of the domain). 
    • There are some exceptions where the ID isn’t unique. For instance, if the object is a built-in user or group, the domain ID is 32 (irrespective of what domain it belongs to). That’s because these are built-in objects and so the domain/ standalone machine part doesn’t really matter. 
  • All the values mentioned above are common to all objects of the same domain/ standalone computer. What follows next – the 32 bits RID – is unique for each object. This is generated as follows:
    • For objects that are part of a standalone machine RIDs are assigned by the machine itself. Some accounts have a standard RID. For example the Administrator account always has RID 500. All user & group accounts start from RID 1000. RIDs are unique within the context of the machine and are assigned by the Local Security Authority (LSA) of the machine. 
    • For objects that are part of a domain the RIDs are assigned by the domain controller where the object was created. Some accounts have a standard RID. For example the Administrator account always has RID 500; Guest account always has RID 501; built-in Administrators group has RID 544; built-in Users group has RID 545; and so on (see this TechNet page for an exhaustive list). RIDs are unique within the context of the domain

And thus we come to the RID Master role. For domain objects the base SID is common for all objects. What varies is the RID. This needs to be unique in the domain. If every Domain Controller could assign a RID of its own there’s no guarantees of uniqueness. So what is required is some way of assigning each Domain Controller a block of RIDs it can assign to objects created on it. And in turn we need one DC that can hand out these block of RIDs and keeps track of what’s free for giving out next. The DC that performs this role is known to have the RID Master role. This DC hands out blocks of 500 RIDs to other DCs in the domain (the value 500 can be modified via registry).

This blog post by Mark Russinovich is a good intro to SIDs (as is this clarification by another blogger – must read if you read Mark’s post). This TechNet page is a good intro to SIDs and RIDs and definitely worth a read. From the latter I learnt that even though SIDs are used by Windows and Active Directory to grant/ deny permissions, Active Directory actually uses its own Globally Unique ID (GUID) to identify objects. These are globally unique (i.e. across the world), and although Active Directory can use GUIDs instead of SIDs it continues to use SIDs for backward compatibility. An object’s SID is stored in the objectSID property; an object’s GUID is stored in the objectGUID property. And, while the GUID is unique for life, the SID changes if the object is moved to a different domain (as that domain has its own domain ID and RID assignments). In case of such SID changes the past SIDs are stored in a property called sIDHistory

Some more bits and pieces on the RID Master:

  • Although there are 32 bits allocated for a RID, prior to Windows Server 2012 only 30 bits could be used. Thus the maximum RID value was 2^30 = 1073741823 (roughly a billion). 
  • The DCDiag command can be used to see RID allocation. The command is: dcdiag /test:RidManager /v – the /v switch is required to see the additional details. 
  • Starting from Windows Server 2012 it is possible to unlock the 31st bit for RID allocation. This requires modifying a hidden attribute. See this TechNet page for more info. Note, however, that Windows Server 2003 and 2008 cannot use these RIDs (Windows Server 2008 R2 can use these RIDs if a hotfix is applied). 
  • Server 2012 also warns for every 10% of the RID space usage (i.e. for every 100 million RIDs allocated). Also, it applies an artificial ceiling of 10% to the RID block – i.e. you can only allocate up to 90% of the roughly 1 billions RIDs. Once this ceiling is reached it has to be manually removed for further RIDs to be allocated (this gives administrators a chance to identify why their RID pool could be nearing exhaustion).  

The “Change RID Master” right is required to transfer/ seize the RID Master role to a different DC. By default the Domain Admins group members have this right.

Update: This is an interesting post worth reading. 

Infrastructure Master

A domain can contain references to objects in other domains. For instance, group members could be users in other trusted domains. If those objects have changes made to them (renames, deletions) in their domain, the domain which contain references to these objects wouldn’t know about these changes. So what is required is for someone to regularly check these references and update the other local DCs accordingly – that’s where the DC with the Infrastructure Master role comes in. 

When a group has members from another trusted domain, the group contains “phantom objects” in place of the actual object of the other domain. These phantom objects cannot be seen in LDAP or ADSI and they contain the Distinguished Name (DN), GUID, and SID of the referenced object in the other domain. When the remote object is added to a group the local DC where it is added creates the phantom object. Every 2 days (the period can be changed via a registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters) the DC holding the Infrastructure Master role goes through all the phantom objects in its domain and checks them against the Global Catalog (GC) (because the Global Catalog contains partial information of all objects across all domains in the forest). If there are changes or deletions it informs the other DCs of this. 

Here’s how the changes are passed on to the other DCs: 

  • The DC with the Infrastructure Master role creates an object of class Infrastructure-Update in the CN=Infrastructure,DC=DomainName container. 
    • If the original object was renamed, then the DNReferenceUpdate attribute of this object contains the new value. 
    • If the original object was deleted, then the DN is updated with a suffix (esc)DEL:GUID. (This is what happens when an object is usually deleted – even in the local domain. It is not really deleted, only “tombstoned” – wherein the object is moved to a special container, its DN is updated like above, and all its other attributes are removed. This way other DCs know the object is now deleted. Only after a certain period is this tombstoned object really removed from the database. Hopefully by this time the information has replicated to all other DCs and they know the object is to be deleted). See this blog post for a screenshot of how the DN looks.  
  • The DC now deletes the object it created. This tombstones the object as I described above (i.e. the DN of this object now has its DN suffixed with (esc)DEL:GUID and all other attributes – except the ones added above – are removed).  
  • The tombstoned object is now replicated to all other DCs in the domain. 
  • The other DCs see this deleted object of class Infrastructure-Update and update their copies of the phantom object accordingly. 

A side effect of the above process is that the Infrastructure Master role cannot be on a DC that’s a GC. If the Infrastructure Master were on a GC, it does not store phantom objects because it already knows of the remote objects (by virtue of being a GC). So there’s nothing to compare, and other DCs won’t be updated with any changes. 

That said, if all DCs in the domain are also GCs, then the placement of the Infrastructure Master role doesn’t matter (as all DCs will all have up-to-date info on remote objects). 

Also, if the Recycle Bin feature is enabled in the forest (for this all DCs must be Windows 2008 R2 and the forest functional level should be Windows Server 2008 R2 or above (as part of raising the functional level the schema is upgraded with some new attributes)) objects aren’t deleted via tombstoning as I described above. Instead, when an object is deleted it is only “logically deleted“. The object is moved to a special container and DN changed as before, but now its other attributes are not wiped, and a flag is set indicating the object that it is deleted and some additional attributes are set indicating how long the object will be kept in the logically deleted state (during which period it can be restored from the Recycle Bin without losing any of the attributes). Moreover, links to the “logically deleted” object are still maintained (because the object can be un-deleted any time). Because of these changes every DC is now responsible for updating references to objects in other domains by itself (I am not sure why!). Thus the Infrastructure Master role is no longer relevant once the Recycle Bin feature is enabled.  

The “Change Infrastructure Master” right is required to transfer/ seize the Infrastructure Master role to a different DC. By default the Domain Admins group members have this right.

The DC with the Infrastructure Master role is where you usually run the adprep /domainprep command. This command prepares the domain for any new DCs of a later version (for example installing a Windows Server 2008 DC in a Windows Server 2003 domain).

There’s some more stuff I wanted to write in this blog post. If I get a chance I’ll make another post with those …

Notes on Windows LSA, Secure Channel, NTLM, etc.

These are some notes to myself on the Windows security subsystem. 

On Windows, the Local Security Authority (LSA) is a subsystem that is responsible for security of the system. The LSA runs as a process called the LSA Subsystem Service (LSASS; you can find it as c:\Windows\System32\lsass.exe) and takes care of two tasks: (1) authentication and (2) enforcing local security policies on system.

For authentication the LSA makes uses of Security Support Providers (SSPs) that provide various authentication protocols. SSPs are Dynamic Link Libraries (DLLs) that offer the authentication protocol to applications that wish to make use of it. They expose a Security Service Provider Interface (SSPI) API which applications can make use of without knowing about the underlying protocol. (Generic Security Service Application Program Interface (GSSAPI or GSS-API) is an IETF standard that defines an API for programs to access security services. SSPI is is a proprietary variant of GSSAPI). 

In a way this post ties in with other things I have been reading about and posted recently. Stuff like encryption ciphers and Active Directory. On domain joined machines for instance, LSA uses Active Directory, while on non-domain joined machines LSA uses Security Accounts Manager (SAM). Either case the LSA is a critical component. 

It is possible to create custom SSPs to support new protocols. Microsoft includes the following SSPs (may not be an exhaustive list).

Kerberos (Kerberos.dll)

  • Provides the Kerberos authentication protocol. This is the protocol of choice in Windows. 
  • Kerberos cannot be used with non-domain joined systems. 
  • More about Kerberos in a later post. I plan to cover it as part of Active Directory. 

NTLM — LM, NTLM, and NTLMv2 (Msv1_0.dll)

  • Provides the NTLM authentication protocol.
    • LM == LAN Manager (also called as LAN MAN). It’s an old way of authentication – from pre Windows NT days. Not recommended any more. 
    • NTLM == NT LAN Manager. Is a successor to LM. Introduced with Windows NT. Is backward compatible to LAN MAN. It too is not recommended any more. It’s worth pointing out that NTLM uses RC4 for encryption (which is insecure as I previously pointed out).
    • NTLMv2 == NT LAN Manager version 2. Is a successor to NTLM. Introduced in Windows 2000 (and in Windows NT as part of SP4). It is the current recommended alternative to LM and NTLM and is the default since Windows Vista.
  • Although Kerberos is the preferred protocol NTLM is still supported by Windows.
  • Also, NTLM must be used on standalone systems as these don’t support Kerberos. 
  • NTLM is a challenge/ response type of authentication protocol. Here’s how it works roughly:
    • The client sends its username to the server. This could be a domain user or a local user (i.e. stored in the server SAM database). Notice that the password isn’t sent. 
    • To authenticate, the server sends some random data to the client – the challenge
    • The client encrypts this data with a hash of its password – the response. Notice that the hash of the password is used as a key to encrypt the data. 
    • If the username is stored in the server SAM database, the hash of the password will be present with the username. The server simply uses this hash to encrypt its challenge, compares the result with the response from the client, and if the two match authenticates the client. 
    • If the username is not stored in the server SAM database, it sends the username, the challenge, and response to a Domain Controller. The Domain Controller will have the password hash along with the username, so it looks these up and performs similar steps as above, compares the two results, and if they match authenticates the client.  
  • Here are some interesting blog posts on NTLM security:
    • NTLM Challenge Response is 100% Broken – talks about vulnerabilities in NTLM & LM, and why it’s better to use NTLMv2.
    • The Most Misunderstood Windows Security Setting of All Time – about the HKLM\SYSTEM\CurrentControlSet\Control\Lsa\LmCompatibilityLevel registry key (mentioned in the above blog post too) which affects how Windows uses NTLMv2. By default Vista and above only send NTLMv2 reponses but accept LM, NTLM, and NTLMv2 challenges. This post also goes into how NTLMv2 performs the challenge/ response I mention above differently.
      • NTLMv2 uses a different hash function (HMAC-MD5 instead of MD4).
      • NTLMv2 also includes a challenge from the client to the server. 
      • It is also possible to use NTLM for authentication with NTLMv2 for session security.
    • Rehashing Pass the Hash – a blog post of Pass the Hash (PtH) which is about stealing the stored password hash (in memory) from the client and using that to authenticate as the client elsewhere (since the hash is equivalent to the password, getting hold of the hash is sufficient). This post also made me realize that LM/ NTLM/ NTLMv2 hashes are unsalted – i.e. the password is hashed and stored, there’s no extra bit added to the password before salting just to make it difficult for attackers to guess the password. (Very briefly: if my password is “Password” and its hashed as it is to “12345”, all any attacker needs to do is try a large number of passwords and compare their hash with “12345”. Whichever one matches is what my password would be! Attackers can create “hash tables” that contain words and their hashes, so they don’t even have to compute the hash to guess my password. To work around this most systems salt the hash. That is, the add some random text – which varies for each user – to the password, so instead of hashing “Password” the system would hash “xxxPassword”. Now an attacker can’t simply reuse any existing hashtables, thus improving security).
      • A good blog post that illustrates Pass the Hash.
      • A PDF presentation that talks about Pass the Hash.
      • Windows 8.1 makes it difficult to do Pass-the-Hash. As the post says, you cannot eliminate Pass-the-Hash attacks as long as the hash is not in some way tied to the hardware machine.
    • When you login to the domain, your computer caches a hash of the password so that you can login even if your Domain Controller is down/ unreachable. This cache stores an MD4 hash of the “MD4 hash of the password + plus the username”.
    • If all the above isn’t enough and you want to know even more about how NTLM works look no further than this page by Eric Glass. :)

Negotiate (secur32.dll)

  • This is a psuedo-SSP. Also called the Simple and Protected GSS-API Negotiation Mechanism (SPNEGO). 
  • It lets clients and servers negotiate a protocol to use for further authentication – NTLM or Kerberos. That’s why it is a psuedo-SSP, it doesn’t provide any authentication of its own. 
  • Kerberos is always selected unless one of the parties cannot use it. 
  • Also, if an SPN (Service Principal Name), NetBIOS name, or UPN (User Principal Name) is not given then Kerberos will not be used. Thus if you connect to a server via IP address then NTLM will be used. 

CredSSP (credssp.dll)

  • Provides the Credential Security Support Provider (CredSSP) protocol. 
  • This allows for user credentials from a client to be delegated to a server for remote authentication from there on. CredSSP was introduced in Windows Vista.
  • Some day I’ll write a blog post on CredSSP and PowerShell :) but for now I’ll point to this Scripting Guy blog post that gives an example of how CredSSP is used. If I connect remotely to some machine – meaning I have authenticated with it – and now I want to connect to some other machine from this machine (maybe I want to open a shared folder), there must be some way for my credentials to be passed to this first machine I am connected to so it can authenticate me with the second machine. That’s where CredSSP comes into play. 
  • CredSSP uses TLS/SSL and the Negotiate/SPNGO SSP to delegate credentials. 
  • More about CredSSP at this MSDN article

Digest (Wdigest.dll)

  • Provides the Digest protocol. See these TechNet articles for more on Digest authentication and how it works
  • Like NTLM, this is a challenge/ response type of authentication protocol. Mostly used with HTTP or LDAP. 
  • There is no encryption involved, only hashing. Can be used in conjunction with SSL/TLS for encryption.

SChannel (Schannel.dll)

  • Provides the SSL/ TLS authentication protocols and support for Public Key Infrastructure (PKI). 
  • Different versions of Windows have different support for TLS/SSL/ DTLS because the SChannel SSP in that version of Windows only supports certain features. For instance:
  • More about SChannel at this TechNet page
  • Used when visiting websites via HTTPS.
  • Used by domain joined machines when talking to Domain Controllers – for validation, changing machine account password, NTLM authentication pass-through, SID look-up, group policies etc.
    • Used between domain machines and Domain Controllers, as well as between Domain Controllers. In case of the latter secure channel is also used for replication. Secure channels also exist between DCs in different trusted domain. 
    • Upon boot up every domain machine will discover a DC, authenticate its machine password with the DC, and create a secure channel to the DC. The Netlogon service maintains the secure channel. 
    • Every machine account in the domain has a password. This password is used to create the secure channel with the domain. 
      • Upon boot up every domain machine will discover a DC, authenticate its machine password with the DC, and create a secure channel to the DC. 
    • This is a good post to read on how to find if secure channel is broken. It shows three methods to identify a problem – NLTest, PowerShell, WMI – and if the secure channel is broken because the machine password is different from what AD has then the NLTest /SC_RESET:<DomainName> command can reset it.
      • A note on the machine password: machine account passwords do not expire in AD (unlike user account passwords). Every 30 days (configurable via a registry key) the Netlogon service of the machine will initiate a password change. Before changing the password it will test whether a secure channel exists. Only after creating a secure channel will it change the password. This post is worth reading for more info. These password changes can be disable via a registry key/ group policy
  • More on how SSL/TLS is implemented in SChannel can be found at this TechNet page.

Get-ADRootDSE

Just a note to myself. A quick way to find the domain functional level as well as other details is the Get-ADRootDSE cmdlet. By default it connects to the DC you are on (or authenticated with if you are running it from a client) but you can specify a different server via the -Server <servername> switch.

Example default output:

Example stuff you can do with it:

 

Active Directory: Troubleshooting Domain Controller critical services

These are notes from the AD Troubleshooting WorkshopPLUS session I attended. The notes are on troubleshooting Domain Controller critical services. I am mostly following what was discussed in class here rather than add anything new (except in the section of SC where I talk a bit about it).

Before moving on let’s recap the DC critical services from my previous post:

  • DHCP client / DNS client – registers the DCs A and PTR records
    • DHCP client for Server 2003 and prior
    • DNS client for Server 2008 and later
  • FRS / DFSR – responsible for SYSVOL replication between DCs
    • FRS is now deprecated, may or may not be used in the domain. DFSR is the replacement.
    • If the domain was born in functional level 2008 (i.e. all DCs are Server 2008 or later) then DFRS is used.
    • Else FRS could be in use unless it was migrated.  
  • DNS server – used by DCs to locate each other, clients to locate DCs
  • KDC – used for Kerberos authentication in the domain
  • Netlogon – maintains secure channel between DCs and other DCs and clients; also updates DNS with the SRV records
    • Secure channel is used for Kerberos authentication and AD replication
    • DNS records are also written to %systemroot%\system32\config\Netlogon.DNS in case manual updating of DNS server is required.
  • Windows Time – maintains correct time in the domain, required for Kerberos authentication and AD replication
  • AD DS – provides AD
  • AD WDS – provides a web interface to AD

Event Viewer

In case of issues the Event Viewer is the best place to start troubleshooting from. Bear in mind merely looking at the System and Application logs as most admins do is not enough. AD specific events are usually logged under the Custom Views > Server Roles section. 

ad-events

Event IDs for some of the common problems can be found at this link. Some more event IDs and their resolution can be found at this link. The previous two links are worth a read in that they also give a high level overview of AD and troubleshooting.  

DcDiag

This has a separate post of its own now.

Service Controller (SC)

This is a command I haven’t used much except in the context of checking for drivers. Try the following if you want to get a list of all active drivers on your system:

Omit the pipe and findstr after that if you want more details. SC is cool in that it can do remote computers too:

But drivers are just one type of objects SC can query. If you omit the type= driver SC returns services (and if you set type= All SC returns both drivers and services).

For example, to get a list of all services on the machine

An example entry in the output looks like this:

Too much info, so to output just the Service Name, Display Name, and State use findstr:

Services can be stopped and started using the following commands:

 

SC has its limitations though, in that you can’t stop a service if it has other services dependent on it. To my knowledge SC doesn’t have a way of enumerate services that depend on a particular service either, so there’s no way to manually stop all those services via a batch file or something. That said, SC can find which services a particular service depends upon via the sc qc command. For example:

Given a service you can also get its description. For example:

Like I said, I don’t use SC much except to query drivers. What I typically use for querying services is PowerShell.

PowerShell

  • Start-Service
  • Stop-Service
  • Restart-Service
  • Get-Service

I have noticed that sometimes the results from Get-Service and sc query vary. A recent example was when I did Get-Service NTDS on a Server 2008 R2 machine and it returned nothing while sc query NTDS returned results as expected.

Even WMIC is able to find NTDS above, but Get-Service doesn’t. Go figure!

Be mindful of the symptoms

One thing that was emphasized in class a lot is that while troubleshooting start with the symptoms (doh!). As in, think of the symptoms you are experiencing and work backwards from them as to what critical services could be down/ broken which might be leading to these symptoms. That will give you a good starting point to troubleshoot and then you can use the tools above to dig deeper and identify the problem. AD is a complex system made up of many moving parts, so a good understanding of the underlying structure and how they tie in together is important.

Active Directory: Troubleshooting with DcDiag (part 2)

Continuing from here

LocatorCheck

  • Checks whether DCs have certain required knowledge/ ability. Specifically, whether the DC that’s tested knows of or can be a:
    • The Global Catalog (GC)
    • The Primary Domain Controller (PDC)
    • Kerberos Key Distribution Centre (KDC)
    • Time Server
    • Preferred Time Server
  • By itself the test doesn’t output much info:

    To get more details one has to use the /v switch. Then output similar to the following will be returned:

    Note that the DC itself needn’t be offering one of the servers. But it must know who else offers these and be able to refer. For instance, in the case of my domain WIN-DC03 (the server I am testing against) isn’t a GC or PDC so it returns WIN-DC01 as these. It is a time server, but is not a preferred time server (as that’s the forest root domain PDC), so the output is accordingly.

Intersite

  • Checks for failures that could affect Intersite replication.
  • Warning: By default the test silently skips doing anything and simply returns a success! Note the output below:

    As you can see from the verbose output the test actually does nothing.

  • To make the test do something one must specify the /a or /e switches (all DCs in the site or all DCs in the enterprise, respectively).

    Now WIN-DC02 is flagged as having issues. The /e will throw even more light:

    (In this case the router between the two sites was shutdown and so Intersite replication was failing. Hence the errors above.

  • This test doesn’t seem to force an Intersite replication. It only connects to the servers and checks for errors, I think. For instance, when I turned on the router above and verified the two DCs can see each other, forced an enterprise wide replication (repadmin /syncall win-dc01 /e /A) (tell WIN-DC01 to ask all its partners to replication, enterprise-wide, all NCs), and double checked the replication status (repadmin.exe /showrepl WIN-DC01) – everything was working fine, but the Intersite test still complains. Not the same errors as above, but different errors. The test passes but there are warnings that each site doesn’t have a Bridgehead yet because of errors. After about 15 mins the errors clears.
  • Intersite replication, Bridgeheads, and InterSite Topology Generators (ISTG) are part of later posts.

KccEvent

  • Checks whether the Knowledge Consistency Checker (KCC) has any errors. 
  • This test only checks the “Directory Services” event log of the specified server for any errors in the last 15 mins. (If you run the test with the /v switch it even says so). 

KnowsOfRoleHolders

  • Checks whether the DC knows of various Flexible Single Master Operations (FSMO) role holders in the domain. (FSMO is part of a later post so I won’t elaborate it here). 
  • By default the answer is just a pass or fail. 
  • Use with the /v switch to know what the DC thinks it knows: 
  • Good test to run after a role change to see whether all DCs in the domain/ enterprise know of the new role holder.

MachineAccount

  • Checks whether the DC’s machine account exists, is in the Domain Controllers OU, and Service Principal Names (SPNs) are correctly registered.
  • This is yet another test that only returns a pass or fail by default. Use with the /v switch to get a list of the registered SPNs.
  • Notice that the CheckSecurityError test also checks SPNs. CheckSecurityError is only run on demand, however.
  • Add the /RecreateMachineAccount switch to recreate the machine account if missing. Note: this does not recreate missing SPNs.
  • Add the /FixMachineAccount switch to fix if the machine account flags are incorrect (am not sure what flags these are …).
  • SPNs can be added/ modified/ deleted using the Setspn command.

NCSecDesc

  • Checks whether all the Naming Contexts on the DC have correct security permissions for replication.

NetLogons

  • Checks whether the Netlogon and SYSVOL shares are available and can be accessed.
  • I pointed out this test previously under the SysVolCheck test. The latter gives the impression it actually checks the SYSVOL shares, but it doesn’t. NetLogons is the one that checks.

ObjectsReplicated

  • Checks whether the DCs machine account and DSA objects have replicated. The DC machine account object is CN=,OU=Domain Controllers,... in the domain NC; the DSA object is CN=NTDS Settings,CN=,CN=Servers,CN=,... in the configuration NC.
  • This test is better run with the /a or /e switches. Without these switches it only checks the DC you test against to see whether it has its own objects. With the switches it checks all the objects for all DCs in the site/ enterprise on all DCs in the site/ enterprise. Which is what you really want.
  • It is also possible to check a specific object via the /objectdn: or limit to DCs holding a specific NC via the /n: switch.

    For example:

    Check all DCs holding the default naming context (rakhesh.local) across all sites:

    Check al DCs holding a specified application NC across all sites:

    I had created the SomeApp2 previously. It is only replicated to the WIN-DC01 and WIN-DC03 servers so the test above will only check those servers. (To recap: you can find the DCs a NC is replicated to from the ms-DS-NC-Replica-Locations attribute of its object in the Partitions container). Note that I had to specify a server above. That’s because without specifying a DC name there’s no way to identify which DCs know of this NC (Note: “know of”, its not necessary they hold the NC, they should only know where to point to). Unlike a domain NC which has DNS entries to help identify the DCs holding it, other NCs have no such mechanism. Below is the error you get if you don’t specify a DC name as above:

    Lastly, it’s also possible to check for the replication status of a specific object. Very useful for testing purposes. Make a test object on one DC, force a replication, wait some time, then test whether that object has replicated to all DCs in your site/ enterprise. (Sure you could connect to each DC via ADUC or ADSIEdit, but this is way more convenient!)

    Below command checks whether the specified user account has replicated to all DCs in the domain:

    I specify a NC above (the /n switch) because I am running DCDiag from a client so I must specify either a server to use (the /s switch) or a NC based on which a DC can be found. If run from a DC then the NC can be omitted.

OutboundSecureChannels

  • Checks whether all DCs in the domain (by default only those in the current site) have a secure channel to DCs in the trusted domain specified by the /testdomain: switch.
  • There seems to be a misunderstanding that this test checks secure channels between DCs of the same domain. That’s not the case, it’s between DCs of two trusted domains.
  • Use the /nositerestriction switch to not limit the test to all DCs in the same site.
  • This test is not run by default. It must be explicitly specified.

RegisterInDNS

  • Checks whether the server being tested can register “A” DNS records. The DNS domain name must be specified via the /DnsDomain: switch.
  • This test is similar to the DcPromo test mentioned previously.
  • This test isn’t run by default.

Replications

  • Checks whether all of the DCs replication partners are able to replicate to it. By default only those in the same site are tested.
  • It contacts each of the partners to get a status update from them. The test also checks whether there’s a replication latency of more than 12 hours.
  • Output from WIN-DC01 in my domain when I disconnected its partner WIN-DC03. WIN-DC02 is not checked as it’s in a different site.

RidManager

  • Checks whether the DC with the RID Master FSMO role is accessible and contains proper information. Use with the /v to get more details on the findings (allocation pool, next available, etc).
  • Example output:

Services

  • Checks whether various AD required services are running the DC.
  • Following services are tested:

    This list is similar (not same!) to the DC critical services list. Notably it doesn’t check if the “DNS Server” and “AD WS” services are running.

SystemLog

  • Checks the System Log for any errors in the last 60 mins (or less if the server uptime is less than 60 mins).

Topology

  • Checks whether the server has a fully connected topology for replication of each of its NCs.
  • Note that the test does not actually check if the servers in the topology are online/ connected. For that use the Replications and CutOffServers tests. This test only checks if the topology is logically fully connected.
  • This test is not run by default. It must be explicitly specified.

VerifyEnterpriseReferences

and

VerifyReferences

  • Checks whether system references required for the FRS and replication infrastructure are present on each DCs. The “Enterprise” variant tests whether references for replication to all DCs in the enterprise are present.
  • Note: I am not very clear what this test does (but feel free to look at Ned’s blog post for more info) and I have been writing this post over many days so I am too lazy to research further either. :) I’ll update this post later if I find more info on the test.
  • This test is not run by default. It must be explicitly specified.

VerifyReplicas

  • Checks whether all the application NCs have replicated to the DCs that should contain a copy.
  • Seems to be similar to the CheckSDRefDom test but more concerned with whether the DCs host a copy or not.
  • This test is not run by default. It must be explicitly specified.

That’s all! Phew! :)

Crazy day!

Today has been a crazy day! For one I have been up till 2 AM today and yesterday morning because I am attending the Azure Iaas sessions and they run from 21:00 to 01:00 my time! I sleep by 02:00, then wake up around 06:45, and two days of doing that has taken a toll on my I think. Today after waking up I went back to bed and tried to sleep till around 09:00 but didn’t make much progress. So my head feels a bit woozy and I have been living on loads of coffee. :)

None of that matters too much really but today has been a crazy day. There’s so many things I want to do but I seem to keep getting distracted. My laptop went a bit crazy today (my fault, updating drivers! never do that when u have other stuff to do) and I am torn between playing with Azure or continuing my AD posts. Eventually I ended up playing a bit with Azure and am now on to the AD posts. I don’t want to lose steam of writing the AD posts, but at the same time I want to explore Azure too so it make sense to me and is fresh in the moment. Yesterday’s sessions were great, for instance, and I was helped by the fact that I had spent the morning reading about storage blobs and such and created a VM on Azure just for the heck of it. So in the evening, during the sessions, it made more sense to me and I could try and do stuff in the Azure portal as the speakers were explaining. The sessions too were superb! Except the last one, which was superb of course, but I couldn’t relate much to it as it was about Disaster Recovery (DR) and I haven’t used SCVMM (System Centre Virtual Machine Manager) which is what you use for DR and Azure. Moreover that session had a lot more demo bits and my Internet link isn’t that great so I get a very fuzzy demo which means I can barely make out what’s being shown!

Anyhoo, so there’s Azure and AD on one hand. And laptop troubles on the other. Added to that Xmarks on my browsers is playing up so my bookmarks aren’t being kept in sync and I am having to spend time manually syncing them. All of this is in the context of a sleepy brain. Oh, and I tried to use VPN to Private Internet Access on my new phone (so I could listen to Songza) and that doesn’t work coz my ISP is blocking UDP access to the Private Internet Access server names. TCP is working fine and streaming isn’t affected thankfully, but now I have this itch to update my OpenVPN config files for Private Internet Access with IP address versions and import that into the phone. Gotta do that but I don’t want to go off on a tangent with that now! Ideally I should be working on the AD post – which I did for a bit – but here I am writing a post about my crazy day. See, distractions all around! :)

Active Directory: Troubleshooting with DcDiag (part 1)

This post originally began as notes on troubleshooting Domain Controller critical services. But when I started talking about DcDiag I went into a tangent explaining each of its tests. That took much longer than I expected – both in terms of effort and post length – so I decided to split it into a post of its own. My notes aren’t complete yet, what follows below is only the first part.

While writing this post I discovered a similar one from the Directory Services Team. It’s by NedPyle, who’s just great when it comes to writing cool posts that explain things, so you should definitely check it out.

DcDiag is your best friend when it comes to troubleshooting AD issues. It is a command-line tool that can identify issues with AD. By default DcDiag will run a series of “default” tests on the DC it is invoked, but it can be asked to run more tests and also test multiple DCs in the site (the /a switch) or across all sites (the /e switch). A quick glance at the DcDiag output is usually enough to tell me where to look further.

For instance, while typing this post I ran DcDiag to check all DCs in one of my sites:

I ran the above from WIN-DC01 and you can see I was straight-away alerted that WIN-DC03 could be having issues. I say “could be” because the errors only say that DcDiag cannot contact the RPC server on WIN-DC03 to check for those particular tests – this doesn’t necessarily mean WIN-DC03 is failing those tests, just that maybe there’s a firewall blocking communication or perhaps the RPC service is down. To confirm this I ran the same test on WIN-DC03 and they succeeded, indicating that WIN-DC03 itself is fine so there’s a communication problem between DcDiag on WIN-DC01 and WIN-DC03. Moreover, DcDiag from WIN-DC03 can query WIN-DC01 so the issue is with WIN-DC03. (In this particular case it was the firewall on WIN-DC03).

Here’s a list of the tests DcDiag can perform:

Advertising

  • Checks whether the Directory System Agent (DSA) is advertising itself. The DSA is a set of services and processes running on every DC. The DSA is what allows clients to access the Active Directory data store. Clients talk to DSA using LDAP (used by Window XP and above), SAM (used by Windows NT), MAPI RPC (used by Exchange server and other MAPI clients), or RPC (used by DCs/DSAs to talk to each other and replicate AD information). More info on the DSA can be found in this Microsoft document.
  • You can think of the DSA as the kernel of the DC – the DSA is what lets a DC behave like a DC, the DSA is what we are really talking about when referring to DCs.
  • Although DNS is used by domain members (and other DCs) to locate DCs in the domain, for a DC to be actually used by others the DSA must be advertising the roles it provides. The nltest command can be used to view what roles a DSA is advertising. For example:

    Notice the flags section. Among other things the DSA advertises that this DC holds the PDC FSMO role (PDC), is a Global Catalog (GC), and that it is a reliable time source (GTIMESERV). Compare the above output with another DC:

    The PDC, GC, and GTIMESERV flags advertised by WIN-DC01 are missing here because this DC does not provide any of those roles. Being a DC it can act as a time source for domain member, hence the TIMESERV flag is present.

  • When DCs replicate they refer to each other via the DSA name rather than the DC name (further enforcing my point from before that the DSA can be thought of as the kernel of the DC – it is what really matters).

    That is why even though a DC in my domain may have the DNS name WIN-DC01.rakhesh.local, in the additional structure that’s used by AD (which I’ll come to later) there’s an entry such as bdb02ab9-5103-4254-9403-a7687ba91488._msdcs.rakhesh.local which is a CNAME to the regular name. These CNAME entries are created by the Netlogon service and are of the format DsaGuid._msdcs.DNSForestName – the CNAME hostname is actually the GUID of the DSA.

  • If you open Active Directory Sites and Services, drill down to a site, then Servers, then expand a particular server – you’ll see the “NTDS Settings” object. This is the DSA. If you right click this object, go to Properties, and select the “Attribute Editor” tab, you will find an attribute called objectGUID. That is the GUID of the DSA – the same GUID that’s there in the CNAME entry.
    ntds-settings

CheckSDRefDom

Before talking about CheckSDRefDom it’s worth talking about directory partitions (also called as Naming Contexts (NC)).

An AD domain is part of a forest. A forest can contain many domains. All these domains share the same schema and configuration, but different domain data. Every DC in the forest thus has some data that’s particular to the domain it belongs to and is replicated with other DCs in the domain; and other data that’s common to the forest and replicated with all DCs in the forest. These are what’s referred to as directory partitions / naming contexts.

Every DC has four directory partitions. These can be viewed using ADSI Edit (adsiedit.msc) tool.

  • “Default naming context” (also known as “Domain”) which contains the domain specific data;
  • “Configuration” (CN=Configuration,DC=forestRootDomain) which contains the configuration objects for the entire forest; and
  • “Schema” (CN=Schema,CN=Configuration,DC=forestRootDomain) which contains class and attribute definitions for all existing and possible objects of the forest. Even though the Schema partition hierarchically looks like it is under the Configuration partition, it is a separate partition.
  • “Application” (CN=...,CN=forestRootDomain – there can be many such partitions) which was introduced in Server 2003 and are user/ application defined partitions that can contain any object except security principals. The replication of these partitions is not bound by domain boundaries – they can be replicated to selected DCs in the forest even if they are in different domains.
    • A common example of Application partitions are CN=ForestDnsZones,CN=forestRootDomain and CN=DomainDnsZones,CN=forestRootDomain which hold DNS zones replicated to all DNS servers in the forest and domain respectively (note that it is not replicated to all DCs in the forest and domain respectively, only a subset of the DCs – the ones that are also DNS servers).

 

If you open ADSI Edit and connect to the RootDSE “context”, then right click the RootDSE container and check its namingContexts attribute you’ll find a list of all directory partitions, including the multiple Application partitions.

rootDSE

Here you’ll also find other attributes such as:

  • defaultNamingContext (DN of the Domain directory partition the DC you are connected to is authoritative for),
  • configurationNamingContext (DN of the Configuration directory partition),
  • schemaNamingContext (DN of the Schema directory partition), and
  • rootNamingContext (DN of the Domain directory partition for the Forest Root domain)

The Configuration partition has a container called Partitions (CN=Partitions,CN=Configuration,DC=forestRootDomain) which contains cross-references to every directory partition in the forest – i.e. Application, Schema, and Configuration directory partitions, as well as all Domain directory partitions. The beauty of cross-references is that they are present in the Configuration partition and hence replicated to all DCs in the forest. Thus even if a DC doesn’t hold a particular NC it can check these cross-references and identify which DC/ domain might hold more information. This makes it possible to refer clients asking for more info to other domains.

The cross-references are actually objects of a class called crossRef.

  • What the CheckSDRefDom test does is that it checks whether the cross-references have an attribute called msDS-SDReferenceDomain set.
  • What does this mean?
    • An Application NC, by definition, isn’t tied to a particular domain. That makes it tricky from a security point of view because if its ACL has security descriptor referring to groups/ users that could belong to any domain (e.g. “Domain Admins”, “Administrator”) there’s no way to identify which domain must be used as the reference.
    • To avoid such situations, cross references to Application directory partitions contain an msDS-SDReferenceDomain attribute which specifies the reference domain.
  • So what the CheckSDRefDom test really does is that it verifies all the Application directory partitions have a reference domain set.
    • In case a reference domain isn’t set, you can always set it using ADSI Edit or other techniques. You can also delegate this.

CheckSecurityError

  • Checks for any security related errors on the DC that might be causing replication issues.
  • Some of the tests done are:
    1. Verify that KDC is working (not necessarily on the target DC, the test only checks that a KDC server is reachable anywhere in the domain, preferably in the same site; even if the target DC KDC service is down but some other KDC server is reachable the test passes)
    2. Verify that the DC”s computer object exists and is within the “Domain Controllers” OU and replicated to other DCs
    3. Check for any KDC packet fragmentation that could cause issues
    4. Check KDC time skew (remember I mentioned previously of the 5 minute tolerance)
    5. Check Service Principle Name (SPN) registration (I’ll talk about SPNs in a later post; check this link for a quick look at what they are and the errors they can cause)
  • This test is not run by default. It must be explicitly specified.
  • Can specify an optional parameter /replsource:... to perform similar tests on that DC and also check the ability to create a replication link between that DC and the DC we are testing against.

Connectivity

  • This is the only DcDiag test that you cannot skip. It runs by default, and is also run even if you perform a specific test.
  • It tests whether the DSAs are registered in DNS, whether they are ping-able, and have LDAP/ RPC connectivity.

CrossRefValidation

Before talking about CheckRefValidation it’s worth revisiting cross-references and application NCs.

Application NCs are actually objects of a class domainDNS with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).

You can create an application NC, for instance, by opening up ADSI Edit and going to the Domain NC, right click, new object, of type domainDNS, enter a Domain Component (DC) value what you want, click Next, then click “More Attributes”, select to view Mandatory/ Both type of properties, find instanceType from the property drop list, and enter a value of 5.
dnsDomain
dnsDomain2The above can be done anywhere in the domain NC. It is also possible to nest application NCs within other application NCs.

Here’s what happens behind the scenes when you make an application NC as above:

  • The application NC isn’t created straight-way.
  • First, the the DSA will check the cross-references in CN=Partitions,CN=Configuration,DC=forestRootDomain to see if one already exists to an Application NC with the same name as you specified.
    • If a cross-reference is found and the NC it points to actually exists then an error will be thrown.
    • If a cross-reference is found but the NC it points to doesn’t exist, then that cross-reference will be used for the new Application NC.
    • If a cross-reference cannot be found, a new one will be created.
  • Cross references (objects of class crossRef) have some important attributes:
    1. CN – the CN of this cross-reference (could be a name such as “CN=SomeApp” or a random GUID “CN=a97a34e3-f751-489d-b1d7-1041366c2b32”)
    2. nCName – the DN of the application NC (e.g. DC=SomeApp,DC=rakhesh,DC=local)
    3. dnsRoot – the DNS domain name where servers that contain this NC can be found (e.g. SomeApp.rakhesh.local).

      (Note this as it’s quite brilliant!) When a new application NC is created, DSA also creates a corresponding zone in DNS. This zone contains all the servers that carry this zone. In the screenshot below, for instance, note the zones DomainDnsZones, ForestDnsZones, and SomeApp2 (which belongs to a zone I created). Note that by querying for all SRV records of name _ldap in _tcp.SomeApp2.rakhesh.local one can easily find the DCs carrying this partition: dnsRoot For the example above, dnsRoot would be “SomeApp2.rakhesh.local” as that’s the DNS domain name.

    4. msDS-NC-Replica-Locations – a list of Distinguished Names (DNs) of DSAs where this application NC is replicated to (e.g. CN=NTDS Settings,CN=WIN-DC01,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local, CN=NTDS Settings,CN=WIN-DC03,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local). replica-locations Initially this attribute has only one entry – the DC where the NC was first created. Other entries can be added later.
    5. Enabled – usually not set, but if it’s set to FALSE it indicates the cross-reference is not in use
  • Once a cross-reference is identified (an existing or a new one) the Configuration NC is replicated through the forest. Then the Application NC is actually created (an object of class domainDNS object as mentioned earlier with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).
  • Lastly, all DCs that hold a copy of this Application NC have their ms-DS-Has-Master-NCs attribute in the DSA object modified to include a DN of this NC. masterNCs

Back to the CrossRefValidation test, it validates the cross-references and the NCs they point to. For instance:

  • Ensure dnsRoot is valid (see here and here for some error messages)
  • Ensure nCName and other attributes are valid
  • Ensure the DN (and CN) are not mangled (in case of conflicts AD can “mangle” the names to reflect that there’s a conflict) (see here for an example of mangled entries)

CutoffServers

If you open AD Sites and Services, expand down to each site, the servers within them, and the NTDS Settings object under each server (which is basically the DSA), you can see the replication partners of each server. For instance here are the partners for two of my servers in one site:

partners1

partners2

Reason WIN-DC01 has links to both WIN-DC03 (in the same site as it) and WIN-DC02 (in a different site) while WIN-DC03 only has links to WIN-DC01 (and not WIN-DC02 which is in a different site) is because WIN-DC01 is acting as a the bridgehead server. The bridgehead server is the server that’s automatically chosen by AD to replicate changes between sites. Each site has a bridgehead server and these servers talk to each other for replication across the site link. All other DCs in the site only get inter-site changes via the bridgehead server of that site. More on it later when I talk about bridgehead servers some other day … for now this is a good post to give an intro on bridgehead servers.

partners3

WIN-DC02, which is my DC in the other site, similarly has only one replication partner WIN-DC01. So WIN-DC01 is kind of link the link between WIN-DC02 and WIN-DC03. If WIN-DC01 were to be offline then WIN-DC02 and WIN-DC03 would be cut off from each other (for a period until the mechanism that creates the topology between sites kicks in and makes WIN-DC03 the bridgehead server between site; or even forever if I “pin” WIN-DC01 as my preferred bridgehead server in which case when it goes down no one else can takeover). Or if the link that connects the two sites to each other were to fail again they’d be cut-off from each other.

  • So what the CutoffServers test does is that it tells you if any servers are cut-off from each other in the domain.
  • This test is not run by default. It must be explicitly specified.
  • This test is best run with the /e switch – which tells DcDiag to test all servers in the enterprise, across sites. In my experience is it’s run against a specific server it usually passes the test even if replication is down.
  • Also in my experience a server is up and running but only LDAP is down (maybe the AD DS service is stopped for instance) – and so it can’t replicate with partners and they are cut-off – the test doesn’t identify the servers as being cut-off. If the server/ link is down then the other servers are highlighted as cut-off.
  • For example I set WIN-DC01 as the preferred bridgehead in my scenario above. Then I disconnect it from the network, leaving WIN-DC02 and WIN-DC03 cut-off.

    If I test WINDC-03 only it passes the test:

    That’s clearly misleading because replication isn’t happening:

    However if I run CutoffServers for the enterprise both WIN-DC02 and WIN-DC03 are correctly flagged:

    Not only is WIN-DC01 flagged in the Connectivity tests but the CutoffServers test also fails WIN-DC02 and WIN-DC03.

  • The /v switch (verbose) is also useful with this test. It will also show which NCs are failing due to the server being cut-off.

DcPromo

  • Checks whether from a DNS point of view the target server can be made a Domain Controller. If the test fails suggestions given.
  • The test has some mandatory switches:
    • /dnsdomain:...
    • /NewForest (a new forest) or /NewTree (a new domain in the forest you specify via /ForestRoot:...)or /ChildDomain (a new child domain) or /ReplicaDC (another DC in the same domain)
  • Needless to say this test isn’t run by default.

DNS

  • Checks the DNS health of the whole enterprise. It has many sub-tests. By default all sub-tests except one are run, but you can do specific sub-tests too.
  • This TechNet page is a comprehensive source of info on what the DNS test does. Tests include checking for zones, network connectivity, client configuration, delegations, dynamic updates, name resolution, and so on.
  • This test is not run by default.
  • Since it is an enterprise-wide test DcDiag requires Enterprise Admin credentials to run tests.

FrsEvent

  • Checks for any errors with the File Replication System (FRS).
  • It doesn’t seem to do an active test. It only checks the FRS Event Logs for any messages in the last 24 hours. If FRS is not used in the domain the test is silently skipped. (Specifying the /v switch will show that it’s being skipped).
  • Take the results with a pinch of salt. Chances are you had some errors but they are now resolved, but since the last 24 hours worth of logs are checked the test will flag previous error messages. Also, FRS may being used for non-SYSVOL replication and these might have errors but that doesn’t really matter as far as the DCs are concerned.
  • There may also be spurious errors a server’s Event Log is not accessible remotely and so the test fails.

DFSREvent

  • Checks for any errors with the Distributed File System Replication (DFSR).
  • Similar to the FrsEvent test. Same caveats apply.

SysVolCheck

  • Checks whether the SYSVOL share is ready.
  • In my experience this doesn’t test doesn’t seem to actually check whether the SYSVOL share is accessible. For example, consider the following:

    Notice SYSVOL exists. Now I delete it.

    But SysVolCheck will happy clear the DC as passing the test:

    So take these test results with a pinch of salt!

  • As an aside, in the case above the Netlogons test will flag the share as missing:
  • There is a registry key HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\SysvolReady which has a value of 1 when SYSVOL is ready and a value of 0 when SYSVOL is not ready. Even if I turn this value to 0 – thus disabling SYSVOL, the SYSVOL and NETLOGON shares stop being shared – the SysvolCheck test still passes. NetLogons flags an error though.

Rest of the tests will be covered in a later post. Stay tuned!

Active Directory: Domain Controller critical services

The first of my (hopefully!) many posts on Active Directory, based on the WorkshopPLUS sessions I attended last month. Progress is slow as I don’t have much time, plus I am going through the slides and my notes and adding more information from the Internet and such. 

This one’s on the services that are critical for Domain Controllers to function properly. 

DHCP Client

  • In Server 2003 and before the DHCP Client service registers A, AAAA, and PTR records for the DC with DNS
  • In Server 2008 and above this is done by the DNS Client
  • Note that only the A and PTR records are registered. Other records are by the Netlogon service.

File Replication Services (FRS)

  • Replicates SVSVOL amongst DCs.
  • Starting with Server 2008 it is now in maintenance mode. DFSR replaces it.
    • To check whether your domain is still using FRS for SYSVOL replication, open the DFS Management console and see whether the “Domain System Volume” entry is present under “Replication” (if it is not, see whether it is available for adding to the display). If it is present then your domain is using DFSR for SYSVOL replication.
    • Alternatively, type the following command on your DC. If the output says “Eliminated” as below, your domain is using DFSR for SYSVOL. (Note this only works with domain functional level 2008 and above).
  • Stopping FRS for long periods can result in Group Policy distribution errors as SYSVOL isn’t replicated. Event ID 13568 in FRS log.

Distributed File System Replication (DFSR)

  • Replicates SYSVOL amongst DCs. Replaced functionality previously provided by FRS. 
  • DFSR was introduced with Server 2003 R2.
  • If the domain was born functional level 2008 – meaning all DCs are Server 2008 or higher – DFSR is used for SYSVOL replication.
    •  Once all pre-Server 2008 DCs are removed FRS can be migrated to DFSR. 
    • Apart from the dfsrmig command mentioned in the FRS section, the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\DFSR\Parameters\SysVols\Migrating Sysvols\LocalState registry key can also be checked to see if DFSR is in use (a value of 3 means it is in use). 
  • If a DC is offline/ disconnected from its peers for a long time and Content Freshness Protection is turned on, when the DC is online/ reconnected DFSR might block SYSVOL replications to & from this DC – resuling in Group Policy distribution errors.
    • Content Freshness Protection is off by default. It needs to be manually turned on for each server.
    • Run the following command on each server to turn it on:

      Replace 60 with the maximum number of days it is acceptable for a DC or DFSR member to be offline. The recommended value is 60 days. And to turn off:

      To view the current setting:

    • Content Freshness Protection exists because of the way deletions work.
      • DFSR is multi-master, like AD, which means changes can be made on any server.
      • When you delete an item on one server, it can’t simply be deleted because then the item won’t exist any more and there’s no way for other servers to know if that’s the case because the item was deleted or because it wasn’t replicated to that server in the first place.
      • So what happens is that a deleted item is “tombstoned“. The item is removed from disk but a record for it remains the in DFSR database for 60 days (this period is called the “tombstone lifetime”) indicating this item as being deleted.
      • During these 60 days other DFSR servers can learn that the item is marked as deleted and thus act upon their copy of the item. After 60 days the record is removed from the database too.
      • In such a context, say we have DC that is offline for more than 60 days and say we have other DCs where files were removed from SYSVOL (replicated via DFSR). All the other DCs no longer have a copy of the file nor a record that it is deleted as 60 days has past and the file is removed for good.
      • When the previously offline DC replicates, it still has a copy of the file and it will pass this on to the other DCs. The other DCs don’t remember that this file was deleted (because they don’t have a record of its deletion any more as as 60 days has past) and so will happily replicate this file to their folders – resulting in a deleted file now appearing and causing corruption.
      • It is to avoid such situations that Content Freshness Protection was invented and is recommended to be turned on.
    • Here’s a good blog post from the Directory Services team explaining Content Freshness Protection.

DNS Client

  • For Server 2008 and above registers the A, AAAA, and PTR records for the DC with DNS (notice that when you change the DC IP address you do not have to update DNS manually – it is updated automatically. This is because of the DNS Client service).
  • Note that only the A, AAAA, and PTR records are registered. Other records are by the Netlogon service.  

DNS Server

  •  The glue for Active Directory. DNS is what domain controllers use to locate each other. DNS is what client computers use to find domain controllers. If this service is down both these functions fail.  

Kerberos Distribution Center (KDC)

  • Required for Kerberos 5.0 authentication. AD domains use Kerberos for authentication. If the KDC service is stopped Kerberos authentication fails. 
  • NTLM is not affected by this service. 

Netlogon

  • Maintains the secure channel between DCs and domain members (including other DCs). This secure channel is used for authentication (NTLS and Kerberos) and DC replication.
  • Writes the SRV and other records to DNS. These records are what domain members use to find DCs.
    • The records are also written to a file %systemroot%\system32\config\Netlogon.DNS. If the DNS server doesn’t support dynamic updates then the records in this text file must be manually created on the DNS server. 

Windows Time

  • Acts as an NTP client and server to keep time in sync across the domain. If this service is down and time is not in sync then Kerberos authentication and AD replication will fail (see resolution #5 in this link).
    • Kerberos authentication may not necessarily break for newer versions of Windows. But AD replication is still sensitive to time.  
  • The PDC of the forest root domain is the master time keeper of the forest. All other DCs in the forest will sync time from it.
    • The Windows Time service on every domain member looks to the DC that authenticates them for time time updates.
    • DCs in the domain look to the domain PDC for time updates. 
    • Domain PDCs look to the domain PDC of the domain above/ sibling to them. Except the forest root domain PDC who gets time from an external source (hardware source, Internet, etc).
  • From this link: there are two registry keys HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxPosPhaseCorrection and HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxNegPhaseCorrection that restrict the time updates accepted by the Windows Time service to the number of seconds defined by these values (the maximum and minimum range). This can be set directly in the registry or via a GPO. The recommended value is 172800 (i.e. 48 hours).

w32tm

The w32tm command can be used to manage time. For instance:

  • To get an idea of the time situation in the domain (who is the master time keeper, what is the offset of each of the DCs from this time keeper):
  • To ask the Windows Time service to resync as soon as possible (the command can target a remote computer too via the /computer: switch)

    • Same as above but before resyncing redetect any network configuration changes and rediscover the sources:
  • To get the status of the local computer (use the /computer: switch to target a different computer)
  • To show what time sources are being used:
  • To show who the peers are:
  • To show the current time zone:

    • You can’t change the time zone using this command; you have to do:

On the PDC in the forest root domain you would typically run a command like this if you want it to get time from an NTP pool on the Internet:

Here’s what the command does:

  • specify a list of peers to sync time from (in this example the NTP Pool servers on the Internet);
  • the /update switch tells w32tm to update the Windows Time service with this configuration change;
  • the /syncfromflags:MANUAL tells the Windows Time service that it must only sync from these sources (other options such as “DOMHIER” tells it to sync from the domain peers only, “NO” tells it sync from none, “ALL” tells it to sync from both the domain peers and this manual list);
  • the /reliable:YES switch marks this machine as special in that it is a reliable source of time for the domain (read this link on what gets set when you set a machine as RELIABLE).

Note: You must manually configure the time source on the PDC in the forest root domain and mark it as reliable. If that server were to fail and you transfer the role to another DC, be sure to repeat the step. 

On other machines in the domain you would run a command like this:

This tells those DCs to follow the domain hierarchy (and only the domain hierarchy) and that they are not reliable time sources (this switch is not really needed if these other DCs are not PDCs).

Active Directory Domain Services (AD DS)

  • Provides the DC services. If this service is stopped the DC stops acting as a DC. 
  • Pre-Server 2008 this service could not be stopped while the OS was online. But since Server 2008 it can be stopped and started. 

Active Directory Web Services (AD WS)

  • Introduced in Windows Server 2008 R2 to provide a web service interface to Active Directory Domain Services (AD DS), Active Directory Lightweight Domain Services (AD LDS), and Active Directory Database Mounting Tool instances running on the DC.
    • The Active Directory Database Mounting Tool was new to me so here’s a link to what it does. It’s a pretty cool tool. Starting from Server 2008 you can take AD DS and AD LDS snapshots via the Volume Snapshots Service (VSS) (I am writing a post on VSS side by side so expect to see one soon). This makes use of the NTDS VSS writer which ensures that consistent snapshots of the AD databases can be taken. The AD snapshots can be taken manually via the ntdsutil snapshot command or via backup software  or even via images of the whole system. Either ways, once you have such snapshots you can mount the one(s) you want via ntdsutil and point Active Directory Database Mounting Tool to it. As the tool name says it “mounts” the AD database in the snapshot and exposes it as an LDAP server. You can then use tools such as ldp.exe of the AD Users and Computers to go through this instance of the AD database. More info on this tool can be found at this and this link.
  • AD WS is what the PowerShell Active Directory module connects to. 
  • It is also what the new Active Directory Administrative Center (which in turn uses PowerShell) too connects to.
  • AD WS is installed automatically when the AD DS or AD LDS roles are installed. It is only activated once the server is promoted to a DC or if and AD LDS instance is created on it. 

Coming soon … fingers crossed!

October was a good month. I had the good fortune to attend a Microsoft workshop on Active Directory troubleshooting last month. And before that, I was at our Amman office for an upgrade from Windows XP (yeah we still had that!) to Windows 7 and I got to build a standard Windows 7 image with all our software and updates and create a bootable USB key that lets users install the OS and apps to a fresh machine. I want to write some posts on both of these – especially the Active Directory workshop, which was ah-maa-zing! – but also on the Windows 7 USB stuff (which is nothing novel but I’d like to write a post nevertheless).

I don’t know if I’ll manage to. I have ambitious plans on the Active Directory posts. It was a 4 day course and we covered many interesting topics such as replication, Kerberos, DNS, as well as a lot of troubleshooting. Many of these were familiar concepts to me but this was the first time I was presented with all of them together and that too someone was teaching the concepts rather than me Googling and/ or reading. I have already forgotten most of what I learnt, I think, but before I forget the rest I want to write multiple posts about the topics of each day, supplemented with more reading and notes from the Internet sort of as a revision to myself. Like I said – ambitious! – and the more ambitious I aim the less likely I am to achieve it (going by my track record). For starters, ever since the training finished I have been down with a stomach bug and so been too wasted to sit at the computer and write a blog post, let alone collect all my thoughts together. This post itself is sort of a last ditch attempt at getting the ball rolling by putting something out there, just so I have a commitment out in the open to get this done.

Fingers crossed, there’ll be more technical posts appearing soon! :)

GPO errors due to SYSVOL replication issues

So the other day at my test lab I noticed many of my GPOs were giving errors. Ran gpupdate /force and it gave the following message:

User policy could not be updated successfully. The following errors were encountered:

The processing of Group Policy failed. Windows attempted to read the file \\rakhesh.local\SysVol\rakhesh.local\Policies\{F28486EC-7C9D-40D6-A243-F1F733979D5C}\gpt.ini from a domain controller and was not successful. Group Policy settings may not be applied until this event is resolved. This issue may be transient and could be caused by one or more of the following:
a) Name Resolution/Network Connectivity to the current domain controller.
b) File Replication Service Latency (a file created on another domain controller has not replicated to the current domain controller).
c) The Distributed File System (DFS) client has been disabled.
Computer Policy update has completed successfully.

To diagnose the failure, review the event log or run GPRESULT /H GPReport.html from the command line to access information about Group Policy results.

Looked like a temporary thing so I didn’t give it much thought first. But when policies were still not being read after half an hour or so I figured something is broken.

Running gpresult /h didn’t have much to add. I noticed the there were errors similar to the one above. I also noticed that another policy had a version mismatch to the one on the server (this wasn’t highlighted anywhere, I just happened to notice).

I went to the path in the error message on each of my DCs (replace rakhesh.local with your DC name) to find out which DC had a missing file. As the message indicated, it was a replication issue. I had updated the GPO on one of the DCs but looks like it didn’t replicate to some of the other DCs; and my clients were trying to find the policy files on this DC where it didn’t replicate to, thus throwing errors.

Event Logs on the clients didn’t have much to add except for error messages similar to the above.

Event Logs on the DCs were more useful. “Applications and Services Logs” > “DFS Replication” is the place to look here. Interestingly, the DC where I couldn’t find the GPO above reported no errors. It had an entry like this:

The DFS Replication service successfully established an inbound connection with partner WIN-DC01 for replication group Domain System Volume.

Additional Information:
Connection Address Used: WIN-DC01.rakhesh.local
Connection ID: 11C8A7A0-F8A5-4867-B277-78DDC66E3ED3
Replication Group ID: 7C0BF99B-677B-4EDA-9B47-944D532DF7CB

This gives you the impression that all is fine. The other DC though had many errors. One of them was event ID 4012:

The DFS Replication service stopped replication on the folder with the following local path: C:\Windows\SYSVOL\domain. This server has been disconnected from other partners for 118 days, which is longer than the time allowed by the MaxOfflineTimeInDays parameter (60). DFS Replication considers the data in this folder to be stale, and this server will not replicate the folder until this error is corrected.

To resume replication of this folder, use the DFS Management snap-in to remove this server from the replication group, and then add it back to the group. This causes the server to perform an initial synchronization task, which replaces the stale data with fresh data from other members of the replication group.

Additional Information:
Error: 9061 (The replicated folder has been offline for too long.)
Replicated Folder Name: SYSVOL Share
Replicated Folder ID: 0546D0D8-E779-4384-87CA-3D4ABCF1FA56
Replication Group Name: Domain System Volume
Replication Group ID: 7C0BF99B-677B-4EDA-9B47-944D532DF7CB
Member ID: 93D960C2-DE50-443F-B8A1-20B04C714E16

Good! So that explains it. Being a test lab, all my DCs aren’t usually online. In this case, WIN-DC01 is what I always have online but WIN-DC02 is only powered on when I want to test anything specific. Since the two weren’t in touch for more than 60 days, WIN-DC01 had stopped replicating with WIN-DC02. Makes sense – you wouldn’t want stale data from WIN-DC02 to overwrite fresh data on WIN-DC01 after all.

The steps in the Event Log entry didn’t work though. Being a SYSVOL folder there’s no option to remove a server from the replication group in DFS Management.

I searched online for any suggestions related to event ID 4012 and SYSVOL, adn found a Microsoft KB article. This article suggested running two commands which I’d like to post here as a reminder to myself as they are generally useful to quickly pinpoint the source of an error. Being a test lab I have just 2-3 DCs so I can go through them manually; but if I had many DCs it is useful to know how to find the faulty ones fast.

First command simply checks whether the SYSVOL folder is shared on each DC:

Next command uses WMIC to get the replication state of this folder on each DC.

A state of 5 indicates the group is in error. Interestingly, the “Access denied” message is one I had noticed earlier too when running dcdiag on the faulty DC.

I didn’t think much of it that time as there were no errors when running dcdiag from the working DC and also because I ran this command while I thought the error was a temporary thing. I still don’t think this is a permissions issue. My guess is that it is to do with the content freshness error I found in the Event Logs earlier. Possibly due to that the second DC is being denied access to replicate and that’s why I get the “access denied” error above.

From the KB article above I came across another one that has steps on how to force a of SYSVOL from one server to all the others. (In the linked article, the section ‘How to perform an authoritative synchronization of DFSR-replicated SYSVOL (like “D4” for FRS)’ is what I am talking about). Here’s what I did:

  1. I launched adsiedit.msc and connected with the default settings
  2. Then I opened CN=SYSVOL Subscription,CN=Domain System Volume,CN=DFSR-LocalSettings,CN=WIN-DC01,OU=Domain Controllers,DC=rakhesh,DC=local (replace the DC name and domain name with yours; in my case WIN-DC01 is the DC I want to set as authoritative) and …
  3. … I set msDFSR-Enabled=FALSE and msDFSR-options=1

    adsiedit

  4. Next, I opened CN=SYSVOL Subscription,CN=Domain System Volume,CN=DFSR-LocalSettings,CN=WIN-DC02,OU=Domain Controllers,DC=rakhesh,DC=local – WIN-DC02 being the DC that is out of sync – and set msDFSR-Enabled=FALSE.
  5. For AD replication through out the domain:

  6. The result of the above steps will be that SYSVOL replication is disabled through out the domain. I confirmed that by going to the “DFS Replication” logs and noting an event with ID 4114:

    The replicated folder at local path C:\Windows\SYSVOL\domain has been disabled. The replicated folder will not participate in replication until it is enabled. All data in the replicated folder will be treated as pre-existing data when this replicated folder is enabled.

    Additional Information:
    Replicated Folder Name: SYSVOL Share
    Replicated Folder ID: 0546D0D8-E779-4384-87CA-3D4ABCF1FA56
    Replication Group Name: Domain System Volume
    Replication Group ID: 7C0BF99B-677B-4EDA-9B47-944D532DF7CB
    Member ID: 09E14860-47F6-49EE-820E-43F7ED015FEB

  7. Then I went back to CN=SYSVOL Subscription,CN=Domain System Volume,CN=DFSR-LocalSettings,CN=WIN-DC01,OU=Domain Controllers,DC=rakhesh,DC=local but this time I set msDFSR-Enabled=TRUE.
  8. Again, I forced AD replication through out the domain.
  9. Next, I ran dfsrdiag pollad on WIN-DC01 (the DC with the correct copy) to perform an initial sync with AD. (Note: My DC was a Server 2012 Core install. This didn’t have dfsrdiag by default. I installed it via Install-WindowsFeature RSAT-DFS-Mgmt-Con in PowerShell).
  10. I checked the Event Logs of WIN-DC01 to confirm SYSVOL is now replicating (event ID 4602 should be present):

    The DFS Replication service successfully initialized the SYSVOL replicated folder at local path C:\Windows\SYSVOL\domain. This member is the designated primary member for this replicated folder. No user action is required. To check for the presence of the SYSVOL share, open a command prompt window and then type “net share”.

    Additional Information:
    Replicated Folder Name: SYSVOL Share
    Replicated Folder ID: 0546D0D8-E779-4384-87CA-3D4ABCF1FA56
    Replication Group Name: Domain System Volume
    Replication Group ID: 7C0BF99B-677B-4EDA-9B47-944D532DF7CB
    Member ID: 93D960C2-DE50-443F-B8A1-20B04C714E16
    Read-Only: 0

  11. Finally, I went back to CN=SYSVOL Subscription,CN=Domain System Volume,CN=DFSR-LocalSettings,CN=WIN-DC01,OU=Domain Controllers,DC=rakhesh,DC=local and set msDFSR-Enabled=TRUE. You can see what’s happening, don’t you? First we disabled SYSVOL replication on all the DCs. Then we enabled it on the authoritative DC and got it to sync to AD. And now we are re-enabling it on all the other DCs so they get a fresh copy from the authoritative DC.
  12. Lastly, I did a dfsrdiag pollad on WIN-DC02 … and while it should have worked without any issues, I got the following error:

    C:\Users>dfsrdiag pollad
    [ERROR] Access is denied when connecting to WMI services on computer: WIN-DC02.r
    akhesh.local

    Operation Failed

However, Event Logs on WIN-DC02 showed that SYSVOL was now replicating successfully and clients are now able to download GPOs successfully. So it looks like the “access denied” error is something else altogether (as I suspected initially, yaay!). I’ll have to look into that later – I know in the past I’ve fiddled with the WMI settings on this machine so maybe I made some change that I forgot to reset.

Good stuff!