Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Notes on ADFS

I have been trying to read on ADFS nowadays. It’s my new area of interest! :) Wrote a document at work sort of explaining it to others, so here’s bits and pieces from that.

What does Active Directory Federation Services (ADFS) do?

Typically when you visit a website you’d need to login to that website with a username/ password stored on their servers, and then the website will give you access to whatever you are authorized to. The website does two things basically – one, it verifies your identity; and two, it grants you access to resources.

It makes sense for the website to control access, as these are resources with the website. But there’s no need for the website to control identity too. There’s really no need for everyone who needs access to a website to have user accounts and passwords stored on that website. The two steps – identity and access control – can be decoupled. That’s what ADFS lets us do.

With ADFS in place, a website trusts someone else to verify the identity of users. The website itself is only concerned with access control. Thus, for example, a website could have trusts with (say) Microsoft, Google, Contoso, etc. and if a user is able to successfully authenticate with any of these services and let the website know so, they are granted access. The website itself doesn’t receive the username or password. All it receives are “claims” from a user.

What are Claims?

A claim is a statement about “something”. Example: my username is ___, my email address is ___, my XYZ attribute is ___, my phone number is ____, etc.

When a website trusts our ADFS for federation, users authenticate against the ADFS server (which in turn uses AD or some other pool to authenticate users) and passes a set of claims to the website. Thus the website has no info on the (internal) AD username, password, etc. All the website sees are the claims, using which it can decide what to do with the user.

Claims are per trust. Multiple applications can use the same trust, or you could have a trust per application (latter more likely).

All the claims pertaining to a user are packaged together into a secure token.

What is a Secure Token?

A secure token is a signed package containing claims. It is what an ADFS server sends to a website – basically a list of claims, signed with the token signing certificate of the ADFS server. We would have sent the public key part of this certificate to the website while setting up the trust with them; thus the website can verify our signature and know the tokens came from us.

Relying Party (RP) / Service Provider (SP)

Refers to the website/ service who is relying on us. They trust us to verify the identity of our users and have allowed access for our users to their services.

I keep saying “website” above, but really I should have been more generic and said Relying Party. A Relying Party is not limited to a website, though that’s how we commonly encounter it.

Note: Relying Party is the Microsoft terminology.

ADFS cannot be used for access to the following:

  • File shares or print servers
  • Active Directory resources
  • Exchange (O365 excepted)
  • Connect to servers using RDP
  • Authenticate to “older” web applications (it needs to be claims aware)

A Relying Party can be another ADFS server too. Thus you could have a setup where a Replying Party trusts an ADFS service (who is the Claims Provider in this relationship), and the ADFS service in turn trusts a bunch of other ADFS servers depending on (say) the user’s location (so the trusting ADFS service is a Relying Party in this relationship).

Claims Provider (CP) / Identity Provider (IdP)

The service that actually validates users and then issues tokens. ADFS, basically.

Note: Claims Party is the Microsoft terminology.

Secure Token Service (STS)

The service within ADFS that accepts requests and creates and issues security tokens containing claims.

Relying Party Trust

Refers to the trust between a Relying Party and Identity Provider. Tokens from the Identity Provider will be signed with the Identity Provider’s token signing key – so the Relying Party knows it is authentic. Similarly requests from the Relying Party will be signed with their certificate (which we can import on our end when setting up the trust).

Web Application Proxy (WAP)

Access to an ADFS server over the Internet is via a Web Application Proxy. This is a role in Server 2012 and above – think of it as a reverse proxy for ADFS. The ADFS server is within the network; the WAP server is on the DMZ and exposed to the Internet (at least port 443). The WAP server doesn’t need to be domain joined. All it has is a reference to the ADFS server – either via DNS, or even just a hosts file entry. The WAP server too contains the public certificates of the ADFS server.

Miscellaneous

  • ADFS Federation Metadata – this is a cool link that is published by the ADFS server (unless we have disabled it). It is https://<your-adfs-fqdn>/FederationMetadata/2007-06/FederationMetadata.xml and contains all the info required by a Replying Party to add the ADFS server as a Claims Provider.
    • This also includes Base64 encoded versions of the token signing certificate and token decrypting certificates.
  • SAML Entity ID – not sure of the significance of this yet, but this too can be found in the Federation Metadata file. It is usually of the form http://<your-adfs-fqdn>/adfs/services/trust and is required by the Relying Party to setup a trust to the ADFS server.
  • SAML endpoint URL – this is the URL where users are sent to for authentication. Usually of the form http://<your-adfs-fqdn>/adfs/ls.  This information too can be found in the Federation Metadata file.
  • Link to my post on ADFS Certificates.

[Aside] AD Sites, Subnets, Trusts, etc.

  • How Domain Controllers are Located Across Trusts – this is a delightful article. I don’t know why, but I simply loved the way the author presented the information. Very logically written. Wish I could write blog posts with such clarity.
    • Praise aside, it is a good article on how subnet and site definitions are used to find a Domain Controller closest to you, and especially how it works across forest trusts.
  • Using Catch-All subnets in AD – Wanted to know how catch-all subnets in AD Sites will interact with specific ones. This one explained it. The specific one takes precedence. Which is exactly what you want. :)

Notes on ADFS Certificates

Was trying to wrap my head around ADFS and Certificates today morning. Before I close all my links etc I thought I should make a note of them here. Whatever I say below is more or less based on these links (plus my understanding):

There are three types of certificates in ADFS. 

The “Service communications” certificate is also referred to as “SSL certification” or “Server Authentication Certificate”. This is the certificate of the ADFS server/ service itself. 

  • If there’s a farm of ADFS servers, each must have the same certificate
  • We have the private key too for this certificate and can export it if this needs to be added to other ADFS servers in the farm. 
  • The Subject Name must contain the federation service name. 
  • This is the certificate that end users will encounter when they are redirected to the ADFS page to sign-on, so this must be a public CA issued certificate. 

The “Token-signing” certificate is the crucial one

  • This is the certificate used by the ADFS server to sign SAML tokens.
  • We have the private key too this certificate too but it cannot be exported. There’s no option in the GUI to export the private key. What we can do is export the public key/ certificate. 
    • ADFS servers in a farm can share the private key.
    • If the certificate is from a public or enterprise CA, however, then the key can be exported (I think).
  • The exported public certificate is usually loaded on the service provider (or relying party; basically the service where we can authenticate using our ADFS). They use it to verify our signature. 
  • The ADFS server signs tokens using this certificate (i.e. uses its private key to encrypt the token or a hash of the token – am not sure). The service provider using the ADFS server for authentication can verify the signature via the public certificate (i.e. decrypt the token or its hash using the public key and thus verify that it was signed by the ADFS server). This doesn’t provide any protection against anyone viewing the SAML tokens (as it can be decrypted with the public key) but does provide protection against any tampering (and verifies that the ADFS server has signed it). 
  • This can be a self-signed certificate. Or from enterprise or public CA.
    • By default this certificate expires 1 year after the ADFS server was setup. A new certificate will be automatically generated 20 days prior to expiration. This new certificate will be promoted to primary 15 days prior to expiration. 
    • This auto-renewal can be disabled. PowerShell cmdlet (Get-AdfsProperties).AutoCertificateRollover tells you current setting. 
    • The lifetime of the certificate can be changed to 10 years to avoid this yearly renewal. 
    • No need to worry about this on the WAP server. 
    • There can be multiple such certificates on an ADFS server. By default all certificates in the list are published, but only the primary one is used for signing.

The “Token-decrypting” certificate.

  • This one’s a bit confusing to me. Mainly coz I haven’t used it in practice I think. 
  • This is the certificate used if a service provider wants to send us encrypted SAML tokens. 
    • Take note: 1) Sending us. 2) Not signed; encrypted
  • As with “Token-signing” we export the public part of this certificate, upload it with the 3rd party, and they can use that to encrypt and send us tokens. Since only we have the private key, no one can decrypt en route. 
  • This can be a self-signed certificate. 
    • Same renewal rules etc. as the “Token-signing” certificate. 
  • Important thing to remember (as it confused me until I got my head around it): This is not the certificate used by the service provider to send us signed SAML tokens. The certificate for that can be found in the signature tab of that 3rd party’s relaying party trust (see screenshot below). 
  • Another thing to take note of: A different certificate is used if we want to send encrypted info to the 3rd party. This is in the encryption tab of the same screenshot. 

So – just to put it all in a table.

Clients accessing ADFS server Service communications certificate
ADFS server signing data and sending to 3rd party Token-signing certificate
3rd party encrypting data and sending ADFS server Token-decrypting certificate
3rd party signing data and sending ADFS server Certificate in Signature tab of Trust
ADFS server encrypting data and sending to 3rd party Certificate in Encryption tab of Trust

Update: Linking to a later post of mine on ADFS in general.

OU delegation not working (contd.) – finding protected groups

Turns out I was mistaken in my previous post. A few minutes after enabling inheritance, I noticed it was disabled again. So that means the groups must be protected by AD.

I knew of the AdminSDHolder object and how it provides a template set of permissions that are applied to protected accounts (i.e. members of groups that are protected). I also knew that there were some groups that are protected by default. What I didn’t know, however, what that the defaults can be changed. 

Initially I did a Compare-Object -ReferenceObject (Get-ADPrincipalGroupMembership User1) -DifferenceObject (Get-ADPrincipalGroupMembership User2) -IncludeEqual to compare the memberships of two random accounts that seemed to be protected. These were accounts with totally different roles & group memberships so the idea was to see if they had any common groups (none!) and failing that to see if the groups they were in had any common ancestors (none again!)

Then I Googled a bit :o) and came across a solution. 

Before moving on to that though, as a note to myself: 

  • The AdminSDHolder object is at CN=AdminSDHolder,CN=System,DC=domain,DC=com. Find that via ADSI Edit (replace the domain part accordingly). 
  • Right click the object and its Security tab lists the template permissions that will be applied to members of protected groups. You can make changes here. 
  • SDProp is a process that runs every 60 minutes on the DC holding the PDC Emulator role. The period can be changed via the registry key HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\AdminSDProtectFrequency. (If it doesn’t exist, add it. DWORD). 
  • SDProp can be run manually if required. 

So back to my issue. Turns out if a group as its adminCount attribute set to 1 then it will be protected. So I ran the following against the OU containing my admin  account groups:

Bingo! Most of my admin groups were protected, so most admin accounts were protected. All I have to do now is either un-protect these groups (my preferred solution), or change the template to delegate permissions there. 

Update: Simply un-protecting a group does not un-protect all its members (this is by design). The member objects too have their adminCount attribute set to 1, so apart from fun-protecting the groups we must un-protect the members too. 

Update 2: Found this good post with lots more details. How to run the process manually, what are the default protected groups, etc. Read that post in conjunction with this one and you are set!

Update 3: You can unprotect the following default groups via dsHeuristics: 1) Account Operators, 2) Backup Operators, 3) Server Operators, 4) Print Operators. But that still leaves groups such as Administrators (built-in), Domain Admins, Enterprise Admins, Domain Controllers, Schema Admins, Read-Only Domain Controllers, and the user Administrator (built-in). There’s no way to un-protect members of these.

Something I hadn’t realized about adminCount. This attribute does not mean a group/ user will be protected. Instead, what it means is that if a group/ user is protected, and its ACLs have changed and are now reset to default, then the adminCount attribute will be set. So yes, adminCount will let you find groups/ users that are protected; but merely setting adminCount on a group/ user does not protect it. I learnt this the hard way while I was testing my changes. Set adminCount to 1 for a group and saw that nothing was happening.

Also, it is possible that a protected user/ group does not have adminCount set. This is because adminCount is only set if there is a difference in the ACLs between the user/ group and the AdminSDHolder object. If there’s no difference, a protected object will not have the adminCount attribute set. :)

OU delegation not working

Today I cracked a problem which had troubled us for a while but which I never really sat down and actually tried to troubleshoot. We had an OU with 3rd level admin accounts that no one else had rights to but wanted to delegate certain password related tasks to our Service Desk admins. Basically let them reset password, unlock the account, and enable/ disable. 

Here’s some screenshots for the delegation wizard. Password reset is a common task and can be seen in the screenshot itself. Enable/ Disable can be delegated by giving rights to the userAccountControl attribute. Only force password change rights (i.e. no reset password) can be given via the pwdLastSet attribute. And unlock can be given via the lockoutTime attribute

Problem was that in my case in spite of doing all this the delegated accounts had no rights!

Snooping around a bit I realized that all the admin accounts within the OU had inheritance disabled and so weren’t getting the delegated permissions from the OU (not sure why; and no these weren’t protected group members). 

Of course, enabling is easy. But I wanted to see if I could get a list of all the accounts in there with their inheritance status. Time for PowerShell. :)

The Get-ACL cmdlet can list access control lists. It can work with AD objects via the AD: drive. Needs a distinguished name, that’s all. So all you have to do is (Get-ADUser <accountname>).DistinguishedName) – prefix an AD: to this, and pass it to Get-ACL. Something like this:

The default result is useless. If you pipe and expand the Access property you will get a list of ACLs. 

The result is a series of entries like these:

The attribute names referred to by the GUIDs can be found in the AD Technical Specs

Of interest to us is the AreAccessRulesProtected property. If this is True then inheritance is disabled; if False inheritance is enabled. So it’s straight forward to make a list of accounts and their inheritance status:

So that’s it. Next step would be to enable inheritance on the accounts. I won’t be doing this now (as it’s bed time!) but one can do it manually or script it via the SetAccessRuleProtection method. This method takes two parameters (enable/ disable inheritance; and if disable then should we add/ remove existing ACEs). Only the first parameter is of significance in my case, but I have to pass the second parameter too anyways – SetAccessRuleProtection($False,$True).

Update: Here’s what I rolled out at work today to make the change.

Update 2: Didn’t realize I had many users in the built-in protected groups (these are protected even though their adminCount is 0 – I hadn’t realized that). To unprotect these one must set the dsHeuristics flag. The built-in protected groups are 1) Account Operators, 2) Server Operators, 3) Print Operators, and 4) Backup Operators. See this post on instructions (actually, see the post below for even better instructions).

Update 3: Found this amazing page that goes into a hell of details on this topic. Be sure to read this before modifying dsHeuristics.

Useful WMIC filters

I have these tabs open in my browser from last month when I was doing some WMI based GPO targeting. Meant to write a blog post but I keep getting side tracked and now it’s been nearly a month so I have lost the flow. But I want to put these in the blog as a reference to my future self. 

That’s all.

Get a list of users in an OU along with last logged on date

Trivial stuff. Wanted to note it down someplace for future reference –

 

The “Administrators” group

Note to self: the “Domain Admins” and “Enterprise Admins” groups aren’t the primary groups in a domain. The primary group is the “Administrators” group, present in the “Builtin” folder. The other two groups are members of this group and thus get rights over the domain. The “Enterprise Admins” group is also a member of the “Administrator” group in all other domains/ child-domains of that forest, hence its members get rights over those domains too.

So if you want to create a separate group in your domain and want to give its members domain admin rights over (say) a child domain, all you need to do is create the group (must be Global or Universal) an add this group to the “Administrators” group in the child domain. That’s it!

Cannot ping an address but nslookup works (contd)

Earlier today I had blogged about nslookup working but ping and other methods not resolving names to IP addresses. That problem started again, later in the evening.

Today morning though as a precaution I had enabled the DNS Client logs on my computer. (To do that open Event Viewer with an admin account, go down to Applications and Services Logs > Microsoft > Windows > DNS Client Events > Operational – and click “Enable log” in the “Actions” pane on the right).

That showed me an error along the following lines:

A name not found error was returned for the name vcenter01.rakhesh.local. Check to ensure that the name is correct. The response was sent by the server at 10.50.1.21:53.

Interesting. So it looked like a particular DC was the culprit. Most likely when I restarted the DNS Client service it just chose a different DC and the problem temporarily went away. And sure enough nslookup for this record against this DNS server returned no answers.

I fired up DNS Manager and looked at this server. It seemed quite outdated with many missing records. This is my simulated branch office DC so I don’t always keep it on/ online. Looks like that was coming back to bite me now.

The DNS logs in Event Manager on that server had errors like this:

The DNS server was unable to complete directory service enumeration of zone TrustAnchors.  This DNS server is configured to use information obtained from Active Directory for this zone and is unable to load the zone without it.  Check that the Active Directory is functioning properly and repeat enumeration of the zone. The extended error debug information (which may be empty) is “”. The event data contains the error.

So Active Directory is the culprit (not surprising as these zones are AD integrated so the fact that they weren’t up to date indicated AD issues to me). I ran repadmin /showrepl and that had many errors:

Naming Context: CN=Configuration,DC=rakhesh,DC=local

Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: CN=Configuration,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Great! I fired up AD Sites and Services and the links seemed ok. Moreover I could ping the DCs from each other. Event Logs on the problem DC (WIN-DC02) had many entries like this though:

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server win-dc01$. The target name used was Rpcss/WIN-DC01. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (RAKHESH.LOCAL) is different from the client domain (RAKHESH.LOCAL), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Hmm, secure channel issues? I tried resetting it but that too failed:

(Ignore the above though. Later I realized that this was because I wasn’t running command prompt as an admin. Because of UAC even though I was logged in as admin I should have right clicked and ran command prompt as admin).

Since I know my environment it looks likely to be a case of this DC losing trust with other DCs. The KRB_AP_ERR_MODIFIED error also seems to be related to Windows Server 2003 and Windows Server 2012 R2 but mine wasn’t a Windows Server 2003. That blog post confirmed my suspicions that this was password related.

Time to check the last password set attribute for this DC object on all my other DCs. Time to run repadmin /showobjmeta.

The above output gives metadata of the WIN-DC02 object on WIN-DC01. I am interested in the pwdLastSet attribute and its timestamp.  Here’s a comparison of this across my three DCs:

That confirms the problem. WIN-DC02 thinks its password last changed on 9th May whereas WIN-DC01 changed it on 25th July and replicated it to WIN-DC03.

Interestingly that date of 25th July is when I first started having problems in my test lab. I thought I had sorted them but apparently they were only lurking beneath. The solution here is to reset the WIN-DC02 password on itself and WIN-DC01 and replicate it across. The steps are in this KB article, here’s what I did:

  1. On WIN-DC02 (the problem DC) I stopped the KDC service and set it to start Manual.
  2. Purge the Kerberos ticket cache. You can view the ticket cache by typing the command: klist.  To purge, do: klist purge.
  3. Open a command prompt as administrator (i.e. right click and do a “Run as administrator”) then type the following command: netdom resetpwd /server WIN-DC01.rakhesh.local /UserD MyAdminAccount /PasswordD *
  4. Restart WIN-DC02.
  5. After logon start the KDC service and set it to Automatic.

Checked the Event Logs to see if there are any errors. None initially but after I forced a sync via repadmin /syncall /e I got a few. All of them had the following as an error:

2148074274 The target principal name is incorrect.

Odd. But at least it was different from the previous errors and we seemed to be making progress.

After a bit of trial and error I noticed that whenever the KDC service on the DC was stopped it seemed to work fine.

I could access other servers (file shares), connect to them via DNS Manager, etc. But start KDC and all these would break with errors indicating the target name was wrong or that “a security package specific error occurred”. Eventually I left KDC stay off, let the domain sync via repadmin /syncall, and waited a fair amount of time (about 15-20 mins) for things to settle. I kept an eye on repadmin /replsummary to see the deltas between WIN-DC02 and the rest, and also kept an eye on the DNS zones to see if WIN-DC02 was picking up newer entries from the others. Once these two looked positive, I started KDC. And finally things were working!

 

vCenter – Cannot load the users for the selected domain

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

Found an informative VMware KB article. The ntpq command (short for “NTP query”) can be used to see the status of NTP daemon on the client side. Like thus:

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

  • If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.
  • If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.
  • If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.
    • The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Cannot load the users

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

Makes no sense to me but the problem looks to be in Java/ SSO.

I tried removing AD from the list of identity sources in SSO (in the web client) and re-added it. No luck.

Tried re-adding AD but this time I used an SPN account instead of the machine account. No luck!

Finally I tried adding AD as an LDAP Server just to see if I can get it working somehow – and that clicked! :)

AD as LDAP

So while I didn’t really solve the problem I managed to work around it …

Update: Added the rest of my DCs as time sources to the ESXi hosts and restarted the ntpd service. Maybe that helped, now NTP is working on the hosts.

 

Fixing “The DNS server was unable to open Active Directory” errors

For no apparent reason my home testlab went wonky today! Not entirely surprising. The DCs in there are not always on/ connected; and I keep hibernating the entire lab as it runs off my laptop so there’s bound to be errors lurking behind the scenes.

Anyways, after a reboot my main DC was acting weird. For one it took a long time to start up – indicating DNS issues, but that shouldn’t be the case as I had another DC/ DNS server running – and after boot up DNS refused to work. Gave the above error message. The Event Logs were filled with two errors:

  • Event ID 4000: The DNS server was unable to open Active Directory.  This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.
  • Event id 4007: The DNS server was unable to open zone <zone> in the Active Directory from the application directory partition <partition name>. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.

A quick Google search brought up this Microsoft KB. Looks like the DC has either lost its secure channel with the PDC, or it holds all the FSMO roles and is pointing to itself as a DNS server. Either of these could be the culprit in my case as this DC indeed had all the FSMO roles (and hence was also the PDC), and so maybe it lost trust with itself? Pretty bad state to be in, having no trust in oneself … ;-)

The KB article is worth reading for possible resolutions. In my case since I suspected DNS issues in the first place, and the slow loading usually indicates the server is looking to itself for DNS, I checked that out and sure enough it was pointing to itself as the first nameserver. So I changed the order, gave the DC a reboot, and all was well!

In case the DC had lost trust with itself the solution (according to the KB article) was to reset the DC password. Not sure how that would reset trust, but apparently it does. This involves using the netdom command which is installed on Server 2008 and up (as well as on Windows 8 or if RSAT is installed and can be downloaded for 2003 from the Support Tools package). The command has to be run on the computer whose password you want to reset (so you must login with an account whose initials are cached, or use a local account). Then run the command thus:

Of course the computer must have access to the PDC. And if you are running it on a DC the KDC service must be stopped first.

I have used netdom in the past to reset my testlab computer passwords. Since a lot of the machines are usually offline for many days, and after a while AD changes the computer account password but the machine still has the old password, when I later boot up the machine it usually gives are error like this: “The trust relationship between this workstation and the primary domain failed.”

A common suggestion for such messages is to dis-join the machine from the domain and re-join it, effectively getting it a new password. That’s a PITA though – I just use netdom and reset the password as above. :)

 

[Aside] Bridgehead Server Selection improvements in Server 2008 R2

Came across this blog post when researching for something. Long time since I read anything AD related (since I am more focused on VMware and HP servers) at work nowadays. Was a good read.

Summary of the post:

  • When you have a domain spread over multiple sites, there is a designated Bridgehead server in each site that replicates changes with/ to the Bridgehead server in other sites.
    • Bridgehead servers talk to each other via IP or SMTP.
    • Bridgehead servers are per partition of the domain. A single Bridgehead server can replicate for multiple partitions and transports.
    • Since Server 2003 there can be multiple Bridgehead servers per partition in the domain and connections can be load-balanced amongst these. The connections will be load-balanced to Server 2000 DCs as well.
  • Bridgehead servers are automatically selected (by default). The selection is made by a DC that holds the Inter-Site Topology Generator (ISTG) role.
    • The DC holding the ISTG role is usually the first DC in the site (the role will failover to another DC if this one fails; also, the role can be manually moved to another DC).
    • It is possible designate certain DCs are preferred Bridgehead servers. In this case the ISTG will choose a Bridgehead server from this list.
  • It is also possible to manually create connections from one DC to another for each site and partition, avoiding ISTG altogether.
  • On each DC there is a process called the Knowledge Consistency Checker (KCC). This process is what actually creates the replication topology for the domain.
  • The KCC process running on the DC holding the ISTG role is what selects the Bridgehead servers.

The above was just background. Now on to the improvements in Server 2008 R2:

  • As mentioned above, in Server 2000 you had one Bridgehead server per partition per site.
  • In Server 2003 you could have multiple Bridgehead servers per partition per site. There was no automatic load-balancing though – you had to use a tool such as Adlb.exe to manually load-balance among the multiple Bridgehead servers.
  • In Server 2008 you had automatic load-balancing. But only for Read-Only Domain Controllers (RODCs).
    • So if Site A had 5 DCs, the RODCs in other sites would load-balance their incoming connections (remember RODCs only have incoming connections) across these 5 DCs. If a 6th DC was added to Site A, the RODCs would automatically load-balance with that new DC.
    • Regular DCs (Read-Write DCs) too would load-balance their incoming connections across these 5 DCs. But if a 6th DC was added they wouldn’t automatically load-balance with that new DC. You would still need to run a tool like Aldb.exe to load-balance (or delete the inbound connection objects on these regular DCs and run KCC again?).
    • Regular DCs would sort of load-balance their outbound connections to Site A. The majority of incoming connections to Site A would still hit a single DC.
  • In Server 2008 R2 you have complete automatic load-balancing. Even for regular DCs.
    • In the above example: not only would the regular DCs automatically load-balance their incoming connections with the new 6th DC, but they would also load-balance their outbound connections with the DCs in Site A (and when the new DC is added automatically load-balance with that too). 

To view Bridgeheads connected to a DC run the following command:

The KCC runs every 15 minutes (can be changed via registry). The following command runs it manually:

Also, the KCC prefers DCs that are more stable / readily available than DCs that are intermittently available. Thus DCs that are offline for an extended period do not get rebalanced automatically when they become online (at least not immediately.

DFS Namespaces missing on some servers / sites

This is something I had sorted out for one of our offices earlier. Came across it for another office today and thought I should make a blog post of it. I don’t remember if I made a a blog post the last time I fixed it (and although it’s far easier to just search my blog and then decide whether to make a blog post or not, I am lazy that way :)).

Here’s the problem. We use AppV to stream some applications to our users. One of our offices started complaining that AppV was no longer working for them. Here’s the error they got:

cderror

I logged on to the user machine to find out where the streaming server is:

Checked the Event Logs on that server and there were errors from the “Application Virtualization” source along these lines:

So I checked the DFS path and it was empty. In fact, there was no folder called “Content” under “\\domain.com\dfs” – which was odd!

I switched DFS servers to that of another location and all the folders appeared. So the problem definitely was with the DFS server of this site.

From the DFS Management tool I found the namespace server for this site (it was the DC) and the folder target (it was one of the data servers). Checked the folder target share and that was accessible, so no issues there. It was looking to be a DFS Namespace issue.

Hopped on to the DFS and checked “C:\DFSRoots\DFS”. It didn’t have the “Content” folder above – so definitely a DFS Namespace issue!

I ran dfsdiag and it gave no errors:

So I checked the Event Logs on the DC. No errors there. Next I restarted the “DFS Namespace” service and voila! all the namespaces appeared correctly.

Ok, so that fixed the problem but why did it happen in the first place? And what’s there to ensure it doesn’t happen again? The site was restarted yesterday as part of some upgrade work so did that cause the namespaces to fail?

I checked the timestamps of the DFS Namespace entries (these are from source “DfsSvc” in “Custom Views” > “Server Roles” > “File Server”). Once the namespaces were ready there was an entry along the following lines at 08:06:43 (when the DC came back up from the reboot):

No errors there. But where does the DC get its name spaces from? This was a domain DFS so it would be getting it from Active Directory. So let’s look at the AD logs under “Applications and Services Logs” > “Directory Service”. When the domain services are up and running there’s an entry from the “ActiveDirectory_DomainService” source along these lines:

On this server the entry was made at 08:06:47. Notice the timestamp – it occurs after the “DFS Namespace” service has finished building namespace! And therein lies the problem. Because the Directory Services took a while to be ready, when DFS was being initialized it went ahead with stale data. When I restarted the service later, it picked up the correct data and sorted the namespaces.

I rebooted the server a couple more times but couldn’t replicate the problem as the Directory Service always initialized before DFS Namespace. To be on the safe side though I set the startup type of “DFS Namespace” to be “Automatic (Delayed)” so it always starts after the “Directory Service”.

 

AD Domain Password policies

Password policies are set in the “Computer Configuration” section at Computer Configuration\Windows Settings\Security Settings\Account Policies\Password Policy.

The policy applies to the computer as user objects are stored in the database of the computer. A Domain Controller is a special computer in that its database holds user objects for the entire domain, so it takes its password policy settings from whatever policy wins at the domain level. What this means is that you can have multiple policies in a domain – each containing different password policies – but the only one that matters for domain users is the policy that wins at the domain level. Any policy that applies to OUs will only apply to local user objects that reside in the SAM database of computers in that OU. 

A quick recap on how GPO precedence works: OU linked GPOs override Domain linked GPOs which override Site linked GPOs which override local GPOs (i.e. OU > Domain > Site > local). Within each of these there can be multiple GPOs. The link order matters here. The GPOs with lower link order win over GPOs with higher link order (i.e. link order 1 > link order 2 > …). Of course all these precedence order can be subverted if a GPO is set as enforced. In this case the parent GPO cannot be overridden by a child GPO. 

By default the Default Domain Policy is set at the Domain level and has link order 1. If you want to change the password policy for the domain you can either modify this GPO, or create a new GPO and apply it at the Domain level (but remember to set it at a lower link order than the Default Domain Policy – i.e. link order 1 for instance). 

For a list of the password policy settings check out this TechNet page. 

Since Windows Server 2008 it is possible to have Fine Grained Password Policies (FGPP) that can apply to OUs. These are not GPOs, so you can’t set them via GPMC. For Server 2008 this TechNet page has instructions (you have to use PowerShell or ADSI Edit), for Server 2012 check out this blog post (you can use ADAC; obviously Server 2012 makes it easier). Check out this article too for Server 2008 (it is better than the TechNet page which is … dense on details). 

Replicate with repadmin

The following command replicates the specified partition from the source DC to the destination DC. You can use this command to force a replication. Note that these three arguments are mandatory. 

An optional switch /full will HWMV and UTDV tables to be reset, replicating all changes from the source DC to the destination DC.

The following command synchronizes the specified DC with all its replication partners. You can specify a partition to replicate; if nothing is specified, the Configuration partition is used by default

Instead of specifying a partition the /A switch can be used to sync all partitions held by the DC. By default only partner DCs in the same site are replicated with, but the /e switch will cause replication to happen with all partners across all sites. Also, changes can be pushed from the DC to others rather than pulled (the default) using the /P switch.