Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

[Aside] Exporting Exchange mailboxes to PST (with the ‘ContentFilter’ date format fixed)

Quick shoutout to this blog post by Tony (cached copy as I can’t access the original). The cmdlet has a -ContentFilter switch where you can specify the date and other parameters by which you can filter out what is exported. Very irritatingly the date parameter expects to be in US format. And even if you specify it in US format to begin with, it converts it to US format again switching the date and month and complaining if it’s an incorrect date. Thanks to this blog post which explained this to me and this blog post for a super cool trick to work around this. Thank you all! 

Exchange 2013 Restore and Operation terminated with error -1216 errors

You know how when you are nervous about something and so try to do a perfect job and hope for the best, things will inevitably go wrong! Yeah, they don’t lie about that. :) 

I had to do an Exchange 2013 restore. My first time, to be honest, if you don’t count the one time I restored a folder in my home lab. And things didn’t go as smoothly as one would have hoped for. But hey it’s not a bad thing coz I get to learn a lot in the process. So no complaints, but it has been a nervous experience. 

Initially I had a lot of DPM troubles restoring. These were solved by attaching my Azure recovery vault to a different server and doing a restore from there. In the process though I didn’t get DPM to clean up and mount the database directly on Exchange; rather I asked it to restore as files to a location and I decided to do the restore from there. 

It is pretty straightforward according to the docs

  1. Restore the database files. 
  2. Bring the files of the restored database to a clean shutdown state
  3. Create a new recovery database, pointing to the previously cleaned up database files. 
  4. Mount database. 
  5. Do a restore request. 

That’s all!

Here’s some more useful details I picked up along the way, including an error that caused some delays for me. 

1 Restore all files to a location

Now, this is important. After restoring *do not* rename the database file. I did that and it an “Operation terminated with error -1216 (JET_errAttachedDatabaseMismatch)” error. You don’t want that! Keep the database name as is. Keep the folder names too as is, it doesn’t really matter. 

2.1 Check the database state

You run the eseutil /mh “\path\to\edb\file” command for this.  

Two things to note. First, the state – “Dirty Shutdown” – so I need to clean up the database. And second, the logs required. If I check the restored folder I should see files of this name there. In my case I did have files along these lines:

2.2 Confirm that the logs are in a good state

You run the eseutil /mh “\path\to\log files\EXX” command for this (where XX is the log prefix). 

All looks good. 

2.3 Bring the database files to a clean shutdown state

For this you run eseutil /R EXX /l “\path\to\log\files” /s “\path\to\edb\file” /d “\path\to\edb\file” (where XX is the log prefix). I remember the /lsd switches coz they look like LSD. :) They specify the path to the log files, the system files (this can be skipped in which case the current directory is used; I prefer to point this to the recovered folder), and the database file. 

Looks good!

The first few times I ran this command I got the following error –

I went on a tangent due to this. :( Found many outdated blog posts (e.g. this one) that suggest adding a “/i” switch because apparently the error is caused by a mis-match with the STM database. I should have remembered STM databases don’t exist in 2013 (they were discontinued in 2010 I think, but I could be wrong; they used to be the database where you stored attachments). When I added the “/i” switch the operation succeeded but the database remained in a “dirty shutdown” state and subsequent runs of the eseutil command without this switch didn’t make any difference. I had to start from scratch (luckily I was working on a copy of the restored files, so didn’t have to do a fresh restore!). 

Later I had the smart idea of checking the event logs to see why this was happening. And there I saw entries like this:

You see I had the “smart” idea of renaming my database file to something else. It was a stupid idea because the database still had references to the old name and that’s why this was failing. None of the blog posts or articles I read did a database rename, but none of them said don’t do it either … so how was I to know? :) 

Anyways, if you haven’t renamed your database you should be fine. You could still get this or any other error … but again, don’t be an idiot like me and go Googling. Check your event logs. That’s what I usually do when I am sensible, just that being nervous and all here I skipped my first principles. 

2.4 Confirm the database state is clean

Run the eseutil /mh command of step 2.1 and now it should say “Clean Shutdown”. 

At this point your recovered database files are in a clean state. Now to the next step!

3 Create a new recovery database, pointing to the restored database files. 

Do this in the Exchange Management Shell. 

You will get a warning about restarting the Information Store service. Don’t do it! Restarting the Information Store will cause all Outlook connections to that server to disconnect and reconnect – you definitely don’t want that in a production environment! :) Restarting is only required if you are adding a new production database and Exchange has to allocate memory for this database. Exchange 2013 does this when the Information Store is (re)started. We are dealing with a Recovery Database here and it will be deleted soon – so don’t bother. 

Also, remember that Recovery databases don’t appear in EAC so don’t worry if you don’t see it there. 

If you want to read more on Recovery databases this is a good article

4 Mount the recovery database

Again in EMS:

5 Do a restore request

Still in EMS:

I am restoring a folder called “Contacts” from “User Mailbox” back to “User Mailbox” under a folder called “RecoveredContacts”. 

You can keep an eye on the restore request thus:

5+ Cleanup tasks

And once done you can remove all completed requests:

That’s it! 

You can also remove the Recovery database now. Dismount it first. 

This doesn’t remove the EDB files or logs, so remove them manually

Some links that I came across while reading up on this:

My first baby step production Exchange 2013 restore. Finally! :)

[Aside] Exchange Mutual TLS

For future reference based on this article

For incoming connections: a) specify the domain as being secure (i.e. requires TLS) via something like this – 

Do the above on the Mailbox server. You can force a sync to edge after that via Start-EdgeSynchronization on the Mailbox server. 

Then b) on the Edge server enable domain secured and TLS (they are likely to be already enabled by default). 

Etisalat blocks outgoing port 25

There, I’ve said it. Hopefully anyone else Googling for this will find it now!

I spent a better part of yesterday checking all my firewalls because I couldn’t connect to port 25 of public mail servers from my VM. As a SysAdmin I have an Exchange VM etc. in my home lab that I wanted to setup for outgoing email … and when I couldn’t connect to port 25 I wondered whether there was something wrong with my network configuration or firewalls (being a VM there are layers of networking and firewalls involved). Turns out none of these is the case. My host itself cannot connect to port 25. The ISP is blocking it – simple!

This is Etisalat at home. Things might be different for business connections of course. And, this is nothing against Etisalat – many ISPs apparently do this to block spam, so this is standard practice. First time I noticed that’s all. 

Edge Subscription process (and some extra info)

It’s documented well, I know. This is just for my quick reference. 

To create a new edge subscription, type the following on edge – 

Copy the file to a Mailbox servers (just one is fine, the rest will sync from this). Then – 

Change the file paths & site etc. as required. To remove edge subscriptions, either on Edge or on Mailbox – 

To set the FQDN on the outgoing (send) connector in response to EHLO do this on the Mailbox server – 

If you want to change the incoming (receive) connector name run this on the Edge server – 

You can find the certificates on the Edge server via Get-ExchangeCertificate. To use one of them as the certificate for the SMTP service do – 

That’s about it I think. 

Outlook auto-discover & DNS weirdness

It’s 2am and I spent the last 2-3 hours chasing a shitty problem in my home lab to which I haven’t yet found a satisfactory answer. What a waste of time (sort of)!

It all began when I enabled MAPI/HTTP on my home Exchange 2013 and noticed that Outlook (2016 & 2019) was still connecting to it via RPC/HTTP. To troubleshoot that I deleted my Outlook profile to see if a fresh auto-discovery would get it to connect using MAPI/HTTP. That ended up being a long rabbit hole!

Background: all my user accounts are of the form firstname.lastname@raxnet.global. I have autodiscovery setup via SCP (here’s a good blog post to read if you want an overview of auto-discovery) and in the past when I launch Outlook it used to give me a prompt like this – 

NewImage

– followed by this one, where I choose “Exchange” and am done. 

NewImage

Today, however, I was getting prompted for a password after the first screen and when I entered the password it complained that the password was wrong. Moreover, I noticed that it was trying to connect via IMAP and picking up an external IMAP server for this domain (the domain raxnet.global is setup externally too with IMAP auto discovery in the form of SRV records for _imaps._tcp.raxnet.global). That didn’t make sense. All my internal clients were pointing to the internal DC for DNS, and this DC had the raxnet.global zone with itself so there was no reason why queries should be going to the Internet.

NewImage

 

NewImage

(Notice the imap.fastmail.com? That should not be happening.)

Moreover, if I did an nslookup -type=SRV _imaps._tcp.raxnet.global. from my clients they weren’t resolving the name either. How was Outlook able to resolve this then?! (I still don’t have an answer to this so don’t get your hopes up, dear reader). Clients can query the name only if they query an external DNS server but that wasn’t happening in my case. 

NewImage

Here’s an article on Outlook 2016’s implementation of auto discover. It too tells me that SCP is the default mechanism for auto-discovery when Outlook discovers it’s on a domain joined machine. Starting from Outlook 2016 however, there is a special case in that if Outlook feels the account is an Office 365 enabled one it will try and auto-discover using Office 365 endpoints (typically https://autodiscover-s.outlook.com/autodiscover/autodiscover.xml or https://autodiscover-s.partner.outlook.cn/autodiscover/autodiscover.xml). My domain isn’t on Office 365, but what the heck I disabled it via a registry key but no luck it still goes to IMAP. 

Why I don’t understand is why it keeps ignoring SCP. It is as if Outlook thinks I am not domain joined and is magically able to query external DNS. The Windows VM on which Outlook runs can’t query the outside world so how is Outlook doing it?

I verified the SCP end points in Exchange (this is a good article to read) and also came across this script which you can run on the client side to make it query AD for the SCP endpoint via LDAP. (I couldn’t access the blog so had to get a cached copy via Google. So I am going to copy paste the script here for my future reference. All credit to the author of the blog). 

This correctly returned my client machines’ site as well as the autodiscover URL. 

I have no idea why auto-discover was failing. My hunch at this point was that Outlook was querying DNS some other way, because of which it was thinking it is not domain joined and so all the usual auto-discovery was failing until it hit up on the Internet face IMAP auto-discovery. So what was my DNS setup like? 

All my clients were pointing to the DC via DHCP for DNS. The DNS in turn was forwarding to my OPNSense router which had Unbound DNS running on it. I verified again, from the client and DC, that I am unable to resolve the external records of raxnet.global. I suspected OPNSense at this point so I turned off Unbound DNS to see if that impacts the auto-discovery. Sure enough, it did! Now Outlook was behaving as it used to. 

Unbound DNS was set to listen only on the servers subnet. So there was no way a query could come from my clients subnet to Unbound (my bad, I should have checked this by querying Unbound directly – too sleepy now to enable Unbound again and see what happens). I couldn’t see any other settings either so I stopped Unbound and added root hints to the DC so it can talk to the Internet directly. (Here yet another thing wasted my time. I always set DNS servers to listen on “All IP addresses” – including IPv6. With that setting my DC was unable to resolve external names using the root hints. If I disable it to listen only on IPv4, the DC is able to resolve as expected! Something in the IPv6 was causing an issue. Funnily a colleague had mentioned this in the past but I never believed him). 

NewImage

That done, my clients would now talk to the DC for DNS queries (as before), and the DC would go and lookup on its own (unlike before). No more forwarder. I tried Outlook again with this setting and auto-discover correctly works. 

The issue isn’t Unbound or OPNSense. The issue is that when my DC is using a forwarder (any forwarder, even 8.8.8.8 for instance) Outlook thinks it is not domain joined and it can view my external DNS records – breaking auto-discovery. I don’t know why this happens, especially considering the OS on which Outlook is running can’t resolve the external records and can in fact resolve the internal LDAP end-points correctly. Something in the way Outlook is sending DNS queries is causing it to work differently. (Or maybe I am just too sleepy and can’t think of a simple explanation now!)

Exchange 2013 Restore

I need to restore the Contacts folder for a user. I will restore it to an admin mailbox and then email the required contact to the user from there. Pretty straightforward.

1) Create a recovery DB. The recovery DB is called “RDB”. I store it on the D: drive in a folder called “Recovery”.

2) Open Services and restart the Exchange Information Store service.

3) Set the recovery DB as over-writeable.

4) I backup using DPM/ Azure Backup Server. Restore as per here.

5) Make a restore request from the recovery DB for the Contacts folder. I will restore this to a folder called Recovery in the administrator user. The switches below are for that:

The -AllowLegacyDNMismatch is required because I am restoring to a different mailbox.

6) Keep an eye on the progress via the Get-MailboxRestoreRequest cmdlet. You can get stats this way if required:

That’s it!

If I hadn’t restored to a recovery DB via DPM and brought it to a cleanly shutdown state I can follow instructions along these lines to do that.

Mail enable an existing distribution group

FET ports

FET == Front End Transport service in Exchange 2013. Is a part of the Client Access Server (CAS) role, it accepts emails. Receives emails, Sends emails. Stateless, no inspection, queueing etc. Just a simple dude!

Here are the default receive connectors on an Exchange server. 3 belong to FET, 2 to HT (Hub Transport).

NewImage 

HubTransport is the Transport service on the Mailbox server role. 

Here are the ports these various connectors listen on:

These connectors are stored in AD under the configuration partition. 

Screen Shot 2019 01 01 at 1 37 27 PM

FET listens on various ports:

  • Port 25 – for receiving emails from other servers. The receive connector for this is called Default Frontend <servername>. It accepts anonymous connections from external SMTP servers for the accepted domains of this server.
    • You can create additional receive connectors on port 25 if you want to accept anonymous connections for non-accepted domains too (i.e. setup an anonymous relay). See this and this for more info on how to do that. Basically you setup a new receive connector on port 25, restrict it to certain source IPs (the IPs from which you want to accept anonymous connections), and then you modify the AD security properties of the connector to allow NT AUTHORITY\ANONYMOUS LOGON to accept any recipient when sending emails. (An alternate way detailed in the same article is to set the receive connector as Externally secured). I didn’t know the receive connectors were AD objects and you could assign ACLs to it so was pleased to learn that today. :)
  •  Port 587 – for receiving emails from clients (such as IMAP4 & POP3 clients that expect an SMTP server where they can submit emails to). The receive connector for this is called Client Frontend <servername>. This port does not accept anonymous connections, clients have to authenticate.
    • To clarify for myself because I have forgotten this from my past: port 465 is for mail submission where a TLS handshake begins immediately (implicit TLS) while port 587 is where TLS starts only after the clients sends a STARTTLS verb. What this means for Exchange is that port 587 can accept TLS connections but chances are you will get a certificate error when trying to use TLS. You must therefore assign a certificate to this receive connector if you want TLS. Thanks to RFC 8314 for reminding me of port 465 vs 587 and this FastMail page for some history. 
  • Port 717 – for receiving emails from the mailbox server role. The receive connector for this is called Outbound Proxy Frontend <servername>. This is only used it a send connector has the “proxy through client access server” ticked. If this option is not ticked the Mailbox server can send external emails directly to external SMTP hosts. (Reason for doing this is that then all outgoing emails appears to come from the CAS server rather than the Mailbox Server – better from a security point of view as the CAS is what is usually Internet exposed). 

As you can see the FET is for emails from other servers or SMTP clients. It is not used when someone sends an email via Outlook MAPI for instance. This picture from the Microsoft docs on the Exchange 2013 mail flow is worth putting here for future reference. As you can see, when Outlook MAPI is used to send an email it is stored in the mailbox database and the Mailbox Transport service picks it from there. Outlook doesn’t send an email via SMTP to Exchange. 

NewImage

The Mailbox server has two “mail flow” services running on it. 

  • The first of these is what we saw as HubTransport in the screenshots above. This is known as the Transport service in Exchange 2013 (though confusingly referred to by its Exchange 2010 name of Hub Transport in the GUI!) and is a stateful SMTP server – i.e. does things like categorizing etc. as in the immediate screenshot above.
  • The second of these in the Mailbox Transport service (which in turn can be thought of as two components but I am going to ignore that here) and this one’s stateless like the FET. Its role is to take mail from the mailbox database and/ or inject mail into the mailbox database. It also interfaces with the Transport service to send & receive emails. 

At this point it’s worth pointing to this older post of mine with some screenshots of these components to get a better picture of the components and flow. One more screenshot I’d like to add here is this one from the Mastering Exchange 2013 book.

NewImage

The Transport service (aka HubTransport) on the Mailbox server is pretty important (not that Mailbox Transport service is any less important). The Transport service does content inspection etc and has two connectors of its own. It listens on these ports:

  • Port 2525 (if the Mailbox & CAS roles are on the same server – in this case CAS has already taken port 25 for itself) or Port 25 (if the Mailbox & CAS are on separate servers) – for many things (read on). The receive connector for this is called Default <servername>
    • This port is used to receive emails from the FET for internal mailboxes, from the Transport service of other Mailbox servers, and from the Mailbox Transport service on itself or other servers.
    • Looking at the diagram from earlier (repeated below), the three incoming arrows are what it accepts emails from.  The yellow “SMTP Receive” is the Transport listening on port 2525 or 25. It accepts emails from the FET (arrow from top), the Transport or Mailbox Transport of other servers (arrow from left), and Mailbox Transport from itself (arrow from down). 

Screen Shot 2019 01 01 at 5 40 07 PM

  • Port 465 – this is used to proxy client connections (IMAP4 & POP3 email clients) to port 587 of the FET. The receive connector for this is called Client Proxy <servername>.

To summarize, the important receive connectors to keep in mind for a typical Outlook/ Exchange environment are the 1) FET receive connector on port 25, the 2) HubTransport receive connector on port 2525 or 25. If you have IMAP4 & POP3 clients then the 3) FET receive connector on port 587 and 4) HubTransport receive connector on port 465 to matter. And lastly, if you proxy outbound connections via the CAS then the 5) outbound proxy connector on port 717 also matters. 

ADFS with Exchange OWA & ECP (contd.)

This is a continuation to my post from yesterday

While OWA works fine following my post yesterday, I learnt today that ECP does not work for users in the second domain. (To use correct terminology, users in the resource forest are able to login via ADFS and access both OWA & ECP but users in the organization forest are only able to access OWA. With ECP they get an error as below). 

NewImage

Yeah, sucks!

With some trial and error here’s what I learnt:

  • ECP expects the SID passed along to be that of the disabled account in the resource forest (i.e. where Exchange & ADFS are hosted). Since I am passing along the SID from the organization forest it is denying access. 
  • OWA on the other hand expects the SID passed along to be that of the active account in the organization forest (which is what we are currently passing). If one tries to make ECP happy and thus somehow pass the SID of the disabled account in the resource forest OWA complains as below. 

Screen Shot 2018 12 29 at 12 49 19 PM

I managed to work around this a bit. It is not a complete fix and I wouldn’t implement this in a production environment, but since my intention here is more to pick up ADFS than anything else I am happy I came up with this workaround. 

First off I changed my second ADFS server (the one in the organization forest) to not pass along all claims to the relying party trust with the resource forest. This is not really needed, but I had to do this for one more change I wanted to implement and figured it’s best to keep a control on the claims I pass along. As I alluded to in another post from today you can end up with duplicate claims if you pass along all claims and make changes along the way. Better to be picky. Anyways, on the second ADFS server I decided to pass along just 4 claims:

NewImage

At the risk of digressing, the first claim is why I initially decided to stop passing along all claims. I have admin accounts for myself with the same username in both forests, but differing UPNs, and I wanted it to be such that when I visit OWA or ECP with the admin account of the organization forest I am signed in to OWA or ECP of the admin account of the resource forest. That is, when I sign in as admin@two.raxnet.global (organization forest; this account has no mailbox or Exchange rights) ADFS will tell Exchange that this is actually admin@raxnet.global and let me login as the latter. That’s the point of ADFS and claims and all that after all – Exchange doesn’t do any authentication, it simply listens to what the ADFS server tells and I can do all this fun stuff on the ADFS side. 

Thus I created a claim rule like this:

It takes an incoming UPN claim, does a regex replace to substitute two.raxnet.global with raxnet.global. Easy peasy. Exchange is none the wiser and thus lets me login as admin@two.raxnet.global and access the mailbox & ECP of admin@raxnet.global. 

Going back to the original issue. I thus have 4 claims as above. 

The replying party trust from the ADFS server in the resource forest to OWA works fine as it is happy with the SID it gets from the organization forest. What. I need is a way to translate the SID of the organization forest to a SID in the resource forest, and pass that to the ECP trust. I don’t need to translate SID -> SID as such; what I really need to do is lookup the account name I am getting from the organization forest and use that to find the SID in the resource forest. When I create the linked mailbox in the resource forest I use the same username as in the organization forest so what I have to do is extract this username from the incoming claims and do a lookup using that. So far so good?

To do this I removed the claim rule on the ECP relying party trust that was passing through SID. (I let the UPN one stay as it is as that’s fine).

Then I added a custom rule like this:

Sanitize Windows account name

This takes an incoming windowsaccountname claim and makes a new claim (note, I add not issue – so this claim isn’t output ever) called “http://raxnet.global/windowsaccountname” (just a dummy URI I made up, doesn’t matter) and the value of this claim is the incoming windowsaccountaname but with the domain name replaced (i.e. remove the domain name of the organization forest, replace with the domain name of the resource forest). 

Now I added another custom rule:

Lookup SID

This one takes the claim I created above (the windowsaccountname with the domain name changed) and queries AD to find the objectSID attribute of this user account. This gives me the SID of this account in the resource forest. I now return it as a primarysid claim. 

To recap here is what I have:

NewImage

Do this and now ECP works via ADFS! :) Unfortunately if you now try to go to OWA it fails with the error I was getting earlier. :(

NewImage  

The problem here is that both OWA and ECP are in the same domain https://exchange.fqdn/ and so when I switch from OWA to ECP or vice versa I don’t hit ADFS again to re-authenticate. Since my browser already has a previously signed in session’s cookie it tries to access the new URL … and fails! And this is where I hit upon a wall and couldn’t proceed further. I figure if there’s some way to make the browser re-authenticate I can make it go to ADFS again … but I don’t know how to do that. I fiddled around with the cookie settings in IIS on the CAS server but didn’t make any progress. Exchange doesn’t let you use different URLs for OWA and ECP so there’s no way to use a different domain for either these and thus bypass cookies either. Am stuck. 

So that’s it. If anyone comes across this problem and has a better idea I’d be interested in hearing. :)

Update: A workaround.

ADFS with Exchange OWA & ECP

Follow the instructions here to setup OWA & ECP authentication via ADFS rather than the default forms based authentication. Here’s another, more concise article. Both articles talk about setting up WAP too which I didn’t do in my home lab. 

I wanted to add something extra to these articles.

In my home lab I have my primary ADFS server, which has a relying party trust setup with OWA & ECP as in these articles. Just to keep things interesting :) I also have a second domain, in a trust relationship with the first domain, and with its own ADFS server and users. The users in this second domain don’t have any Exchange mailboxes – these are hosted in the first domain. The second domain’s ADFS server has the first domain’s ADFS server as its relying party trust; and the first domain’s ADFS server has the second domain’s ADFS server as a claims provider trust apart from the default Active Directory. 

Upshot of the situation is that everyone authenticates with the ADFS server of the primary domain. They are given the choice: 

homerealm

Selecting “Branch Office” takes them to the second domain ADFS server where they can authenticate and claims are sent to the primary domain ADFS server from where it is passed on to the application. Selecting “Head Office” results in authentication against the Domain Controller of the primary domain itself. Easy peasy.

Now onto the claims setup in the primary domain as part of setting up OWA & ECP. We setup two claims: one takes the WindowsAccountName claim issued by AD and gets the user SID and sends that. 

The other claims takes the WindowsAccountName claim and gets the user UPN and sends that.

So both these claims essentially query AD and send the SID & UPN. They don’t work well with claims passed from another claims provider (such as the ADFS server in my second domain). So what I did is:

  1. On the second ADFS server I already had a rule that passed along all claims to the relying party trust of the primary ADFS server, so I did nothing additional. (If I didn’t have this in place I would have made two rules like I did in the next step). 
  2. On the trust between the primary ADFS server and OWA & ECP I removed the two rules above and made two rules that simply sent UPN & SID as UPN & SID. The idea being that the default rules only query from AD specifically, but I don’t want to limit to that. I will anyways get the UPN & SID claims either from the AD claims or the second ADFS server, so all I need do is pass these on to Exchange. 

upn-upn

That’s it. Now users in the second domain too should be able to use OWA & ECP via ADFS. 

Update: This breaks ECP. Continued in another post.

Disabling Exchange 2013 Managed Availability monitors

Check out this blog post from Microsoft first. Mine’s mostly based on that but tailored to my specific situation.

We have a CAS server that’s purely for internal admin ECP functions. Managed Availability was running some ActiveSync tests on it and failing (because they don’t exist) with errors like these in SCOM:

So my mission, which I’ve chosen to accept (I saw “Mission Impossible: Fallout” this weekend!) is to disable this. :)

Managed Availability has the concept of health sets. From this page which lists all the health sets in Exchange 2013:

In Managed Availability, each component in Exchange 2013 monitors itself using probes, monitors and responders. Each Exchange 2013 component that implements Managed Availability is referred to as a health set.

So what are the unhealthy health sets on my server?

The result of this in my case are the following:

So how do I find which monitors in these health sets are failing? The following cmdlet can help:

I’ll just pipe both cmdlets to get a list of monitors across all unhealthy health sets:

In my case I get:

Should probably have filtered to just the unhealthy monitors:

Update: Re-reading this later, I realize this is all complicate and wrong. All I really need to do is Get-ServerHealth -Identity <server name>. I don’t need to loop over each health set. 

Anyways, the SCOM error referred to an EAS component, but I don’t see anything with that name. ActiveSyncProxy is probably the one it was referring to?

As an aside, if I want to see the components of a health set (i.e. the monitors, probes, responders) I can do the following:

In the case of the ActiveSync.Proxy health set (which has the ActiveSyncProxy component) I can see:

Note that the ActiveSyncProxyTestMonitor monitor is what was showing as unhealthy earlier.

To disable a monitor I need to use the Add-ServerMonitoringOverride cmdlet. This is of the format:

In my case, to disable ActiveSync.Proxy (health set) ActiveSyncProxyTestMonitor (monitoring item – you can see this in the list of unhealthy monitors as well as in the list above) I do:

That’s it. Wait a while and now it will appear as disabled.

Next thing is how do I find out why the ActiveSync health set is unhealthy? Let’s take a  look at the probes in that:

I can invoke the probe manually using the following cmdlet:

Pipe this out as a list to read better. Here’s what I did:

The output gives the errors encountered by the tests. I could see that it was related to EAS so decided to disable it too.

Lastly, if you are curious as to what overrides exist the following cmdlet will help:

Also, if you want to double check that a particular component on the Exchange server is inactive (and that’s why monitors are failing) the following cmdlet will help (I sort it by state for easy reading but that’s optional):

The last section of the article I referred to at the beginning of this post on editing C:\Program Files\Microsoft\Exchange Server\V15\Bin\Monitoring\Config\ClientAccessProxyTest.xml to disable certain probes. Not sure why they suggest that instead of disabling probes via the cmdlet – I think that’s because the cmdlets way is more of a temporary thing (for a certain duration) while modify the config file is a permanent fix. I should probably do the config file in my case.

[Aside] Under the Hood with DAGs

Watching this Ignite 2015 video: Under the Hood with DAGs, by Tim McMichael.

Adding some links here to supplement the video:

  • Tuning Failover Cluster Network thresholds – useful when you have stretched DAGs
  • The mystery of the 9223372036854775766 copy queue… – never had this one but good info.
    • Basically, the cluster registry keeps track of the last log number per database and also the timestamp. When a node wants to see its copy queue length (i.e. how behind it is in terms of processing the logs) it can compare this log number with the log number it has actually processed. Sometimes, however, some node might be having issue updating the cluster registry or reading the cluster registry and so they fall behind in terms of receiving updates. In such cases the last log number will match what they have processed, but it is actually outdated info and so if the Exchange Replication service on the server hosting the passive copy notices that the timestamp is 12 minutes ago it puts its database copy into self-protection mode. This is done by putting the copy queue length (a.k.a. CQL) manually as 9 quintillion (the maximum a 64-bit integer can be). No one can actually have such large a copy queue length so it’s as good number to choose.
    • The video suggests rebooting each node until you find one which might be holding updates. But the link above suggests a different method.

DAC

Came across some Datacenter Activation Coordination (DAC) from Tim’s blog: part 1, followed by a series of posts you can see at the end of part 1.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

The bit I italicized is important. If you read his blog post you’ll see why. If DAC is activated and you are starting up a previously shutdown DAG, even though the DAG might have quorum it will not start up if some members are still offline. (I had missed that when reading about DAC earlier). To summarize it succinctly from part 2 of his series:

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

Here’s highlights from some of the interesting posts in Tim’s series:

  • Part 4 has info on the steps to do a datacenter switchover and the cmdlets available when DAC is enabled. Essentially: you 1) Stop-DatabaseAvailabilityGroup with –configurationOnly:$TRUE switch for the site that is down – this marks the servers in the site that is down as down, 2) Stop-Service CLUSSVC on the nodes in the site that is up, and finally 3) Restore-DatabaseAvailabilityGroup specifying the site that is up. This Microsoft doc on datacenter switchovers is worth reading side-by-side. It contains info on both DAC and non-DAC scenarios so watch out for that.
  • Part 5 has info on how to use the Start-DatabaseAvailabilityGroup cmdlet to set the DACP bit as 1 on a specified server thus bringing up the DAG by forcing a consensus.
  • Part 6 is an interesting story. A nice edge case of DAC being enabled and graceful shutdown.
  • Part 8 is another interesting story on what happens due to a typo in a cmdlet.

Very briefly, the DAC cmdlets:

  • Stop-DatabaseAvailabilityGroup – mark a specified server, or all server in a specified AD site, as down. Use the -ConfigurationOnly switch to mark the server as down in AD only but not actually do anything on the server(s). Need to use this switch if the servers are already offline but AD is up and accessible in that site. This cmdlet also forces a sync of AD across sites so the information is propagated.
  • Start-DatabaseAvailabilityGroup – same as above, but mark as up. Can use the -ConfigurationOnly switch to not really do anything but only mark in AD.
  • Restore-DatabaseAvailabilityGroup – it evicts any stopped servers, it can configure the DAG to use an alternate witness server, and it brings up the DAG after doing this. This cmdlet can only be used against a DAG with DAC enabled.

Dynamic Quorum

Came across dynamic quorum from the videos (wasn’t previously aware of it). Am being lazy and will put in some screenshots from the video:

The highlighted part is the key thing.

Remember that quorum is defined as “(the number of votes)/2 + 1“. Each node (or witness) typically has a single vote, and (number of votes)/2 is rounded down (i.e. 7/2 = 3.5, rounded down to 3).

With dynamic quorum once a node (or set of nodes) fail, and if the remaining set of nodes form a quorum (note – they have to form quorum), then the required quorum of the cluster is adjusted to reflect the remaining number of nodes.

Take a look at the scenario below:

We have two data centers. 6 nodes + a witness, so initially the quorum was 7/2 + 1 = 4.

The link between the two data centers goes down. Data center B has 3 nodes, which is below the quorum of 4 so all 3 nodes shutdown. Data center A has 3 nodes + witness, thus meeting the quorum and it stays up.

At this point if any further node in data center A goes down, they will fall below the quorum and the cluster will shutdown. To avoid such a situation is where dynamic quorum comes in. With dynamic quorum (introduced since Server 2012) when the nodes in data center A form quorum, the new quorum requirements is 4/2 + 1 = 3.

If a node goes down in data center A, leaving 2 nodes + a witness, since they meet the new quorum of 3 the cluster stays up. The quorum then gets revised to be 2/2 + 1 = 2. If yet another node goes down, the remaining node + witness still meets the new quorum of 2 and so the cluster continues to stay up.

Another slide:

Two data centers, 2 nodes + 1 node, no witness; the quorum is therefore 3/2 + 1 = 2.

One of the nodes in data center A goes down. Since the number of remaining nodes meets quorum, the cluster can stay up. But since there is no fail share witness each node cannot be given an equal vote (I wasn’t aware of this). Thus the cluster service picks up one of the nodes (the one with the lowest node ID) and gives it a vote of 0. The node in data center A has a vote of 0. The new quorum is thus 1/2 + 1 = 1.

If the link between the two data centers goes down, the node in data center B stays up even though the node in data center A too could have formed quorum! Nothing wrong with it, just an edge case to keep in mind as chances are you probably wanted data center A to remain up as that’s why you provisioned two nodes there in the first place.

Now for a variant in which there is a witness:

So two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before, one of the nodes in data center A goes down. Since there are 3 nodes + witness remaining, they meet quorum and the cluster continues. The new quorum is 4/2 + 1 = 3. Again, the data center link goes down. Everything goes down! :) Why? Coz no one has a clear majority. Each side has 2 votes, not the 3 required.

Interestingly I have this setup at work. So a critical thing to keep in mind is that if I were to update & reboot the witness or one of the nodes in data center A (my preferred data center), and the WAN link were to go down – I could lose the cluster! No such problems if I update & reboot a node in data center B and the link goes down, as data center A has the majority. Funny, it’s like you must keep the witness in the less preferred data center.

Windows Server 2012R2 improves upon dynamic quorum by adding dynamic witness.

So if the number of votes is odd, the witness vote is removed. And if the witness is offline or failed, then too it is removed (that includes reboots too, right?). 

Now things get tricky.

Going back to the previous example: so two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before a node data center A goes down (the picture is a bit incorrect as I skipped some intermediate slides), the remaining nodes have quorum so the cluster stays put. The new quorum is 4/2 + 1 = 3. But since the number of nodes is now ODD, cluster service removes the witness from the vote calculations. So the new quorum turns out to be 3/2 + 1 = 2. At this point if the link goes down, the nodes in data center B have quorum and so they form a cluster while the remaining node in data center A is shut down. So unlike the Server 2012 case, which had no dynamic witness, the whole cluster does not go down!

Now, going back to the case where one of the nodes (not witness) had its vote removed as there were only one node in each data center, I mentioned that the node with the lowest ID gets removed. The next two slides talk about that, including how to select a node in a cluster that we’d preferentially like to remove the vote of in such situations.

At this point I’d also like to link to this Microsoft doc on cluster quorum. Am going to quote some parts from there as they explain well and I’d like to keep it here as as reference to myself.

How cluster quorum works

When nodes fail, or when some subset of nodes loses contact with another subset, surviving nodes need to verify that they constitute the majority of the cluster to remain online. If they can’t verify that, they’ll go offline.

But the concept of majority only works cleanly when the total number of nodes in the cluster is odd (for example, three nodes in a five node cluster). So, what about clusters with an even number of nodes (say, a four node cluster)?

There are two ways the cluster can make the total number of votes odd:

  1. First, it can go up one by adding a witness with an extra vote. This requires user set-up.
  2. Or, it can go down one by zeroing one unlucky node’s vote (happens automatically as needed).

I didn’t know about point 2 until watching this video.

Worth bearing in mind that this also applies in the case of the witness being lost. So any time your witness is offline the cluster service automatically zeroes the vote of one of the nodes. If you have 2 nodes in each data center + a witness in one data center, and you reboot the witness – that is fine. One of the nodes will have its vote zeroed out, but there’s no impact and when the witness returns the zeroed out node gets its vote back. But if during the time your witness is rebooting you also have a network outage between the two data centers, then the data center with majority nodes (i.e. not the data center containing the node whose vote was zeroed) wins and the cluster fails over there.

Some more:

Dynamic witness

Dynamic witness toggles the vote of the witness to make sure that the total number of votes is odd. If there are an odd number of votes, the witness doesn’t have a vote. If there is an even number of votes, the witness has a vote. Dynamic witness significantly reduces the risk that the cluster will go down because of witness failure. The cluster decides whether to use the witness vote based on the number of voting nodes that are available in the cluster.

Dynamic quorum works with Dynamic witness in the way described below.

Dynamic quorum behavior

  • If you have an even number of nodes and no witness, one node gets its vote zeroed. For example, only three of the four nodes get votes, so the total number of votes is three, and two survivors with votes are considered a majority.
  • If you have an odd number of nodes and no witness, they all get votes.
  • If you have an even number of nodes plus witness, the witness votes, so the total is odd.
  • If you have an odd number of nodes plus witness, the witness doesn’t vote.

Am pretty sure am going to forget all this a few days from today so I’ll re-link to the docs again as it goes into more detail and has examples etc.

[Aside] Various Exchange 2013 links

I am reading up on Exchange 2013 nowadays (yes, I know, bit late in the day to be doing that considering it is going out of support :) and these are some links I want to put here as a bookmark to myself. Some excellent blog posts and videos that detail the changes in Exchange 2013.

(By way of background: I am not an Exchange admin. I am Exchange 2010 certified as I have a huge interest in Exchange and as part of preparing for the certification I had attended a course and setup a lab on my laptop and even begun this blog to start posting about my adventures with it. I never got to work with Exchange 2010 at work – except as a helpdesk administrator one could say – but I am familiar with the concepts even though I have forgotten more than I remember. I have dabbled with Exchange 2000 before that. Going through these links and videos is like a trip down memory line – seeing concepts that I was once familiar with but have since changed for the better. Hopefully this time around I get to do more Exchange 2013 work! Fingers crossed).

If you don’t like reading, start with this video.

Alternatively, start with these links but I’d strongly recommend watching the above video once you finish reading.

Preferred Architecture

From the preferred architecture link I like to highlight this point about DAG design as I wasn’t aware of it (PA == Preferred Architecture; this is also discussed in the video):

Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures.

Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”.

The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery.

The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive) for more than 24 hours

By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue.

With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection.

The last line made me smile. Never thought I’d read someplace that backups for Exchange are unnecessary! :) If you have a lagged copy database, then you can enable circular logging on the database (this only affects the non-lagged copies) and skip taking backups – or at least not worry about the database dismounting because your backups are failing and logs are filling up disk space!

So what’s a lagged database copy? Basically it’s a copy of the database (in a DAG) that lags behind other members by a specified duration (maximum is 14 days). So if the other servers in your DAG have some issue, rather than restore the database from backup you can simply “play down” the lagged database copy (i.e. tell that copy to process all the transaction logs it already has an thus become up-to-date) and activate it. Neat, huh. I want to delve a bit more into this, so check out this “Lagged copy enhancements” section from the Exchange 2013 HA improvements page.

First there’s Safety Net (it’s not related to lagged copies, but it plays along well with it in a cool way so worth pointing out):

Safety Net is a feature of transport that replaces the Exchange 2010 feature known as transport dumpster. Safety Net is similar to transport dumpster, in that it’s a delivery queue that’s associated with the Transport service on a Mailbox server. This queue stores copies of messages that were successfully delivered to the active mailbox database on the Mailbox server. Each active mailbox database on the Mailbox server has its own queue that stores copies of the delivered messages. You can specify how long Safety Net stores copies of the successfully delivered messages before they expire and are automatically deleted.

Ok – so each mailbox server has a queue for each of its active database (remember lagged copies are active too, just that they have a higher number and hence not preferred). This queue contains messages that were delivered. Even after a message is delivered to a user, Safety Net can keep it around. You get to specify how long a message is kept for. Cool! Next up is this cool integration:

With the introduction of Safety Net, activating a lagged database copy becomes significantly easier. For example, consider a lagged copy that has a 2-day replay lag. In that case, you would configure Safety Net for a period of 2 days. If you encounter a situation in which you need to use your lagged copy, you can suspend replication to it, and copy it twice (to preserve the lagged nature of the database and to create an extra copy in case you need it). Then, take a copy and discard all the log files, except for those in the required range. Mount the copy, which triggers an automatic request to Safety Net to redeliver the last two days of mail. With Safety Net, you don’t need to hunt for where the point of corruption was introduced. You get the last two days mail, minus the data ordinarily lost on a lossy failover.

Whoa! So when a lagged copy is mounted, it asks Safety Net to redeliver all messages in the specified period – so as long as your Safety Net and lagged database copy have the same period, if you mount the lagged copy from the specified period ago, Safety Net will deliver all the messages since then. (It’s cool, but yeah I can imagine users complaining about a whole bunch of unread messages now, and missing Sent Items etc. – but it’s cool, I like it for the geek factor). :)

To re-emphasize something that was mentioned earlier:

Lagged copies can now care for themselves by invoking automatic log replay to play down the log files in certain scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive only; lagged database copies are not counted) for more than 24 hours

Lagged copy play down behavior is disabled by default, and can be enabled by running the following command.

After being enabled, play down occurs when there are fewer than three copies. You can change the default value of 3, by modifying the following DWORD registry value.

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagManagerNumAvailableCopies

To enable play down for low disk space thresholds, you must configure the following registry entry.

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagLowSpacePlaydownThresholdInMB

After configuring either of these registry settings, restart the Microsoft Exchange DAG Management service for the changes to take effect.

As an example, consider an environment where a given database has 4 copies (3 highly available copies and 1 lagged copy), and the default setting is used for ReplayLagManagerNumAvailableCopies. If a non-lagged copy is out-of-service for any reason (for example, it is suspended, etc.) then the lagged copy will automatically play down its log files in 24 hours.

For future reference this doc has steps on how to mount a lagged database copy – i.e. if you are not doing the automatic play down behavior. You can manually play down via the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks switch.

However, it is recommended you first suspend the copy (i.e. make it not “active”) and make a copy of the database and logs just in case.

Optionally, if you want to recover to a specific point in time you’d 1) suspend the database, 2) make a copy just in case, 3) move elsewhere all log files after the time you want to recover, 4) delete the checkpoint file, 5) run eseutil to recover the database – this is what replays the remaining logs and brings the database up to the point in time you want, and 6) move the database elsewhere to use as a recovery database for a restore. After this you move back the logs file previously moved away, and resume the database copy. This blog post has a bit more details but it is more or less same as the Microsoft doc. Note: I’ve never ever done this, so all this more of info for future me. :)

Lastly, that doc also has info on how activate a logged copy using Safety Net. Step 4 of the instructions made no sense to me.

Moving on … (but pointing to this HA TechNet link again coz it has a lot of other info that I skipped here).

Outlook Anywhere & OWA behind a WAP server

Some links around publishing Exchange namespaces such as OWA and Outlook Anywhere externally via a WAP server:

The easiest thing to do is pass-through everything via the WAP to the internal URL. But if you want, you can setup OWA authentication via ADFS claims. A step-by-step official guide is here, but the two links above cover the same stuff.

Healthcheck.htm

Exchange 2013 has a new monitoring architecture. When monitoring via a load balancer one can use a “healthcheck.htm” URL to test the health of each virtual directory (corresponding to each of the user consumed services). This URL is per virtual directoy, here’s an example from Citrix on how to add monitors for each service in NetScaler:

If the service is up the URL returns an HTTP 200 OK.

Virtual Directory cmdlets

Speaking of virtual directories, if any of the PowerShell Get- cmdlets for virtual directories are slow this blog post tells you why and what to do about it. These are the cmdlets, and the workaround is to add a switch -ADPropertiesOnly (this makes the cmdlet query AD Config partition for the same info rather than query each server’s IIS Metabase, which is slower):

  • Get-WebServicesVirtualDirectory
  • Get-OwaVirtualDirectory
  • Get-ActiveSyncVirtualDirectory
  • Get-AutodiscoverVirtualDirectory
  • Get-EcpVirtualDirectory
  • Get-PowerShellVirtualDirectory
  • Get-OABvirtualDirectory
  • Get-OutlookAnywhere

Update: Thought I’d add more videos and links to this post than make separate posts.

Transport Architecture

Check out this talk: https://channel9.msdn.com/events/TechEd/2013/OUC-B319. Slides available online here. I wanted to put a screenshot of the transport components as a quick reference to myself in this post:

So the CAS has stateless SMTP service. Affectionately called FET, or Front-End Transport.

The MBX has a stateful and stateless SMTP service. Called Transport and Mailbox Transport respectively. (Transport replaces the Hub Transport role of Exchange 2010).

There’s no longer a Hub Transport role. Previously the Hub Transport role on one server could directly talk to the store of another server – thus there were no clear layers. Everything was messy and tied up using RPC. Now there are clear layers as below and communication between servers happen at the protocol layer. Within a server communication goes up & down the layer; across servers it is using protocols like SMTP, EWS, and MRS proxy. Everything is clean.

Some slides on all three components:

Outbound email from the transport component on a MBX server can go out directly to an external SMTP server, or it can be delivered to the FET on any CAS server in the same site. This delivery happens on port 717 and needs to be specifically enabled.

Transport component listens on port 25 if MBX and CAS are on separate servers. Else it listens on port 2525 as CAS is already listening on 25. These ports are for accepting messages from the FET. For accepting messages from the Mailbox Transport component, it listens on port 465.

Remember that Transport is stateful.

Destination can be a CAS server or another transport component (on another MBX server). The Transport component is what does the lookup of the mailbox database.

Last component: Mailbox Transport. This is the component that actually talks to the next layer in the mailbox server. This talks MAPI and receives emails from the Transport component. This is also the component that does the message conversion (TNF to MIME and vice versa). No extensibility at this component as all that is at the Transport component. Once a message reaches Mailbox Transport there’s no changes happening to it!

Azure and DAG IP – not pingable on all nodes

If you are running an Exchange DAG in Azure you won’t always be able to ping the DAG IP. In fact, IP less DAG seems to be the recommendation.

Crazy thing is I can’t find any install guide or advise for Exchange on Azure. There’s plenty of documentation on setting up an Exchange 2013 DAG witness in Azure for use with on-prem Exchange, but nothing on actually setting up a DAG in Azure (like how you’d find for SQL AlwaysOn in Azure, for example). This is a good article though for a non-techie introduction.

Another thing to bear in mind with Azure is that all communication between VMs – including those in the same subnet – happen via a gateway. If you check the  arp output of your Azure VMs you will see that all IPs are being intercepted by a gateway.  So if the gateway doesn’t know if the IP it won’t route it. This is why for SQL AlwaysOn you need to setup the availability group IPs on the Azure load balancer, thus making Azure aware of the IP.

In my case we have a two site setup in Azure and I noticed that I was able to ping the DAG IP when the PAM was in the DR site but not when it was in the main site. First I suspected some routing issue. Then I realized that hang on, the DAG IP was configured in each site but since it was manually assigned than being via a load balancer it was assigned to a single server NIC in each site. Thus, for instance, node01 in both sites had the DAG IP assigned to it while node02 in both sites did not. It so happened that when my PAM failed over to the DR site it failed to node01 which had the DAG IP assigned (of that site subnet) while when it failed over to the primary site it happened to choose node02 which did not have the DAG IP. Simple! Elementary but didn’t realize this was what was happening as I didn’t make the PAM role go to each node to see if it behaves differently.

Sometimes you got to let go of the big picture and see the small stuff. :)