Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Disabling Exchange 2013 Managed Availability monitors

Check out this blog post from Microsoft first. Mine’s mostly based on that but tailored to my specific situation.

We have a CAS server that’s purely for internal admin ECP functions. Managed Availability was running some ActiveSync tests on it and failing (because they don’t exist) with errors like these in SCOM:

So my mission, which I’ve chosen to accept (I saw “Mission Impossible: Fallout” this weekend!) is to disable this. :)

Managed Availability has the concept of health sets. From this page which lists all the health sets in Exchange 2013:

In Managed Availability, each component in Exchange 2013 monitors itself using probes, monitors and responders. Each Exchange 2013 component that implements Managed Availability is referred to as a health set.

So what are the unhealthy health sets on my server?

The result of this in my case are the following:

So how do I find which monitors in these health sets are failing? The following cmdlet can help:

I’ll just pipe both cmdlets to get a list of monitors across all unhealthy health sets:

In my case I get:

Should probably have filtered to just the unhealthy monitors:

Anyways, the SCOM error referred to an EAS component, but I don’t see anything with that name. ActiveSyncProxy is probably the one it was referring to?

As an aside, if I want to see the components of a health set (i.e. the monitors, probes, responders) I can do the following:

In the case of the ActiveSync.Proxy health set (which has the ActiveSyncProxy component) I can see:

Note that the ActiveSyncProxyTestMonitor monitor is what was showing as unhealthy earlier.

To disable a monitor I need to use the Add-ServerMonitoringOverride cmdlet. This is of the format:

In my case, to disable ActiveSync.Proxy (health set) ActiveSyncProxyTestMonitor (monitoring item – you can see this in the list of unhealthy monitors as well as in the list above) I do:

That’s it. Wait a while and now it will appear as disabled.

Next thing is how do I find out why the ActiveSync health set is unhealthy? Let’s take a  look at the probes in that:

I can invoke the probe manually using the following cmdlet:

Pipe this out as a list to read better. Here’s what I did:

The output gives the errors encountered by the tests. I could see that it was related to EAS so decided to disable it too.

Lastly, if you are curious as to what overrides exist the following cmdlet will help:

Also, if you want to double check that a particular component on the Exchange server is inactive (and that’s why monitors are failing) the following cmdlet will help (I sort it by state for easy reading but that’s optional):

The last section of the article I referred to at the beginning of this post on editing C:\Program Files\Microsoft\Exchange Server\V15\Bin\Monitoring\Config\ClientAccessProxyTest.xml to disable certain probes. Not sure why they suggest that instead of disabling probes via the cmdlet – I think that’s because the cmdlets way is more of a temporary thing (for a certain duration) while modify the config file is a permanent fix. I should probably do the config file in my case.

[Aside] Under the Hood with DAGs

Watching this Ignite 2015 video: Under the Hood with DAGs, by Tim McMichael.

Adding some links here to supplement the video:

  • Tuning Failover Cluster Network thresholds – useful when you have stretched DAGs
  • The mystery of the 9223372036854775766 copy queue… – never had this one but good info.
    • Basically, the cluster registry keeps track of the last log number per database and also the timestamp. When a node wants to see its copy queue length (i.e. how behind it is in terms of processing the logs) it can compare this log number with the log number it has actually processed. Sometimes, however, some node might be having issue updating the cluster registry or reading the cluster registry and so they fall behind in terms of receiving updates. In such cases the last log number will match what they have processed, but it is actually outdated info and so if the Exchange Replication service on the server hosting the passive copy notices that the timestamp is 12 minutes ago it puts its database copy into self-protection mode. This is done by putting the copy queue length (a.k.a. CQL) manually as 9 quintillion (the maximum a 64-bit integer can be). No one can actually have such large a copy queue length so it’s as good number to choose.
    • The video suggests rebooting each node until you find one which might be holding updates. But the link above suggests a different method.

DAC

Came across some Datacenter Activation Coordination (DAC) from Tim’s blog: part 1, followed by a series of posts you can see at the end of part 1.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

The bit I italicized is important. If you read his blog post you’ll see why. If DAC is activated and you are starting up a previously shutdown DAG, even though the DAG might have quorum it will not start up if some members are still offline. (I had missed that when reading about DAC earlier). To summarize it succinctly from part 2 of his series:

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

Here’s highlights from some of the interesting posts in Tim’s series:

  • Part 4 has info on the steps to do a datacenter switchover and the cmdlets available when DAC is enabled. Essentially: you 1) Stop-DatabaseAvailabilityGroup with –configurationOnly:$TRUE switch for the site that is down – this marks the servers in the site that is down as down, 2) Stop-Service CLUSSVC on the nodes in the site that is up, and finally 3) Restore-DatabaseAvailabilityGroup specifying the site that is up. This Microsoft doc on datacenter switchovers is worth reading side-by-side. It contains info on both DAC and non-DAC scenarios so watch out for that.
  • Part 5 has info on how to use the Start-DatabaseAvailabilityGroup cmdlet to set the DACP bit as 1 on a specified server thus bringing up the DAG by forcing a consensus.
  • Part 6 is an interesting story. A nice edge case of DAC being enabled and graceful shutdown.
  • Part 8 is another interesting story on what happens due to a typo in a cmdlet.

Very briefly, the DAC cmdlets:

  • Stop-DatabaseAvailabilityGroup – mark a specified server, or all server in a specified AD site, as down. Use the -ConfigurationOnly switch to mark the server as down in AD only but not actually do anything on the server(s). Need to use this switch if the servers are already offline but AD is up and accessible in that site. This cmdlet also forces a sync of AD across sites so the information is propagated.
  • Start-DatabaseAvailabilityGroup – same as above, but mark as up. Can use the -ConfigurationOnly switch to not really do anything but only mark in AD.
  • Restore-DatabaseAvailabilityGroup – it evicts any stopped servers, it can configure the DAG to use an alternate witness server, and it brings up the DAG after doing this. This cmdlet can only be used against a DAG with DAC enabled.

Dynamic Quorum

Came across dynamic quorum from the videos (wasn’t previously aware of it). Am being lazy and will put in some screenshots from the video:

The highlighted part is the key thing.

Remember that quorum is defined as “(the number of votes)/2 + 1“. Each node (or witness) typically has a single vote, and (number of votes)/2 is rounded down (i.e. 7/2 = 3.5, rounded down to 3).

With dynamic quorum once a node (or set of nodes) fail, and if the remaining set of nodes form a quorum (note – they have to form quorum), then the required quorum of the cluster is adjusted to reflect the remaining number of nodes.

Take a look at the scenario below:

We have two data centers. 6 nodes + a witness, so initially the quorum was 7/2 + 1 = 4.

The link between the two data centers goes down. Data center B has 3 nodes, which is below the quorum of 4 so all 3 nodes shutdown. Data center A has 3 nodes + witness, thus meeting the quorum and it stays up.

At this point if any further node in data center A goes down, they will fall below the quorum and the cluster will shutdown. To avoid such a situation is where dynamic quorum comes in. With dynamic quorum (introduced since Server 2012) when the nodes in data center A form quorum, the new quorum requirements is 4/2 + 1 = 3.

If a node goes down in data center A, leaving 2 nodes + a witness, since they meet the new quorum of 3 the cluster stays up. The quorum then gets revised to be 2/2 + 1 = 2. If yet another node goes down, the remaining node + witness still meets the new quorum of 2 and so the cluster continues to stay up.

Another slide:

Two data centers, 2 nodes + 1 node, no witness; the quorum is therefore 3/2 + 1 = 2.

One of the nodes in data center A goes down. Since the number of remaining nodes meets quorum, the cluster can stay up. But since there is no fail share witness each node cannot be given an equal vote (I wasn’t aware of this). Thus the cluster service picks up one of the nodes (the one with the lowest node ID) and gives it a vote of 0. The node in data center A has a vote of 0. The new quorum is thus 1/2 + 1 = 1.

If the link between the two data centers goes down, the node in data center B stays up even though the node in data center A too could have formed quorum! Nothing wrong with it, just an edge case to keep in mind as chances are you probably wanted data center A to remain up as that’s why you provisioned two nodes there in the first place.

Now for a variant in which there is a witness:

So two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before, one of the nodes in data center A goes down. Since there are 3 nodes + witness remaining, they meet quorum and the cluster continues. The new quorum is 4/2 + 1 = 3. Again, the data center link goes down. Everything goes down! :) Why? Coz no one has a clear majority. Each side has 2 votes, not the 3 required.

Interestingly I have this setup at work. So a critical thing to keep in mind is that if I were to update & reboot the witness or one of the nodes in data center A (my preferred data center), and the WAN link were to go down – I could lose the cluster! No such problems if I update & reboot a node in data center B and the link goes down, as data center A has the majority. Funny, it’s like you must keep the witness in the less preferred data center.

Windows Server 2012R2 improves upon dynamic quorum by adding dynamic witness.

So if the number of votes is odd, the witness vote is removed. And if the witness is offline or failed, then too it is removed (that includes reboots too, right?). 

Now things get tricky.

Going back to the previous example: so two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before a node data center A goes down (the picture is a bit incorrect as I skipped some intermediate slides), the remaining nodes have quorum so the cluster stays put. The new quorum is 4/2 + 1 = 3. But since the number of nodes is now ODD, cluster service removes the witness from the vote calculations. So the new quorum turns out to be 3/2 + 1 = 2. At this point if the link goes down, the nodes in data center B have quorum and so they form a cluster while the remaining node in data center A is shut down. So unlike the Server 2012 case, which had no dynamic witness, the whole cluster does not go down!

Now, going back to the case where one of the nodes (not witness) had its vote removed as there were only one node in each data center, I mentioned that the node with the lowest ID gets removed. The next two slides talk about that, including how to select a node in a cluster that we’d preferentially like to remove the vote of in such situations.

At this point I’d also like to link to this Microsoft doc on cluster quorum. Am going to quote some parts from there as they explain well and I’d like to keep it here as as reference to myself.

How cluster quorum works

When nodes fail, or when some subset of nodes loses contact with another subset, surviving nodes need to verify that they constitute the majority of the cluster to remain online. If they can’t verify that, they’ll go offline.

But the concept of majority only works cleanly when the total number of nodes in the cluster is odd (for example, three nodes in a five node cluster). So, what about clusters with an even number of nodes (say, a four node cluster)?

There are two ways the cluster can make the total number of votes odd:

  1. First, it can go up one by adding a witness with an extra vote. This requires user set-up.
  2. Or, it can go down one by zeroing one unlucky node’s vote (happens automatically as needed).

I didn’t know about point 2 until watching this video.

Worth bearing in mind that this also applies in the case of the witness being lost. So any time your witness is offline the cluster service automatically zeroes the vote of one of the nodes. If you have 2 nodes in each data center + a witness in one data center, and you reboot the witness – that is fine. One of the nodes will have its vote zeroed out, but there’s no impact and when the witness returns the zeroed out node gets its vote back. But if during the time your witness is rebooting you also have a network outage between the two data centers, then the data center with majority nodes (i.e. not the data center containing the node whose vote was zeroed) wins and the cluster fails over there.

Some more:

Dynamic witness

Dynamic witness toggles the vote of the witness to make sure that the total number of votes is odd. If there are an odd number of votes, the witness doesn’t have a vote. If there is an even number of votes, the witness has a vote. Dynamic witness significantly reduces the risk that the cluster will go down because of witness failure. The cluster decides whether to use the witness vote based on the number of voting nodes that are available in the cluster.

Dynamic quorum works with Dynamic witness in the way described below.

Dynamic quorum behavior

  • If you have an even number of nodes and no witness, one node gets its vote zeroed. For example, only three of the four nodes get votes, so the total number of votes is three, and two survivors with votes are considered a majority.
  • If you have an odd number of nodes and no witness, they all get votes.
  • If you have an even number of nodes plus witness, the witness votes, so the total is odd.
  • If you have an odd number of nodes plus witness, the witness doesn’t vote.

Am pretty sure am going to forget all this a few days from today so I’ll re-link to the docs again as it goes into more detail and has examples etc.

[Aside] Various Exchange 2013 links

I am reading up on Exchange 2013 nowadays (yes, I know, bit late in the day to be doing that considering it is going out of support :) and these are some links I want to put here as a bookmark to myself. Some excellent blog posts and videos that detail the changes in Exchange 2013.

(By way of background: I am not an Exchange admin. I am Exchange 2010 certified as I have a huge interest in Exchange and as part of preparing for the certification I had attended a course and setup a lab on my laptop and even begun this blog to start posting about my adventures with it. I never got to work with Exchange 2010 at work – except as a helpdesk administrator one could say – but I am familiar with the concepts even though I have forgotten more than I remember. I have dabbled with Exchange 2000 before that. Going through these links and videos is like a trip down memory line – seeing concepts that I was once familiar with but have since changed for the better. Hopefully this time around I get to do more Exchange 2013 work! Fingers crossed).

If you don’t like reading, start with this video.

Alternatively, start with these links but I’d strongly recommend watching the above video once you finish reading.

Preferred Architecture

From the preferred architecture link I like to highlight this point about DAG design as I wasn’t aware of it (PA == Preferred Architecture; this is also discussed in the video):

Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures.

Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”.

The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery.

The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive) for more than 24 hours

By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue.

With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection.

The last line made me smile. Never thought I’d read someplace that backups for Exchange are unnecessary! :) If you have a lagged copy database, then you can enable circular logging on the database (this only affects the non-lagged copies) and skip taking backups – or at least not worry about the database dismounting because your backups are failing and logs are filling up disk space!

So what’s a lagged database copy? Basically it’s a copy of the database (in a DAG) that lags behind other members by a specified duration (maximum is 14 days). So if the other servers in your DAG have some issue, rather than restore the database from backup you can simply “play down” the lagged database copy (i.e. tell that copy to process all the transaction logs it already has an thus become up-to-date) and activate it. Neat, huh. I want to delve a bit more into this, so check out this “Lagged copy enhancements” section from the Exchange 2013 HA improvements page.

First there’s Safety Net (it’s not related to lagged copies, but it plays along well with it in a cool way so worth pointing out):

Safety Net is a feature of transport that replaces the Exchange 2010 feature known as transport dumpster. Safety Net is similar to transport dumpster, in that it’s a delivery queue that’s associated with the Transport service on a Mailbox server. This queue stores copies of messages that were successfully delivered to the active mailbox database on the Mailbox server. Each active mailbox database on the Mailbox server has its own queue that stores copies of the delivered messages. You can specify how long Safety Net stores copies of the successfully delivered messages before they expire and are automatically deleted.

Ok – so each mailbox server has a queue for each of its active database (remember lagged copies are active too, just that they have a higher number and hence not preferred). This queue contains messages that were delivered. Even after a message is delivered to a user, Safety Net can keep it around. You get to specify how long a message is kept for. Cool! Next up is this cool integration:

With the introduction of Safety Net, activating a lagged database copy becomes significantly easier. For example, consider a lagged copy that has a 2-day replay lag. In that case, you would configure Safety Net for a period of 2 days. If you encounter a situation in which you need to use your lagged copy, you can suspend replication to it, and copy it twice (to preserve the lagged nature of the database and to create an extra copy in case you need it). Then, take a copy and discard all the log files, except for those in the required range. Mount the copy, which triggers an automatic request to Safety Net to redeliver the last two days of mail. With Safety Net, you don’t need to hunt for where the point of corruption was introduced. You get the last two days mail, minus the data ordinarily lost on a lossy failover.

Whoa! So when a lagged copy is mounted, it asks Safety Net to redeliver all messages in the specified period – so as long as your Safety Net and lagged database copy have the same period, if you mount the lagged copy from the specified period ago, Safety Net will deliver all the messages since then. (It’s cool, but yeah I can imagine users complaining about a whole bunch of unread messages now, and missing Sent Items etc. – but it’s cool, I like it for the geek factor). :)

To re-emphasize something that was mentioned earlier:

Lagged copies can now care for themselves by invoking automatic log replay to play down the log files in certain scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive only; lagged database copies are not counted) for more than 24 hours

Lagged copy play down behavior is disabled by default, and can be enabled by running the following command.

After being enabled, play down occurs when there are fewer than three copies. You can change the default value of 3, by modifying the following DWORD registry value.

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagManagerNumAvailableCopies

To enable play down for low disk space thresholds, you must configure the following registry entry.

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagLowSpacePlaydownThresholdInMB

After configuring either of these registry settings, restart the Microsoft Exchange DAG Management service for the changes to take effect.

As an example, consider an environment where a given database has 4 copies (3 highly available copies and 1 lagged copy), and the default setting is used for ReplayLagManagerNumAvailableCopies. If a non-lagged copy is out-of-service for any reason (for example, it is suspended, etc.) then the lagged copy will automatically play down its log files in 24 hours.

For future reference this doc has steps on how to mount a lagged database copy – i.e. if you are not doing the automatic play down behavior. You can manually play down via the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks switch.

However, it is recommended you first suspend the copy (i.e. make it not “active”) and make a copy of the database and logs just in case.

Optionally, if you want to recover to a specific point in time you’d 1) suspend the database, 2) make a copy just in case, 3) move elsewhere all log files after the time you want to recover, 4) delete the checkpoint file, 5) run eseutil to recover the database – this is what replays the remaining logs and brings the database up to the point in time you want, and 6) move the database elsewhere to use as a recovery database for a restore. After this you move back the logs file previously moved away, and resume the database copy. This blog post has a bit more details but it is more or less same as the Microsoft doc. Note: I’ve never ever done this, so all this more of info for future me. :)

Lastly, that doc also has info on how activate a logged copy using Safety Net. Step 4 of the instructions made no sense to me.

Moving on … (but pointing to this HA TechNet link again coz it has a lot of other info that I skipped here).

Outlook Anywhere & OWA behind a WAP server

Some links around publishing Exchange namespaces such as OWA and Outlook Anywhere externally via a WAP server:

The easiest thing to do is pass-through everything via the WAP to the internal URL. But if you want, you can setup OWA authentication via ADFS claims. A step-by-step official guide is here, but the two links above cover the same stuff.

Healthcheck.htm

Exchange 2013 has a new monitoring architecture. When monitoring via a load balancer one can use a “healthcheck.htm” URL to test the health of each virtual directory (corresponding to each of the user consumed services). This URL is per virtual directoy, here’s an example from Citrix on how to add monitors for each service in NetScaler:

If the service is up the URL returns an HTTP 200 OK.

Virtual Directory cmdlets

Speaking of virtual directories, if any of the PowerShell Get- cmdlets for virtual directories are slow this blog post tells you why and what to do about it. These are the cmdlets, and the workaround is to add a switch -ADPropertiesOnly (this makes the cmdlet query AD Config partition for the same info rather than query each server’s IIS Metabase, which is slower):

  • Get-WebServicesVirtualDirectory
  • Get-OwaVirtualDirectory
  • Get-ActiveSyncVirtualDirectory
  • Get-AutodiscoverVirtualDirectory
  • Get-EcpVirtualDirectory
  • Get-PowerShellVirtualDirectory
  • Get-OABvirtualDirectory
  • Get-OutlookAnywhere

Update: Thought I’d add more videos and links to this post than make separate posts.

Transport Architecture

Check out this talk: https://channel9.msdn.com/events/TechEd/2013/OUC-B319. Slides available online here. I wanted to put a screenshot of the transport components as a quick reference to myself in this post:

So the CAS has stateless SMTP service. Affectionately called FET, or Front-End Transport.

The MBX has a stateful and stateless SMTP service. Called Transport and Mailbox Transport respectively. (Transport replaces the Hub Transport role of Exchange 2010).

There’s no longer a Hub Transport role. Previously the Hub Transport role on one server could directly talk to the store of another server – thus there were no clear layers. Everything was messy and tied up using RPC. Now there are clear layers as below and communication between servers happen at the protocol layer. Within a server communication goes up & down the layer; across servers it is using protocols like SMTP, EWS, and MRS proxy. Everything is clean.

Some slides on all three components:

Outbound email from the transport component on a MBX server can go out directly to an external SMTP server, or it can be delivered to the FET on any CAS server in the same site. This delivery happens on port 717 and needs to be specifically enabled.

Transport component listens on port 25 if MBX and CAS are on separate servers. Else it listens on port 2525 as CAS is already listening on 25. These ports are for accepting messages from the FET. For accepting messages from the Mailbox Transport component, it listens on port 465.

Remember that Transport is stateful.

Destination can be a CAS server or another transport component (on another MBX server). The Transport component is what does the lookup of the mailbox database.

Last component: Mailbox Transport. This is the component that actually talks to the next layer in the mailbox server. This talks MAPI and receives emails from the Transport component. This is also the component that does the message conversion (TNF to MIME and vice versa). No extensibility at this component as all that is at the Transport component. Once a message reaches Mailbox Transport there’s no changes happening to it!

Azure and DAG IP – not pingable on all nodes

If you are running an Exchange DAG in Azure you won’t always be able to ping the DAG IP. In fact, IP less DAG seems to be the recommendation.

Crazy thing is I can’t find any install guide or advise for Exchange on Azure. There’s plenty of documentation on setting up an Exchange 2013 DAG witness in Azure for use with on-prem Exchange, but nothing on actually setting up a DAG in Azure (like how you’d find for SQL AlwaysOn in Azure, for example). This is a good article though for a non-techie introduction.

Another thing to bear in mind with Azure is that all communication between VMs – including those in the same subnet – happen via a gateway. If you check the  arp output of your Azure VMs you will see that all IPs are being intercepted by a gateway.  So if the gateway doesn’t know if the IP it won’t route it. This is why for SQL AlwaysOn you need to setup the availability group IPs on the Azure load balancer, thus making Azure aware of the IP.

In my case we have a two site setup in Azure and I noticed that I was able to ping the DAG IP when the PAM was in the DR site but not when it was in the main site. First I suspected some routing issue. Then I realized that hang on, the DAG IP was configured in each site but since it was manually assigned than being via a load balancer it was assigned to a single server NIC in each site. Thus, for instance, node01 in both sites had the DAG IP assigned to it while node02 in both sites did not. It so happened that when my PAM failed over to the DR site it failed to node01 which had the DAG IP assigned (of that site subnet) while when it failed over to the primary site it happened to choose node02 which did not have the DAG IP. Simple! Elementary but didn’t realize this was what was happening as I didn’t make the PAM role go to each node to see if it behaves differently.

Sometimes you got to let go of the big picture and see the small stuff. :)

TIL: The Exchange PAM can move to another node when the FSW reboots

Didn’t know this. The Exchange PAM (Primary Active Manager) can move to another node when the FSW (File Share Witness) reboots or is offline. There’s no impact, but it’s worth being aware o if you are wondering what could be affected when the FSW server is rebooted.

From this blog post (via this forum post where I found it):

When the File Share Witness host server becomes unavailable, the File Share Witness resource will still fail in cluster and cause the Cluster Core Resources to move between nodes. In this case assuming the File Share Witness host server is still not available, the resource remains in a failed state. If it becomes necessary to utilize the File Share Witness to maintain Quorum, and the witness resource is in a failed state, cluster will attempt to online the witness resource. If the online is successful the witness share is alive and accessible – quorum is maintained. If the online is not successful, the witness share is not alive and accessible – a lost quorum condition is encountered.

Brief background on PAM (via this blog post):

At any given time, in every database availability group (DAG), there is one member that is responsible for the coordination of database actions across the DAG. This member is known as the Primary Active Manager (PAM). The current PAM can be determined by using Get-DatabaseAvailabilityGroup –Status. The cluster group may contain several cluster resources. The PAM does not depend on the state of any of the resources in this group, and the PAM role will always be assigned to the node that owns the Cluster Group. The Cluster Group can be moved between members using the cluster management tools.

Each DAG member that does not own the Cluster Group is a Standby Active Manager (SAM). When the Cluster Group is moved between nodes, a notification process detects that the Cluster Group owner has changed. This triggers detection logic to determine the new PAM. 

Automatic arbitration may occur for a number of reasons including: 

  • The failure of a member
  • The failure of a resource contained within the Cluster Group

In most cases, Exchange administrators should not be concerned with the owner of the Cluster Group or the node designated as the PAM.  This is true even for DAGs that span multiple sites where the PAM may be a node in a distant datacenter.

Some more details regarding PAM  (via this blog post):

Active manager is a role that runs on a mailbox server. A single active manager role “Standalone Active Manager” which runs on a mailbox server that has no high-availability configured. Two active manager roles will be in use when the mailbox server is a member of a DAG; Primary Active Manager (PAM) and Standby Active Manager (SAM). 

PAM is the Active Manager in a DAG that decides which database copies will be active and passive. PAM is responsible for getting topology change notifications and reacting to server failures. 

SAM provides information on which server hosts the active copy of a mailbox database to other components of Exchange that are running an Active Manager client component (for example, RPC Client Access service or Hub Transport server). The SAM detects failures of local databases and the local Information Store. It reacts to failures by asking the PAM to initiate a failover. A SAM does not determine the target of failover, nor does it update a database’s location state in the PAM. SAM runs on all mailbox servers in a DAG except on the one where PAM is running.

NetScaler/ Exchange RPC – TCP syn sent, reset received

At work one of my colleagues is setting up NetScalers as load balancers for our new Exchange environment. He is replicating the existing setup but found that the RPC 60001 & 60002 Service Groups on the NetScalers were being marked as down. Curious, I took a look.

After SSH-ing into the NetScaler I could see the following via show serviceGroup <serviceGroupName>:

My colleague too had seen this and pointed me to a good blog post from Citrix on what the reset codes mean. That blog post is a good one (that’s why I am linking it here, as a reference to myself) but I don’t think he was looking at the trace via a NetScaler trace so we had no idea of the codes. (Speaking of which, here’s a good post on NetScaler and Wireshark. Here’s a KB article on how to collect traces from NetScaler. And here’s a KB article on how to collect traces from the CLI. Whilst I have briefly read them, I haven’t tried them out currently). 

Back to the issue at hand. I could see that the individual servers (Exchange 2010 Client Access) were up on RPC 135 and HTTPS, but only RPC 60001 & 60002 were down. I decided to do a portQry against a server in the older environment and compare against the new. Here’s the relevant bits from an older server:

As expected, something is listening on ports 60001 and 60002. When I tried the same against the new server, however, there was nothing listening on either of these ports. I searched the output based on the UUIDs and found the port numbers were different:

So that’s why the NetScalers were getting a reset. Nothing was listening on those ports! Solution is simple. Configure these RPC ports as static.

That’s all! :)

Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

Using Solarwinds to monitor Windows Performance Monitor (perfmon) Counters

Had a request from our Exchange admin to setup Solarwinds alerts for some of our Exchange servers based on Performance Monitor counters.

MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length       (above 200)
MSExchangeTransport Queues(_total)\Largest Delivery Queue Length                 (above 200)
MSExchangeTransport Queues(_total)\Messages Queued For Delivery                (above 200)
MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length        (above 20)

Before setting up alerts I need to add them to Solarwinds first. Here’s how you do that.

First, open up the Solarwinds web console, go to Applications, and then SAM Settings.

applicationssam settings

Then go to Component Monitor Wizard.

component monitor

 

Select Windows Performance Counter Monitor.

perfmon

Notice that it says the data is collected using RPC. This means (1) the server must be monitored by Solarwinds using WMI and not SNMP. In case of the latter, switch to monitoring via WMI. And (2) RPC ports must be open between the Solarwinds server and the target server. If not, monitoring will fail.

Enter the name of a server you wish to target. This server would be one that contains the perfmon counters you are interested in. You use this server to setup monitoring for the counters you are interested in. Change to 64bit if 32bit doesn’t work.

target

Change the “Choose Credential” drop down according to your environment. To select the server it’s better to click “Browse” and find the server you are interested in if Solarwinds complains that it cannot find the name you type in.

Note: The next step will fail if you have not opened the required RPC ports.

Select the counters you are interested in. First select the object you want to monitor (MSExchangeTransport Queues, in the screenshot below) and then the counters.

select counters

The next screen will list all the counters you selected and give you a chance to set warning and critical thresholds. Customize these.

 

properties

Select where you would like these counters added to – a new application monitor/ monitor template, or an existing application monitor/ monitor template. I am going with a new application monitor template. Easier to make changes to templates than individual application monitors.

whereadd

 

Choose more nodes you would like to assign this application monitor to. Am skipping this screenshot. This step is optional as you can assign the application monitor to nodes later too.

An optional step – I also went to Manage Application Templates screen after the above steps, selected the template I created, and assigned it some tags and set a custom view.

defineview

A custom view lets you define what details are shown when anyone clicks this application monitor template on a particular node in the Solarwinds web console. You can customize the view by going to Settings (of Solarwinds) and selecting Manage Views.

Next step is to create an alert. For that you have to logon to the Solarwinds server itself, go to Alert Manager, create a new alert (skipping screenshots for all these) and create a new alert whose condition is as follows:solarwinds trigger

Note that the type of property to monitor is “APM: Component”. This is important for the correct variables to be visible in the alert message. Also, note that I am triggering for each of the component (with an “any” condition) and not for the application monitor itself. This lets me get alerts for individual components; if I don’t do this, and instead trigger on the application monitor itself, I will get alert emails for each component including the ones that don’t have an issue.

Here’s the alert message:

solarwinds message

Find Outlook rules that are deleting a message

As part of troubleshooting something I needed to quickly find what Outlook rules the user had for deleting messages. So I came up with this one-liner.

The result is a list of rule names and a friendly description of what the rule does.

Run this from the EMS of course.

Delays with Send As permissions, Quotas, Mailbox permissions in Exchange

Send As permissions prompted this post so I’ll stick to that. But what I describe below also affects mailbox permissions, quotas, and other Exchange information stored in AD. 

Assigning someone Send As permissions in Exchange is easy. Use the EMC, or if you love PowerShell do something like this:

However, there is a catch in that the permission isn’t always granted immediately. It could be instant, but it could also take up to 2 hours.

This is because when we make a change to a user object – even of a mail related attribute such as Send As – the change isn’t immediately read by Exchange. The change is in fact made in Active Directory (notice the cmdlet above) and Exchange has to query AD to get the changed value. If Exchange were to query AD each time it needs to do something with a mail enabled user object, that would be poor performance for both Exchange and Domain Controllers. Instead, Exchange servers have a component called Directory Service Access (DSAccess) (renamed to Active Directory Access (ADAccess) since Exchange 2007) that periodically queries AD and caches the results for a period of time. When you grant someone Send As permissions as above, if Exchange’s cache isn’t updated with the latest value it will reject users trying to send email on behalf of the other user until such time the cache is updated. 

This cache is commonly referred to as Mail Box Information (MBI) Cache. 

The items in the MBI Cache are controlled by something called the Mailbox Cache Age Limit. There is a registry key HKLM\System\CurrentControlSet\Services\MSExchangeIS\ParametersSystem which has two values:

  • Mailbox Cache Age Limit – a value of type DWORD (decimal) that controls the age of the items in the cache.
    • The higher the age limit, the longer the item can stay in cache (and so won’t be refreshed). 
    • The default for this is 120 minutes (i.e. 2 hours). This is the default even if the value does not exist.
    • So this means any item present in the cache is not updated with new information for up to 2 hours – even if DSAccess/ ADAccess queries AD in between
  • Reread Logon Quotas Interval – a value of type DWORD (decimal) that controls how often the Exchange information store reads the mailbox quota information from the cache.
    • This is not relevant for the current topic but I thought to mention as it is present in the same place, and quotas are another area which commonly affect the user / administrator expectation. 
    • The default for this 7200 seconds (i.e. 2 hours). This is the default even if the value does not exist. 
    • The recommended value is 1200 seconds (i.e. 20 mins)?
    • Note that this controls how often information is read from the cache. You could read from the cache every 20 mins, but if you only update the cache every 2 hours (the default for the previous registry value) you will still get stale information! So any changes to this registry value must be done in conjunction with the previous registry value. 

To get changes made to mail objects in AD propagate faster to Exchange one must change the two values above (or at least the first one if you don’t care about quotas). 

Bear in mind though, that the above values only control how often the cache items expire or are referred to. They do not control how often DSAccess/ ADAccess queries AD for new information. This is controlled by a value at another registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchange ADAccess\Instance0:

  • CacheTTLUser – a value of type DWORD (decimal) that controls how often DSAccess/ ADAccess queries a DC for new information. 
    • The default for this is 300 seconds (i.e. 5 mins). This is the default even if the value does not exist. 
    • Lowering this value is not recommended as it will affect Exchange performance. 

Once any of these values are changed the Information Store service (or the Exchange server itself) must be restarted. 

In addition to the registry values I mention above, there’s one more worth mentioning. It isn’t too important because it’s value can never exceed that of the Mailbox Cache Age Limit, but it’s worth knowing about. In the same registry key as Mailbox Cache Age Limit – i.e. at HKLM\System\CurrentControlSet\Services\MSExchangeIS\ParametersSystem – is present another value: 

  • Mailbox Cache Idle Limit – a value of type DWORD (decimal) that controls how long an item can be idle in the cache before it is removed from the cache.  
    • The default for this is 900 seconds (i.e. 15 mins). This is the default even if the value does not exist. 
    • The maximum for this value can be up to the value of Mailbox Cache Age Limit (for obvious reasons – that’s when the item is removed anyway!).
    • This is an interesting setting. It seems to be there to introduce some randomness in how cache items are refreshed, and also to stir up the cache a bit?
      • Without it all items in the cache will expire at the same time (every 2 hours by default) and will need to be refreshed together. But with this setting some items will be periodically discarded (every 15 mins by default) if they were not accessed by the Exchange server during that interval and a fresh copy cached the next time DSAccess/ ADAccess contacts a DC. 
      • This way any object that wasn’t accessed recently will have an up-to-date copy in the cache, but objects that were accessed recently continue to be cached (and potentially have stale information). 
      • So at any given point of time the cache will have items of varying expiry time (i.e. not all will expire together in 2 hours). And the first time an item is accessed in a long time, it will possibly have the latest information. 

With this in mind one can appreciate why updates to a mail enabled object may not immediately take effect. There’s so many factors! It will depend on whether the object has past its idle time. If not, it will depend on whether the object has past its cache age. Further, if it’s a quota setting, it will also depend on the quota reread interval. Also, how often ADAccess/ DSAccess contacts a DC matters. And on top of all this, AD replication too plays a role because the change has to replicate from the DC you made it on to the DC that ADAccess/ DSAccess recently contacted to cache information.

You could be lucky in that the object you changed has just past its idle time and so ADAccess/ DSAccess will get an up-to-date copy soon. You could be lucky in that even though the object hasn’t past its idle time it’s about to expire in a few minutes and so will be refreshed. Or you could be unlucky and make a change just after the object was refreshed, and it’s not idle any time soon, so you’ll have to wait a whole 2 hours (by default) before your change takes effect! 

As Yoda would say, patience you must with Exchange and permissions. :)

p.s. While writing this post I learnt that Update Rollup 4 for Exchange 2010 SP 2 introduces the ability to automatically copy emails you send on behalf of someone to that person’s Sent Items folder too. Good to know! That’s a useful feature. 

Notes on X.400, X.500, IMCEA, and Exchange

Was reading about X.400, X.500, IMCEA, and Exchange today and yesterday. Here are my notes on these.

SMTP

We all know of SMTP email addresses. These are email addresses of the form user@domain.com and are defined by the SMTP standard. SMTP is the protocol for transferring emails between users with these addresses. It runs on the IP stack and makes uses of DNS to look up domain names and mail servers.

SMTP was defined by the IETF for sending emails over the Internet. SMTP grew out of standards developed in the 1970s and became widely used during the 1980s.

X.400

Before SMTP became popular, during the initial days of email, each email system had its own way of defining email addresses and transferring emails.

To standardize the situation the ITU-T came up with a set of recommendations. These were known as the X.400 recommendations, and although they were designed to be a new standard they didn’t catch on as expected. Mainly because the ITU-T charged for X.400 and systems had to be certified as X.400 compliant by the ITU-T; but also because X.400 was more complex (the X.400 recommendations were published in two large books known as the Red Book and Blue Book, with an additional book added later called the White Book) and its email addresses were complicated. X.400 is still used in many areas though, where some of its advantageous features over SMTP are required. For instance,  in military, financial (EDI), and aviation systems. Two articles that provide a good comparison of X.400 and SMTP are this MSExchange article and this Microsoft KB page.

X.400 email addresses are different and more complicated compared to SMTP email addresses. For instance the equivalent of an SMTP email address such as some.one@somedomain.com in X.400 could be C=US;ADMD=;PRMD=SomeDomain;O=First Organization;G=Some;S=One.

From the X.400 email address above one gets an idea of how the address is defined. An X.400 address uses a hierarchical naming system and consists of a series of elements that are put together to form the address. Elements such as C (country name), ADMD (administrative management domain, usually the ISP), O (organization name), G (given name), and so on. The original X.400 standard only specified the elements but not the way they could be written. Later, RFC 1685 defined one way of writing an X.400 address. The format defined by RFC 1685 is just one way though. Other standards too exist, so there is no one way of writing an X.400 address. This adds to the complexity of X.400.

X.400 was originally designed to run on the OSI stack but has since been adapted to run on the TCP/IP stack.

X.500

The ITU-T, in association with the ISO, designed another standard called the X.500. This was meant to be a global directory service and was developed to support the requirements of X.400 email messaging and name lookup. X.500 defined many protocols running on the OSI stack. One of these protocols was DAP (Directory Access Protocol), specified in X.511, and it defined how clients accessed the X.500 directory. A popular implementation of DAP for the TCP/IP stack is LDAP (Lightweight Directory Access Protocol). Microsoft’s Active Directory is based on LDAP.

(As an aside, the X.509 standard for a Public Key Infrastructure (PKI) began as part of the X.500 standards).

The X.500 directory structure is familiar to those who have worked with Active Directory. This is not surprising as Active Directory is based on X.500 (via LDAP). Like Active Directory and X.400, X.500 defines a hierarchical namespace, with elements having Distinguished Names (DNs) and Relative Distinguished Names (RDNs). As with X.400 though, X.500 does not define how to represent the names (i.e. X.500 does not define a representation like CN=SomeUser,OU=SomeOU,OU=AnotherOU,DC=SomeDomain,DC=Com, it only specifies what the elements are). X.500 does not even require the elements to be strings as Active Directory and LDAP do. The elements are represented using an ASN.1 notation and each X.500 implementation is free to use its own form of representation. The idea being that whist different X.500 implementations can use their own representation – since the elements used are specified by the X.500 standard these implementations can communicate with each other by passing the elements without structure.

Active Directory and LDAP use a string representation.

  • LDAP uses a comma-delimited list arranged from right to left (i.e. the the object referred to appears rightmost (last), the root appears left most (first)). For example: DC=Com,DC=SomeDomain,OU=AnotherOU,OU=SomeOU,CN=SomeUser.
  • Active Directory uses a forward slash (/) and arranges the list from left to right (i.e. the object referred to appears left most (first), the root appears rightmost (last)). For example: CN=SomeUser/OU=SomeOU/OU=AnotherOU/DC=SomeDomain/DC=Com. Unlike LDAP, Active Directory does not support the C (Country name) element and it uses DC (Domain Component) instead of O (Organization).

Some examples can of X.500 elements in non-string representations can be found on this page from IBM and this page from Microsoft.

Exchange

Now on to how all these tie together. Since I am mostly concerned with Exchange that’s the context I will look at these in.

Exchange 4.0, the first version of Exchange, used X.400 (via RPC) internally for email messaging. It had its own X.500/LDAP based directory service (Windows didn’t have Active Directory yet). User objects had an X.400 email address and were referred to in the X.500/LDAP directory via their obj-Dist-Name attribute (Distinguished Name (DN) attribute).

For example, the obj-Dist-Name attribute could be /O=SomeDomain/OU=SomeOU/OU=AnotherOU/CN=SomeUser. This object might have an X.400 email address of C=US;ADMD=;PRMD=SomeDomain;O=First Organization;OU1=SomeOU;G=UserFirstName;S=UserLastName.

The obj-Dist-Name attribute is important because this is what uniquely represents a user. In the previous example, for instance, the user with obj-Dist-Name attribute /O=SomeDomain/OU=SomeOU/OU=AnotherOU/CN=SomeUser could have two X.400 addresses: C=US;ADMD=;PRMD=SomeDomain;O=First Organization;OU1=SomeOU;G=UserFirstName;S=UserLastName and C=US;ADMD=;PRMD=SomeDomain;O=First Organization;OU1=SomeOU1;G=UserFirstName;S=UserNewLastName for instance (the last name is different between these addresses). Emails addressed to both addresses are to be sent to the same user, so Exchange resolves the X.400 address to the obj-Dist-Name attribute and routes emails to the correct user object.

This approach has a catch, however. Suppose a user object is moved to a different OU or renamed, its obj-Dist-Name attribute too changes. The Distinguished Name (DN) is made up of the Common Name (CN) and path to the object after all, so any changes to these will result in a new Distinguished Name and even though the object may have the same X.400 email addresses as before, as far as the X.500/ LDAP directory is concerned it is a new user. Exchange is fine with this because it will resolve an email sent to one of the X.400 addresses to the new object, but the catch is that clients (such as Outlook) cache the From: address of an email with the obj-Dist-Name attribute rather than the X.400 email address (because that’s what Exchange uses after all) and so any replies sent to such emails will bounce with an NDR. As far as Exchange is concerned the object referenced by the older obj-Dist-Name attribute does not exist, so it doesn’t know what to do with the email. This catch also affects users when Outlook caches email addresses. Again, Outlook caches the obj-Dist-Name attribute of object rather than the X.400 address, so if the obj-Dist-Name attribute changes the older one is no longer valid and emails to it bounce.

To work around this, apart from the X.400 address and the obj-Dist-Name attribute of an object, a new address type called the X.500 address is created in case of obj-Dist-Name attribute changes. This X.500 address simply holds the previous obj-Dist-Name attribute value (and that’s why it’s called an X.500 address as its format is like that of the obj-Dist-Name attribute, it holds the previous X.500 location address of the object). Each time the obj-Dist-Name attribute changes, a new X.500 address is created with the older value. When clients send an email to a non-existent obj-Dist-Name attribute, Exchange finds that it does not exist and so searches through the X.500 addresses it knows. If the X.500 address was created correctly, Exchange finds a match and the email is delivered to the user.

Exchange 5.0 was similar to Exchange 4.0 but it had an add-on called Internet Mail Connector which let it communicate directly with SMTP servers on the Internet. This was a big step because Exchange 5.0 users could now email SMTP addresses on the Internet. These users don’t have an SMTP address themselves, so Exchange 5.0 needed some way of providing them with an SMTP address automatically. Thus was born the idea of encapsulating addresses using an Internet Mail Connector Encapsulated Address (IMCEA) method.

IMCEA lets Exchange 5.0 send emails from a user with a X.400 address to an external SMTP address by encapsulating the X.400 address within an IMCEA encapsulated SMTP address. It also let the user receive SMTP emails at this IMCEA encapsulated SMTP address. IMCEA is a straightforward way of converting a non-SMTP email address to an IMCEA encapsulated SMTP address (the string “IMCEA” followed by the type of the non-SMTP address followed by a dash is prefixed to the address; the address is modified to leave alphabets and numbers as they are, slashes are converted to underscores, everything else is converted to its ASCII code with a plus put before the code; lastly the Exchange server’s primary domain is appended with an @ symbol). Thus every non-SMTP email address has a corresponding IMCEA encapsulated address and vice versa.

Consider the following X.400 address: C=US;ADMD=;PRMD=SomeDomain;O=First Organization;G=UserFirstName;S=UserLastName.

It could be encapsulated in IMCEA as IMCEAEX-C+3DUS+3BADMD+3D+3BPRMD+3DSomeDomain+3BO+3DFirst+20Organization+3BG+3DUserFirstName+3BS+3DUserLastName@somedomain.com. As you can see the IMCEA address looks like a regular SMTP email address (albeit a long and weird looking one).

Internally Exchange 5.0 continued using X.400 (via RPC) for email messaging and stored objects in its X.500/LDAP directory.

Exchange 5.5 was similar to Exchange 5.0. It added an X.400 connector to let it communicate with other X.400 email systems. Internally it continued using X.400 (via RPC) for email messaging and stored objects in its X.500/LDAP directory.

Exchange 2000 took another big step in that it no longer had its own directory service. Instead it made use of Active Directory, which had launched with Windows 2000 Server. Exchange 2000 thus required a Windows 2000 Active Directory environment.

Internally, Exchange 2000 switched to SMTP for email messaging. This meant new accounts created on Exchange 2000 would have an SMTP and X.400 email addresses by default.  Exchange 2000 still supported X.400 via its MTA (Message Transfer Agent) and the X.400 connector let it talk to external X.400 systems, but SMTP was what the server’s core engine used.

The switch to Active Directory meant that Exchange 2000 started using the Active Directory attribute distinguishedName instead of obj-Dist-Name to identify objects. The distinguishedName attribute in Active Directory has a different form to the obj-Dist-Name in the X.500/LDAP directory. To give an example, an object with obj-Dist-Name attribute of /O=SomeDomain/OU=SomeOU/OU=AnotherOU/CN=SomeUser could have the equivalent distinguishedName attribute as CN=SomeUser,OU=AnotherOU,OU=SomeOU,O=SomeDomain.

The distinguishedName attribute is dynamically updated by Active Directory based on the location of the object and its name (i.e. moving the object to a different OU or renaming it will change this attribute).

To provide backward compatibility with Exchange 5.5, older clients, and 3rd party software, Exchange 2000 added a new attribute called LegacyExchangeDN to Active Directory. This attribute holds the info that previously used to be in the obj-Dist-Name attribute (in the same format used by that attribute). Thus in the example above, the LegacyExchangeDN attribute would be /O=SomeDomain/OU=SomeOU/OU=AnotherOU/CN=SomeUser. A new service called the Exchange Recipient Update Service (RUS) was introduced to keep the LegacyExchangeDN attribute up to date.

Although Exchange 2000 switched to SMTP for email messaging, it continues using the X.500/LDAP directory addressing scheme (via the LegacyExchangeDN attribute) to route internal emails. This means the LegacyExchangeDN must be kept up-to-date and correctly formatted for internal email delivery to work. When the Exchange server receives an internal email, it looks up the SMTP email address in Active Directory and finds the LegacyExchangeDN to get the object. If the LegacyExchangeDN attribute cannot be found, it searches the X.500 address (as before) to find a match. Thus Exchange 2000 behaves similar to its predecessors except for the fact that it uses the LegacyExchangeDN attribute.

Exchange 2003 was similar to Exchange 2000.

Exchange 2007 dropped support for X.400. This also meant Exchange 5.5 couldn’t be upgraded to Exchange 2007 (as Exchange 5.5 users would have X.400 addresses and support for that is dropped in Exchange 2007). New accounts created on Exchange 2007 have only SMTP addresses. Migrated accounts will have X.400 addresses too.

Exchange 2007 dropped both the X.400 connector and MTA. To connect to an external X.400 system Exchange 2007 thus requires use of a foreign connector that will send the message to an Exchange 2003 server hosting an X.400 connector.

Exchange 2007 also got rid of the concept of administrative groups. This affects the format of the LegacyExchangeDN attribute in the sense that previously a typical LegacyExchangeDN attribute would be of the format /o=MyDomain/ou=First Administrative Group/cn=Recipients/cn=UserA (the administrative group name could change) but now it would be of the format /o=MyDomain/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=UserA (notice the administrative group name is fixed as there’s only one default group; “FYDIBOHF23SPDLT” is code for “EXCHANGE12ROCKS”).

Exchange 2010 was similar to Exchange 2007 in this context.

Exchange 2010 SP1 Rollup 6 made a change to the LegacyExchangeDN attribute to add three random hex characters at the end for uniqueness (previously Exchange would append an 8 digit random number only if there was a clash).

Exchange 2013 is similar to Exchange 2010 to the best of my knowledge (haven’t worked much with it so I could be mistaken).

Implications

It’s worth stepping back a bit to look at the implications of the above.

Say I am in an Exchange 2010 SP2 environment and I have a user account testUser1 with SMTP email address testUser1@mydomain.com. The Active Directory object associated with this account could have a LegacyExchangeDN attribute of the form /o=MyDomain/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=testUser1abc. If I have previously sent emails from this SMTP email address to another internal user, that user’s Outlook will cache the address of testUser1 as /o=MyDomain/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=testUser1abc and NOT testUser1@mydomain.com.

Next, I create a new user testUser2, remove the testUser1@mydomain.com SMTP email address from testUser1, and assign it to testUser2. This new account will have its own LegacyExchangeDN attribute.

Now, suppose Outlook were to reply to one of the emails it previously received, one would expect the reply to go to testUser2 as that’s where the testUser1@mydomain.com SMTP address now points to (and Exchange 2010 uses SMTP internally so nothing else should matter!) but that is not what happens. Instead the reply will go to testUser1 as Outlook has stored the LegacyExchangeDN attribute of this object in its received email, and so even though we have typed an SMTP address that is now assigned to a different user, Exchange only goes by the LegacyExchangeDN attribute and happily routes email to the first user.

On the other hand, suppose another user (who has never received written to the testUser1@mydomain.com address before) were to compose a fresh email to the testUser1@mydomain.com address it will correctly go to the newly created account. This is because Outlook doesn’t have a cached LegacyExchangeDN entry for this address and so it is up to the Exchange server to resolve the SMTP address to LegacyExchangeDN and it correctly picks up the new account.

A variant of this recently affected a colleague of mine. We use Outlook 2010 in cached mode at work. This colleague made a new user account and email address. Later that evening she was asked to delete that account and recreate it afresh. She did so, then noted that the new account has the same SMTP email address as the previous one, and so assumed everything should work fine. They did, sort of.

The user was able to receive external emails fine, and also emails sent to Distribution Lists she was a member of. But if anyone internally were to compose a fresh email to this user, it bounces with an NDR.

Generating server: mail.domain.com

IMCEAEX?_O=EXAMPLE_OU=EXCHANGE+20ADMINISTRATIVE+20GROUP+20+28FYDIBOHF23SPDLT+29_CN=RECIPIENTS_CN=LastName+2C+20FirstName264@domain.com
#550 5.1.1 RESOLVER.ADR.ExRecipNotFound; not found ##

Original message headers:

Received: from …

From the NDR we can see the email was sent to the LegacyExchangeDN attribute address (encapsulate in an IMCEA address as Exchange uses SMTP for delivering messages). Since this is the first time any users are emailing this address, it should have worked as there’s no cached LegacyExchangeDN attribute value to mess things up. And it would have worked, but for two things:

  1. Because we are on Exchange 2010 SP1 Rollup 6 or above (notice the three hex digits in the LegacyExchangeDN address encapsulated in the IMCEA address above) the LegacyExchangeDN value has changed even though the user account details are exactly the same as before. If we were on an earlier version of Exchange the LegacyExchangeDN attribute would have stayed the same and the email will be happily be delivered. But the random digits change the LegacyExchangeDN attribute to be a different one, thus bouncing the email. The email is being sent to a LegacyExchangeDN attribute value ending with the digits “264”, while the actual digits are “013” (found via the Get-Mailbox userName | fl LegacyExch* cmdlet).
  2. That still doesn’t explain why Outlook has outdated info. Since no one has emailed this new user before, Outlook shouldn’t be having cached info.This is where the fact that Outlook was in cached mode comes into the picture. Cached mode implies Outlook has an Offline Address Book. The Exchange server generates the OAB periodically, clients download the OAB periodically. So it’s possible one of these stages still has the cached info. The best fix here is to manually regenerate the OAB, then get Outlook to download the OAB. We did that at work here, and users were now able to email the new user correctly (of course, if they have already tried emailing the user before and got an NDR, be sure to remove the name from the auto complete list and re-select from GAL).

Recap

  • X.400 addresses are of the form: C=US;ADMD=;PRMD=SomeDomain;O=First Organization;G=Some;S=One. They are hierarchical and consist of elements. There is no true standard for the representation of an X.400 address.
  • X.500 is a directory service specification. It is not an email address (just clarifying!). The representation of an X.500 object is up to the implementation. 
    • An LDAP example would be: DC=Com,DC=SomeDomain,OU=AnotherOU,OU=SomeOU,CN=SomeUser.
    • An AD example would be: CN=SomeUser/OU=SomeOU/OU=AnotherOU/DC=SomeDomain/DC=Com.
  • Exchange 4.0, 5.0, 5.5 used X.400 internally. Exchange 2000 onwards use SMTP internally.
    • All versions of Exchange identify objects via their obj-Dist-Name attribute (Exchange 4.0 – 5.5) or distinguishedName attribute (Exchange 2000 and upwards) in the directory.
    • Example obj-Dist-Name attribute: /O=SomeDomain/OU=SomeOU/OU=AnotherOU/CN=SomeUser.
    • Example distinguishedName attribute: CN=SomeUser,OU=AnotherOU,OU=SomeOU,O=SomeDomain.
    • Notice the format is different!
  • For backwards compatibility Exchange uses the LegacyExchangeDN attribute. It is same as the obj-Dist-Name attribute.
    • This attribute is kept up-to-date by Exchange.
    • For all intents and purposes it is fair to say Exchange uses this attribute to identify objects. Users may think they are emailing SMTP addresses internally, but they are really emailing LegacyExchangeDN attributes.
  • Since the LegacyExchangeDN attribute changes when the location or name of an object changes, and this could break mail-flow even though the X.400 or SMTP address may not have changed, Exchange adds and X.500 “address” for each previous LegacyExchangeDN attributes an object may have. That is, when the LegacyExchangeDN attribute changes, the previous value is stored as an X.500 address. This way Exchange can fall back on X.500 addresses if it is unable to find an object by LegacyExchangeDN.
  • IMCEA is a way of encapsulating a non-SMTP address as an SMTP address. Exchange uses this internally when sending email to an X.500 address or LegacyExchangeDN attribute. Example IMCEA address: IMCEAEX?_O=EXAMPLE_OU=EXCHANGE+20ADMINISTRATIVE+20GROUP+20+28FYDIBOHF23SPDLT+29_CN=RECIPIENTS_CN=LastName+2C+20FirstName264@domain.com

That’s all for now!

Managing Exchange 2010 from Windows XP

Although the Exchange Management Console (EMC) and Exchange Management Shell (EMS) are 64-bit tools that can only be installed on Vista SP2 or higher, it is possible to manage Exchange 2010 from a 32-bit Windows XP machine. Exchange 2010 relies on Windows PowerShell for its management tasks and the EMC and EMS are only different ways of running these PowerShell commands. Both these tools do remote PowerShell into an IIS PowerShell virtual directory published on the Exchange 2010 server and so one can install PowerShell 2.0 on Windows XP and do the same. There is one catch but that doesn’t hamper the usage much.

Here’s what you go to do:

First, download and install the Windows Management Framework for Windows XP from here. This installs PowerShell 2.0 and Windows Remote Management (WinRM) 2.0. Launch PowerShell and type the following:

Replace <exchangeserver> with the name of your Exchange 2010 server.

If you are running PowerShell as a user who is authorized to connect to the Exchange 2010 server the above command is fine. But if you need to specify credentials, then tack a -Credential switch thus:

This will prompt to enter the username/ password and then connect to Exchange 2010 remote PowerShell. You can also replace (Get-Credential)with a username and you will get prompted only for the password.

Next import the session to your existing one thus:

This could take a while. Once it completes all the usual EMS commands will work from your PowerShell window on Windows XP.

I have a snippet like this in my $PROFILE file:

What this does is that it creates the remote session, shows me the session name and an ID, but doesn’t connect to it. Whenever I want to do any Exchange related tasks I can import the session.

Now on to the one catch with this method. What happens is that because the objects are converted to XML and back when passed from server to client, some of the member types get converted into strings incorrectly. This is usually not a problem, but does make a difference with certain operations. For instance, when it comes to sizes – you can’t easily convert the size to MB or KB as the member is no longer of type Size but is type String. Similarly, when sorting, you would expect an order such as 1, 2, 3, 100, 200, 1000, etc but because the “numbers” are now strings the result you get will be 1, 100, 1000, 2, 200, 3, etc.

Update: I detail a workaround for this in a later post.

Redirect HTTP to HTTPS with IIS

While deploying OWA in Exchange 2010, by default the OWA website is set to be accessible only via HTTPS. Connecting to the website via HTTP results in an error page, and while this is fine and dandy it would be convenient if users were silently redirected to the HTTP page.

Visiting the main URL itself of the domain where OWA is hosted (as in visiting http://mail.contoso.local/ instead of the OWA URL http://mail.contoso.local/owa as I was talking about above) too results in an error page. Again, would be convenient if users were redirected to the HTTPS OWA page silently.

There’s two ways of doing this. First approach is to do a plain redirect. Open up IIS Manager, go to the default website, double click “HTTP Redirect”, and fill up similar to the screenshot below:

image.png

Here you are telling IIS you would like to redirect incoming requests to the default website to be redirected to the OWA link you specify. Also, only requests to the default website link are to be redirected, not anything to a subdirectory of that website (as in redirect http://mail.contoso.local/ but not http://mail.contoso.local/blah). This is very important (for instance: failing to do so will break EMC and EMS as Exchange uses Remote PowerShell for its management tasks, and these connect to a PowerShell Virtual Directory in IIS).

image.pngYou can set up a similar redirect in the OWA Virtual Directory too so anyone accessing http://mail.contoso.local/owa is redirected to the HTTPS version.

Apart from this, go to the SSL Settings in both places (default website and OWA Virtual Directory) and un-tick the checkbox that says Require SSL. This is important – else the server will not apply the redirect as we are telling it that any access to this website requires SSL.

While this setup is fine, I don’t find it elegant as we are redirecting all links to a specific host. It’s possible the web server is accessible over two addresses – http://mail.contoso.local and http://mail.fabrikam.local – but now we are redirecting even the fabrikam.local addresses to https://mail.contoso.local. Not good.

Hence the second approach. Do URL rewriting. This requires the URL Rewrite module which can be downloaded from IIS.net. A handy reference guide can be found at the same site. Once you install the module and close and open the IIS Manager, you will see an icon for “URL Rewrite”. This gives a GUI to manage rewrite rules.

To create a new rule using the GUI click on “Add Rule(s)” at the top right corner in the actions pane. Select “Blank Rule”, give it a name (I called mine “Redirect HTTP to HTTPS for default site”), and fill thus:

image.pngThe Match URL section looks at the URL and performs matches on it. You can specify matches in terms of regular expressions, wildcards, or an exact string to match. The pattern is matched at the portion after the ‘/’ following the domain name. For instance if a URL is of the form http://mail.contoso.local/owa/blah/blah, then the pattern is matched with ‘owa/blah/blah’. You can setup a rule to perform a certain action if the pattern matches or does not.

In my case, I wanted to target URLs of the form http://mail.contoso.local/ with nothing following the ‘/’ after the domain name. So I set the pattern to be matched as ‘^$’ which in regular expression language stands for an empty pattern (the ‘^’ denotes the start of the pattern and the ‘$’ denotes the end; so essentially we are saying match for a pattern with nothing in it).

Note that since URL matching only look at the bits after the ‘/’ in the URL, we can’t use it to distinguish between HTTP and HTTPS links. This is important because if I don’t do that in this particular case, I could have a link that redirects http://mail.contoso.local/ to https://mail.contoso.local/ and since the redirected to link too matches the pattern I specify, it will be redirected again (and again and again …) to http://mail.contoso.local/ in a perpetual loop. So we want the above pattern to be matched, but only if certain conditions are met. This is what the next section of the UI lets you specify.

From the configuration reference we can note that the HTTPS server variable lets you determine if the link was HTTP or HTTPS. If it was HTTP, then this variable has a value OFF. We can use this to not match the pattern above for HTTPS links.

Click “Add” in the Conditions section, fill it up thus, and click OK:

image.png image.png

So now the rule will match all non-HTTP links with an empty pattern. Good!

Next step is to tell IIS what to do in case of such matching URLs. This is specified at the Action section:

image.png

We specify the action to be taken as Redirect. And specify the Redirect URL as https://{HTTP_HOST}/owa. The HTTP_HOST is a server variable that contains the domain name in the URL, so we can use that to redirect the URL to whatever domain it was meant for, but with HTTPS prefixed. (This is why we are doing the rewriting in the first place instead of doing a blanket redirect as in the first approach). In this case, I want the URL to redirect to OWA, so I set the redirect URL as https://{HTTP_HOST}/owa.

Next I create a similar rule for OWA so IIS redirects http://mail.contoso.local/owa to https://mail.contoso.local/owa. The Conditions section for this rule is same as the above so I am skipping that in the screenshot:

image.png

I set the match URL to be ^owa/?$ which stands for URLs starting with the word ‘owa’ followed by an optional slash (that’s what the /? stands for) and ending with that. Such URLs – non HTTPS, don’t forget to add the Condition section for that – are redirected to https://{HTTP_HOST}/owa.

It is also possible to create this second rule under the URL Rewrite section in the OWA Virtual Directory. In which case you’d skip the “owa” in the rule as the matching happens only after the “owa” bit of the URL (as we created it within the OWA Virtual Directory).

Also: in this case I was only concerned with redirecting OWA URLs. But if I wanted to do a similar thing for say the ECP URL too, I would tweak the rule above instead of making a fresh one. Set the Match URL to be ^(owa|ecp)/?$ and the Redirect URL to be https://{HTTP_HOST}/{R:0}. Putting the two options – “owa” and “ecp” – in brackets with a pipe between them means we want to match on either of these words. And {R:0} in the Redirect URL is replaced with whichever one matches.

Here are the two rules in the summary window:

image.png

For those who prefer non-GUI approaches, you can edit the C:inetpubwwwrootweb.config file directly and manage rules. The GUI is only a fancy way of making changes to this file. For the two rules we made above, here’s what the web.config file contains:

Similar to plain HTTP redirecting, redirecting HTTP to HTTPS via URL Rewrite rules too require the Require SSL checkbox to be un-ticked in both the default website and the OWA Virtual Directory.

That’s all for now. I am no IIS expert so there could be errors/ inefficiencies in what I posted above. Feel free to correct me in the comments!

EMC sections elaborated

I keep clicking the wrong section in the EMC when trying to remember whether an item I want is in the Organization Configuration section or the Server Configuration section. So here’s a note to myself so I remember it for the future:

Organization Configuration

As is obvious from the name, this section contains items that deal with the organization as such. The main level of the section lets you configure the Federation Trust and Organization Relationships.

Ignoring Unified Messaging, there are three sub-sections here:

Mailbox deals with mailbox related items that concern the organization. Stuff such as Database Management and DAG (yes, Database Management concerns the whole section and not a specific server as you might expect because with DAG a database can be spread across multiple servers and so it’s an organization specific item), Sharing Policies, Address Lists, Offline Address Book, and Managed Default Folders, Managed Custom Folders, Managed Folder Mailbox Policies (again, all these are concerned with the whole organization rather than a specific server so they come under this section).

Client Access deals with policies concerning OWA and EAS. Note: policies.

Hub Transport deals with Remote Domains, Accepted Domains, Email Address Policies, Transport Rules, Journal Rules, Send Connectors, Edge Subscriptions, and Global Settings. As you can see, it’s about the “plumbing” of the organization. All these are again policy sort of items. They don’t concern with configuring a server as such.

Server Configuration

This section contains items that deal with specific servers. All the sub-sections let you select a particular server and then configure it. The main level of the section lets you select a server and manage some of its properties such as the DNS servers used, the DCs used, as well as configure Exchange certificates.

Ignoring Unified Messaging, there are three sub-sections here:

Mailbox lets you view properties of the database copies with the server. This section is bare as database are configured at the Organization Configuration section.

Client Access lets you configure OWA, ECP, EAS, OAB distribution, and POP & IMAP4. It’s worth remembering that all these are per-server items. An OWA link is of a specific server. You can define a link that’s shared across many servers in an array or a DNS round-robin, but at the end of the day that’s defined on each server separately.

Hub Transport lets you define receive connectors for a server to accept email (there are two by default: for receiving emails from clients and for receiving emails from other Exchange servers). Note that this sub-section only concerns itself with receive connectors – as that’s a connector you define per server. Send Connectors, on the other hand, are defined for the whole organization and so come under Organization Configuration. Counter intuitively, while Send Connectors have some server specific configuration (which Hub Transport server can use a particular Send Connector), since that is tied to the configuration of a Send Connector it is defined in the Organization Configuration with the rest of the properties. 

There you go, that classifies everything logically!

Exchange 2010 email address policy

Notes to self on the Exchange 2010 email address policy.

One – if you have multiple email address policies only the one with the highest priority applies. This is obvious in retrospect but easy to forget. For instance you may have a default policy setting a certain email address template. And you have another policy, applying just on a particular OU. Also the latter policy has a higher priority than the default policy. So you may expect that the primary address template set by the OU specific policy will be the default primary address and the address templates set by the default policy too are present. Nope! Whichever policy wins, only that is applied. Email address template from the other policies are ignored.

As a corollary to this, if you want the default address templates too to be used when another policy wins, be sure to add the default email address templates in this other policy.

Two – removing an email address policy, or moving a user to another OU so that it falls out of scope of a particular policy, does not mean the email addresses defined by that policy are removed. Nope! They continue to exist, just that the primary address will change (to that set by the next policy that wins as the default). Again, in retrospect this is sensible. This ensures that users continue to receive emails at their previous email addresses. Without such a behavior when users move OUs and fall out of scope from a particular policy, their existing email addresses will get removed and they’ll stop receiving emails at that – probably not something you expect or want.

If you want the existing email addresses to be deleted, use PowerShell.