Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

[Aside] Stretchly – Break time reminder

Via the always resourceful How-To Geek. Came across Stretchly and Big Stretch Reminder. Two software that remind you to take breaks and micro-breaks. I have previously used WorkRave, wanted to try something different now. This time I’ll try Stretchly as I like its UI and is open-source.

Update: Switched to a Mac recently and started using Time Out instead. Stretchly doesn’t properly detect natural breaks (on Mac or Windows) so I began searching for alternatives and came across this.

[Aside] Various Exchange 2013 links

I am reading up on Exchange 2013 nowadays (yes, I know, bit late in the day to be doing that considering it is going out of support :) and these are some links I want to put here as a bookmark to myself. Some excellent blog posts and videos that detail the changes in Exchange 2013.

(By way of background: I am not an Exchange admin. I am Exchange 2010 certified as I have a huge interest in Exchange and as part of preparing for the certification I had attended a course and setup a lab on my laptop and even begun this blog to start posting about my adventures with it. I never got to work with Exchange 2010 at work – except as a helpdesk administrator one could say – but I am familiar with the concepts even though I have forgotten more than I remember. I have dabbled with Exchange 2000 before that. Going through these links and videos is like a trip down memory line – seeing concepts that I was once familiar with but have since changed for the better. Hopefully this time around I get to do more Exchange 2013 work! Fingers crossed).

If you don’t like reading, start with this video.

Alternatively, start with these links but I’d strongly recommend watching the above video once you finish reading.

Preferred Architecture

From the preferred architecture link I like to highlight this point about DAG design as I wasn’t aware of it (PA == Preferred Architecture; this is also discussed in the video):

Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures.

Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”.

The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery.

The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive) for more than 24 hours

By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue.

With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection.

The last line made me smile. Never thought I’d read someplace that backups for Exchange are unnecessary! :) If you have a lagged copy database, then you can enable circular logging on the database (this only affects the non-lagged copies) and skip taking backups – or at least not worry about the database dismounting because your backups are failing and logs are filling up disk space!

So what’s a lagged database copy? Basically it’s a copy of the database (in a DAG) that lags behind other members by a specified duration (maximum is 14 days). So if the other servers in your DAG have some issue, rather than restore the database from backup you can simply “play down” the lagged database copy (i.e. tell that copy to process all the transaction logs it already has an thus become up-to-date) and activate it. Neat, huh. I want to delve a bit more into this, so check out this “Lagged copy enhancements” section from the Exchange 2013 HA improvements page.

First there’s Safety Net (it’s not related to lagged copies, but it plays along well with it in a cool way so worth pointing out):

Safety Net is a feature of transport that replaces the Exchange 2010 feature known as transport dumpster. Safety Net is similar to transport dumpster, in that it’s a delivery queue that’s associated with the Transport service on a Mailbox server. This queue stores copies of messages that were successfully delivered to the active mailbox database on the Mailbox server. Each active mailbox database on the Mailbox server has its own queue that stores copies of the delivered messages. You can specify how long Safety Net stores copies of the successfully delivered messages before they expire and are automatically deleted.

Ok – so each mailbox server has a queue for each of its active database (remember lagged copies are active too, just that they have a higher number and hence not preferred). This queue contains messages that were delivered. Even after a message is delivered to a user, Safety Net can keep it around. You get to specify how long a message is kept for. Cool! Next up is this cool integration:

With the introduction of Safety Net, activating a lagged database copy becomes significantly easier. For example, consider a lagged copy that has a 2-day replay lag. In that case, you would configure Safety Net for a period of 2 days. If you encounter a situation in which you need to use your lagged copy, you can suspend replication to it, and copy it twice (to preserve the lagged nature of the database and to create an extra copy in case you need it). Then, take a copy and discard all the log files, except for those in the required range. Mount the copy, which triggers an automatic request to Safety Net to redeliver the last two days of mail. With Safety Net, you don’t need to hunt for where the point of corruption was introduced. You get the last two days mail, minus the data ordinarily lost on a lossy failover.

Whoa! So when a lagged copy is mounted, it asks Safety Net to redeliver all messages in the specified period – so as long as your Safety Net and lagged database copy have the same period, if you mount the lagged copy from the specified period ago, Safety Net will deliver all the messages since then. (It’s cool, but yeah I can imagine users complaining about a whole bunch of unread messages now, and missing Sent Items etc. – but it’s cool, I like it for the geek factor). :)

To re-emphasize something that was mentioned earlier:

Lagged copies can now care for themselves by invoking automatic log replay to play down the log files in certain scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than three available healthy copies (active or passive only; lagged database copies are not counted) for more than 24 hours

Lagged copy play down behavior is disabled by default, and can be enabled by running the following command.

After being enabled, play down occurs when there are fewer than three copies. You can change the default value of 3, by modifying the following DWORD registry value.

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagManagerNumAvailableCopies

To enable play down for low disk space thresholds, you must configure the following registry entry.

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagLowSpacePlaydownThresholdInMB

After configuring either of these registry settings, restart the Microsoft Exchange DAG Management service for the changes to take effect.

As an example, consider an environment where a given database has 4 copies (3 highly available copies and 1 lagged copy), and the default setting is used for ReplayLagManagerNumAvailableCopies. If a non-lagged copy is out-of-service for any reason (for example, it is suspended, etc.) then the lagged copy will automatically play down its log files in 24 hours.

For future reference this doc has steps on how to mount a lagged database copy – i.e. if you are not doing the automatic play down behavior. You can manually play down via the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks switch.

However, it is recommended you first suspend the copy (i.e. make it not “active”) and make a copy of the database and logs just in case.

Optionally, if you want to recover to a specific point in time you’d 1) suspend the database, 2) make a copy just in case, 3) move elsewhere all log files after the time you want to recover, 4) delete the checkpoint file, 5) run eseutil to recover the database – this is what replays the remaining logs and brings the database up to the point in time you want, and 6) move the database elsewhere to use as a recovery database for a restore. After this you move back the logs file previously moved away, and resume the database copy. This blog post has a bit more details but it is more or less same as the Microsoft doc. Note: I’ve never ever done this, so all this more of info for future me. :)

Lastly, that doc also has info on how activate a logged copy using Safety Net. Step 4 of the instructions made no sense to me.

Moving on … (but pointing to this HA TechNet link again coz it has a lot of other info that I skipped here).

Outlook Anywhere & OWA behind a WAP server

Some links around publishing Exchange namespaces such as OWA and Outlook Anywhere externally via a WAP server:

The easiest thing to do is pass-through everything via the WAP to the internal URL. But if you want, you can setup OWA authentication via ADFS claims. A step-by-step official guide is here, but the two links above cover the same stuff.

Healthcheck.htm

Exchange 2013 has a new monitoring architecture. When monitoring via a load balancer one can use a “healthcheck.htm” URL to test the health of each virtual directory (corresponding to each of the user consumed services). This URL is per virtual directoy, here’s an example from Citrix on how to add monitors for each service in NetScaler:

If the service is up the URL returns an HTTP 200 OK.

Virtual Directory cmdlets

Speaking of virtual directories, if any of the PowerShell Get- cmdlets for virtual directories are slow this blog post tells you why and what to do about it. These are the cmdlets, and the workaround is to add a switch -ADPropertiesOnly (this makes the cmdlet query AD Config partition for the same info rather than query each server’s IIS Metabase, which is slower):

  • Get-WebServicesVirtualDirectory
  • Get-OwaVirtualDirectory
  • Get-ActiveSyncVirtualDirectory
  • Get-AutodiscoverVirtualDirectory
  • Get-EcpVirtualDirectory
  • Get-PowerShellVirtualDirectory
  • Get-OABvirtualDirectory
  • Get-OutlookAnywhere

Update: Thought I’d add more videos and links to this post than make separate posts.

Transport Architecture

Check out this talk: https://channel9.msdn.com/events/TechEd/2013/OUC-B319. Slides available online here. I wanted to put a screenshot of the transport components as a quick reference to myself in this post:

So the CAS has stateless SMTP service. Affectionately called FET, or Front-End Transport.

The MBX has a stateful and stateless SMTP service. Called Transport and Mailbox Transport respectively. (Transport replaces the Hub Transport role of Exchange 2010).

There’s no longer a Hub Transport role. Previously the Hub Transport role on one server could directly talk to the store of another server – thus there were no clear layers. Everything was messy and tied up using RPC. Now there are clear layers as below and communication between servers happen at the protocol layer. Within a server communication goes up & down the layer; across servers it is using protocols like SMTP, EWS, and MRS proxy. Everything is clean.

Some slides on all three components:

Outbound email from the transport component on a MBX server can go out directly to an external SMTP server, or it can be delivered to the FET on any CAS server in the same site. This delivery happens on port 717 and needs to be specifically enabled.

Transport component listens on port 25 if MBX and CAS are on separate servers. Else it listens on port 2525 as CAS is already listening on 25. These ports are for accepting messages from the FET. For accepting messages from the Mailbox Transport component, it listens on port 465.

Remember that Transport is stateful.

Destination can be a CAS server or another transport component (on another MBX server). The Transport component is what does the lookup of the mailbox database.

Last component: Mailbox Transport. This is the component that actually talks to the next layer in the mailbox server. This talks MAPI and receives emails from the Transport component. This is also the component that does the message conversion (TNF to MIME and vice versa). No extensibility at this component as all that is at the Transport component. Once a message reaches Mailbox Transport there’s no changes happening to it!

[Aside] NameCheap CSR generator

I have previously mentioned the DigiCert CSR utility and how I generate CSRs via it and also export the private key. Today I came across a site from NameCheap that does CSR generation and also key conversion etc. Nice one! Also worth reading this article from them on extracting private keys.

For completeness sake here’s how to do the same via OpenSSL.

[Aside] Various DPM 2016 links

Reading up on (and trying to work with) DPM 2016 nowdays so here’s some links to myself before I close them from the browser:

Copy the path and save it in a notepad. It’ll look like the following. E:\ on DPM2016TP5-01.contoso.local C:\Program Files\Microsoft System Center 2016\DPM\DPM\Volumes\Replica\31d8e7d7-8aff-4d54-9a45-a2425986e24c\d6b82768-738a-4f4e-b878-bc34afe189ea\Full\E-Vol\

The first part of the copied string is the source. The second part, separated by a whitespace, is the destination. The destination contains the following information:

DPM Install Folder          C:\Program Files\[..]\DPM\Volumes\Replica\

Physical ReplicaID          31d8e7d7-8aff-4d54-9a45-a2425986e24c\

Datasource ID                   d6b82768-738a-4f4e-b878-bc34afe189ea\

Path                                        Full\E-Vol\

 

TIL: Network access: Restrict clients allowed to make remote calls to SAM

Today I learnt of this setting. I was seeing messages like the following on a couple of my servers and read the link:

1 remote calls to the SAM database have been denied in the past 900 seconds throttling window.
For more information please see http://go.microsoft.com/fwlink/?LinkId=787651.

This part gives you a gist of the matter:

The SAMRPC protocol makes it possible for a low privileged user to query a machine on a network for data. For example, a user can use SAMRPC to enumerate users, including privileged accounts such as local or domain administrators, or to enumerate groups and group memberships from the local SAM and Active Directory. This information can provide important context and serve as a starting point for an attacker to compromise a domain or networking environment.

To mitigate this risk, you can configure the Network access: Restrict clients allowed to make remote calls to SAM security policy setting to force the security accounts manager (SAM) to do an access check against remote calls. The access check allows or denies remote RPC connections to SAM and Active Directory for users and groups that you define.

By default, the Network access: Restrict clients allowed to make remote calls to SAM security policy setting is not defined. If you define it, you can edit the default Security Descriptor Definition Language (SDDL) string to explicitly allow or deny users and groups to make remote calls to the SAM. If the policy setting is left blank after the policy is defined, the policy is not enforced.

The default security descriptor on computers beginning with Windows 10 version 1607 and Windows Server 2016 allows only the local (built-in) Administrators group remote access to SAM on non-domain controllers, and allows Everyone access on domain controllers. You can edit the default security descriptor to allow or deny other users and groups, including the built-in Administrators.

So it looks like in my case some remote computer was trying to access this server’s SAM database (this is a server 2016 BTW) and it wasn’t in the local admin group of this server.

[Aside] Query remote RDP sessions and kill them

If you want to query the remote RDP sessions on a machine:

And to disconnect:

[Aside] Easily switch between multiple audio outputs using SoundSwitch

Via the always helpful How-To Geek – if you have multiple audio output devices on Windows 10 (e.g. HDMI, regular headphones via the headphone jack, a couple of Bluetooth headphones) like I do, and always right click the volume icon and change default devices and wished there was an easier & faster way to do this, look no far! Check out SoundSwitch. :) Open Source and actively developed too.

TIL: The Exchange PAM can move to another node when the FSW reboots

Didn’t know this. The Exchange PAM (Primary Active Manager) can move to another node when the FSW (File Share Witness) reboots or is offline. There’s no impact, but it’s worth being aware o if you are wondering what could be affected when the FSW server is rebooted.

From this blog post (via this forum post where I found it):

When the File Share Witness host server becomes unavailable, the File Share Witness resource will still fail in cluster and cause the Cluster Core Resources to move between nodes. In this case assuming the File Share Witness host server is still not available, the resource remains in a failed state. If it becomes necessary to utilize the File Share Witness to maintain Quorum, and the witness resource is in a failed state, cluster will attempt to online the witness resource. If the online is successful the witness share is alive and accessible – quorum is maintained. If the online is not successful, the witness share is not alive and accessible – a lost quorum condition is encountered.

Brief background on PAM (via this blog post):

At any given time, in every database availability group (DAG), there is one member that is responsible for the coordination of database actions across the DAG. This member is known as the Primary Active Manager (PAM). The current PAM can be determined by using Get-DatabaseAvailabilityGroup –Status. The cluster group may contain several cluster resources. The PAM does not depend on the state of any of the resources in this group, and the PAM role will always be assigned to the node that owns the Cluster Group. The Cluster Group can be moved between members using the cluster management tools.

Each DAG member that does not own the Cluster Group is a Standby Active Manager (SAM). When the Cluster Group is moved between nodes, a notification process detects that the Cluster Group owner has changed. This triggers detection logic to determine the new PAM. 

Automatic arbitration may occur for a number of reasons including: 

  • The failure of a member
  • The failure of a resource contained within the Cluster Group

In most cases, Exchange administrators should not be concerned with the owner of the Cluster Group or the node designated as the PAM.  This is true even for DAGs that span multiple sites where the PAM may be a node in a distant datacenter.

Some more details regarding PAM  (via this blog post):

Active manager is a role that runs on a mailbox server. A single active manager role “Standalone Active Manager” which runs on a mailbox server that has no high-availability configured. Two active manager roles will be in use when the mailbox server is a member of a DAG; Primary Active Manager (PAM) and Standby Active Manager (SAM). 

PAM is the Active Manager in a DAG that decides which database copies will be active and passive. PAM is responsible for getting topology change notifications and reacting to server failures. 

SAM provides information on which server hosts the active copy of a mailbox database to other components of Exchange that are running an Active Manager client component (for example, RPC Client Access service or Hub Transport server). The SAM detects failures of local databases and the local Information Store. It reacts to failures by asking the PAM to initiate a failover. A SAM does not determine the target of failover, nor does it update a database’s location state in the PAM. SAM runs on all mailbox servers in a DAG except on the one where PAM is running.

[Aside] Quote

Listening to “The End of the Affair” narrated by the amazing Colin Firth (a pleasure so far to listen to! wow). This sentence caught my attention:

How twisted we humans are, and yet they say a God made us; but I find it hard to conceive of any God who is not as simple as a perfect equation, as clear as air.

[Aisde] Random Stuff

Changing the colors in Vim so it looks better in PuTTY. I live with this usually (as I don’t spend much time in Linux nowadays) until I Googled today and found an easy fix for this. Thanks to this post: “:color desert” (where desert is an example color).

Testing SSL in SMTP (thanks to):

That link is a good reference on Postfix SSL too.

[Aside] Various ADFS links

No biggie, just as a reference to myself:

Update 16 July 2018: Needed to make a claim rule yesterday that converted the email address from an incoming claim to Name ID of an outgoing claim. The default GUI provided rule didn’t work, so I made a custom one:

I think I’ll add more such snippets here later.

Btw note to self: custom claim rules are useful if you want to combine multiple incoming claims – i.e. for an AND operation. If you don’t want to combine – i.e. you want to OR multiple claims – just add them as separate rules.

[Aside] Quote

Came across this when listening to “Fahrenheit 451”:

We cannot tell the precise moment when friendship is formed. As in filling a vessel drop by drop, there is at last a drop which makes it run over; so in a series of kindnesses there is at last one which makes the heart run over.

It’s from a book by James Boswell and the full paragraph is worth reading.

[Aside] Various Azure links

My blog posting has taken a turn for the worse. Mainly coz I have been out of country and since returning I am busy reading up on Azure monitoring.

Anyways, some quick links to tabs I want to close now but which will be useful for me later –

  • A funny thing with Azure monitoring (OMS/ Log Analytics) is that it can’t just do simple WMI queries against your VMs to check if a service is running. Crazy, right! So you have to resort to tricks like monitor the event logs to see any status messages. Came across this blog post with a neat idea of using performance counters. I came across that in turn from this blog post that has a different way of using the event logs.
  • We use load balancers in Azure and I was thinking I could tap into their monitoring signals (from the health probes) to know if a particular server/ service is up or down. In a way it doesn’t matter if a particular server/ service is down coz there won’t be a user impact coz of the load balancer, so what I am really interested in knowing is whether a particular monitored entity (from the load balancer point of view) is down or not. But turns out the basic load balancer cannot log monitoring signals if it is for internal use only (i.e. doesn’t have a public IP). You either need to assign it a public IP or use the newer standard load balancer.
  • Using OMS to monitor and send alert for BSOD.
  • Using OMS to track shutdown events.
  • A bit dated, but using OMS to monitor agent health (has some queries in the older query language).
  • A useful list of log analytics query syntax (it’s a translation from old to new style queries actually but I found it a good reference)

Now for some non-Azure stuff which I am too lazy to put in a separate blog post:

  • A blog post on the difference between application consistent and crash consistent backups.
  • At work we noticed that ADFS seemed to break for our Windows 10 machines. I am not too clear on the details as it seemed to break with just one application (ZScaler). By way of fixing it we came across this forum post which detailed the same symptoms as us and the fix suggested there (Set-ADFSProperties -IgnoreTokenBinding $True) did the trick for us. So what is this token binding thing?
    • Token Binding seems to be like cookies for HTTPS. I found this presentation to be a good explanation of it. Basically token binding binds your security token (like cookies or ADFS tokens) to the TLS session you have with a server, such that if anyone were to get hold of your cookie and try to use it in another session it will fail. Your tokens are bound to that TLS session only. I also found this medium post to be a good techie explanation of it (but I didn’t read it properly*). 
    • It seems to be enabled on the client side from Windows 10 1511 and upwards.
    • I saw the same recommendation in these Microsoft Docs on setting up Azure stack.

Some excerpts from the medium post (but please go and read the full one to get a proper understanding). The excerpt is mostly for my reference:

Most of the OAuth 2.0 deployments do rely upon bearer tokens. A bearer token is like ‘cash’. If I steal 10 bucks from you, I can use it at a Starbucks to buy a cup of coffee — no questions asked. I do not want to prove that I own the ten dollar note.

OAuth 2.0 recommends using TLS (Transport Layer Security) for all the interactions between the client, authorization server and resource server. This makes the OAuth 2.0 model quite simple with no complex cryptography involved — but at the same time it carries all the risks associated with a bearer token. There is no second level of defense.

OAuth 2.0 token binding proposal cryptographically binds security tokens to the TLS layer, preventing token export and replay attacks. It relies on TLS — but since it binds the tokens to the TLS connection itself, anyone who steals a token cannot use it over a different channel.

Lastly, I came across this awesome blog post (which too I didn’t read properly* – sorry to myself!) but I liked a lot so here’s a link to my future self – principles of token validation.

 

* I didn’t read these posts properly coz I was in a “troubleshooting mode” trying to find out why ADFS broke with token binding. If I took more time to read them I know I’d get side tracked. I still don’t know why ADFS broke, but I have an idea.

[Aside] Quote from Mythos

Listening to Stephen Fry’s Mythos and I loved this epitaph from one of the stories. That of Phaëthon, son of Phoebus Apollo the sun God, who rode his father’s sun chariot for a day but lost control and ended up scorching Africa in the process (thus creating the Sahara desert). This epitaph was offered by the American classicist Edith Hamilton.

Here Phaëthon lies who in the sun-gods chariot fared.
And though greatly he failed, more greatly he dared.

[Aside] Sherlock Holmes quote

Was listening to “The Mystery of the Cardboard Box” in the bus after a particularly shitty day and loved this ending paragraph. It resonated with me.

“What is the meaning of it, Watson? said Holmes solemnly as he laid down the paper. “What object is served by this circle of misery and violence and fear? It must tend to some end, or else our universe is ruled by chance, which is unthinkable. But what end? There is the great standing perennial problem to which human reason is as far from an answer as ever.”