Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

What is esx.problem.hyperthreading.unmitigated?

Upgraded one of our ESXi hosts with the latest patches released today that are aimed at fixing the L1 Terminal Fault issues. After that the host started giving this warning: esx.problem.hyperthreading.unmitigated. No idea what it’s supposed to mean!

Went to Configure > Settings > Advanced System Settings and searched for anything with “hyperthread” in it. Found VMkernel.Boot.hyperthreadingMitigation, which was set to “false” but sounded suspiciously similar to the warning I had. Changed it to “true”, rebooted the host, and Googled on this setting to come across this KB article. It’s a good read but here’s some excerpts if you are interested in only the highlights:

Like Meltdown, Rogue System Register Read, and “Lazy FP state restore”, the “L1 Terminal Fault” vulnerability can occur when affected Intel microprocessors speculate beyond an unpermitted data access. By continuing the speculation in these cases, the affected Intel microprocessors expose a new side-channel for attack. (Note, however, that architectural correctness is still provided as the speculative operations will be later nullified at instruction retirement.)

CVE-2018-3646 is one of these Intel microprocessor vulnerabilities and impacts hypervisors. It may allow a malicious VM running on a given CPU core to effectively infer contents of the hypervisor’s or another VM’s privileged information residing at the same time in the same core’s L1 Data cache. Because current Intel processors share the physically-addressed L1 Data Cache across both logical processors of a Hyperthreading (HT) enabled core, indiscriminate simultaneous scheduling of software threads on both logical processors creates the potential for further information leakage. CVE-2018-3646 has two currently known attack vectors which will be referred to here as “Sequential-Context” and “Concurrent-Context.” Both attack vectors must be addressed to mitigate CVE-2018-3646..

Attack Vector Summary

  • Sequential-context attack vector: a malicious VM can potentially infer recently accessed L1 data of a previous context (hypervisor thread or other VM thread) on either logical processor of a processor core.
  • Concurrent-context attack vector: a malicious VM can potentially infer recently accessed L1 data of a concurrently executing context (hypervisor thread or other VM thread) on the other logical processor of the hyperthreading-enabled processor core.

Mitigation Summary

  • Mitigation of the Sequential-Context attack vector is achieved by vSphere updates and patches. This mitigation is enabled by default and does not impose a significant performance impact. Please see resolution section for details.
  • Mitigation of the Concurrent-context attack vector requires enablement of a new feature known as the ESXi Side-Channel-Aware Scheduler. The initial version of this feature will only schedule the hypervisor and VMs on one logical processor of an Intel Hyperthreading-enabled core. This feature may impose a non-trivial performance impact and is not enabled by default.

So that’s what the warning was about. To enable the ESXi Side Channel Aware scheduler we need to set the key above to “true”. More excerpts:

The Concurrent-context attack vector is mitigated through enablement of the ESXi Side-Channel-Aware Scheduler which is included in the updates and patches listed in VMSA-2018-0020. This scheduler is not enabled by default. Enablement of this scheduler may impose a non-trivial performance impact on applications running in a vSphere environment. The goal of the Planning Phase is to understand if your current environment has sufficient CPU capacity to enable the scheduler without operational impact.

The following list summarizes potential problem areas after enabling the ESXi Side-Channel-Aware Scheduler:

  • VMs configured with vCPUs greater than the physical cores available on the ESXi host
  • VMs configured with custom affinity or NUMA settings
  • VMs with latency-sensitive configuration
  • ESXi hosts with Average CPU Usage greater than 70%
  • Hosts with custom CPU resource management options enabled
  • HA Clusters where a rolling upgrade will increase Average CPU Usage above 100%

Note: It may be necessary to acquire additional hardware, or rebalance existing workloads, before enablement of the ESXi Side-Channel-Aware Scheduler. Organizations can choose not to enable the ESXi Side-Channel-Aware Scheduler after performing a risk assessment and accepting the risk posed by the Concurrent-context attack vector. This is NOT RECOMMENDED and VMware cannot make this decision on behalf of an organization.

So to fix the second issue we need to enable the new scheduler. That can have a performance hit, so best to enable it manually so you are aware and can keep an eye on the load and performance hits. Also, if you are not in a shared environment and don’t care, you don’t need to enable it either. Makes sense.

That warning message could have been a bit more verbose though! :)

Interesting podcast episodes

Quick shoutout to some interesting podcast episodes I listened to lately. Sorry they are Overcast links than links to the podcast site. I am being lazy here.

  • The Tradeoffs of Information Hiding in the Control Plane – this one’s from the Packet Pushers network and while the title sounds very techie it is actually a discussion about a book written by the podcast host and the person he is talking to. The book seems interesting, I must buy it sometime to read (or at least add to my library).
  • Episode 221 of The Committed podcast – again an interview, with the author of a productivity book. It’s less of an interview (as both podcasts are) and more of a discussion. Both host and author share a lot of their workflow and apps they use. The apps are mostly Mac or iOS based but it’s a good listen.
  • Episode 222 of The Committed podcast – listening to this currently. I liked the discussion. It’s about books and reading and I resonated with a lot of the discussion. Especially a bit where one of the hosts mentions that he has cut down on his audiobook and podcast listening recently as they were taking up all his time, and started listening to more music. Same here. In my case audiobooks were taking up all my ear time so I have cut them down over the month to listen to more podcasts and also a lot more music than I usually do. Hope that pattern sticks! It’s difficult because my huge Audible library of unheard books make me feel guilty and so I tend to subconsciously prioritize audiobooks unless I actively counter this tendency. :)

New MacBook Air

So I finally dipped my feet into the Mac ecosystem and bought myself a MacBook Air. Yes, I know it’s 3 years old but what the heck – it was the cheapest Mac I could buy! Went for the 8GB/ 256GB i5 version as that’s the one I found on a deal with our local online shopping provider. Might have gone with a different spec if I decided to go with the version available officially with Apple but a) that had a UK English keyboard and b) the same model there was about 33% more expensive so if I were to get a better spec’d one I’d be spending a lot lot more (bringing the costs up to the MacBook range). 

One thing about MacBook purchases though – it isn’t easy. I mean, with an iPhone. you only have to choose along the color & size, and then pick the capacity you want. But with Macs I have to worry about size, CPU (i5 or i7), RAM, and storage; and each choices ups the price by so much! And more than the price the choices just exhaust. It’s the paradox of choice concept (I’ve read the book) and the feeling is similar to Windows laptops where there’s so many choices and you just get bogged down trying to pick what you want and eventually let go of the idea itself. Which is what I had done here (let go of the idea) until my wife suggested this MacBook Air model that was on a deal and I thought what the heck and just purchased it. My focus here is to get something that will get me a toehold in the Mac ecosystem and probably settling on price as a criteria than anything else was what was needed. 

Oh, and the MacBook Air is the only one with a decent set of ports. Yikes! All the other MacBooks have just USB-C ports so there’s the additional cost of dongles and the hassle of having to carry them around. If it wasn’t for the dongles and the fact that the MacBook has a 2nd generation butterfly keyboard which is known for problems (which is fixed in the MacBook Pro’s 3rd generation keyboard) I might have gone for the MacBook. It has more colors too. 

Anyways, back to the MacBook Air. I’ve had it for less than a day now so these are just initial thoughts. 

  • I love the keyboard and size. There’s a lot of room for the hand, and the keys feel good to type on. It’s a very “lapable” laptop. 
  • I thought I’d be put off by the 1440×900 screen as I am so used to full HD nowadays and when I had recently tried using a 1440×900 external monitor I didn’t like it at all, but no I don’t mind this screen. Yes I notice the difference but I don’t mind it. 
  • I like the feel of the OS. I had various people tell me it is complicated and unintuitive etc. but I don’t see that. I love the two finger way of scrolling up and down pages and going back and forward, and the three finger way of moving across apps. That feels very intuitive and much better than having a touch screen. There’s a lot more gestures but I am yet to get the hang of that. I tried to memorize those initially but then figured I’ll pick them up as I go. I think I know the main ones that I am interested in at least. 
  • It’s a jarring experience going to the App Store and seeing all the prices! Boy. It’s like the pre-iPhone days when software used to be expensive. Pretty much everything is US$10 and above, and if something is free it is bound to have a in-add purchase. Even the same app which for iPhone & iPad is (say) US$5 would be US$50 or above here! I imagine it is because the code base is different and so there’s more effort? I don’t know. That’s something I am having trouble getting my head around. The Windows OS store apps are much cheaper (but yeah there aren’t many). Anyways, the App Store is like a trip back in time to expensive software. I don’t think I’ll be buying much apps. Or I hope I won’t be buying much apps – it is not a sustainable option. 
  • The laptop came with MacOS High Sierra 10.13.1 and I couldn’t update to the latest 10.13.6 via the App Store. I downloaded it and tried to upgrade manually, but that failed saying the volume doesn’t meet some pre-requisites. I downloaded 10.13.2 and 10.13.3 and was able to upgrade to them manually, but 10.13.4 fails with the same error. That’s when I came across the macOS Recovery options, especially the Internet Recovery option which you get to by pressing Option-Command (⌘)-R (instead of just Command (⌘)-R for regular recovery). Internet Recovery actually connects to the Internet (it prompts you for Wi-Fi details etc) and can download the latest version and do a fresh install. When I tried this it complained my disk was still being encrypted and so it cannot upgrade. Am guessing that is why the update previously failed so I’ll wait for the encryption to finish and try again. That is so cool though, being able to connect to the Internet and do a recovery! Windows recovery options are nothing compared to this. Even the Recovery screen has a good GUI etc. (of course, that’s easy for Apple to do as it controls the hardware; versus Microsoft which can’t cater for every single display where Windows might be installed on). 
    • Update: After encryption completed I was able to install 10.13.4 successfully. I tried to just to 10.13.6 directly but that failed. I realized that these updates are deltas so I’ll just have to install 10.13.5 and then 10.13.6. Tried that and now my system is finally up to date. Yay! Pity MacOS doesn’t do cumulative updates. 
  • What else? The Finder is good, the uniform way in which each app shows a menubar where you can go and find its options etc. is good. I love the UI as expected for its consistency and sleekness. I also loved how I could just click on the Apple icon and go to “About this Mac” to quickly find its OS version, free storage etc. I don’t know why I liked that, but I found it incredibly thoughtful of Apple to present this information via this option. 
  • There’s still (obviously) a lot to pick up. Keyboard shortcuts and gestures etc. 
  • Oh, forgot. Installing apps from outside the App Store is cute in the way you download the DMG file and then (in most cases) just drag and drop the application to the Applications folder. I remember reading somewhere that in the Mac each application is sandboxed to its own hierarchy or something so it’s not like Windows or Linux where everything just writes to a common place and there’s dependencies and DLL hell etc. 
  • I love how the MacOS restores all my previously open apps after a reboot/ shutdown. It’s just the other day I was wishing Windows could do something similar (my laptop crashed and I had to restore all my Windows) and it was pleasant to see the MacOS do exactly this whenever I’d reboot. Such a user friendly and useful thing to do!

More later!

… forcefulness (personality) of the magician’s character

A paragraph from “Jonathan Strange and Mr. Norrell”, which I am still reading.

“But in the end,” added Dr John, “it is by the imposition of his will upon his patient that the doctor effects his cure. It is the forcefulness of the doctor’s own character which determines his success or failure. It was observed by many people that our father could subdue lunatics merely by fixing them with his eye.”

“Really?” said Strange, becoming interested in spite of himself. “I had never thought of it before, but something of the sort is certainly true of magic. There are all sorts of occasions when the success of a piece of magic depends upon the forcefulness of the magician’s character.”

So true!

[Aside] OS/2 Museum

Oh, this is lovely. This OS/2 Museum blog. Such a trip down memory lane! :)

I came across the blog via a post from it (“How fast is a PS/2 keyboard“). OS/2 is a OS I wanted to try when I was a kid but never got a chance. Just seeing the floppy disk image in the blog header makes me smile with nostalgia!

DNS SRV records used by AD

Just thought I’d put these here for my own easy reference. I keep forgetting these records and when there’s an issue I end up Googling and trying to find them! These are DNS records you can query to see if clients are able to lookup the PDC, GC, KDC, and DC of the domain you specify via DNS. If this is broken nothing else will work. :)

PDC _ldap._tcp.pdc._msdcs.<DnsDomainName>
GC _ldap._tcp.gc._msdcs.<DnsDomainName>
KDC _kerberos._tcp.dc._msdcs.<DnsDomainName>
DC _ldap._tcp.dc._msdcs.<DnsDomainName>

You would look this up using nslookup -type=SRV <Record>.

As a refresher, SRV records are of the form _Service._Proto.Name TTL Class SRV Priority Weight Port Target. The _Service._Proto.Name is what we are looking up above, just that our name space is _msdcs.<DnsDomainName>.

Vocal Harmonizing

A few days ago I was listening to “Agar Tum Saath Ho” from the excellent movie “Tamashaa” and noticed for the first time (yeah after nearly 2-3 years of regularly listening to that song coz it is one of my favorites!) that Arijith has someone else singing along with him in the background. I had previously seen A.R. Rahman employ this in other favorites of mine like “Piya Haji Ali” (from the otherwise unremarkable “Fiza”) and also “Noon-Un-Ala-Noor” (from the artsy-but-worth-a-watch “Meenaxi”). But in both these cases I knew you the background singer was – it was obvious from the artists section of the song. But with “Agar Tum Saath Ho” I never noticed this other singer until a few days ago when I kind of slept in my bus ride home listening to this song on loop, and I think my mind just relaxed and stopped thinking other stuff … it just soaked in the song, was in the moment so to say, and I heard the other singer as obvious as anything else.

Turns out this other singer was Arijith himself, but in a different pitch (thanks Quora) and this technique is called vocal harmonizing. Nice, I didn’t know of this.

While typing this post I was post I was listening to “Aanandhame” from the movie “Aravindante Athithikal” (which I previously mentioned, I love its songs) and noticed that it too employs something similar. While Anne Amie is the primary voice, you can also hear Vineeth Sreenivasan lightly in the background singing along with her. Adds a lot of the feel of the song.

Speaking of “Aravindante Athithikal”, a lovely first half a very draggy second half. Wish the movie had just stuck on with the theme of first half or concluded there if it had nothing more to say. The second half would even have been fine if it didn’t drag so much towards the end about finding the mother!

New AirPods

So I finally purchased a pair of Apple AirPods. There was a deal going on and I got a good additional 20% as there was an offer on my credit card.

  • I can’t control the volume with it (except using Siri).
  • I have to choose between whether I want to be able to pause the music via double tap or go to previous or next tracks. I can customize the double tap on either side AirPod so I only have two choices really.
  • Good thing though is that I can pause by removing either of the AirPods.
  • The fit is good too. I expected it to fall out as Apple EarPods have never fit me; but no, this one stays. Good job!
  • Audio quality is ok as expected. No large sound stage. No bass (I don’t mind that). Good for podcasts and audiobooks which is my use case.
  • The lack of much controls customization irks me though. No other vendor would have been able to get away with that in my opinion.
  • Update after using it for a day: I love the fact that I can use it just one AirPod at a time. That’s super handy. That alone plus the small size and that it’s light and that it fits in my ear and I barely notice it makes it a very useful gadget.

Lovin’ iPhone portrait mode

I started using a work provided iPhone 8 recently, side by side to my personal iPhone 7 Plus. I opted for a golden iPhone 8 and I love that look on the glass back. In terms of prettiness I so much prefer the iPhone 8 to the 7 Plus. It’s a much less finger print magnet too. I think it’s my fingers – they sweat – so the back of the iPhone 7 Plus gets all sweaty after a while of use. But no such issues with the iPhone 8.

I don’t think I’ll ever buy a personal non Plus size phone though. I don’t use the portrait mode much but I miss the dual lens on the Plus when I take pics with the iPhone 8. And the size of the Plus is convenient for typing and watching movies. I notice that I tend to use the iPhone 8 more as a phone or checking work emails or browsing something which quickly, but long term I prefer the iPhone 7 Plus for the size.

Here’s a nice (personal opinion!) pic I took with the iPhone 7 Plus in portrait mode now. That’s what prompted this post.

In other news I have purchased a TORRO case for the iPhone 7 Plus. They look so good! It was an impulse purchase and I hope to get it tomorrow.

Disabling Exchange 2013 Managed Availability monitors

Check out this blog post from Microsoft first. Mine’s mostly based on that but tailored to my specific situation.

We have a CAS server that’s purely for internal admin ECP functions. Managed Availability was running some ActiveSync tests on it and failing (because they don’t exist) with errors like these in SCOM:

So my mission, which I’ve chosen to accept (I saw “Mission Impossible: Fallout” this weekend!) is to disable this. :)

Managed Availability has the concept of health sets. From this page which lists all the health sets in Exchange 2013:

In Managed Availability, each component in Exchange 2013 monitors itself using probes, monitors and responders. Each Exchange 2013 component that implements Managed Availability is referred to as a health set.

So what are the unhealthy health sets on my server?

The result of this in my case are the following:

So how do I find which monitors in these health sets are failing? The following cmdlet can help:

I’ll just pipe both cmdlets to get a list of monitors across all unhealthy health sets:

In my case I get:

Should probably have filtered to just the unhealthy monitors:

Anyways, the SCOM error referred to an EAS component, but I don’t see anything with that name. ActiveSyncProxy is probably the one it was referring to?

As an aside, if I want to see the components of a health set (i.e. the monitors, probes, responders) I can do the following:

In the case of the ActiveSync.Proxy health set (which has the ActiveSyncProxy component) I can see:

Note that the ActiveSyncProxyTestMonitor monitor is what was showing as unhealthy earlier.

To disable a monitor I need to use the Add-ServerMonitoringOverride cmdlet. This is of the format:

In my case, to disable ActiveSync.Proxy (health set) ActiveSyncProxyTestMonitor (monitoring item – you can see this in the list of unhealthy monitors as well as in the list above) I do:

That’s it. Wait a while and now it will appear as disabled.

Next thing is how do I find out why the ActiveSync health set is unhealthy? Let’s take a  look at the probes in that:

I can invoke the probe manually using the following cmdlet:

Pipe this out as a list to read better. Here’s what I did:

The output gives the errors encountered by the tests. I could see that it was related to EAS so decided to disable it too.

Lastly, if you are curious as to what overrides exist the following cmdlet will help:

Also, if you want to double check that a particular component on the Exchange server is inactive (and that’s why monitors are failing) the following cmdlet will help (I sort it by state for easy reading but that’s optional):

The last section of the article I referred to at the beginning of this post on editing C:\Program Files\Microsoft\Exchange Server\V15\Bin\Monitoring\Config\ClientAccessProxyTest.xml to disable certain probes. Not sure why they suggest that instead of disabling probes via the cmdlet – I think that’s because the cmdlets way is more of a temporary thing (for a certain duration) while modify the config file is a permanent fix. I should probably do the config file in my case.

[Aside] Printer Objects in AD

I knew printer objects were present in AD but had no idea where to go look for them. Today I had a need to, and this post helped.

Reading & Listening Updates

  • Started reading (on my Kindle) “Jonathan Strange & Mr. Norell” by Susanna Clarke. Wow, never imagined I’d read a book like this and love it. I am hooked to the olden English used by the author and the way she writes – the long descriptions, details, foot notes, etc. Reads like a children’s novel from a long ago age. I am about 25% done. Looking forward to finishing it.
  • Going to start listening to in Audible “The Dead Mountaineer’s Inn: One More Last Rite for the Detective Genre” by the Strugstsky brothers. I listened to the introduction. Sounds like an interesting book fingers crossed.
  • Listened to this episode of the Vector podcast. It’s an interview by Rene Ritchie of Ashraf Eassa who is an expert in CPUs, and is a good listen.
  • Speaking of podcasts, I came across (and loved the first episode of) a new podcast from Microsoft. It’s called Behind the Tech with Kevin Scott. Going to listen to the second episode next.

Life and all that…

I went traveling to a different country recently. The day after reaching there I realized that I have loose motions. Not sure what I ate that disagreed with me as I was fine all night long and only had this sudden urge to rush to the toilet an hour after I woke up (during which I ate nothing). Coincidentally, on the day that I was leaving for this place I had also packed my Imodium tablets (they put an end to loose motions, they are amazing!). I hadn’t packed them for the trip but when I woke up that day I had a headache and so decided to pack some Panadol just in case. Looking for extra supplies in my medicine drawer I chanced upon a pack of Imodium which was expiring this month and so took it along just in case. And the very next day they turned out to be useful. “How lucky!” my mind thought.

But is that luck? Wouldn’t luck have been not having loose motions in the first place? I am “lucky” in that I generally don’t get loose motions when traveling (in fact the last time I remember getting loose motions on a trip was maybe 4-5 years ago) so considering I had to get loose motions, yes I was lucky that events conspired such that I took the tablets along; but I would have been luckier if I just didn’t get loose motions at all.

I suppose I should be thanking God for the stroke of luck. But then again couldn’t God just have prevented me from getting loose motions (either through a more resilient stomach or just pointing me away from whatever food triggered the loose motions). If God is all seeing and thus all this is predestined then I was meant to get loose motions and also meant to carry tablets along – so this is all just pointless no?

I don’t get a God who is all seeing and does pointless things just for self-amusement or something. A better explanation is that there is no predestination and God isn’t all seeing. Things just happen but God is someone who is more aware of the probabilities and multiple futures. I too as a Human can calculate these but I am nothing compared to God who is able to calculate things infinitely better and with more nuance perhaps. He can then nudge me along such that I am better prepared for things. This is the God in M. Night Shyamalan’s “Signs”. The God who arranged for things such that the family was able to kill the aliens and the preacher’s brother nearly escaped having his kiss messed up forever by the girl puking into his mouth.

So what makes things happen? I dunno. Random chance I guess. Is God able to influence things? I dunno. If he can influence things directly – like send a thought to my head to take medicines or avoid a certain food – then it doesn’t make much sense as he can avoid a lot of the drama of life by just making sure certain things don’t happen. So direct influencing doesn’t make sense to me, it has to be indirect. He can arrange for things to sway the probabilities this way or the other perhaps, but what eventually happens is not His direct doing. Thus He can arrange for me to wake up with a headache (make sure I have a bad sleep and I will wake up thus) then arrange for things to be such that I open my medicine cabinet to look for headache medicines and see the stomach upset medicines; but whether I actually take the medicines along or not depends on me. That bit is my decision, he can only arrange things so I have a decision to make.

Thus God is not responsible for my actions or those of others. But He can influence actions – both mine and others. Because I wake up with a headache chances are I might go pick up a fight with someone or be unnecessarily rude, but that choice still rests with me and what I do is my responsibility. The cards are stacked against me but it is my free will to act out. A lot of times I will fall into His trap and act poorly, a few times I will act better. This too has His influence written all over it because the person I am today is a result of my past and the events there, and if He so influenced my past events to be one which has left me full of negative thoughts and a depressed nature chances are I will react very poorly to the events in my life (making my future prospects poor in turn). He has slowly nudged the probabilities to be better able to nudge me the way He wants. I was probably a clean slate when born (sort of, coz my environment and parents etc too matter of course) but over time He is able to influence my actions more.

One could split God into the God and Devil I guess. One tries to do good by you, the other bad. Or there could even be a host of Beings I guess. I mean who knows. Even if there’s just one or a whole lot, the question is why should any one or more of these care for me. Why should they try to do good by me (take medicines for instance) or bad (eat food that causes a stomach upset). What’s their stake in it? Is it because I pray and so God wants to do good by me and the Devil wants to hurt me; or am I just a pawn in a game between them where things have to happen for the sake of the drama? I dunno. I don’t like to think in terms of black and white so this concept of God and Devil sounds rubbish to me. I prefer to think of things this way: one, there’s Life which is the random happening of events; two, there’s Beings that can influence things one way or the other because they can see more of the big picture in space and time; and three:, there’s us Humans (and other Animals?) who can make individual choices which may not necessarily be easy to make because the Beings I mentioned could influence our decisions but we nevertheless have free will and so the decisions are ours in the end and these in turn feed into Life and affect others and have an interplay with everything. Life + Beings is what one would refer to as the Tao I guess.

Why do these Beings do what they do? Are they just impartial beings or themselves influenced by other things? Maybe they are influenced by Humans too via their deeds and prayers (or lack thereof)? Maybe Beings influence each other, maybe the random events of Life influence these Beings too? My “guardian angel” Being (for lack of a better term) had some interest in ensuring I don’t have a shitty trip (literally haha!) so ensured I took tablets along. Other Beings have negative interests towards me for whatever reasons so They ensure a lot of other things don’t go well with me. Or maybe it’s the same Being who both didn’t want to spoil my trip but otherwise has it out for me in certain matters so spoils them and helps out in other matters however He can. Who knows.

Part of me knows that this is me anthropomorphizing things. Things happened, I was lucky to take tablets along, now I am trying to find a reason of explain things. Big deal! But I don’t think so. I don’t feel life is entirely random. A lot of times things seem to have a pattern. It’s like a fractal. Seems very complex and varied but there’s a kernel of a pattern which influences the overall structure. I find life to be like that. It’s a mix of random (the capital L Life) and some non-random ordering (God; Beings) working together. A sensible philosophy for mental peace would be to accept things and find understanding but it doesn’t work that way. It is irritating when things don’t work out even though you may put in a lot of effort, or things always seem to go a certain way as if there’s some bad luck or jinxing involved. What does one do here? Keep trying? Work harder? Pray!? :) I don’t know. I don’t think praying or pointlessly trying is an answer. But I do feel that one must try as much as one can, without getting frustrated. Try because that’s in ones nature, but couple it with an understanding perhaps that there is a non-random element at work too which for whatever reason nudges things around so you may not always get what you want (and will sometime get things when you least expect it). If there are some non-random sequence of events which work to ensure that you are never stuck without loose motion pills in case you have to get loose motions during a trip, then there are also non-random sequence of events which will work to ensure the probabilities are always stacked against you in certain aspects of life and how much ever you try things will always seem to work against you. There’s nothing you can do in the latter, but that doesn’t mean you give up. The same way these Beings can’t make you win by taking tablets, They can’t make you fail either. The only time you really fail is when you fall down and stay down (I am paraphrasing this from something I read). As long as you get up, even though your legs might be broken, you haven’t decided to fail. At the end you might have failed because the cards were stacked against you but you (the capital H Human in this game of life) hasn’t given up.

Before wrapping up, something I want to add as a reminder to my future self reading this. If things have conspired to make it a bad day and are nudging you to make bad decisions, remember the final decision is still yours. It is hard to resist because of all the environmental factors, but remember you have a choice and try to exert it.

Ps. Typing this post on my iPhone from the beautiful (but hot, wrong time to visit!) Armenia. Maybe all these thoughts are thanks to the monastery visits or the long hours spent half asleep half dreaming in the car rides to these monasteries. I didn’t know Armenia had such a rich heritage until I visited the place. I knew of Armenia somewhere in the back of my mind but didn’t realize how old and historically rich it was.

[Aside] Under the Hood with DAGs

Watching this Ignite 2015 video: Under the Hood with DAGs, by Tim McMichael.

Adding some links here to supplement the video:

  • Tuning Failover Cluster Network thresholds – useful when you have stretched DAGs
  • The mystery of the 9223372036854775766 copy queue… – never had this one but good info.
    • Basically, the cluster registry keeps track of the last log number per database and also the timestamp. When a node wants to see its copy queue length (i.e. how behind it is in terms of processing the logs) it can compare this log number with the log number it has actually processed. Sometimes, however, some node might be having issue updating the cluster registry or reading the cluster registry and so they fall behind in terms of receiving updates. In such cases the last log number will match what they have processed, but it is actually outdated info and so if the Exchange Replication service on the server hosting the passive copy notices that the timestamp is 12 minutes ago it puts its database copy into self-protection mode. This is done by putting the copy queue length (a.k.a. CQL) manually as 9 quintillion (the maximum a 64-bit integer can be). No one can actually have such large a copy queue length so it’s as good number to choose.
    • The video suggests rebooting each node until you find one which might be holding updates. But the link above suggests a different method.

DAC

Came across some Datacenter Activation Coordination (DAC) from Tim’s blog: part 1, followed by a series of posts you can see at the end of part 1.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

The bit I italicized is important. If you read his blog post you’ll see why. If DAC is activated and you are starting up a previously shutdown DAG, even though the DAG might have quorum it will not start up if some members are still offline. (I had missed that when reading about DAC earlier). To summarize it succinctly from part 2 of his series:

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

Here’s highlights from some of the interesting posts in Tim’s series:

  • Part 4 has info on the steps to do a datacenter switchover and the cmdlets available when DAC is enabled. Essentially: you 1) Stop-DatabaseAvailabilityGroup with –configurationOnly:$TRUE switch for the site that is down – this marks the servers in the site that is down as down, 2) Stop-Service CLUSSVC on the nodes in the site that is up, and finally 3) Restore-DatabaseAvailabilityGroup specifying the site that is up. This Microsoft doc on datacenter switchovers is worth reading side-by-side. It contains info on both DAC and non-DAC scenarios so watch out for that.
  • Part 5 has info on how to use the Start-DatabaseAvailabilityGroup cmdlet to set the DACP bit as 1 on a specified server thus bringing up the DAG by forcing a consensus.
  • Part 6 is an interesting story. A nice edge case of DAC being enabled and graceful shutdown.
  • Part 8 is another interesting story on what happens due to a typo in a cmdlet.

Very briefly, the DAC cmdlets:

  • Stop-DatabaseAvailabilityGroup – mark a specified server, or all server in a specified AD site, as down. Use the -ConfigurationOnly switch to mark the server as down in AD only but not actually do anything on the server(s). Need to use this switch if the servers are already offline but AD is up and accessible in that site. This cmdlet also forces a sync of AD across sites so the information is propagated.
  • Start-DatabaseAvailabilityGroup – same as above, but mark as up. Can use the -ConfigurationOnly switch to not really do anything but only mark in AD.
  • Restore-DatabaseAvailabilityGroup – it evicts any stopped servers, it can configure the DAG to use an alternate witness server, and it brings up the DAG after doing this. This cmdlet can only be used against a DAG with DAC enabled.

Dynamic Quorum

Came across dynamic quorum from the videos (wasn’t previously aware of it). Am being lazy and will put in some screenshots from the video:

The highlighted part is the key thing.

Remember that quorum is defined as “(the number of votes)/2 + 1“. Each node (or witness) typically has a single vote, and (number of votes)/2 is rounded down (i.e. 7/2 = 3.5, rounded down to 3).

With dynamic quorum once a node (or set of nodes) fail, and if the remaining set of nodes form a quorum (note – they have to form quorum), then the required quorum of the cluster is adjusted to reflect the remaining number of nodes.

Take a look at the scenario below:

We have two data centers. 6 nodes + a witness, so initially the quorum was 7/2 + 1 = 4.

The link between the two data centers goes down. Data center B has 3 nodes, which is below the quorum of 4 so all 3 nodes shutdown. Data center A has 3 nodes + witness, thus meeting the quorum and it stays up.

At this point if any further node in data center A goes down, they will fall below the quorum and the cluster will shutdown. To avoid such a situation is where dynamic quorum comes in. With dynamic quorum (introduced since Server 2012) when the nodes in data center A form quorum, the new quorum requirements is 4/2 + 1 = 3.

If a node goes down in data center A, leaving 2 nodes + a witness, since they meet the new quorum of 3 the cluster stays up. The quorum then gets revised to be 2/2 + 1 = 2. If yet another node goes down, the remaining node + witness still meets the new quorum of 2 and so the cluster continues to stay up.

Another slide:

Two data centers, 2 nodes + 1 node, no witness; the quorum is therefore 3/2 + 1 = 2.

One of the nodes in data center A goes down. Since the number of remaining nodes meets quorum, the cluster can stay up. But since there is no fail share witness each node cannot be given an equal vote (I wasn’t aware of this). Thus the cluster service picks up one of the nodes (the one with the lowest node ID) and gives it a vote of 0. The node in data center A has a vote of 0. The new quorum is thus 1/2 + 1 = 1.

If the link between the two data centers goes down, the node in data center B stays up even though the node in data center A too could have formed quorum! Nothing wrong with it, just an edge case to keep in mind as chances are you probably wanted data center A to remain up as that’s why you provisioned two nodes there in the first place.

Now for a variant in which there is a witness:

So two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before, one of the nodes in data center A goes down. Since there are 3 nodes + witness remaining, they meet quorum and the cluster continues. The new quorum is 4/2 + 1 = 3. Again, the data center link goes down. Everything goes down! :) Why? Coz no one has a clear majority. Each side has 2 votes, not the 3 required.

Interestingly I have this setup at work. So a critical thing to keep in mind is that if I were to update & reboot the witness or one of the nodes in data center A (my preferred data center), and the WAN link were to go down – I could lose the cluster! No such problems if I update & reboot a node in data center B and the link goes down, as data center A has the majority. Funny, it’s like you must keep the witness in the less preferred data center.

Windows Server 2012R2 improves upon dynamic quorum by adding dynamic witness.

So if the number of votes is odd, the witness vote is removed. And if the witness is offline or failed, then too it is removed (that includes reboots too, right?). 

Now things get tricky.

Going back to the previous example: so two data centers, 2 nodes + 2 nodes, 1 witness in data center A; the quorum is therefore 5/2 + 1 = 3.

As before a node data center A goes down (the picture is a bit incorrect as I skipped some intermediate slides), the remaining nodes have quorum so the cluster stays put. The new quorum is 4/2 + 1 = 3. But since the number of nodes is now ODD, cluster service removes the witness from the vote calculations. So the new quorum turns out to be 3/2 + 1 = 2. At this point if the link goes down, the nodes in data center B have quorum and so they form a cluster while the remaining node in data center A is shut down. So unlike the Server 2012 case, which had no dynamic witness, the whole cluster does not go down!

Now, going back to the case where one of the nodes (not witness) had its vote removed as there were only one node in each data center, I mentioned that the node with the lowest ID gets removed. The next two slides talk about that, including how to select a node in a cluster that we’d preferentially like to remove the vote of in such situations.

At this point I’d also like to link to this Microsoft doc on cluster quorum. Am going to quote some parts from there as they explain well and I’d like to keep it here as as reference to myself.

How cluster quorum works

When nodes fail, or when some subset of nodes loses contact with another subset, surviving nodes need to verify that they constitute the majority of the cluster to remain online. If they can’t verify that, they’ll go offline.

But the concept of majority only works cleanly when the total number of nodes in the cluster is odd (for example, three nodes in a five node cluster). So, what about clusters with an even number of nodes (say, a four node cluster)?

There are two ways the cluster can make the total number of votes odd:

  1. First, it can go up one by adding a witness with an extra vote. This requires user set-up.
  2. Or, it can go down one by zeroing one unlucky node’s vote (happens automatically as needed).

I didn’t know about point 2 until watching this video.

Worth bearing in mind that this also applies in the case of the witness being lost. So any time your witness is offline the cluster service automatically zeroes the vote of one of the nodes. If you have 2 nodes in each data center + a witness in one data center, and you reboot the witness – that is fine. One of the nodes will have its vote zeroed out, but there’s no impact and when the witness returns the zeroed out node gets its vote back. But if during the time your witness is rebooting you also have a network outage between the two data centers, then the data center with majority nodes (i.e. not the data center containing the node whose vote was zeroed) wins and the cluster fails over there.

Some more:

Dynamic witness

Dynamic witness toggles the vote of the witness to make sure that the total number of votes is odd. If there are an odd number of votes, the witness doesn’t have a vote. If there is an even number of votes, the witness has a vote. Dynamic witness significantly reduces the risk that the cluster will go down because of witness failure. The cluster decides whether to use the witness vote based on the number of voting nodes that are available in the cluster.

Dynamic quorum works with Dynamic witness in the way described below.

Dynamic quorum behavior

  • If you have an even number of nodes and no witness, one node gets its vote zeroed. For example, only three of the four nodes get votes, so the total number of votes is three, and two survivors with votes are considered a majority.
  • If you have an odd number of nodes and no witness, they all get votes.
  • If you have an even number of nodes plus witness, the witness votes, so the total is odd.
  • If you have an odd number of nodes plus witness, the witness doesn’t vote.

Am pretty sure am going to forget all this a few days from today so I’ll re-link to the docs again as it goes into more detail and has examples etc.

[Aside] Stretchly – Break time reminder

Via the always resourceful How-To Geek. Came across Stretchly and Big Stretch Reminder. Two software that remind you to take breaks and micro-breaks. I have previously used WorkRave, wanted to try something different now. This time I’ll try Stretchly as I like its UI and is open-source.