Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

[Aside] How to quickly get ESXi logs from a web browser (without SSH, vSphere client, etc)

This post made my work easy yesterday – https://www.vladan.fr/check-esxi-logs-from-web-browser/

tl;dr version:  go to https://IP_of_Your_ESXi/host

NSX Firewall no working on Layer3; OpenBSD VMware Tools; IP Discovery, etc.

I have two security groups. Network 1 VMs (a group that contains my VMs in the 192.168.1.0/24) and Network 2 VMs (similar, for 192.168.2.0/24 network). 

Both are dynamic groups. I select members based on whether the VM name contains -n1 or -n2. (The whole exercise is just for fun/ getting to know this stuff). 

I have two firewall rules making use of these rules. Layer 2 and Layer 3. 

The Layer 2 rule works but the Layer 3 one does not! Weird. 

I decided to troubleshoot this via the command line. Figured it would be a good opportunity.

To troubleshoot I have to check the rules on the hosts (because remember, that’s where the firewall is; it’s a kernel module in each host). For that I need to get the host-id. For which I need to get the cluster-id. Sadly there’s no command to list all hosts (or at least I don’t know of any). 

So now I have my host-ids.

Let’s also take a look the my VMs (thankfully it’s a short list! I wonder how admins do this in real life):

We can see the filters applying to each VM.  To summarize:

And are these filters applying on the hosts themselves?

Hmm, that too looks fine. 

Next I picked up one of the rule sets and explored it further:

The Layer 3 & Layer 2 rules are in separate rule sets. I have marked the ones which I am interested in. One works, the other doesn’t. So I checked the address sets used by both:

Tada! And there we have the problem. The address set for the Layer 3 rule is empty. 

I checked this for the other rules too – same situation. I modified my Layer 3 rule to specifically target the subnets:

And the address set for that rule is not empty:

And because of this the firewall rules do work as expected. Hmm.

I modified this rule to be a group with my OpenBSD VMs from each network explicitly added to it (i.e. not dynamic membership in case that was causing an issue). But nope, same result – empty address set!

But the address set is now empty. :o)

So now I have an idea of the problem. I am not too surprised by this because I vaguely remember reading something about VMware Tools and IP detection inside a VM (i.e. NSX makes use of VMware Tools to know the IP address of a VM) and also because I am aware OpenBSD does not use the official VMware Tools package (it has its own and that only provides a subset of functions).

Googling a bit on this topic I came across the IP address Discovery section in the NSX Admin guide – prior to NSX 6.2 if VMware Tools wasn’t installed (or was stopped) NSX won’t be able to detect the IP address of the VM. Post NSX 6.2 it can do DHCP & ARP snooping to work around a missing/ stopped VMware Tools. We configure the latter in the host installation page:

I am going to go ahead and enable both on all my clusters. 

That helped. But it needs time. Initially the address set was empty. I started pings from one VM to another and the source VM IP was discovered and put in the address set; but since the destination VM wasn’t in the list traffic was still being allowed. I stopped pings, started pings, waited a while … tried again … and by then the second VM IP to was discovered and put in the address set – effectively blocking communication between them. 

Side by side I installed a Windows 8.1 VM with VMware Tools etc and tested to see if it was being automatically picked up (I did this before enabling the snooping above). It was. In fact its IPv6 address too was discovered via VMware Tools and added to the list:

Nice! Picked up something interesting today. 

Useful offline Windows troubleshooting/ fixing tricks

Had a Windows Server 2008 R2 server that started giving a blank screen since the recent Windows update reboot. This was a VM and it was the same result via VMware console or RDP. Safe Mode didn’t help either. Bummer!

Since this is a VM I mounted its disk on another 2008 R2 VM and tried to fix the problem offline. Most of my attempts didn’t help but I thought of posting them here for reference. 

Note: In the following examples the broken VM’s disk is mounted to F: drive. 

Recent updates

I used dism to list recent updates and remove them. To list updates from this month (March 2017):

To remove an update:

I did this for each of the updates I had. That didn’t help though. And oddly I found that one of the updates kept re-appearing with a slightly different name (a different number suffixed to it actually) each time I’d remove it. Not sure why that was the case but I saw that F:\Windows\SxS had a file called pending.xml and figured this must be doing something to stop the update from being removed. I couldn’t delete the file in-spite of taking ownership and full control, so I opened it in Notepad and cleared all the contents. :o) After that the updates didn’t return but the machine was still broken. 

SFC

I used sfc to check the integrity of all the system files:

No luck with that either!

Event Logs

Maybe the Event Logs have something? These can be found at F:\Windows\System32\Winevt\Logs. Double click the ones of interest to view. 

In my case the Event Logs had nothing! No record at all of the VM starting up or what was causing it to hang. Tough luck!

Bonus info: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Eventlog contains locations of the files backing the Event Logs. Just mentioning it here as I came across this.

Drivers

Could drivers cause any issue? Unlikely. You can’t use dism to query drivers as above but you can check via registry. See this post. Honestly, I didn’t read it much. I didn’t suspect drivers and it seemed too much work fiddling through registry keys and folders. 

Last Known Good Configuration

Whenever I’d boot up the VM I never got the Last Known Good (LKG) Configuration option. I tried pressing F8 a couple of times but it had no effect. So I wondered if I could tweak this via the registry. Turns out I can. And turns out I already knew this just that I had forgotten!

Your current configuration is HKLM\System\CurrentControlSet. This is actually a link to HKLM\System\CurrentControlSet01 or HKLM\System\CurrentControlSet02 or HKLM\System\CurrentControlSet03 or … (you get the point). Each of the CurrentControlSetXXX key is one of your previous configurations. The one that’s actually used can be found via HKLM\System\Select. The entry Current points to the number of the CurrentControlSetXXX key in use. The entry LastKnownGood points to the Last Known Good Configuration. Now we know what to do. 

  1. Mount the HKLM\SYSTEM hive of the broken VM. All registry hives can be found under %windir%\System32\Config. In my case that translates to the file F:\Windows\System32\Config\SYSTEM.
  2. To mount this file open Registry Editor, select the HKLM hive, and go to File > Load Hive. (This is a good post with screenshots etc).  
  3. Go to the Select key above. Change Current to whatever LastKnownGood was. 
  4. That’s all. Now unload the hive and you are done.

This helped in my case! I was finally able to move past the blank screen and get a login prompt. Upon login I was also able to download and install all the patches and confirm that the VM is now working fine (took a snapshot of course, just in case!). I have no idea what went wrong, but at least I have the pleasure of being able to fix it. From the post I link to below, I’d say it looks like a registry hive corruption. 

Since I successfully logged in, my machine’s Last Known Good Configuration will be automatically updated by Windows with the current one. Here’s a blog post that explains this in more detail. 

That’s all! Hope this helps someone. 

Active Directory: Troubleshooting

This is intended to be a “running post” with bits and pieces I find on AD troubleshooting. If I bookmark these I’ll forget them. But if I put them here I can search easily and also put some notes alongside. 

DCDiag switches and other commands

From Paul Bergson:

  • dcdiag /v /c /d /e /s:dcname > c:\dcdiag.log
    • /v tells it to be verbose
    • /d tells it to also show debug out – i.e. even more verbosity
    • /c tells it to be comprehensive – do all the non-default tests too (except DCPromo and RegisterInDNS)
    • /e tells it to test all servers in the enterprise – i.e. across site links

This prompted me to make a table with the list of DcDiag tests that are run by default and in comprehensive mode. 

Test Name By default? Comprehensive?
Advertising Y Y
CheckSDRefDom Y Y
CheckSecurityError N Y
Connectivity Y Y
CrossRefValidation Y Y
CutOffServers N Y
DcPromo N/A N/A
DNS N Y
FrsEvent Y Y
DFSREvent Y Y
SysVolCheck Y Y
LocatorCheck Y Y
Intersite Y Y
KccEvent Y Y
KnowsOfRoleHolders Y Y
MachineAccount Y Y
NCSecDesc Y Y
NetLogos Y Y
ObjectsReplicated Y Y
OutboundSecureChannels Y Y
RegisterInDNS N/A N/A
Replications Y Y
RidManager Y Y
Services Y Y
SystemLog Y Y
Topology N Y
VerifyEnterpriseReferences N Y
VerifyReferences  Y Y
VerifyReplicas N Y

Replication error 1722 The RPC server is unavailable

Came across this after I setup a new child domain. Other DCs in the forest were unable to replicate to this for about 2 hours. The error was due to DNS – the CNAME records for the new DC hadn’t replicated yet. 

This TechNet post was a good read. Gives a few commands worth keeping in mind, and shows a logical way of troubleshooting.

Replication error 8524 The DSA operation is unable to proceed because of a DNS lookup failure

Another TechNet post came across in relation to the above DNS issue. 

This command is worth remembering:

Shows all the replication partners and a summary of last replication. Seems to be similar to:

 Especially useful is the fact that both commands give the DSA GUIDs of the target DC and its partners:

It is possible to specify a DC by giving its name. Have the GUIDs is useful when you suspect DNS issues. Check that the CNAMEs can be resolved from both source and destination DCs.  

Active Directory: Troubleshooting Domain Controller critical services

These are notes from the AD Troubleshooting WorkshopPLUS session I attended. The notes are on troubleshooting Domain Controller critical services. I am mostly following what was discussed in class here rather than add anything new (except in the section of SC where I talk a bit about it).

Before moving on let’s recap the DC critical services from my previous post:

  • DHCP client / DNS client – registers the DCs A and PTR records
    • DHCP client for Server 2003 and prior
    • DNS client for Server 2008 and later
  • FRS / DFSR – responsible for SYSVOL replication between DCs
    • FRS is now deprecated, may or may not be used in the domain. DFSR is the replacement.
    • If the domain was born in functional level 2008 (i.e. all DCs are Server 2008 or later) then DFRS is used.
    • Else FRS could be in use unless it was migrated.  
  • DNS server – used by DCs to locate each other, clients to locate DCs
  • KDC – used for Kerberos authentication in the domain
  • Netlogon – maintains secure channel between DCs and other DCs and clients; also updates DNS with the SRV records
    • Secure channel is used for Kerberos authentication and AD replication
    • DNS records are also written to %systemroot%\system32\config\Netlogon.DNS in case manual updating of DNS server is required.
  • Windows Time – maintains correct time in the domain, required for Kerberos authentication and AD replication
  • AD DS – provides AD
  • AD WDS – provides a web interface to AD

Event Viewer

In case of issues the Event Viewer is the best place to start troubleshooting from. Bear in mind merely looking at the System and Application logs as most admins do is not enough. AD specific events are usually logged under the Custom Views > Server Roles section. 

ad-events

Event IDs for some of the common problems can be found at this link. Some more event IDs and their resolution can be found at this link. The previous two links are worth a read in that they also give a high level overview of AD and troubleshooting.  

DcDiag

This has a separate post of its own now.

Service Controller (SC)

This is a command I haven’t used much except in the context of checking for drivers. Try the following if you want to get a list of all active drivers on your system:

Omit the pipe and findstr after that if you want more details. SC is cool in that it can do remote computers too:

But drivers are just one type of objects SC can query. If you omit the type= driver SC returns services (and if you set type= All SC returns both drivers and services).

For example, to get a list of all services on the machine

An example entry in the output looks like this:

Too much info, so to output just the Service Name, Display Name, and State use findstr:

Services can be stopped and started using the following commands:

 

SC has its limitations though, in that you can’t stop a service if it has other services dependent on it. To my knowledge SC doesn’t have a way of enumerate services that depend on a particular service either, so there’s no way to manually stop all those services via a batch file or something. That said, SC can find which services a particular service depends upon via the sc qc command. For example:

Given a service you can also get its description. For example:

Like I said, I don’t use SC much except to query drivers. What I typically use for querying services is PowerShell.

PowerShell

  • Start-Service
  • Stop-Service
  • Restart-Service
  • Get-Service

I have noticed that sometimes the results from Get-Service and sc query vary. A recent example was when I did Get-Service NTDS on a Server 2008 R2 machine and it returned nothing while sc query NTDS returned results as expected.

Even WMIC is able to find NTDS above, but Get-Service doesn’t. Go figure!

Be mindful of the symptoms

One thing that was emphasized in class a lot is that while troubleshooting start with the symptoms (doh!). As in, think of the symptoms you are experiencing and work backwards from them as to what critical services could be down/ broken which might be leading to these symptoms. That will give you a good starting point to troubleshoot and then you can use the tools above to dig deeper and identify the problem. AD is a complex system made up of many moving parts, so a good understanding of the underlying structure and how they tie in together is important.

Down the rabbit hole

Ever had this feeling that when you want to do one particular thing, a whole lot of other things keep coming into the picture leading you to other distracting paths?

For about a week now I’ve been meaning to write some posts about my Active Directory workshop. In a typical me fashion, I thought I’d set up some VMs and stuff on my laptop. This being a different laptop to my usual one, I thought of using Hyper-V. And then I thought why not use differencing VHDs to save space. And then I thought why not use a Gen 2 VM. Which doesn’t work so I went on a tangent reading about UEFI’s boot process and writing a blog post on that. Then I went into making an answer file to use while installing, went into refreshing myself on the PowerShell cmdlets I can use to do the initial configuring of Server Core 2012, made a little script to take care of that for multiple servers, and so on …

Finally I got around to installing a member server yesterday. Thought this would be easy – I know all the steps from before, just that I have to use a Server 2012 GUI WIM instead of a Core WIM. But nope! Now the ReAgentC.exe command on my computer doesn’t work! It worked till about 3 days ago but has now suddenly stopped working – so irriting! Of course, I could skip the WinRE partition – not that I use it anyways! – or just use a Gen 1 VM, but that just isn’t me. I don’t like to give up or backtrack from a problem. Every one of these is a learning opportunity, because now I am reading about Component Based Servicing, the Windows Recovery Environment, and learning about new DISM cleanup options that I wasn’t even aware of. But the problem is one of balance. I can’t afford to lose myself too much in learning new things because I’ll soon lose sight of the original goal of making Active Directory related posts.

It’s exciting though! And this is what I like and dislike about embarking on a project like this (writing Active Directory related posts). I like stumbling upon new issues and learning new things and working through them; but I dislike having to be on guard so I don’t go too deep down the hole and lose sight of what I had set out to do.

Here’s a snapshot of where I am now:

workflowy

It’s from WorkFlowy, a tool that I use to keep track of such stuff. I could write a blog post raving about it but I’ll just point you to this excellent review by Farhad Manjoo instead.

Downloading Trace32 and CMTrace for easy log file reading

I was working with some log file recently (C:\Windows\Logs\cbs\CBS.log to be precise, to troubleshoot an issue I am having on my laptop, which I hope to sort soon and write a blog post about). Initially I was opening the file in notepad but that isn’t a great way of going through log files. Then I remembered at work I use Trace32 from the SCCM 2007 Toolkit. So I downloaded it from Microsoft. Then I learnt Trace32’s been replaced with one called CMTrace in SCCM 2012 R2.

Here’s links to both the toolkits:

For the 2007 toolkit when installing choose the option to only install the Common Tools and skip the rest. That will install only Trace32 at C:\Program Files (x86)\ConfigMgr 2007 Toolkit V2 (add this to your PATH variable for ease of access).

2007-toolkit

For the 2012 R2 toolkit choose the option to install only the Client Tools and skip the rest. That will install CMTrace and a few other tools at C:\Program Files (x86)\ConfigMgr 2012 Toolkit R2\ClientTools (add this too to your PATH variable).

2012-toolkit

That’s all! Happy troubleshooting!