Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Useful offline Windows troubleshooting/ fixing tricks

Had a Windows Server 2008 R2 server that started giving a blank screen since the recent Windows update reboot. This was a VM and it was the same result via VMware console or RDP. Safe Mode didn’t help either. Bummer!

Since this is a VM I mounted its disk on another 2008 R2 VM and tried to fix the problem offline. Most of my attempts didn’t help but I thought of posting them here for reference. 

Note: In the following examples the broken VM’s disk is mounted to F: drive. 

Recent updates

I used dism to list recent updates and remove them. To list updates from this month (March 2017):

To remove an update:

I did this for each of the updates I had. That didn’t help though. And oddly I found that one of the updates kept re-appearing with a slightly different name (a different number suffixed to it actually) each time I’d remove it. Not sure why that was the case but I saw that F:\Windows\SxS had a file called pending.xml and figured this must be doing something to stop the update from being removed. I couldn’t delete the file in-spite of taking ownership and full control, so I opened it in Notepad and cleared all the contents. :o) After that the updates didn’t return but the machine was still broken. 

SFC

I used sfc to check the integrity of all the system files:

No luck with that either!

Event Logs

Maybe the Event Logs have something? These can be found at F:\Windows\System32\Winevt\Logs. Double click the ones of interest to view. 

In my case the Event Logs had nothing! No record at all of the VM starting up or what was causing it to hang. Tough luck!

Bonus info: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Eventlog contains locations of the files backing the Event Logs. Just mentioning it here as I came across this.

Drivers

Could drivers cause any issue? Unlikely. You can’t use dism to query drivers as above but you can check via registry. See this post. Honestly, I didn’t read it much. I didn’t suspect drivers and it seemed too much work fiddling through registry keys and folders. 

Last Known Good Configuration

Whenever I’d boot up the VM I never got the Last Known Good (LKG) Configuration option. I tried pressing F8 a couple of times but it had no effect. So I wondered if I could tweak this via the registry. Turns out I can. And turns out I already knew this just that I had forgotten!

Your current configuration is HKLM\System\CurrentControlSet. This is actually a link to HKLM\System\CurrentControlSet01 or HKLM\System\CurrentControlSet02 or HKLM\System\CurrentControlSet03 or … (you get the point). Each of the CurrentControlSetXXX key is one of your previous configurations. The one that’s actually used can be found via HKLM\System\Select. The entry Current points to the number of the CurrentControlSetXXX key in use. The entry LastKnownGood points to the Last Known Good Configuration. Now we know what to do. 

  1. Mount the HKLM\SYSTEM hive of the broken VM. All registry hives can be found under %windir%\System32\Config. In my case that translates to the file F:\Windows\System32\Config\SYSTEM.
  2. To mount this file open Registry Editor, select the HKLM hive, and go to File > Load Hive. (This is a good post with screenshots etc).  
  3. Go to the Select key above. Change Current to whatever LastKnownGood was. 
  4. That’s all. Now unload the hive and you are done.

This helped in my case! I was finally able to move past the blank screen and get a login prompt. Upon login I was also able to download and install all the patches and confirm that the VM is now working fine (took a snapshot of course, just in case!). I have no idea what went wrong, but at least I have the pleasure of being able to fix it. From the post I link to below, I’d say it looks like a registry hive corruption. 

Since I successfully logged in, my machine’s Last Known Good Configuration will be automatically updated by Windows with the current one. Here’s a blog post that explains this in more detail. 

That’s all! Hope this helps someone. 

Useful WMIC filters

I have these tabs open in my browser from last month when I was doing some WMI based GPO targeting. Meant to write a blog post but I keep getting side tracked and now it’s been nearly a month so I have lost the flow. But I want to put these in the blog as a reference to my future self. 

That’s all.

Go through a group of servers and find whether a particular patch is installed

Patch Tuesday is upon us. Our pilot group of server was patched via SCCM but there were reports that 2012R2 servers were not picking up one of the patches. I wanted to quickly identify the servers that were missing patches. 

Our pilot servers are in two groups. So I did the following:

The first two lines basically enumerate the two groups. If it was just one group I could have replaced it with Get-ADGroupMember "GroupName"

The remaining code checks whether the server is online, filters out 2012 R2 servers (version number 6.3.9600), and makes a list of the servers along with the installed date of the hotfix I am interested in. If the hotfix is not installed, the date will be blank. Simple. 

Oh, and I wanted to get the output as and when it comes so I went with a Width=20 in the name field. I could have avoided that and gone for an -AutoSize but that would mean I’ll have to patiently wait for PowerShell to generate the entire output and then Format-Table to do an autosize. 

Update: While on the Win32_QuickFixEngineering WMI class I wanted to point out to these posts: [1], [2]

Worth keeping in mind that Win32_QuickFixEngineering (or QFE for short) only returns patches installed via the CBS (Component Based Servicing) – which is what Windows Updates do anyway. What this means, however, is that it does not return patches installed via an MSI/ MSP/ MSU. 

IE 11 update fails due to prerequisite updates (KB2729094)

IE 11 update requires the following prerequisite updates – link.

Even after installing those (most of which are already there) IE 11 install will complain and fail. The log files are in C:\Windows\IE_.main.log.

In my case I was getting the following error (seems to be the same for others too):

Thing is I already had this hotfix installed, so there was nothing more to do. Found this useful support post where someone suggested running the hotfix install and side-by-side launching the IE install. Might need to do it 2-3 times but that seems to make a difference. So I tried that and sure enough it helped.

That post is worth a read for some other tricks, especially if you are sequencing this via SCCM. I found this article from Symantec too which seems helpful. Some day when I am in charge of SCCM too I can try such stuff out! :)

P2V a SQL cluster by breaking the cluster

Need to P2V a SQL cluster at work. Here’s screenshots of what I did in a test environment to see if an idea of mine would work.

We have a 2 physical-nodes SQL cluster. The requirement was to convert this into a single virtual machine.

P2V-ing a single server is easy. Use VMware Converter. But P2V-ing a cluster like this is tricky. You could P2V each node and end up with a cluster of 2 virtual-nodes but that wasn’t what we wanted. We didn’t want to deal with RDMs and such for the cluster, so we wanted to get rid of the cluster itself. VMware can provide HA if anything happens to the single node.

My idea was to break the cluster and get one of the nodes of the cluster to assume the identity of the cluster. Have SQL running off that. Virtualize this single node. And since there’s no change as far as the outside world is concerned no one’s the wiser.

Found a blog post that pretty much does what I had in mind. Found one more which was useful but didn’t really pertain to my situation. Have a look at the latter post if your DTC is on the Quorum drive (wasn’t so in my case).

So here we go.

1) Make the node that I want to retain as the active node of the cluster (so it was all the disks and databases). Then shutdown SQL server.

sqlshutdown

2) Shutdown the cluster.

clustershutdown

3) Remove the node we want to retain, from the cluster.

We can’t remove/ evict the node via GUI as the cluster is offline. Nor can we remove the Failover Cluster feature from the node as it is still part of a cluster (even though the cluster is shutdown). So we need to do a bit or “surgery”. :)

Open PowerShell and do the following:

This simply clears any cluster related configuration from the node. It is meant to be used on evicted nodes.

Once that’s done remove the Failover Cluster feature and reboot the node. If you want to do this via PowerShell:

4) Bring online the previously shared disks.

Once the node is up and running, open Disk Management and mark as online the shared disks that were previously part of the cluster.

disksonline

5) Change the IP and name of this node to that of the cluster.

Straight-forward. Add CNAME entries in DNS if required. Also, you will have to remove the cluster computer object from AD first before renaming this node to that name.

6) Make some registry changes.

The SQL Server is still not running as it expects to be on a cluster. So make some registry changes.

First go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Setup and open the entry called SQLCluster and change its value from 1 to 0.

Then take a backup (just in case; we don’t really need it) of the key called HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\Cluster and delete it.

Note that MSSQL10_50.MSSQLSERVER may vary depending on whether you have a different version of SQL than in my case.

7) Start the SQL services and change their startup type to Automatic.

I had 3 services.

Now your SQL server should be working.

8) Restart the server – not needed, but I did so anyways.

Test?

If you are doing this in a test environment (like I was) and don’t have any SQL applications to test with, do the following.

Right click the desktop on any computer (or the SQL server computer itself) and create a new text file. Then rename that to blah.udl. The name doesn’t matter as long as the extension is .udl. Double click on that to get a window like this:

udl

Now you can fill in the SQL server name and test it.

One thing to keep in mind (if you are not a SQL person – I am not). The Windows NT Integrated security is what you need to use if you want to authenticate against the server with an AD account. It is tempting to select the “Use a specific user name …” option and put in an AD username/ password there, but that won’t work. That option is for using SQL authentication.

If you want to use a different AD account you will have to do a run as of the tool.

Also, on a fresh install of SQL server SQL authentication is disabled by default. You can create SQL accounts but authentication will fail. To enable SQL authentication right click on the server in SQL Server Management Studio and go to Properties, then go to Security and enable SQL authentication.

sqlauth

That’s all!

Now one can P2V this node.

Installing a new license key in KMS

KMS is something you login to once in a blue moon and then you wonder how the heck are you supposed to install a license key and verify that it got added correctly. So as a reminder to myself.

To install a license key:

Then activate it:

If you want to check that it was added correctly:

I use cscript so that the output comes in the command prompt itself and I can scroll up and down (or put into a text file) as opposed to a GUI window which I can’t navigate.

Very Brief Notes on Windows Memory etc

I have no time nowadays to update this blog but I wanted to dump these notes I made for myself today. Just so the blog has some update and I know I can find these notes here again when needed.

This is on the various types of memory etc in Windows (and other OSes in general). I was reading up on this in the context of Memory Mapped Files.

Virtual Memory

  • Amount of memory the OS can access. Not related to the physical memory. It is only related to processor and OS – is it 32-bit or 64-bit.
  • 32-bit means 2^32 = 4GB; 64-bit means 2^64 = a lot! :)
  • On a default install of 32-bit Windows kernel reserves 2GB for itself and applications can use the balance 2GB. Each application gets 2GB. Coz it doesn’t really exist and is not limited by the physical memory in the machine. The OS just lies to the applications that they have 2GB of virtual memory for themselves.

Physical Memory

  • This is physical. But not limited to the RAM modules. It is RAM modules plus paging file/ disk.

Committed Memory

  • When a virtual memory page is touched (read/ write/ committed) it becomes “real” – i.e. put into Physical Memory. This is Committed Memory. It is a mix of RAM modules and disk.

Commit Limit

  • The total amount of Committed Memory is obviously limited by your Physical Memory – i.e. the RAM modules plus disk space. This is the Commit Limit.

Working Set

  • Set of virtual memory pages that are committed and fully belong to that process.
  • These are memory pages that exist. They are backed by Physical Memory (RAM plus paging files). They are real, not virtual.
  • So a working set can be thought of as the subset of a processes Virtual Memory space that is valid; i.e. can be referenced without a page fault.
    • Page fault means when the process requests for a virtual page and that is not in the Physical Memory, and so has to be loaded from disk (not page file even), the OS will put that process on hold and do this behind the scene. Obviously this causes a performance impact so you want to avoid page faults. Again note: this is not RAM to page file fault; this is Physical Memory to disk fault. Former is “soft” page fault; latter is “hard” page fault.
    • Hard faults are bad and tied to insufficient RAM.

Life cycle of a Page

  • Pages in Working Set -> Modified Pages -> Standby Pages -> Free Pages -> Zeroed pages.
  • All of these are still in Physical RAM, just different lists on the way out.

From http://stackoverflow.com/a/22174816:

Memory can be reserved, committed, first accessed, and be part of the working set. When memory is reserved, a portion of address space is set aside, nothing else happens.

When memory is committed, the operating system guarantees that the corresponding pages could in principle exist either in physical RAM or on the page file. In other words, it counts toward its hard limit of total available pages on the system, and it formally creates pages. That is, it creates pages and pretends that they exist (when in reality they don’t exist yet).

When memory is accessed for the first time, the pages that formally exist are created so they truly exist. Either a zero page is supplied to the process, or data is read into a page from a mapping. The page is moved into the working set of the process (but will not necessarily remain in there forever).

Memory Mapped Files

Windows (and other OSes) have a feature called memory mapped files.

Typically your files are on a physical disk and there’s an I/O cost involved in using them. To improve performance what Windows can do is map a part of the virtual memory allocated to a process to the file(s) on disk.

This doesn’t copy the entire file(s) into RAM, but a part of the virtual memory address range allocated to the process is set aside as mapping to these files on disk. When the process tries to read/ write these files, the parts that are read/ written get copied into the virtual memory. The changes happen in virtual memory, and the process continues to access the data via virtual memory (for better performance) and behind the scenes the data is read/ written to disk if needed. This is what is known as memory mapped files.

My understanding is that even though I say “virtual memory” above, it is actually restricted to the Physical RAM and does not include page files (coz obviously there’s no advantage to using page files instead of the location where the file already is). So memory mapped files are mapped to Physical RAM. Memory mapped files are commonly used by Windows with binary images (EXE & DLL files).

In Task Manager the “Memory (Private Working Set)” column does not show memory mapped files. For this look to the “Commit Size” column.

Also, use tools like RAMMap (from SysInternals) or Performance Monitor.

More info

Solarwinds not seeing correct disk size; “Connection timeout. Job canceled by scheduler.” errors

Had this issue at work today. Notice the disk usage data below in Solarwinds –

Disk Usage

The ‘Logical Volumes’ section shows the correct info but the ‘Disk Volumes’ section shows 0 for everything.

Added to that all the Application Monitors had errors –

Timeout

I searched Google on the error message “Connection timeout. Job canceled by Scheduler.” and found this Solarwinds KB article. Corrupt performance counters seemed to be a suspect. That KB article was a bit confusing me to in that it gives three resolutions and I wasn’t sure if I am to do all three or just pick and choose. :)

Event Logs on the target server did show corrupt performance counters.

Initial Errors

I tried to get the counters via PowerShell to double check and got an error as expected –

Broken Get-Counter

Ok, so performance counter issue indeed. Since the Solarwinds KB article didn’t make much sense to me I searched for the Event ID 3001 as in the screenshot and came across a TechNet article. Solution seemed simple – open up command prompt as an admin, run the command lodctr /R. This command apparently rebuilds the performance counters from scratch based on currently registry settings adn backup INI files (that’s what the help message says). The command completed straight-forwardly too.

lodctr - 1

With this the performance counters started working via PowerShell.

Working Get-Counter

Event Logs still had some error but those were to do with the performance counters of ASP.Net and Oracle etc.

More Errors

The fix for this seemed to be a bit more involved and requires rebooting the server. I decided to skip it for now as I don’t these additional counters have much to do with Solarwinds. So I let those messages be and tried to see if Solarwinds was picking up the correct info. Initially I took a more patient approach of waiting and trying to make it poll again; then I got impatient and did things like removing the node from monitoring and adding it back (and then wait again for Solarwinds to poll it etc) but eventually it began working. Solarwinds now sees the disk space correctly and all the Application Monitors work without any errors too.

Here’s what I am guessing happened (based on that Solarwinds KB article I linked to above). The performance counters of the server got corrupt. Solarwinds uses counters to get the disk info etc. Due to this corruption the poller spent more time than usual when fetching info from the server. This resulted in the Application Monitor components not getting a chance to run as the poller had run out of time to poll the server. Thus the Application Monitors gave the timeout errors above. In reality the timeout was not from those components, it was from the corrupt performance counters.

Exchange DAG fails. Information Store service fails with error 2147221213.

Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

The Microsoft Exchange Information Store service terminated with service-specific error %%-2147221213.

The Application log had entries such as these (event ID 5003) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and  is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

The fix for this is to either disable VMs from synchronizing time from the ESXi host or setup NTP on all the ESXi hosts so they have the correct time going forward. I decided to go ahead with the latter.

Update: Found this and this blog post. They have more screenshots and a better explanation, so worth checking out. :)

Using SolarWinds to highlight servers in a pending reboot status

Had a request to use SolarWinds to highlight servers in a pending reboot status. Here’s what I did.

Sorry, this is currently broken. After implementing this I realized I need to enable PowerShell remoting on all servers for it to work, else the script just returns the result from the SolarWinds server. Will update this post after I fix it at my workplace. If you come across this post before that, all you need to do is enable PowerShell remoting across all your servers and change the script execution to “Remote Host”.

SolarWinds has a built in application monitor called “Windows Update Monitoring”. It does a lot more than what I want so I disabled all the components I am not interested in. (I could have also just created a new application monitor, I know, just was lazy).

winupdatemon-1

The part I am interested in is the PowerShell Monitor component. By default it checks for the reboot required status by checking a registry key: HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\WindowsUpdate\Auto Update\RebootRequired. Here’s the default script –

Inspired by this blog post which monitors three more registry keys and also queries ConfigMgr, I replaced the default PowerShell script with the following –

Then I added the application monitor to all my Windows servers. The result is that I can see the following information on every node –

winupdatemon-2

Following this I created alerts to send me an email whenever the status of the above component (“Machine restart status …”) went down for any node. And I also created a SolarWinds report to capture all nodes for which the above component was down.

winupdatemon-3

Then I assigned this to a schedule to run once in a month after our patching window to email me a list of nodes that require reboots.

 

Solarwinds AppInsight for IIS – doing a manual install – and hopefully fixing invalid signature (error code: 16007)

AppInsight from Solarwinds is pretty cool. At least the one for Exchange is. Trying out the one for IIS now. Got it configured on a few of our servers easily but it failed on one. Got the following error –

appinsight-error

Bummer!

Manual install it is then. (Or maybe not! Read on and you’ll see a hopeful fix that worked for me).

First step in that is to install PowerShell (easy) and the IIS PowerShell snap-in. The latter can be downloaded from here. This downloads the Web Platform Installer (a.k.a. “webpi” for short) and that connects to the Internet to download the goods. In theory it should be easy, in practice the server doesn’t have connectivity to the Internet except via a proxy so I have to feed it that information first. Go to C:\Program Files\Microsoft\Web Platform Installer for that, find a file called WebPlatformInstaller.exe.config, open it in Notepad or similar, and add the following lines to it –

This should be within the <configuration> -- </configuration> block. Didn’t help though, same error.

webpi-error

Time to look at the logs. Go to %localappdata%\Microsoft\Web Platform Installer\logs\webpi for those.

From the logs it looked like the connection was going through –

But the problem was this –

If I go to the link – https://www.microsoft.com/web/webpi/5.0/webproductlist.xml – via IE on that server I get the following –

untrusted-cert

 

However, when I visit the same link on a different server there’s no error.

Interesting. I viewed the untrusted certificate from IE on the problem server and compared it with the certificate from the non-problem server.

Certificate on the problem server

Certificate on the problem server

Certificate on a non-problem server

Certificate on a non-problem server

Comparing the two I can see that the non-problem server has a VeriSign certificate in the root of the path, because of which there’s a chain of trust.

verisign - g5

If I open Certificate Manager on both servers (open mmc > Add/ Remove Snap-Ins > Certificates > Add > Computer account) and navigate to the “Trusted Root Certification Authorities” store) on both servers I can see that the problem server doesn’t have the VeriSign certificate in its store while the other server has.

cert manager - g5

So here’s what I did. :) I exported the certificate from the server that had it and imported it into the “Trusted Root Certification Authorities” store of the problem server. Then I closed and opened IE and went to the link again, and bingo! the website opens without any issues. Then I tried the Web Platform Installer again and this time it loads. Bam!

The problem though is that it can’t find the IIS PowerShell snap-in. Grr!

no snap-in

no snap-in 2

That sucks!

However, at this point I had an idea. The SolarWinds error message was about an invalid signature, and what do we know of that can cause an invalid signature? Certificate issues! So now that I have installed the required CA certificate for the Web Platform Installer, maybe it sorts out SolarWinds too? So I went back and clicked “Configure Server” again and bingo! it worked this time. :)

Hope this helps someone.

Solarwinds – “The WinRM client cannot process the request”

Added the Exchange 2010 Database Availability Group application monitor to couple of our Exchange 2010 servers and got the following error –

error1

Clicking “More” gives the following –

error2

This is because Solarwinds is trying to run a PowerShell script on the remote server and the script is unable to run due to authentication errors. That’s because Solarwinds is trying to connect to the server using its IP address, and so instead of using Kerberos authentication it resorts to Negotiate authentication (which is disabled). The error message too says the same but you can verify it for yourself from the Solarwinds server too. Try the following command

This is what’s happening behind the scenes and as you will see it fails. Now replace “Negotiate” with “Kerberos” and it succeeds –

So, how to fix this? Logon to the remote server and launch IIS Manager. It’s under “Administrative Tools” and may not be there by default (my server only had “Internet Information Services (IIS) 6.0 Manager”), in which case add it via Server Manager/ PowerShell –

Then open IIS Manager, go to Sites > PowerShell and double click “Authentication”.

iis-1

Select “Windows Authentication” and click “Enable”.

iis-2

Now Solarwinds will work.

Using Solarwinds to monitor Windows Services

This is similar to how I monitored performance counters with Solarwinds.

I want to monitor a bunch of AppSense services.

Similar to the performance counters where I created an application monitor template so I could define the threshold values, here I need to create an application monitor so that the service appears in the alert manager. This part is easy (and similar to what I did for the performance counters) so I’ll skip and just put a screenshot like below –

applimonitor

I created a separate application monitor template but you could very well add it to one of the standard templates (if it’s a service that’s to be monitored on all nodes for instance).

Now for the part where you create alerts.

Initially I thought this would be a case of creating triggers when the above application monitor goes down. Something like this – alert1

And create an alert message like this –

alert1a

With this I was hoping to get a one or more alert messages only for the services that actually went down. Instead, what happened is that whenever any one service went down I’d get an alert for the service that went down and also a message for the services that were up. Am guessing since I was triggering on the application monitor, Solarwinds helpfully sent the status for each of its components – up or down.

alert2

The solution is to modify your trigger such that you target each component.

alert3

Now I get alerts the way I want.

Hope this helps!

Fixing a Windows Server that was stuck on “Preparing to configure Windows”

This is something that I fixed a few months ago at work but didn’t get a chance to blog about then. Coz of the gap I might not post much verbosely about it as I usually may.

The situation was that we had a Windows Server 2012 R2 that was stuck on a “Preparing to configure Windows” loop. I didn’t take a screenshot of it but you can find an example of it for Windows 7 here. All the usual troubleshooting steps like automatic repairs and last known good configuration etc didn’t make a dent. The problem began after the server was rebooted following Windows Updates so I focused on that. Here’s what I did to fix the server:

  1. I rebooted the server and pressed F8 before the Windows logo screen.
  2. This got me to the recovery options screen, where I selected the option to get a command prompt.
  3. Next I found the drive letter that corresponds to the C: drive of the server. This was through trial and error by typing each drive letter and find the one with the Windows folder.
    1. In my case the drive letter was E: so keep that in mind while viewing the screenshots replace. Replace with the drive letter you find.
  4. I entered the following command to get a list of all Windows updates, both installed and pending. 1
  5. The output was as below. 2
  6. I noted the names of the updates that were in a “Staged” state and uninstalled them one by one. 3
  7. Then I exited the command prompt. This rebooted the server and it was stuck at the following screen for a while. 4
  8. I stayed on this screen for about 15 mins. The server rebooted itself and stayed on the screen again, for about 10 mins. After this it proceeded to the login screen and I could login as usual. :)

Find out which DCs in your domain have the DHCP service enabled

Use PowerShell –

Result is a table of DC names and the status of the “DHCP Server” service. If the service isn’t installed (i.e. the feature isn’t enabled) you get a blank.