Solarwinds not seeing correct disk size; “Connection timeout. Job canceled by scheduler.” errors

Had this issue at work today. Notice the disk usage data below in Solarwinds –

Disk Usage

The ‘Logical Volumes’ section shows the correct info but the ‘Disk Volumes’ section shows 0 for everything.

Added to that all the Application Monitors had errors –

Timeout

I searched Google on the error message “Connection timeout. Job canceled by Scheduler.” and found this Solarwinds KB article. Corrupt performance counters seemed to be a suspect. That KB article was a bit confusing me to in that it gives three resolutions and I wasn’t sure if I am to do all three or just pick and choose. :)

Event Logs on the target server did show corrupt performance counters.

Initial Errors

I tried to get the counters via PowerShell to double check and got an error as expected –

Broken Get-Counter

Ok, so performance counter issue indeed. Since the Solarwinds KB article didn’t make much sense to me I searched for the Event ID 3001 as in the screenshot and came across a TechNet article. Solution seemed simple – open up command prompt as an admin, run the command lodctr /R. This command apparently rebuilds the performance counters from scratch based on currently registry settings adn backup INI files (that’s what the help message says). The command completed straight-forwardly too.

lodctr - 1

With this the performance counters started working via PowerShell.

Working Get-Counter

Event Logs still had some error but those were to do with the performance counters of ASP.Net and Oracle etc.

More Errors

The fix for this seemed to be a bit more involved and requires rebooting the server. I decided to skip it for now as I don’t these additional counters have much to do with Solarwinds. So I let those messages be and tried to see if Solarwinds was picking up the correct info. Initially I took a more patient approach of waiting and trying to make it poll again; then I got impatient and did things like removing the node from monitoring and adding it back (and then wait again for Solarwinds to poll it etc) but eventually it began working. Solarwinds now sees the disk space correctly and all the Application Monitors work without any errors too.

Here’s what I am guessing happened (based on that Solarwinds KB article I linked to above). The performance counters of the server got corrupt. Solarwinds uses counters to get the disk info etc. Due to this corruption the poller spent more time than usual when fetching info from the server. This resulted in the Application Monitor components not getting a chance to run as the poller had run out of time to poll the server. Thus the Application Monitors gave the timeout errors above. In reality the timeout was not from those components, it was from the corrupt performance counters.