ESXi host – cannot install HA – no space left on device

These are less of notes and more of links and what I did when I encountered this issue. Just for my future self.

At work we had a host which was giving HA errors. The message was along the lines that vCenter could not contact HA. So I tried reconfiguring it for HA (right click the host and select “Reconfigure for vSphere HA”) upon which I got a new error: Cannot install the vCenter Server agent service. Cannot upload agent.

HA-errorInitially I thought it must just be a permissions issue. But it wasn’t so.

To investigate further I tried logging on to the server. I couldn’t enable SSH and ESXi Shell from the Configuration tab – it gave me an error. So I iLO’d into the server DCUI and enabled SSH and ESXi Shell. SSH still refused to let me in, and when I’d press Alt+F1 on the console to get the login prompt it was filled with messages like these: /bin/sh cant fork. Initially I thought it might be to do with HP AMS memory leak (see this and this) but it wasn’t.

I pressed Alt+F12 to see the on-screen logs. It was filled with messages like these:

alt+f12 logsBlimey!

There was nothing more I could do here basically. Couldn’t login to the server at all, heck I couldn’t even Shutdown/ Restart it gracefully via F12 in DCUI (nothing would happen). So I cold booted it and that got it working. 

It’s been about 2 hours since I did that and the server seems stable so maybe it was a one off-thing. I looked at more logs though and here’s what I found.

/var/log/syslog.log

(Contains: Management service initialization, watchdogs, scheduled tasks and DCUI use)

/var/log/vmkwarning.log

(Contains: A summary of Warning and Alert log messages excerpted from the VMkernel logs)

/var/log/vob.log

(Contains: VMkernel Observation events)

/var/log/vmkernel.log

(Contains: Core VMkernel logs, including device discovery, storage and networking device and driver events, and virtual machine startup)

/var/log/hostd.log

(Contains: Host management service logs, including virtual machine and host Task and Events, communication with the vSphere Client and vCenter Server vpxa agent, and SDK connections.)

From these logs one thing was clear. The ESXi RAMdisk hosting the root filesystem had run out of inodes. Possibly caused by the SFCB service. Because of this the root filesystem had run out of space and everything was failing. Great!

In Linux I am used to the df command to check filesystem usage. But in ESXi df only seems to be give info on the mounted filesystems whereas vdf gives the local filesystems (like RAMdisks and Tardisks (whatever that is)).

Above output is after a reboot and all seems fine. To check the inode usage use the stat command.

Or use exscli. It gives you the free space as well as the inode count!

Note to self: Make a habit of using the esxcli command as that seems to be the VMware preferred way of doing things. Plus it’s one command with various namespaces you can use for networking and other info.

In my case things look to be fine now.

KB 2037798 talks about this problem. Apparently it is fixed via a patch released in 2013, and as far as I can tell we are properly patched so we shouldn’t have been hit by this issue. If it happens again though the same KB article talks about creating a separate RAMdisk for SFCB so even if it eats up all the inodes your root file system isn’t affected. This involves creating a new RAMdisk at boot time by modifying rc.local (nice!). The esxcli command can be used to create a new ramdisk and mount it at the mount point required by SFCB:

Turns out such an issue can also occur because of SNMP. Or if you have an HP Gen8 blade server then coz of the hpHelper.log file, which is fixed via a patch from HP (this server was a Gen8 blade but it didn’t have this log file). KB 2040707 too talks about this. Didn’t help much in my case as that didn’t seem to be my issue.

Two useful links for future reference are:

That’s all for now.

p.s. I keep talking about SFCB above but have no idea what it is. Turns out it is the CIM server for ESXi. Found this blog post on it.