Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Notes on Virtual Connect and firmware upgrading without network outage

I am yet to read this but in case you didn’t know there’s a book by HP on Virtual Connect. I haven’t used Virtual Connect at all except briefly see it for the first time when my colleagues showed it to me last month. I have to update the Virtual Connect firmware for our enclosures now so am looking into how I can do that. Here are some more documents I am yet to read; linking them here as a bookmark to myself:

  • A PDF giving an overview of Virtual Connect
  • A page with all the documentation HP has on Virtual Connect and related
  • A page with many whitepapers and manuals on how Virtual Connect works

Virtual Connect firmware can be done via HP SUM/ SPP. It can also be done independently via the Virtual Connect Support Utility (VCSU).

  • This PDF (which can be found via the second bullet point above) is very useful. It is a document outlining the steps involved in upgrading the Virtual Connect firmware. It’s from 2013 but I couldn’t find anything newer on HP’s website.
  • The above PDF is also linked to from this excellent blog post that talks about how to upgrade the firmware without any downtime. 
  • VCSU can be downloaded from this page.
  • Here’s a page with some of its more useful commands.
  • Finally, this page has the latest version of the firmware. You can download the version of Windows and extract the binary image of the firmware.

Upgrading the Virtual Connect firmware seems very straightforward. As I said you can do it via the SUM/ SPP too. Recommended order is to first upgrade the OA (easily done via SUM/ SPP – no reboot required); then the ROM, iLO, and any other firmware for the blades (again easily done via SUM/ SPP – ROM & iLO don’t require any reboot); and finally the VC.  For me the big question was whether I can do the VC upgrade without any network impact.

The PDF I mentioned above (this one) is a must read on the upgrade process. Page 10 onwards talks about the upgrade process.

One thing to note is that before upgrade VCSU (which is what SUM/ SPP too use behind the scenes I suppose) takes a backup of the configuration and does health checks. If the VCs don’t pass health checks the upgrade doesn’t happen. Each Ethernet module of the VC takes about 20 minute to upgrade; each FC module takes about 5 mins. An overview of the upgrade process can be found on page 11 – in short, here’s what happens:

  1. Via SFTP the new firmware is copied in parallel to all modules.
  2. Firmware is upgraded on all modules in parallel. This can be thought of as the update phase.
  3. Then the firmware are activated. The default order is odd-even in which modules on the odd side of the enclosure are activated, then those on the even side.
    1. It is also possible to do serial activation (one after the other), or parallel (everything at the same time), or manual.
  4. Post activation the module is rebooted.
    1. I am not very clear here but it seems the modules on the backup VC side of things (including the backup VC) get rebooted first.
    2. Then the modules on the primary VC side of things (except the primary VC) get rebooted.
    3. Failover VC Manager (VCM) to the backup VC module, and then the primary VC module is rebooted.
    4. Post-reboot the VCM fails over back to the primary VC module (this is only for the Ethernet modules I think, not FC).

Notice the bit about the reboots above? That’s when network connectivity can be lost. On page 12 the document talks about how network outages can be avoided via redundant configuration and NIC bonding but then on page 13 it clarifies that because the reboot is a graceful one there is a possibility that there could be a 20 second network outage because the blade hardware (and the OS running on it) might not be notified that the VC module is down. You see, something called the SmartLink and DCC protocol are responsible for informing the blades that the VC modules are down and so the NICs they map to are down – and so they should fail over to another NIC using the backup VC – but because the firmware is being upgraded the SmartLink and DCC protocol are unavailable, no one informs the blades. So it only when the OS in the blades realize that it has lost network connectivity and must take corrective action, does the OS fail over to using the backup NIC – leading to a potential 20 second outage.

(What I said above is also what this blog post mentions. To give credit I came across the blog post first and through it the guide).

The workaround to the above outage is to set the activation order as manual. And then reset the VC modules through the OA. Since that’s a reset – as opposed to a graceful reboot – the blades will get a notification immediately that the module is down.

Here’s how I updated the VC firmware on my servers without any network outage. First I used VCSU (in update mode) to update & activate the VC modules. Note I select “manual” as the option in two places below. 

I set a time of 5 mins to wait between activation of each VC module. That’s generally recommended.

After that I got the screens below – the whole process took about 40 minutes:

That completes the updating and activation but the firmware isn’t activated yet because I chose not to reboot. Because of that there’s no network downtime so far.

After that I logged into the OA, went over to the Interconnect Bays section > selected the first VC module > Virtual Buttons tab > and clicked Reset.

vc module reset

This resets the VC module. Again no network outage (I was continually pinging some of the hosts and the VMs – one of the VMs had 3 packet drops, that’s it; the hosts I pinged had no drops). Post resetting (which is instant on the UI) I waited some 5 mins, then checked the Information tab to see the firmware level. It was showing the new firmware:

firmware infoAfter that did the same (reset) for the second VC module. Waited 5-6 mins and then I ran VCSU again (in healthcheck mode) to confirm the state of the modules. (To make the output smaller I used input switches to VCSU. Could have done the same above too).

As can be seen the modules are in sync and both the latest firmware version. All done without any network outage! :)

Update Jan 2016: Chris Lynch (from HPE) wrote to me three months ago clarifying some misinformation in my post above. Turns out you no longer have the 20 second outage and all that I wrote above is more or less incorrect. :) Rather than copy paste his email here I’ve printed it to a PDF and you can read it here – Chris Lynch update. Thanks Chris!

HP SUM/ SPP configuration location

Before I forget – HP SUM & HP SPP store their configuration stuff in the following folder – C:\Users\(Username)\AppData\Local\Temp\2\HPSUM. I spent a while today trying to discover where this information is stored. Thought it would be in the same folder as HP SUM or perhaps in the registry, but no – it’s stored as above!

In my case HP SUM was acting weird and not talking to all my nodes properly. It did so correctly in the beginning, and even updated a few, but after that it kept hanging at the inventory stage and would complain about the username/ password being wrong. Figured I’d nuke it and start again but couldn’t make much progress until I figured the above.

Brief notes on HP SUM and SPP

HP SUM (Smart Update Manager) can be downloaded from http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/hpsum/index.aspx. This is just the tool. Its home page is http://www8.hp.com/us/en/products/server-software/product-detail.html?oid=5182020. As of this post date the home page says the latest version is 7.3.0 but the download page only has 7.1.0. Not sure why.

I am on Windows so I downloaded the ISO and the ZIP file (which can be found later on in the page). The ISO file is bootable. You can add firmware and drivers to this and boot up. The ZIP file has the HP SUM tool for Windows and Linux and can be extracted to these OSes and run from there. It’s not meant for booting up and deploying.

From Windows computers you can run HP SUM and update Windows, Linux, VMware, HP-UX, iLO, Virtual Connect, etc. From Linux computers you can do all these except Windows.

Documentation can be found at http://h17007.www1.hp.com/us/en/enterprise/servers/solutions/info-library/index.aspx?cat=smartupdate&subcat=hp_sum.

An SPP (Service Pack for Proliant) is the SUM along with a set of firmware and drivers. As of a certain date. These have been tested to ensure they work well together.

HP SUM only works with VMware if you are using the HP customized version of VMware. These can be found at http://www8.hp.com/us/en/products/servers/solutions.html?compURI=1499005#tab=TAB4. If your installation of VMware is not an HP customized version then the inventory step will fail with an error that the username/ password is incorrect.

A baseline is a set of updates that you want all the nodes added into SUM to be at. If you run SUM from an SPP then the baseline that of the SPP – for example 2015.04 if you are running the 2015.04 SPP. SUM creates a baseline from the packages you add to it the first time it runs. In addition to a baseline you can also add extra components (I am not too sure about that, haven’t played with it).

So you create a baseline (or it happens implicitly). You add nodes and do an inventory of the nodes. That tells you what’s present on the system. Then in the next screen you review what needs to be done and deploy accordingly. On this scren you can choose whether reboots happen or should be delayed. You can also see which updates will cause a reboot. In some cases you can even downgrade via this screen.

Some of the components will appear as “Force” or “Disabled”. This means no update is required. If you click on the details link for these components you will usually see that the installed component is already at the version with SUM. If you want you can re-install/ overwrite some of these components. The ones you can overwrite are shown as “Force”; the ones you cannot are shown as “Disabled”. If you toggle “Force” it becomes “Forced”.

SUM can be run via GUI. In this case the GUI is actually run via a web server you have to point to. Or you can run via command-line. The latter gives you more fine control over the process I think.

Upgrading iLO firmware manually (working around a stuck HP logo screen when updating)

Past two weeks I have been upgrading the iLO and ROM of all our servers (a bunch of HP DL 360s basically – Gen6 to Gen8) following which I upgrade them from ESXi 4.1 to 5.5. Side by side I have also been upgrading the iLO and ROM of our LeftHand/ StoreVirtual boxes following which I upgrade them from LeftHand OS 8.5 to 12.0. Yes, I’ve been busy!

Interesting thing about the firmware upgrades is that even between servers of the same model, when upgrading with the same Service Pack for Proliant (SPP) CD version, I get different errors. Some odd ones really. For instance some servers simply power off once the SPP CD boots, others give me a Pink Screen of Death, and yet others simply hang with the pulsating HP logo.

Pink Screen

Pulsating Logo

I couldn’t find any solutions for the servers that power off (I used SPP version 2015.04, 2014.09, 2014.06 and 2013.02 – same results for all). I was able to work around the pink screen by using an older version (for instance, I was using 2015.04 and that failed but 2014.09 worked). And I sorted of worked around the pulsating logo problem.

For the pulsating logo issue apparently the fix is to upgrade iLO first and then run the SPP. In my case the servers had really ancient versions of iLO – “1.87 06/03/2009” – so I upgraded them via the iLO webpage. The blog post I link to before (and also this one) show a way of updating iLO via SSH but that didn’t work for me sadly. (Could just be the web server I was running. I used TinyWeb to run a small web server off my desktop machine).

Before upgrading iLO via SSH or the webpage, you need to get iLO first. That should be easy but I had trouble getting it. For anyone else looking for the latest and greatest version of iLO 2 this HP page is what you want (and the “Revision History” tab on that page gives you older versions too). That page lets you download versions of the firmware for flashing via Linux or Windows. I downloaded the Windows versions, right clicked on it (it’s an EXE file) via 7-Zip (any other zip tool should do), and extracted the contents. The result is a file with a name like “ilo2_225.bin”. This is the binary image of the iLO 2 firmware that you can flash via SSH or the webpage.

Flashing via the webpage is easy. Go to the “Administration” tab, click Browse to select this file, and click “Send firmware image”.

GUI firmwareUse a modern browser if you can. :-) I used the ancient version of IE on my server and that didn’t do anything, but when I used Firefox I was able to see a progress bar and the firmware actually got updated.

GUI firmware flashingAfter doing this I was able to run the SPP without any issue.

Another thing I learnt is that for the LeftHand/ StoreVirtual servers, simply upgrading the OS or patching it is enough to upgrade the ROM too. So I could have saved some time for myself with the LeftHand/ StoreVirtual servers by updating the iLO (as above) and upgrading the OS. No need to run the SPP.

On a related note, I had some servers with an “Internal Health LED failed” error even though everything seemed to be alright with them. Upgrading the iLO sorted that out!

And while on the topic of iLO I had some servers whose iLO was not responsive. I couldn’t ping the iLO IP address nor could I connect to it. I was able to fix some of those servers by completely powering off the server, removing the power cables, removing the iLO cable, waiting a few minutes, putting back the power cable and powering on the server, and once it has loaded the OS put in the iLO cable. (I have also read reports on the Internet where there was no need to remove/ re-insert the iLO cable so YMMV).

One server though had no luck – its iLO chip was faulty I guess. I tried to upgrade its iLO firmware and ROM by physically being in front of the server but it would hang at the pulsating logo as above. I think the faulty iLO was causing SPP to fail. Because of the faulty iLO though, ESXi would hang at “loading module ipmi_si_drv” for about 30 minutes each time it would boot (or when I’d run the installer to upgrade to 5.5). The solution is as detailed in this blog post. (Note: the argument is noipmiEnabled – I was mistakenly typing noipmiEnable the first few times and nothing happened). Post-install I configured the VMkernel.Boot.impiEnabled advanced configuration option to 0 (I unchecked it). This way I don’t have to enter the boot options each time.

That’s all!

Run HP USB Key Utility from Windows 7

The HP USB Key Utility can be used to write an SPP ISO to a bootable USB drive. If you run it on a Windows Desktop though it refuses to run because it requires a Windows Server OS.

To work around this extract the files instead of installing it, and run the hpusbkey.exe file from the extracted files. That will work.

Also worth keeping in mind is the fact that the latest version of this utility – version 2.0.0.0 – is 64 bit only. You only need to use this version if you want to write the latest (2015.04 as of this post) SPP to a USB. That’s because the ISO for this version is greater than 4GB and older versions of the utility don’t support it. If you are on 32 bit Windows 7, be sure to use an older version such as 1.8.0.0 or 1.7.0.0.

Hope this helps someone!