Notes on Virtual Connect and firmware upgrading without network outage

I am yet to read this but in case you didn’t know there’s a book by HP on Virtual Connect. I haven’t used Virtual Connect at all except briefly see it for the first time when my colleagues showed it to me last month. I have to update the Virtual Connect firmware for our enclosures now so am looking into how I can do that. Here are some more documents I am yet to read; linking them here as a bookmark to myself:

  • A PDF giving an overview of Virtual Connect
  • A page with all the documentation HP has on Virtual Connect and related
  • A page with many whitepapers and manuals on how Virtual Connect works

Virtual Connect firmware can be done via HP SUM/ SPP. It can also be done independently via the Virtual Connect Support Utility (VCSU).

  • This PDF (which can be found via the second bullet point above) is very useful. It is a document outlining the steps involved in upgrading the Virtual Connect firmware. It’s from 2013 but I couldn’t find anything newer on HP’s website.
  • The above PDF is also linked to from this excellent blog post that talks about how to upgrade the firmware without any downtime. 
  • VCSU can be downloaded from this page.
  • Here’s a page with some of its more useful commands.
  • Finally, this page has the latest version of the firmware. You can download the version of Windows and extract the binary image of the firmware.

Upgrading the Virtual Connect firmware seems very straightforward. As I said you can do it via the SUM/ SPP too. Recommended order is to first upgrade the OA (easily done via SUM/ SPP – no reboot required); then the ROM, iLO, and any other firmware for the blades (again easily done via SUM/ SPP – ROM & iLO don’t require any reboot); and finally the VC.  For me the big question was whether I can do the VC upgrade without any network impact.

The PDF I mentioned above (this one) is a must read on the upgrade process. Page 10 onwards talks about the upgrade process.

One thing to note is that before upgrade VCSU (which is what SUM/ SPP too use behind the scenes I suppose) takes a backup of the configuration and does health checks. If the VCs don’t pass health checks the upgrade doesn’t happen. Each Ethernet module of the VC takes about 20 minute to upgrade; each FC module takes about 5 mins. An overview of the upgrade process can be found on page 11 – in short, here’s what happens:

  1. Via SFTP the new firmware is copied in parallel to all modules.
  2. Firmware is upgraded on all modules in parallel. This can be thought of as the update phase.
  3. Then the firmware are activated. The default order is odd-even in which modules on the odd side of the enclosure are activated, then those on the even side.
    1. It is also possible to do serial activation (one after the other), or parallel (everything at the same time), or manual.
  4. Post activation the module is rebooted.
    1. I am not very clear here but it seems the modules on the backup VC side of things (including the backup VC) get rebooted first.
    2. Then the modules on the primary VC side of things (except the primary VC) get rebooted.
    3. Failover VC Manager (VCM) to the backup VC module, and then the primary VC module is rebooted.
    4. Post-reboot the VCM fails over back to the primary VC module (this is only for the Ethernet modules I think, not FC).

Notice the bit about the reboots above? That’s when network connectivity can be lost. On page 12 the document talks about how network outages can be avoided via redundant configuration and NIC bonding but then on page 13 it clarifies that because the reboot is a graceful one there is a possibility that there could be a 20 second network outage because the blade hardware (and the OS running on it) might not be notified that the VC module is down. You see, something called the SmartLink and DCC protocol are responsible for informing the blades that the VC modules are down and so the NICs they map to are down – and so they should fail over to another NIC using the backup VC – but because the firmware is being upgraded the SmartLink and DCC protocol are unavailable, no one informs the blades. So it only when the OS in the blades realize that it has lost network connectivity and must take corrective action, does the OS fail over to using the backup NIC – leading to a potential 20 second outage.

(What I said above is also what this blog post mentions. To give credit I came across the blog post first and through it the guide).

The workaround to the above outage is to set the activation order as manual. And then reset the VC modules through the OA. Since that’s a reset – as opposed to a graceful reboot – the blades will get a notification immediately that the module is down.

Here’s how I updated the VC firmware on my servers without any network outage. First I used VCSU (in update mode) to update & activate the VC modules. Note I select “manual” as the option in two places below. 

I set a time of 5 mins to wait between activation of each VC module. That’s generally recommended.

After that I got the screens below – the whole process took about 40 minutes:

That completes the updating and activation but the firmware isn’t activated yet because I chose not to reboot. Because of that there’s no network downtime so far.

After that I logged into the OA, went over to the Interconnect Bays section > selected the first VC module > Virtual Buttons tab > and clicked Reset.

vc module reset

This resets the VC module. Again no network outage (I was continually pinging some of the hosts and the VMs – one of the VMs had 3 packet drops, that’s it; the hosts I pinged had no drops). Post resetting (which is instant on the UI) I waited some 5 mins, then checked the Information tab to see the firmware level. It was showing the new firmware:

firmware infoAfter that did the same (reset) for the second VC module. Waited 5-6 mins and then I ran VCSU again (in healthcheck mode) to confirm the state of the modules. (To make the output smaller I used input switches to VCSU. Could have done the same above too).

As can be seen the modules are in sync and both the latest firmware version. All done without any network outage! :)

Update Jan 2016: Chris Lynch (from HPE) wrote to me three months ago clarifying some misinformation in my post above. Turns out you no longer have the 20 second outage and all that I wrote above is more or less incorrect. :) Rather than copy paste his email here I’ve printed it to a PDF and you can read it here – Chris Lynch update. Thanks Chris!