Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Notes on Virtual Connect and firmware upgrading without network outage

I am yet to read this but in case you didn’t know there’s a book by HP on Virtual Connect. I haven’t used Virtual Connect at all except briefly see it for the first time when my colleagues showed it to me last month. I have to update the Virtual Connect firmware for our enclosures now so am looking into how I can do that. Here are some more documents I am yet to read; linking them here as a bookmark to myself:

  • A PDF giving an overview of Virtual Connect
  • A page with all the documentation HP has on Virtual Connect and related
  • A page with many whitepapers and manuals on how Virtual Connect works

Virtual Connect firmware can be done via HP SUM/ SPP. It can also be done independently via the Virtual Connect Support Utility (VCSU).

  • This PDF (which can be found via the second bullet point above) is very useful. It is a document outlining the steps involved in upgrading the Virtual Connect firmware. It’s from 2013 but I couldn’t find anything newer on HP’s website.
  • The above PDF is also linked to from this excellent blog post that talks about how to upgrade the firmware without any downtime. 
  • VCSU can be downloaded from this page.
  • Here’s a page with some of its more useful commands.
  • Finally, this page has the latest version of the firmware. You can download the version of Windows and extract the binary image of the firmware.

Upgrading the Virtual Connect firmware seems very straightforward. As I said you can do it via the SUM/ SPP too. Recommended order is to first upgrade the OA (easily done via SUM/ SPP – no reboot required); then the ROM, iLO, and any other firmware for the blades (again easily done via SUM/ SPP – ROM & iLO don’t require any reboot); and finally the VC.  For me the big question was whether I can do the VC upgrade without any network impact.

The PDF I mentioned above (this one) is a must read on the upgrade process. Page 10 onwards talks about the upgrade process.

One thing to note is that before upgrade VCSU (which is what SUM/ SPP too use behind the scenes I suppose) takes a backup of the configuration and does health checks. If the VCs don’t pass health checks the upgrade doesn’t happen. Each Ethernet module of the VC takes about 20 minute to upgrade; each FC module takes about 5 mins. An overview of the upgrade process can be found on page 11 – in short, here’s what happens:

  1. Via SFTP the new firmware is copied in parallel to all modules.
  2. Firmware is upgraded on all modules in parallel. This can be thought of as the update phase.
  3. Then the firmware are activated. The default order is odd-even in which modules on the odd side of the enclosure are activated, then those on the even side.
    1. It is also possible to do serial activation (one after the other), or parallel (everything at the same time), or manual.
  4. Post activation the module is rebooted.
    1. I am not very clear here but it seems the modules on the backup VC side of things (including the backup VC) get rebooted first.
    2. Then the modules on the primary VC side of things (except the primary VC) get rebooted.
    3. Failover VC Manager (VCM) to the backup VC module, and then the primary VC module is rebooted.
    4. Post-reboot the VCM fails over back to the primary VC module (this is only for the Ethernet modules I think, not FC).

Notice the bit about the reboots above? That’s when network connectivity can be lost. On page 12 the document talks about how network outages can be avoided via redundant configuration and NIC bonding but then on page 13 it clarifies that because the reboot is a graceful one there is a possibility that there could be a 20 second network outage because the blade hardware (and the OS running on it) might not be notified that the VC module is down. You see, something called the SmartLink and DCC protocol are responsible for informing the blades that the VC modules are down and so the NICs they map to are down – and so they should fail over to another NIC using the backup VC – but because the firmware is being upgraded the SmartLink and DCC protocol are unavailable, no one informs the blades. So it only when the OS in the blades realize that it has lost network connectivity and must take corrective action, does the OS fail over to using the backup NIC – leading to a potential 20 second outage.

(What I said above is also what this blog post mentions. To give credit I came across the blog post first and through it the guide).

The workaround to the above outage is to set the activation order as manual. And then reset the VC modules through the OA. Since that’s a reset – as opposed to a graceful reboot – the blades will get a notification immediately that the module is down.

Here’s how I updated the VC firmware on my servers without any network outage. First I used VCSU (in update mode) to update & activate the VC modules. Note I select “manual” as the option in two places below. 

I set a time of 5 mins to wait between activation of each VC module. That’s generally recommended.

After that I got the screens below – the whole process took about 40 minutes:

That completes the updating and activation but the firmware isn’t activated yet because I chose not to reboot. Because of that there’s no network downtime so far.

After that I logged into the OA, went over to the Interconnect Bays section > selected the first VC module > Virtual Buttons tab > and clicked Reset.

vc module reset

This resets the VC module. Again no network outage (I was continually pinging some of the hosts and the VMs – one of the VMs had 3 packet drops, that’s it; the hosts I pinged had no drops). Post resetting (which is instant on the UI) I waited some 5 mins, then checked the Information tab to see the firmware level. It was showing the new firmware:

firmware infoAfter that did the same (reset) for the second VC module. Waited 5-6 mins and then I ran VCSU again (in healthcheck mode) to confirm the state of the modules. (To make the output smaller I used input switches to VCSU. Could have done the same above too).

As can be seen the modules are in sync and both the latest firmware version. All done without any network outage! :)

Update Jan 2016: Chris Lynch (from HPE) wrote to me three months ago clarifying some misinformation in my post above. Turns out you no longer have the 20 second outage and all that I wrote above is more or less incorrect. :) Rather than copy paste his email here I’ve printed it to a PDF and you can read it here – Chris Lynch update. Thanks Chris!

HP SUM/ SPP configuration location

Before I forget – HP SUM & HP SPP store their configuration stuff in the following folder – C:\Users\(Username)\AppData\Local\Temp\2\HPSUM. I spent a while today trying to discover where this information is stored. Thought it would be in the same folder as HP SUM or perhaps in the registry, but no – it’s stored as above!

In my case HP SUM was acting weird and not talking to all my nodes properly. It did so correctly in the beginning, and even updated a few, but after that it kept hanging at the inventory stage and would complain about the username/ password being wrong. Figured I’d nuke it and start again but couldn’t make much progress until I figured the above.

Brief notes on HP SUM and SPP

HP SUM (Smart Update Manager) can be downloaded from http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/hpsum/index.aspx. This is just the tool. Its home page is http://www8.hp.com/us/en/products/server-software/product-detail.html?oid=5182020. As of this post date the home page says the latest version is 7.3.0 but the download page only has 7.1.0. Not sure why.

I am on Windows so I downloaded the ISO and the ZIP file (which can be found later on in the page). The ISO file is bootable. You can add firmware and drivers to this and boot up. The ZIP file has the HP SUM tool for Windows and Linux and can be extracted to these OSes and run from there. It’s not meant for booting up and deploying.

From Windows computers you can run HP SUM and update Windows, Linux, VMware, HP-UX, iLO, Virtual Connect, etc. From Linux computers you can do all these except Windows.

Documentation can be found at http://h17007.www1.hp.com/us/en/enterprise/servers/solutions/info-library/index.aspx?cat=smartupdate&subcat=hp_sum.

An SPP (Service Pack for Proliant) is the SUM along with a set of firmware and drivers. As of a certain date. These have been tested to ensure they work well together.

HP SUM only works with VMware if you are using the HP customized version of VMware. These can be found at http://www8.hp.com/us/en/products/servers/solutions.html?compURI=1499005#tab=TAB4. If your installation of VMware is not an HP customized version then the inventory step will fail with an error that the username/ password is incorrect.

A baseline is a set of updates that you want all the nodes added into SUM to be at. If you run SUM from an SPP then the baseline that of the SPP – for example 2015.04 if you are running the 2015.04 SPP. SUM creates a baseline from the packages you add to it the first time it runs. In addition to a baseline you can also add extra components (I am not too sure about that, haven’t played with it).

So you create a baseline (or it happens implicitly). You add nodes and do an inventory of the nodes. That tells you what’s present on the system. Then in the next screen you review what needs to be done and deploy accordingly. On this scren you can choose whether reboots happen or should be delayed. You can also see which updates will cause a reboot. In some cases you can even downgrade via this screen.

Some of the components will appear as “Force” or “Disabled”. This means no update is required. If you click on the details link for these components you will usually see that the installed component is already at the version with SUM. If you want you can re-install/ overwrite some of these components. The ones you can overwrite are shown as “Force”; the ones you cannot are shown as “Disabled”. If you toggle “Force” it becomes “Forced”.

SUM can be run via GUI. In this case the GUI is actually run via a web server you have to point to. Or you can run via command-line. The latter gives you more fine control over the process I think.