Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Notes on Virtual Connect and firmware upgrading without network outage

I am yet to read this but in case you didn’t know there’s a book by HP on Virtual Connect. I haven’t used Virtual Connect at all except briefly see it for the first time when my colleagues showed it to me last month. I have to update the Virtual Connect firmware for our enclosures now so am looking into how I can do that. Here are some more documents I am yet to read; linking them here as a bookmark to myself:

  • A PDF giving an overview of Virtual Connect
  • A page with all the documentation HP has on Virtual Connect and related
  • A page with many whitepapers and manuals on how Virtual Connect works

Virtual Connect firmware can be done via HP SUM/ SPP. It can also be done independently via the Virtual Connect Support Utility (VCSU).

  • This PDF (which can be found via the second bullet point above) is very useful. It is a document outlining the steps involved in upgrading the Virtual Connect firmware. It’s from 2013 but I couldn’t find anything newer on HP’s website.
  • The above PDF is also linked to from this excellent blog post that talks about how to upgrade the firmware without any downtime. 
  • VCSU can be downloaded from this page.
  • Here’s a page with some of its more useful commands.
  • Finally, this page has the latest version of the firmware. You can download the version of Windows and extract the binary image of the firmware.

Upgrading the Virtual Connect firmware seems very straightforward. As I said you can do it via the SUM/ SPP too. Recommended order is to first upgrade the OA (easily done via SUM/ SPP – no reboot required); then the ROM, iLO, and any other firmware for the blades (again easily done via SUM/ SPP – ROM & iLO don’t require any reboot); and finally the VC.  For me the big question was whether I can do the VC upgrade without any network impact.

The PDF I mentioned above (this one) is a must read on the upgrade process. Page 10 onwards talks about the upgrade process.

One thing to note is that before upgrade VCSU (which is what SUM/ SPP too use behind the scenes I suppose) takes a backup of the configuration and does health checks. If the VCs don’t pass health checks the upgrade doesn’t happen. Each Ethernet module of the VC takes about 20 minute to upgrade; each FC module takes about 5 mins. An overview of the upgrade process can be found on page 11 – in short, here’s what happens:

  1. Via SFTP the new firmware is copied in parallel to all modules.
  2. Firmware is upgraded on all modules in parallel. This can be thought of as the update phase.
  3. Then the firmware are activated. The default order is odd-even in which modules on the odd side of the enclosure are activated, then those on the even side.
    1. It is also possible to do serial activation (one after the other), or parallel (everything at the same time), or manual.
  4. Post activation the module is rebooted.
    1. I am not very clear here but it seems the modules on the backup VC side of things (including the backup VC) get rebooted first.
    2. Then the modules on the primary VC side of things (except the primary VC) get rebooted.
    3. Failover VC Manager (VCM) to the backup VC module, and then the primary VC module is rebooted.
    4. Post-reboot the VCM fails over back to the primary VC module (this is only for the Ethernet modules I think, not FC).

Notice the bit about the reboots above? That’s when network connectivity can be lost. On page 12 the document talks about how network outages can be avoided via redundant configuration and NIC bonding but then on page 13 it clarifies that because the reboot is a graceful one there is a possibility that there could be a 20 second network outage because the blade hardware (and the OS running on it) might not be notified that the VC module is down. You see, something called the SmartLink and DCC protocol are responsible for informing the blades that the VC modules are down and so the NICs they map to are down – and so they should fail over to another NIC using the backup VC – but because the firmware is being upgraded the SmartLink and DCC protocol are unavailable, no one informs the blades. So it only when the OS in the blades realize that it has lost network connectivity and must take corrective action, does the OS fail over to using the backup NIC – leading to a potential 20 second outage.

(What I said above is also what this blog post mentions. To give credit I came across the blog post first and through it the guide).

The workaround to the above outage is to set the activation order as manual. And then reset the VC modules through the OA. Since that’s a reset – as opposed to a graceful reboot – the blades will get a notification immediately that the module is down.

Here’s how I updated the VC firmware on my servers without any network outage. First I used VCSU (in update mode) to update & activate the VC modules. Note I select “manual” as the option in two places below. 

I set a time of 5 mins to wait between activation of each VC module. That’s generally recommended.

After that I got the screens below – the whole process took about 40 minutes:

That completes the updating and activation but the firmware isn’t activated yet because I chose not to reboot. Because of that there’s no network downtime so far.

After that I logged into the OA, went over to the Interconnect Bays section > selected the first VC module > Virtual Buttons tab > and clicked Reset.

vc module reset

This resets the VC module. Again no network outage (I was continually pinging some of the hosts and the VMs – one of the VMs had 3 packet drops, that’s it; the hosts I pinged had no drops). Post resetting (which is instant on the UI) I waited some 5 mins, then checked the Information tab to see the firmware level. It was showing the new firmware:

firmware infoAfter that did the same (reset) for the second VC module. Waited 5-6 mins and then I ran VCSU again (in healthcheck mode) to confirm the state of the modules. (To make the output smaller I used input switches to VCSU. Could have done the same above too).

As can be seen the modules are in sync and both the latest firmware version. All done without any network outage! :)

Update Jan 2016: Chris Lynch (from HPE) wrote to me three months ago clarifying some misinformation in my post above. Turns out you no longer have the 20 second outage and all that I wrote above is more or less incorrect. :) Rather than copy paste his email here I’ve printed it to a PDF and you can read it here – Chris Lynch update. Thanks Chris!

HP SUM/ SPP configuration location

Before I forget – HP SUM & HP SPP store their configuration stuff in the following folder – C:\Users\(Username)\AppData\Local\Temp\2\HPSUM. I spent a while today trying to discover where this information is stored. Thought it would be in the same folder as HP SUM or perhaps in the registry, but no – it’s stored as above!

In my case HP SUM was acting weird and not talking to all my nodes properly. It did so correctly in the beginning, and even updated a few, but after that it kept hanging at the inventory stage and would complain about the username/ password being wrong. Figured I’d nuke it and start again but couldn’t make much progress until I figured the above.

Pause a DNS zone on all DNS servers

Here’s how I paused a zone on all the DNS servers hosting that zone:

This looks up the name servers for the zone and suspends the zone on each of those servers. If there are any servers that host this zone but aren’t specified as name servers for the zone (for example it could be an AD integrated zone but the NS records are incomplete) it misses out those servers. So it’s not a great script, there’s probably better ways to do this.

In my case the zone in question was being replicated to all DCs in the domain. So I got a list of all DCs in the domain and targeted those instead:

 

Stub zones do not need zone transfer (with screenshots!)

I had to write an email about this and so take the trouble to set up a test zone and create screenshots. Figured I might as well put it in a blog post too.

Exhibit A: An AD integrated zone called some.zone.com.

somezoneDoesn’t matter that it’s AD integrated or what NS records it holds. I just created an AD integrated zone to simulate our work environment.

Note that this zone doesn’t have zone transfers enabled.

nozonetransferExhibit B: A regular Windows Server 2012 machine called WIN-SVR01. Not domain joined (just in case anyone points out that could make a difference). It has access to the master server and Name Servers and that’s it. Create a stub zone as usual, pointing it to the master servers (in the screenshot below I point to just one master server).

new stub zone

Exhibit C: And that’s it! As soon as I do the above, the zone loads and I am able to query records in it.

stub zone works

That’s it!

One source of confusion seems to be the Get-DnsServerZone cmdlet. Here’s the cmdlet output once the stub zone has loaded:

Note the attributes LastZoneTransferAttempt and LastZoneTransferResult – these give the impression a zone transfer is being carried out.

Now watch the same output after I recreated the stub zone but this time I blocked it from accessing the master servers (so the stub zone doesn’t load):

Even though the zone hasn’t loaded LastSuccessfulZoneTransfer gives the impression it has succeeded. LastZoneTransferResult gives an error code though. Best to ignore these attributes for stub zones – as I showed above stub zones don’t require a zone transfer.

Stub zones do not need zone transfer

At work there was some confusion that creating stub zones requires the master servers to allow zone transfers to the servers holding the stub zones. That’s not correct and oddly I couldn’t find any direct hits when I typed this query into Google so I could show some blog posts/ articles for support.

I mentioned this briefly in one of my earlier blog posts. But don’t take my word for it here are two blog posts and a book except mentioning the same:

If stub zones don’t require any zone transfers what’s the difference between them and conditional forwarders? Again, check the second link above but the long and short of it is that stub zones query the master server IPs you give and asks these servers for a list of NS records and their addresses, and then queries these name servers for whatever record you want; while conditional forwarders have a predefined list of name servers and so always query these for the record you want. Stub zones are more resilient to changes. If the remote end adds/ removes a name server the stub zone will automatically pick it up as long as the master servers are up-to-date and reachable. A conditional forwarder won’t do this automatically – the remote end admins will have to communicate the new name server details to the conditional forwarder admins and they will have to update at their end.

Hope that clarifies!

p.s. See my next post.

p.p.s. Hadn’t thought of this. Good point (via):

Stub zones: will use whatever is in the NS records of the zone (or descendants of the zone, if not otherwise defined) to resolve queries  which are below a zone cut.

Forward zones: will always use the configured forwarders, which must support recursion, even for names which are known to be deeper in the delegation hierarchy and whose delegated/authoritative nameservers might respond more quickly than the forwarders, if asked.

The “Administrators” group

Note to self: the “Domain Admins” and “Enterprise Admins” groups aren’t the primary groups in a domain. The primary group is the “Administrators” group, present in the “Builtin” folder. The other two groups are members of this group and thus get rights over the domain. The “Enterprise Admins” group is also a member of the “Administrator” group in all other domains/ child-domains of that forest, hence its members get rights over those domains too.

So if you want to create a separate group in your domain and want to give its members domain admin rights over (say) a child domain, all you need to do is create the group (must be Global or Universal) an add this group to the “Administrators” group in the child domain. That’s it!

Cannot ping an address but nslookup works (contd)

Earlier today I had blogged about nslookup working but ping and other methods not resolving names to IP addresses. That problem started again, later in the evening.

Today morning though as a precaution I had enabled the DNS Client logs on my computer. (To do that open Event Viewer with an admin account, go down to Applications and Services Logs > Microsoft > Windows > DNS Client Events > Operational – and click “Enable log” in the “Actions” pane on the right).

That showed me an error along the following lines:

A name not found error was returned for the name vcenter01.rakhesh.local. Check to ensure that the name is correct. The response was sent by the server at 10.50.1.21:53.

Interesting. So it looked like a particular DC was the culprit. Most likely when I restarted the DNS Client service it just chose a different DC and the problem temporarily went away. And sure enough nslookup for this record against this DNS server returned no answers.

I fired up DNS Manager and looked at this server. It seemed quite outdated with many missing records. This is my simulated branch office DC so I don’t always keep it on/ online. Looks like that was coming back to bite me now.

The DNS logs in Event Manager on that server had errors like this:

The DNS server was unable to complete directory service enumeration of zone TrustAnchors.  This DNS server is configured to use information obtained from Active Directory for this zone and is unable to load the zone without it.  Check that the Active Directory is functioning properly and repeat enumeration of the zone. The extended error debug information (which may be empty) is “”. The event data contains the error.

So Active Directory is the culprit (not surprising as these zones are AD integrated so the fact that they weren’t up to date indicated AD issues to me). I ran repadmin /showrepl and that had many errors:

Naming Context: CN=Configuration,DC=rakhesh,DC=local

Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC03
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: CN=Configuration,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=DomainDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=ForestDnsZones,DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Naming Context: DC=rakhesh,DC=local
Source: COCHIN\WIN-DC01
******* WARNING: KCC could not add this REPLICA LINK due to error.

Great! I fired up AD Sites and Services and the links seemed ok. Moreover I could ping the DCs from each other. Event Logs on the problem DC (WIN-DC02) had many entries like this though:

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server win-dc01$. The target name used was Rpcss/WIN-DC01. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (RAKHESH.LOCAL) is different from the client domain (RAKHESH.LOCAL), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Hmm, secure channel issues? I tried resetting it but that too failed:

(Ignore the above though. Later I realized that this was because I wasn’t running command prompt as an admin. Because of UAC even though I was logged in as admin I should have right clicked and ran command prompt as admin).

Since I know my environment it looks likely to be a case of this DC losing trust with other DCs. The KRB_AP_ERR_MODIFIED error also seems to be related to Windows Server 2003 and Windows Server 2012 R2 but mine wasn’t a Windows Server 2003. That blog post confirmed my suspicions that this was password related.

Time to check the last password set attribute for this DC object on all my other DCs. Time to run repadmin /showobjmeta.

The above output gives metadata of the WIN-DC02 object on WIN-DC01. I am interested in the pwdLastSet attribute and its timestamp.  Here’s a comparison of this across my three DCs:

That confirms the problem. WIN-DC02 thinks its password last changed on 9th May whereas WIN-DC01 changed it on 25th July and replicated it to WIN-DC03.

Interestingly that date of 25th July is when I first started having problems in my test lab. I thought I had sorted them but apparently they were only lurking beneath. The solution here is to reset the WIN-DC02 password on itself and WIN-DC01 and replicate it across. The steps are in this KB article, here’s what I did:

  1. On WIN-DC02 (the problem DC) I stopped the KDC service and set it to start Manual.
  2. Purge the Kerberos ticket cache. You can view the ticket cache by typing the command: klist.  To purge, do: klist purge.
  3. Open a command prompt as administrator (i.e. right click and do a “Run as administrator”) then type the following command: netdom resetpwd /server WIN-DC01.rakhesh.local /UserD MyAdminAccount /PasswordD *
  4. Restart WIN-DC02.
  5. After logon start the KDC service and set it to Automatic.

Checked the Event Logs to see if there are any errors. None initially but after I forced a sync via repadmin /syncall /e I got a few. All of them had the following as an error:

2148074274 The target principal name is incorrect.

Odd. But at least it was different from the previous errors and we seemed to be making progress.

After a bit of trial and error I noticed that whenever the KDC service on the DC was stopped it seemed to work fine.

I could access other servers (file shares), connect to them via DNS Manager, etc. But start KDC and all these would break with errors indicating the target name was wrong or that “a security package specific error occurred”. Eventually I left KDC stay off, let the domain sync via repadmin /syncall, and waited a fair amount of time (about 15-20 mins) for things to settle. I kept an eye on repadmin /replsummary to see the deltas between WIN-DC02 and the rest, and also kept an eye on the DNS zones to see if WIN-DC02 was picking up newer entries from the others. Once these two looked positive, I started KDC. And finally things were working!

 

Brief notes on HP SUM and SPP

HP SUM (Smart Update Manager) can be downloaded from http://h17007.www1.hp.com/us/en/enterprise/servers/products/service_pack/hpsum/index.aspx. This is just the tool. Its home page is http://www8.hp.com/us/en/products/server-software/product-detail.html?oid=5182020. As of this post date the home page says the latest version is 7.3.0 but the download page only has 7.1.0. Not sure why.

I am on Windows so I downloaded the ISO and the ZIP file (which can be found later on in the page). The ISO file is bootable. You can add firmware and drivers to this and boot up. The ZIP file has the HP SUM tool for Windows and Linux and can be extracted to these OSes and run from there. It’s not meant for booting up and deploying.

From Windows computers you can run HP SUM and update Windows, Linux, VMware, HP-UX, iLO, Virtual Connect, etc. From Linux computers you can do all these except Windows.

Documentation can be found at http://h17007.www1.hp.com/us/en/enterprise/servers/solutions/info-library/index.aspx?cat=smartupdate&subcat=hp_sum.

An SPP (Service Pack for Proliant) is the SUM along with a set of firmware and drivers. As of a certain date. These have been tested to ensure they work well together.

HP SUM only works with VMware if you are using the HP customized version of VMware. These can be found at http://www8.hp.com/us/en/products/servers/solutions.html?compURI=1499005#tab=TAB4. If your installation of VMware is not an HP customized version then the inventory step will fail with an error that the username/ password is incorrect.

A baseline is a set of updates that you want all the nodes added into SUM to be at. If you run SUM from an SPP then the baseline that of the SPP – for example 2015.04 if you are running the 2015.04 SPP. SUM creates a baseline from the packages you add to it the first time it runs. In addition to a baseline you can also add extra components (I am not too sure about that, haven’t played with it).

So you create a baseline (or it happens implicitly). You add nodes and do an inventory of the nodes. That tells you what’s present on the system. Then in the next screen you review what needs to be done and deploy accordingly. On this scren you can choose whether reboots happen or should be delayed. You can also see which updates will cause a reboot. In some cases you can even downgrade via this screen.

Some of the components will appear as “Force” or “Disabled”. This means no update is required. If you click on the details link for these components you will usually see that the installed component is already at the version with SUM. If you want you can re-install/ overwrite some of these components. The ones you can overwrite are shown as “Force”; the ones you cannot are shown as “Disabled”. If you toggle “Force” it becomes “Forced”.

SUM can be run via GUI. In this case the GUI is actually run via a web server you have to point to. Or you can run via command-line. The latter gives you more fine control over the process I think.

Extract secret keys from Two-Factor Authentication (TFA) QR codes

Got me Pebble Time yesterday! Yay. Found a cool app for Two-Factor Authentication codes called QuickAuth (it’s open source too, amazing!). 

The app requires you to enter the secret keys for your Two-Factor Authentication sites. Unfortunately I never saved these when I set up TFA on my devices. I was smart enough to save the QR code for each site and this way I was always able to add new devices by just scanning the saved QR code, but now I had to enter the secret key and I was stuck. 

Enter another open source project Zebra Crossing (zxing). This is a library for processing QR codes and they have an Android app called Barcode Scanner. Get this app, scan the QR code, and you get an output that starts with otp://. That’s the secret key you want. Enter this into QuickAuth. 

If you don’t want to download the app there’s also an online interface to upload a QR code and decode. Nice!

p.s. In case it helps anyone – on the face of it there seems to be no easy way to delete a key/ site once you enter it into QuickAuth. Later I realized if I long press the select button on the pebble when it shows a code I get many options. One of these lets you delete the key/ site. 

Cannot ping an address but nslookup works

Had an odd thing happen yesterday. I could nslookup names from my computer but couldn’t ping them. Ping would give an error as below:

Very odd.

This is not just ping, nothing in my system could connect to these addresses.

Nslookup works by querying the DNS servers directly. Ping and the OS work by telling the Windows DNS Resolver to query the DNS servers. Both should be querying the same servers (unless I specifically tell nslookup to check with someone else) so in theory both should work the same – but oddly it doesn’t.

I restarted the DNS Client service (called dnscache) and then everything began working as expected. Not sure what was really wrong …

DNS request timed out

Whenever I’d do an nslookup I noticed these timeout messages as below:

Everything seemed to be working but I was curious about the timeouts.

Then I realized the problem was that I had missed the root domain at the end of the name. You see, a name such as “www.msftncsi.com” isn’t an absolute name. For all the DNS resolver knows it could just be a hostname – like say “eu.mail” in a full name such as “eu.mail.somedomain.com”. To tell the resolver this is the full name one must terminate it with a dot like thus: “www.msftncsi.com.“. When we omit the dot in common practice the DNS resolver puts it in implicitly and so we don’t notice it usually.

The dot is what tells the resolver about the root of the domain name hierarchy. A name such as “rakhesh.com.” actually means the “rakhesh” domain in the “com” domain in the “.” domain. It is “.” that knows of all the sub-domains such as “com”, “io”, “net”, “pl”, etc.

In the case above since I had omitted the dot my resolver was trying to append my DNS search suffixes to the name “www.msftncsi.com” to come up with a complete name. I have search suffixes “rakhesh.local.” and “dyn.rakhesh.local.” (I didn’t put the dot while specifying the search suffixes but the resolver puts it in because I am telling it these are the absolute domain names) so the resolver was actually expanding “www.msftncsi.com” to “www.msftncsi.com.rakhesh.local.” and “www.msftncsi.com.dyn.rakhesh.local.” and searching for these. That fails and so I get these “DNS request timed out” messages.

If I re-try with the proper name the query goes through fine:

Just to eliminate any questions of whether a larger timeout is the solution, no it doesn’t help:

If I replace my existing DNS suffixes search list with the dot domain, that too helps (not practical because then I can’t query by just the hostname, I will always have to put in the fully qualified name):

Or I could tell nslookup to not use the DNS suffix search lists at all (a more practical solution):

So there you go. That’s why I was getting those timeout errors.

Updating Windows DNS Server root hints

Somehow I came upon the root hints of my Windows DNS Server today and had a thought to update it. Never done that before so why not give it a shot?

You can find the root hints by right clicking on the server and going to the ‘Root Hints’ tab.

root hints

Or you could click the server name in DNS Manager and select ‘Root Hints’ in the right pane. Either ways you get to the screen above. From here you can add/ remove/ edit root server names and IP addresses. If you want to update this list you can do so by each entry, or click the ‘Copy from Server’ button to update the list with a new bunch of entries. Note that ‘Copy from Server’ does not over-write the list, so you are better off removing all the entries first and then doing ‘Copy from Server’.

The ‘Copy from Server’ option had me stumped though. You can find the root hints on the IANA website – there’s an FTP link to the file containing root hints, as well as an HTTP link (http://www.internic.net/domain/named.root). I thought simply entering this in the ‘Copy from Server’ window should suffice but it doesn’t. Notice the OK  button is grayed out.

copy from serverThe window says it wants a server name or IP address so I removed everything above except the server name and then clicked OK. That looked like it was doing something but then failed with a message that it couldn’t get the root hints. The message said the specified DNS server could not be contacted so that gave me the idea it was looking for a DNS server which had the root hints.

searching for root hintssearch failsSo I tried inputting the name of one of my DNS servers. This DNS server knows of the root servers because it has them already. (You can verify that a server knows of the root hints via nslookup as below).

My DNS server doesn’t have an authoritative answer (notice the output above) because it only has the info that’s present with it by default. The real answers could have changed by now (and it often does – the root hints list that these servers come with can have outdated entries) but that’s fine because it has some answers at least. If I were to input this server’s name or IP address to the ‘Copy from Server’ dialog above, that DNS server gets the root hints from this DNS server and updates itself.

Even better though would be to put the IP address of one of the root servers returned above. Like a.root-servers.net which has an IPv4 address of 198.41.0.4. (Don’t go by the output above, you can get the latest IP addresses from IANA). If I query that address for the root servers it has an authoritative answer:

So there you have it. I put in this IPv4 address into the ‘Copy from Server’ window and my server updated itself with the IP addresses. I noticed that it had missed some of the IPv6 addresses (not sure why, maybe coz it can’t validate these?) but when I did a ‘Copy from Server’ again without removing any existing entries and input in the same IPv4 address and did an update, this time it picked up all the addresses.

(Note to self: The %WINDIR%\System32\dns\cache.dns file seems to contain root hints. I replaced this file with the root hints from IANA but that did not update the server. When I updated the server as above and checked this file it hadn’t changed either. Restarting the DNS service didn’t update the file/ root hints either, so am not sure how this file comes into play).