Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Removing a monitored resource from multiple nodes in Solarwinds

Had to remove a drive from being monitored from multiple servers in Solarwinds. Rather than go to each node, edit its properties and untick the drive, I figured that if you do a search for the nodes and expand each of them it’s possible to tick the ones you don’t need and click “Delete”.

multiple nodes

Nice!

Using Solarwinds to monitor Windows Services

This is similar to how I monitored performance counters with Solarwinds.

I want to monitor a bunch of AppSense services.

Similar to the performance counters where I created an application monitor template so I could define the threshold values, here I need to create an application monitor so that the service appears in the alert manager. This part is easy (and similar to what I did for the performance counters) so I’ll skip and just put a screenshot like below –

applimonitor

I created a separate application monitor template but you could very well add it to one of the standard templates (if it’s a service that’s to be monitored on all nodes for instance).

Now for the part where you create alerts.

Initially I thought this would be a case of creating triggers when the above application monitor goes down. Something like this – alert1

And create an alert message like this –

alert1a

With this I was hoping to get a one or more alert messages only for the services that actually went down. Instead, what happened is that whenever any one service went down I’d get an alert for the service that went down and also a message for the services that were up. Am guessing since I was triggering on the application monitor, Solarwinds helpfully sent the status for each of its components – up or down.

alert2

The solution is to modify your trigger such that you target each component.

alert3

Now I get alerts the way I want.

Hope this helps!

Mute Solarwinds alerts during reboots/ maintenance windows

I wanted to mute Solarwinds alerts during our patch weekends when all servers are rebooted because they have to be and our mailboxes get flooded with Solarwinds alerts. I decided to use custom properties for this purpose. Here’s what I did.

Login to the Solarwinds web console. Go to the “Settings” page, and then “Manage Custom Properties” under “Node *& Group Management”.

Click “Add Custom Property”, select the default of “Nodes” from the drop down, and create something along the following lines –

customproperties

Select the nodes you’d like to apply this custom property to. I chose to apply it on all my Windows and VMware nodes. Set the value to be “No”.

customvalues

Now login to Orion Alerts Manager and pick an alert you’d like to mute during patch weekends. Go to its “Alert Suppression” tab and add a condition on the custom property we created earlier.

alert custom properties

alert custom properties2

And that’s it, really!

Update: Not sure why, but the above didn’t seem to work for me. So I added the Mute_Alerts check as part of the trigger condition itself.

new trigger

Note: If you don’t get the custom property in Orion, close and restart it as an administrator (i.e. right click and do “Run as Administrator” even if you are already running it with an admin account). Not sure why, but until I did that the custom property didn’t get picked up. You only need to do it one time; after that you can launch Orion normally.

Next time your server estate is being rebooted/ undergoing maintenance, login to Solarwinds webconsole and change the “Mute_Alerts” custom property to “Y” for a node/ nodes that you want to mute alerts for. Below I show how I will mute the alerts for all my Windows nodes.

Go to “Manage Nodes”. Group by “Vendor” and select Windows. Then select all nodes. (The checkbox to select all nodes got blanked out in the screenshot below but it’s easy to find).

apply custom property

Then click on “Custom Property Editor” to get to the screen below.

apply custom property

Here too select all the nodes and click “Edit multiple values”.

From the drop down, change the value for “Mute_Alerts” to “true”. Then save changes and that’s it. :)

 

Using Solarwinds to monitor Windows Performance Monitor (perfmon) Counters

Had a request from our Exchange admin to setup Solarwinds alerts for some of our Exchange servers based on Performance Monitor counters.

MSExchangeTransport Queues(_total)\Active Remote Delivery Queue Length       (above 200)
MSExchangeTransport Queues(_total)\Largest Delivery Queue Length                 (above 200)
MSExchangeTransport Queues(_total)\Messages Queued For Delivery                (above 200)
MSExchangeTransport Queues(_total)\Retry Remote Delivery Queue Length        (above 20)

Before setting up alerts I need to add them to Solarwinds first. Here’s how you do that.

First, open up the Solarwinds web console, go to Applications, and then SAM Settings.

applicationssam settings

Then go to Component Monitor Wizard.

component monitor

 

Select Windows Performance Counter Monitor.

perfmon

Notice that it says the data is collected using RPC. This means (1) the server must be monitored by Solarwinds using WMI and not SNMP. In case of the latter, switch to monitoring via WMI. And (2) RPC ports must be open between the Solarwinds server and the target server. If not, monitoring will fail.

Enter the name of a server you wish to target. This server would be one that contains the perfmon counters you are interested in. You use this server to setup monitoring for the counters you are interested in. Change to 64bit if 32bit doesn’t work.

target

Change the “Choose Credential” drop down according to your environment. To select the server it’s better to click “Browse” and find the server you are interested in if Solarwinds complains that it cannot find the name you type in.

Note: The next step will fail if you have not opened the required RPC ports.

Select the counters you are interested in. First select the object you want to monitor (MSExchangeTransport Queues, in the screenshot below) and then the counters.

select counters

The next screen will list all the counters you selected and give you a chance to set warning and critical thresholds. Customize these.

 

properties

Select where you would like these counters added to – a new application monitor/ monitor template, or an existing application monitor/ monitor template. I am going with a new application monitor template. Easier to make changes to templates than individual application monitors.

whereadd

 

Choose more nodes you would like to assign this application monitor to. Am skipping this screenshot. This step is optional as you can assign the application monitor to nodes later too.

An optional step – I also went to Manage Application Templates screen after the above steps, selected the template I created, and assigned it some tags and set a custom view.

defineview

A custom view lets you define what details are shown when anyone clicks this application monitor template on a particular node in the Solarwinds web console. You can customize the view by going to Settings (of Solarwinds) and selecting Manage Views.

Next step is to create an alert. For that you have to logon to the Solarwinds server itself, go to Alert Manager, create a new alert (skipping screenshots for all these) and create a new alert whose condition is as follows:solarwinds trigger

Note that the type of property to monitor is “APM: Component”. This is important for the correct variables to be visible in the alert message. Also, note that I am triggering for each of the component (with an “any” condition) and not for the application monitor itself. This lets me get alerts for individual components; if I don’t do this, and instead trigger on the application monitor itself, I will get alert emails for each component including the ones that don’t have an issue.

Here’s the alert message:

solarwinds message

Power cycle/ Reset an HP blade server

Was getting the following error on one of our servers. It’s from ESXi. None of the NICs were working for the server (the NICs seemed to be working, just that the driver wasn’t loading). 

error

Power cycle required. 

I switched off and switched on the server but that didn’t help. Turns out that doesn’t actually power cycle the server (because the server still has power – doh!). What you need to do is do something called an e-fuse reset. This power cycles the blade. You have to do this by opening an SSH session to the Onboard Administrator, finding the bay number of the blade you want to power cycle, and typing the command reset server <bay number>

Good to know!

Note: The command does not appear when you type help, but it’s there:

Also, to get a list of your bays and servers use the show server list command. To do the same for interconnects use the show interconnect list command.

Unable to login to vSphere because the admin@system-domain password cannot be reset

vSphere 5.1 has admin@system-domain as the default admin account. vSphere 5.5 changes that to administrator@vsphere.local. However, if you upgrade from 5.1 to 5.5 the default admin account remains admin@system-domain. Which is fine and dandy until the password for this account expires. Then you are unable to reset or login! See below. :)

Trying to login as usual

1 - login

Password has expired, needs a reset

2 - reset

Reset fails though coz you can only reset for the vsphere.local domain

3 - reset fails

Missed out on taking a screenshot but if you were to try and login with administrator@vsphere.local instead you get an error that the credentials are invalid (because that account doesn’t exist!). So you are stuck!

What do you do?

Solution is to reset the admin password

When you do this vSphere automatically creates the administrator@vsphere.local account. Follow the steps in this KB article.

4 - reset password

Now you can login with administrator@vsphere.local and the generated password.

Notes on NLB, VMware, etc

Just some notes to myself so I am clear about it while reading about it. In the context of this VMware KB article – Microsoft NLB not working properly in Unicast mode.

Before I get to the article I better talk about a regular scenario. Say you have a switch and it’s got a couple of devices connected to it. A switch is a layer 2 device – meaning, it has no knowledge of IP addresses and networks etc. All devices connected to a switch are in the same network. The devices on a switch use MAC addresses to communicate with each other. Yes, the devices have IPv4 (or IPv6) addresses but how they communicate to each other is via MAC addresses.

Say Server A (IPv4 address 10.136.21.12) wants to communicate with Server B (IPv4 address 10.136.21.22). Both are connected to the same switch, hence on the same LAN. Communication between them happens in layer 2. Here the machines identify each other via MAC addresses, so first Server A checks whether it knows the MAC address of Server B. If it knows (usually coz Server A has communicated with Server B recently and the MAC address is cached in its ARP table) then there’s nothing to do; but if it does not, then Server A finds the MAC address via something called ARP (Address Resolution Protocol). The way this works is that Server A broadcasts to the whole network that it wants the MAC address of the machine with IPv4 address 10.136.21.22 (the address of Server B). This message goes to the switch, the switch sends it to all the devices connected to it, Server B replies with its MAC address and that is sent to Server A. The two now communicate – I’ll come to that in a moment.

When it’s communication from devices in a different network to Server A or Server B, the idea is similar except that you have a router connected to the switch. The router receives traffic for a device on this network – it knows the IPv4 address – so it finds the MAC address similar to above and passes it to that device. Simple.

Now, how does the switch know which port a particular device is connected to. Say the switch gets traffic addresses to MAC address 00:eb:24:b2:05:ac – how does the switch know which port that is on? Here’s how that happens –

  • First the switch checks if it already has this information cached. Switches have a table called the CAM (Content Addressable Memory) table which holds this cached info.
  • Assuming the CAM table doesn’t have this info the switch will send the frame (containing the packets for the destination device) to all ports. Note, this is not like ARP where a question is sent asking for the device to respond; instead the frame is simply sent to all ports. It is broadcast to the whole network.
  • When a switch receives frames from a port it notes the source MAC address and port and that’s how it keeps the CAM table up to date. Thus when Server A sends data to Server B, the MAC address and switch port of Server A are stored in the switch’s CAM table.  This entry is only stored for a brief period.

Now let’s talk about NLB (Network Load Balancing).

Consider two machines – 10.136.21.11 with MAC address 00:eb:24:b2:05:ac and 10.136.21.12 with MAC address 00:eb:24:b2:05:ad. NLB is a form of load balancing wherein you create a Virtual IP (VIP) such as 10.136.21.10 such that any traffic to 10.136.21.10 is sent to either of 10.136.21.11 or 10.136.21.12. Thus you have the traffic being load balanced between the two machines; and not only that if any one of the machines go down, nothing is affected because the other machine can continue handling the traffic.

But now we have a problem. If we want a VIP 10.136.21.10 that should send traffic to either host, how will this work when it comes to MAC addresses? That depends on the type of NLB. There’s two sorts – Unicast and Multicast.

In Unicast the NIC that is used for clustering on each server has its MAC address changed to a new Unicast MAC address that’s the same for all hosts. Thus for example, the NIC that holds the NLB IP address 10.136.21.10 in the scenario above will have its MAC address changed from 00:eb:24:b2:05:ac and 00:eb:24:b2:05:ad respectively to (say) 00:eb:24:b2:05:af. Note that the MAC address is a Unicast MAC (which basically means the MAC address looks like a regular MAC address, such as that assigned to a single machine). Since this is a Unicast MAC address, and by definition it can only be assigned to one machine/ switch port, the NLB driver on each machines cheats a bit and changes the source MAC address address to whatever the original NIC MAC address was. That is to say –

  • Server IP 10.136.21.11
    • Has MAC address 00:eb:24:b2:05:ac
    • Which is changed to a MAC address of 00:eb:24:b2:05:af as part of the Unicast IP/ enabling NLB
    • However when traffic is sent out from this machine the MAC address is changed back to 00:eb:24:b2:05:ac
  • Same for Server 10.136.21.12

Why does this happen? This is because –

  • When a device wants to send data to the VIP address, it will try find the MAC address using ARP. That is, it sends a broadcast over the network asking for the device with this IP address to respond. Since both servers now have the same MAC address for their NLB NIC either server will respond with this common MAC address.
  • Now the switch receives frames for this MAC address. The switch does not have this in its CAM table so it will broadcast the frame to all ports – reaching either of the servers.
  • But why does outgoing traffic from either server change the MAC address of outgoing traffic? That’s because if outgoing frames have the common MAC address, then the switch will associate this common MAC address with that port – resulting in all future traffic to the common MAC address only going to one of the servers. By changing the outgoing frame MAC address back to the server’s original MAC address, the switch never gets to store the common MAC address in its CAM table and all frames for the common MAC address are always broadcast.

In the context of VMware what this means is that (a) the port group to which the NLB NICs connect to must allow changes to the MAC address and allow forged transmits; and (b) when a VM is powered on the port group by default notifies the physical switch of the VMs MAC address, since we want to avoid this because this will expose the cluster MAC address to the switch this notification too must be disabled. Without these changes NLB will not work in Unicast mode with VMware.

(This is a good post to read more about NLB).

Apart from Unicast NLB there’s also Multicast NLB. In this form the NLB NIC’s MAC address is not changed. Instead, a new Multicast MAC address is assigned to the NLB NIC. This is in addition to the regular MAC address of the NIC. The advantage of this method is that since each host retains its existing MAC address the communication between hosts is unaffected. However, since the new MAC address is a Multicast MAC address – and switches by default are set to ignore such address – some changes need to be done on the switch side to get Multicast NLB working.

One thing to keep in mind is that it’s important to add a default gateway address to your NLB NIC. At work, for instance, the NLB IPv4 address was reachable within the network but from across networks it wasn’t. Turns out that’s coz Windows 2008 onwards have a strong host behavior – traffic coming in via one NIC does not go out via a different NIC, even if both are in the same subnet and the second NIC has a default gateway set. In our case I added the same default gateway to the NLB NIC too and it was then reachable across networks. 

HP DL360 Gen9 with HP FlexFabric 534 adapter and HP Ethernet 530 adapter and ESXi

That’s a very vague subject line, I know, but I couldn’t think of anything concise. Just wanted to put some keywords so that if anyone else comes across the same problem and types something similar into Google hopefully they stumble upon this post.

At work we got some HP DL360 Gen9s to use as ESXi hosts. To these servers we added additional network cards –

  • HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter; and
  • HP Ethernet 10Gb 2-port 530SFP+ Adapter.

Each of these adapters have two NICs each. Here’s a picture of the adapters in the server and the vmnic numbers ESXi assigns to them.

serverIn this picture –

  • vmnic5 & vmnic4 are the HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter;
  • vmnic6 & vmnic7 are the HP Ethernet 10Gb 2-port 530SFP+ Adapter; and
  • vmnic0 – vmnic3 are HP Ethernet 1Gb 4-port 331i Adapter (which come in-built into the server);
  • iLO is the iLO port (which I’ll ignore for now).

We didn’t want to use vmnic0 – vmnic3 as they are only 1Gb. So the idea was the use vmnic4 – vmnic7. Two NICs would be for Management+vMotion (connecting to two different switches); two NICs would be for iSCSI (again connecting to different switches).

We came across two issues. First was that the FlexFabric NICs didn’t seem to support iSCSI. ESXi showed two iSCSI adapters but the NICs mapped to them were the regular Ethernet 10Gb ones, not the FlexFabric 10Gb ones. Second issue was that we wanted to use vmnic4 and vmnic6 for Management+vMotion and vmnic5 and vmnic7 for iSCSI – basically a NIC from each adapter such that even if an adapter were to fail there’s a NIC from another adapter for resiliency. This didn’t work for some reason. The Ethernet 10Gb NICs weren’t “connecting” to the network switch for some reason. They would connect in the sense that the link status appears as connected and the LEDs on the switch and NICs blink, but something was missing. There was no real connectivity.

Here’s what we did to fix these.

But first, for both these fixes you have to reboot the server and go into the System Utilities menu.

f9 system utils

Change 1: Enable iSCSI on the FlexFabric adapter (vmnic4 and vmnic5)

Once in the System Utilities menu select “System Configuration”.

system configurationSelect the first FlexFabric NIC (port1).

select flexfabricThen select the Device Hardware Configuration menu.

select device hardwareYou will see that the storage personality is FCoE.

current flex personalityThat’s the problem. This is why the FlexFabric adapters don’t show up as iSCSI adapters. Select the FCoE entry and change it to iSCSI.

new flex personalityNow press Esc to go back to the previous menus (you will be prompted to save the changes – do so). Then repeat the above steps for the second FlexFabric NIC (port 2).

With this change the FlexFabric NICs will appear as iSCSI adapters. Now for the second change.

Change 2: Enable DCB for the Ethernet adapters

From the System Configuration menu now select the first Ethernet NIC (port 1).

select ethernetThen select its Device Hardware Configuration menu.

select device hardware (ethernet)Notice the entry for “DCB Protocol”. Most likely it is “Disabled” (which is why the NICs don’t work for you).

current DCBChange that to “Enabled” and now the NICs will work.

new DCBThat’s it. Once again press Esc (choosing to save the changes when prompted) and then reboot the system. Now all the NICs will work as expected and appear as iSCSI adapters too.

rebootI have no idea what DCB does. From what I can glean via Google it seems to be a set of extensions to Ethernet that provide “hardware-based bandwidth allocation to a specific type of traffic and enhances Ethernet transport reliability with the use of priority-based flow control” (via TechNet) (also check out this Cisco whitepaper for more info). I didn’t read much into it because I couldn’t find anything that mentioned why DCB mattered in this case – as in why were the NICs not working when DCB was disabled? The NICs are connected to an HP 5920AF switch but I couldn’t find anything that suggested the switch requires DCB enabled for the ports to work. This switch supports DCB but that doesn’t imply it requires DCB.

Anyhow, the FlexFabric adapters have DCB enabled by default which is probably why they worked. That’s how I got the idea to enable DCB on the Ethernet adapters to see if it makes a difference – and it did! The only thing I can think of is that DCB also seems to include a DCBX (Data Centre Bridging Exchange) protocol which is about discovering peers, discovering mismatched configuration etc – so maybe the fact that DCB was disabled on these adapters made the switch not “see” these NICs and soft-disable them somehow. That’s my guess at least.

Windows DNS server subnet prioritization and round-robin

Consider the following multiple A records for a DNS record proxy.mydomain.com:

  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5

These records are defined on a DNS server. When a client queries the DNS server for the address to proxy.mydomain.com, the DNS server returns all the addresses above. However, the order of answers returned keeps varying. The first client asking for answers could get them in the following order for instance:

  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5

The second client could get them in the following order:

  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5
  • proxy.mydomain.com IN A 192.168.10.5

The third client could get:

  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5
  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5

This is called round-robin. Basically it rotates between the various IP addresses. All IP addresses are offered as answers, but the order is rotated so that as long as clients choose the first answer in the list every client chooses a different IP address.

Notice I said clients choose the first answer in the list. This needn’t always be the case though. When I said clients above, I meant the client computer that is querying the DNS server for an answer. But that’s not really who’s querying the server. Instead, an application on the client (e.g. Chrome, Internet Explorer) or the client OS itself is the one looking for an answer. These ask the DNS resolver which is usually a part of the OS for an answer, and it’s the resolver that actually queries the server and gets the list of answers above.

The DNS resolver can then return the list as it is to the requesting application, or it can apply a re-ordering of its own. For instance, if the client is from the 192.168.10.0 network, the resolver may re-order the answers such that the 192.168.10.5 answer is always first. This is called Subnet prioritization. Basically, the resolver prioritizes answers that are from the same subnet as the client. The idea being that client applications would prefer reaching out to a server in their same subnet (it’s closer to them, no need to go over the WAN link for instance) than one on a different subnet.

Subnet prioritization can be disabled on the resolver side by adding a registry key PrioritizeRecordData (link) with value 0 (REG_DWORD) at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DnsCache\Parameters. By default this key does not exist and has a default value of 1 (subnet prioritization enabled).

Subnet prioritization can also be set on the server side so it orders the responses based on the client network. This is controlled by the registry key LocalNetPriority (link) under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\ on the DNS server. By default this is 0, so the server doesn’t do any subnet prioritization. Change this to 1 and the server will order its responses according to the client subnet.

By default the server also does round-robin for the results it returns. This can be turned off via the DNS Management tool (under server properties > advanced tab). If round-robin is off the server returns records in the order they were added.

More on subnet prioritization at this link.

That’s is not the end though. :)

Consider a server who has round-robin and subnet prioritization enabled. Now consider the DNS records from above:

  • proxy.mydomain.com IN A 192.168.10.5
  • proxy.mydomain.com IN A 10.136.53.5
  • proxy.mydomain.com IN A 10.136.52.5
  • proxy.mydomain.com IN A 10.136.33.5
  • proxy.mydomain.com IN A 192.168.15.5

The first and last records are from class C networks. The other three are from Class A networks. In reality though thanks to CIDR these are all class C addresses.

Now say there’s a client with IP address 10.136.50.2/24 asking the server for answers. On the face of it the client network does not match any of the answer record networks so the server will simply return answers as per round-robin, without any re-ordering. But in reality though the client 10.136.50.2/24 is in the same network as 10.136.52.5/24 and both are part of a larger 10.136.48.0/20 network that’s simply been broken into multiple /24 networks (to denote clients, servers, etc). What can we do so the server correctly identifies the proxy record for this client?

This is where the LocalNetPriorityNetMask registry key under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters\ on the DNS server comes into play. This key – which does not exist by default – tells the server what subnet mask to assume when it’s trying to subnet prioritize. By default the server assumes a /24 subnet, but by tweaking this key we can tell the server to use a different subnet in its calculations and thus correctly return an answer.

The LocalNetPriorityNetMask key takes a REG_DWORD value in a hex format. Check out this KB article for more info, but a quick run through:

A netmask can be written as xxx.xxx.xxx.xxx. 4 pairs of numbers. The LocalNetPriorityNetMask key is of format 0xaabbccdd – again, 4 pairs of hex numbers. This is a mask that’s applied on the mask of 255.255.255.255 so to calculate this number you subtract the mask you want from 255.255.255.255 and convert the resulting numbers into hex.

For example: you want a /8 netmask. That is 255.0.0.0. Subtracting this from 255.255.255.255 leaves you with 0.255.255.255.255. What’s that in hex? 00ffffff. So LocalNetPriorityNetMask will be 0x00ffffff. Easy?

So in the example above I want a /20 netmask. That is, I am telling the server to assume the clients and the record IPs it has to be in a /20 network, so subnet prioritize accordingly. A /20 netmask is 255.255.240.0. Subtract from 255.255.255.255 to get 0.0.15.255. Which in hex is 00000fff (15 decimal is F hex). So all I have to do is put this value as LocalNetPriorityNetMask on the DNS server, restart the service, and now the server will correctly return subnet prioritized answers for my /20 network.

Update: Some more links as I did some more reading on this topic later.

  • Ace Fekay’s post – a must read!
  • A subnet calculator (also gives you the wildcard, which you can use for calculating the LocalNetPriorityNetMask key)
  • I am not very clear on what happens if you disable RoundRobin but there are multiple entries from the same subnet. What order are they returned in? Here’s a link to the RoundRobin setting, doesn’t explain much but just linking it in case it helps in the future.
  • More as a note to myself (and any others if they wonder the same) – the subnet mask you specify is applied on the client. That is to say if you client has an IP address of say 10.136.20.10, by default the DNS server will assume a subnet mask of /24 (Class C is the default) and assume the client is in a 10.136.20.0/24 network. So any records from that range are prioritized. If you want to include other records, you specify a larger subnet mask. Thus, for example, if you specify a /20 then the client is assumed to have an IP address 10.136.20.10/20, so its network range is considered to be 10.136.16.1 – 10.136.31.254 (don’t wrack your brain – use the subnet calculator for this). So any record in this range is prioritized over records not in this range.
  • The Windows calculator can be used to find the LocalNetPriorityNetMask key value. Say you want a subnet mask of /19. The subnet calculator will tell you this has a wildcard of 0.0.31.255 – i.e. 00011111.11111111. Put this (13 1’s) into the Windows calculator to get the hex value 3FFF.  

vMotion is using the Management Network (and failing)

Was migrating one of our offices to a new IP scheme the other day and vMotion started failing. I had a good idea what the problem could be (coz I encountered something similar a few days ago in another context) so here’s a blog post detailing what I did.

For simplicity let’s say the hosts have two VMkernel NICs – vmk0 and vmk1. vmk0 is connected to the Management Network. vmk1 is for vMotion. Both are on separate VLANs.

When our Network admins gave out the new IPs they gave IPs from the same range for both functions. That is, for example, vmk0 had an IP 10.20.1.2/24 (and 10.20.1.3/24 and 10.20.4/24 on the other hosts) and vmk1 had an IP of 10.20.12/24 (and 10.20.1.13/24 and 10.20.1.14/24 on the other hosts).

Since both interfaces are on separate VLANs (basically separate LANs) the above setup won’t work. That’s because as far as the hosts are concerned both interfaces are on the same network yet physically they are on separate networks. Here’s the routing table on the hosts:

Notice that any traffic to the 10.20.1.0/24 network goes via vmk0. And that includes the vMotion traffic because that too is in the same network! And since the network that vmk0 is on is physically a separate network (because it is a VLAN) this traffic will never reach the vMotion interfaces of the other hosts because they don’t know of it.

So even though you have specific vmk1 as your vMotion traffic NIC, it never gets used because of the default routes.

If you could force the outgoing traffic to specifically use vmk1 it will work. Below are the results of vmkping using the default route vs explicitly using vmk1:

The solution here is to either remove the VLANs and continue with the existing IP scheme, or to keep using VLANs but assign a different IP network for the vMotion interfaces.

Update: Came across the following from this blog post while searching for something else:

If the management network (actually the first VMkernel NIC) and the vMotion network share the same subnet (same IP-range) vMotion sends traffic across the network attached to first VMkernel NIC. It does not matter if you create a vMotion network on a different standard switch or distributed switch or assign different NICs to it, vMotion will default to the first VMkernel NIC if same IP-range/subnet is detected.

Please be aware that this behavior is only applicable to traffic that is sent by the source host. The destination host receives incoming vMotion traffic on the vMotion network!

That answered another question I had but didn’t blog about in my post above. You see, my network admins had also set the iSCSI networks to be in the same subnet as the management network – but separate VLANs – yet the iSCSI traffic was correctly flowing over that VLAN instead of defaulting to the management VMkernel NIC. Now I understand why! It’s only vMotion that defaults to the first VMkernel NIC in the same IP range/ subnet as vMotion. 

 

How to fail-over the primary VC module to another

Probably obvious to most people but it was new to me. To failover a VC module from primary to backup login to the VC Manager, then go to “Tools > Reset Virtual Connect Manager”.

toolsClick “Yes”, but before that tick the box that says “Force Failover”.

force failoverThat’s all. Now you will be logged out and after a few minutes you can login to the module that is now primary. It takes a few minutes so be patient.

This has no impact on the network or FC connectivity.

Some more ways of doing this can be found at this HP link.

Brief notes on Windows Time

The w32time service provides time for Windows. Since Windows XP NTP (Network Time Protocol) is supported. Prior to that it was only SNTP (Simple NTP).

Non domain joined computers (including servers) use SNTP.

This is a good article that explains the Windows Time service and its configurations. Covers both registry keys and GPOs. This is another good article that goes into even more detail.

Any Windows machine can be set up to sync time in one of four ways: (1) no syncing! (2) sync from specified NTP servers (3) sync via domain hierarchy (i.e. members sync from a DC in the domain; DCs sync from PDC of the parent domain/ forest root domain) (4) use either of the above (i.e. NTP and domain hierarchy). Default mechanism on domain joined computers is domain hierarchy (the setting is called NT5DS). Stand-alone machines have the default as NTP servers (the setting is called NTP; the default server is time.windows.com though you can change it (and probably recommended that you change it?)).

For machines that are off and on the domain – e.g. laptops – it is better to set their time sync mechanism as any. They needn’t always have contact with the DC to sync time.

When specifying NTP time servers you also specify flags. Check this post for an explanation of the flags. There are four possible flags: 0x01 SpecialInterval; 0x02 UseAsFallbackOnly; 0x04 SymmetricActive; 0x08 Client.

  • Flag UseAsFallbackOnly means the server is only used if the others are unavailable. Check out this post for an example of this.
  • Flag SpecialInterval lets you change how often the NTP server is polled. By default the interval is determined by Windows based on the quality of time samples, but you can use the above flag and set a registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient\SpecialPollInterval to change the polling interval.
  • I am not sure what the other two flags do. The Client flag seems to be a commonly used one. Some posts/ articles use it, others don’t. The default time.windows.com setting uses this flag as well as the SpecialInterval.

p.s. To turn on w32tm debugging check out this link.

The dig command

Just a quick post on how to use the dig command. Had to troubleshooting some DNS issues today and nslookup just wasn’t cutting it so I downloaded BIND for Windows and ran dig from there.  (For anyone else that’s interested – if you download BIND (it’s a zip file), extract the contents, and copy dig.exe and all the .dll files elsewhere that’s all you need to run dig).

Here’s the simple way of running dig:

A bit confusing when you come from nslookup where you type the name server first and then the name and type. Here’s the nslookup syntax for reference:

Although all documentation shows the dig syntax as above I think the order can be changed too. Since the name server is identified by the @ sign you can specify it later on too:

I will use this syntax everywhere as it’s easier for me to remember.

You can complicate the simple syntax by adding options to dig. There’s two sort of options: query options and domain options (I made the names up). These are usually placed after the type but again the order doesn’t really matter:

  • Query options start with a - (dash) and look like the switches you have in most command line tools. These are useful to control the behavior of dig when making the query.
    • For instance: say I want to for dig to only use IPv4. I can add the option -4. Similarly for only IPv6 it is -6.
    • If I want to specify the port of the name server (i.e. apart from 53) the -p option is for that.
    • (Very useful) If I want to specify a different source port for the client (i.e. the port from which the outgoing request is made and to which the reply is sent, in case you want to eliminate any issues with that) the -b switch is for that. It takes as argument address#port, if you don’t care about the address leave it as 0.0.0.0#port.
    • Apart from specifying the name and type as above it is also possible to pass these as query options via -n and -t.
    • There are some more query options. Above are just the common ones I can remember.
  • Domain options start with a + (plus). These pertain to the name lookup you are doing.
    • For instance: typically DNS queries are performed using UDP. To use TCP instead do +tcp. Most of these options can be prefixed with a “no”, so if you explicitly want to not use TCP then you can do +notcp instead.
    • To specify the default domain name do +domain=xxx.
    • To turn off or on recursive mode do +norecurse or +recurse. (This is similar to the -recurse or -norecurse switch in nslookup).
      • Note: By default recursion is on.
      • A quick note on recursion: When recursion is on dig asks the name server you supply (or that used by the OS) for an answer and if that name server doesn’t have an answer the name server is expected ask other name servers and get back with an answer. Hence the term “recursive”. The name server will send a recursive query to the name server it knows of, who will send a recursive query to the name server it knows of, until an answer is got down the chain of delegation. 
      • A quick note on iterative (non-recursive queries): When recursion is off dig gets an answer from the first name server, then it queries the second name server, gets an answer and queries the third name server, and so on … until dig itself gets an answer from a server who knows.
    • The output from dig is usually verbose and shows all the sections. To keep it short use the +short option.
      • This doesn’t show the name of the server that dig got an answer from. To print this too use the +identify option.
    • To list all the authoritative name servers of a domain along with their SOA records use the +nssearch option. This disables recursion.
    • To list a trace of the delegation path use the +trace option. This too disables recursion. The delegation path is listed by dig contacting each name server down the chain iteratively.
    • Couple of output modifying options. Most of these require no changing:
      • By default the first line of the output shows the version of  dig, the query, and some extra bits of info. To hide that use +nocmd.
      • By default the answer section of replies is shown. To hide it do +noanswer.
      • By default the additional sections of a reply are shown. To hide it do +noadditional.
      • By default the authority section of a reply is shown. To hide it do +noauthority.
      • By default the question asked is shown as a comment (this is useful because you might type something like dig rakhesh.com which dig interprets as dig rakhesh.com A so this makes the question explicit). To hide this do +noquestion.
        • Note that this is not same as the query that is sent. The query is the question plus other flags like whether we want a recursive query etc. By default the query that is sent is not shown. To show that do +qr.
      • By default comments are printed. To hide it do +nocomments.
      • By default some stats are printed. To hide these do +nostats.
    • To change the timeout use +time=X. Default is 5 seconds. This is equivalent to nslookup -timeout=X.
    • To change the number of retries use +retry=X. Default is 3. This is equivalent to nslookup -retry=X.
    • There are some more domain options. Above are just the common ones I can remember.

An interesting thing about dig (since BIND 9) is that it can do multiple queries (the brackets below are not part of the command, I put them to show the multiple queries):

Each of these queries can take its own options but in addition to that you can also provide global options. So the full command when you have multiple queries and global options is as below (again, without the brackets):

To add to the fun you can even have one name server to be used for all queries and over-ride these via name servers for each query! Thus a fuller syntax is:

This is also why the options can be at the beginning of the query rather than at the end of the query. When you specify global options you can over-ride these per query (except the +[no]cmd option).

Exporting and Importing Windows DNS zones

Exporting a DNS zone is easy. Use the dnscmd.

Importing too is easy but the commands aren’t so obvious. Again you use dnscmd, with the /zoneadd switch as though you are creating a new zone. The help page for this misses out on an important switch though – /load – which lets you load the zone from an exported or pre-existing file.

You can find this switch in the dnscmd help:

So the way to import a zone is as follows: first, copy the exported file into the c:\windows\system32\dns folder of the DNS server and preferably rename it so the extension is a .dns (not required, just a nice thing to do). Then run a command similar to below:

That’s it. This will create a primary zone called “blah.com” and use the zone file that’s already in the location.

Note that you can’t use this technique for AD integrated zones. But that’s no issue. Simply import as above and then convert the zone to AD integrated via the GUI.

A problem occurred while trying to add the conditional forwarder

I was trying to create a conditional forwarder on my work DNS servers and kept hitting this cryptic error message:

forwarder error

I was trying to create a conditional forwarder called some.sub.zone.com.  At first I thought maybe I had this as an existing zone or perhaps as a stub zone – but nope, I don’t have that. Some forum posts mentioned the lack of root hints could lead to this error, but that doesn’t make sense to me – why would I need root hints for this? Next I created a test conditional forwarder to some random domain name and that worked – so surely it wasn’t a server issue.

I recreated this in my test lab and found the problem. The issue is that I am trying to create a forwarder to some.sub.zone.com while zone.com already exists on the DNS server. I was under the impression you could have conditional forwarders even for zones you host, but nope that’s a no can do. From the official docs here’s a para of interest:

A DNS server cannot forward queries for the domain names in the zones it hosts. For example, the authoritative DNS server for the zone microsoft.com cannot forward queries according to the domain name microsoft.com. The DNS server authoritative for microsoft.com can forward queries for DNS names that end with example.microsoft.com, if example.microsoft.com is delegated to another DNS server.

The emphasis is mine and that’s the work-around to use here. You have two options – either delete the zone.com zone from your DNS servers and then create a conditional forwarder for some.sub.zone.com; or create a delegation for some.sub.zone.com – you could do that to yourself too – and then create the conditional forwarder.

Here’s a screenshot from my test lab –

delegationThe some.sub delegation is to my server itself. You don’t need to create a zone for the delegation to succeed. The delegation is just a one way pointer of sorts telling the server to ask the delegated server for any queries concerning this sub zone – it basically tells the server hosting zone.com that it is no longer responsible for some.sub.zone.com (even though the delegation points back to itself!). Once that is done the server will allow you to create a conditional forwarder for some.sub.zone.com.