Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Get a list of users in an OU along with last logged on date

Trivial stuff. Wanted to note it down someplace for future reference –

 

Start-BitsTransfer does nothing

I had to copy some VMware templates from our head office to the branch offices. Thought I’d copy them out of the datastores manually, then do a BITS transfer to the remote offices. This way I can do the transfer during normal hours but with minimal user impact.

Since PowerShell 2.0 you had the Start-BitsTransfer cmdlet to do BITS transfers.

Oddly however, the command would just exit without any error for me. And it didn’t seem to be doing anything. Then I realized I was pointing the cmdlet to my source folder and that’s why it was failing! Start-BitsTransfer only takes files. You can specify wildcards to select multiple files but you can’t point it to a folder.

So the following doesn’t work:

But this works:

Since Start-BitsTransfer supports wildcards it’s mostly fine unless your folder contains sub-folders and you want to copy these and/ or preserve the structure. Fortunately this is PowerShell so it’s just a matter of creating some wrappers around the cmdlet to support folders too. Like this one for instance. Or use CSV files like in this MSDN article.

One thing to keep in mind – by default Start-BitsTransfer has a default priority (specified via the -Priority switch) of Foreground. This competes with other applications so is probably not what you want. You alternatives are High, Normal, or Low – each of which does the transfer in the background and uses the idle bandwidth of the client for transfer (the priority determines which transfer job gets priority over the other similar BITS transfers).

Another thing to keep in mind is that BITS only really looks at the bandwidth availability of the client (when I say “client” I am not sure if it’s the sender or receiver – I didn’t read much into this). In a LAN environment it could be the case that the WAN side is saturated but the particular client you are targeting is idle – in this case BITS will use the full client bandwidth available to it even though the network itself doesn’t have any spare bandwidth (this was the case prior to BITS 2.0 but since then BITS can use an Internet Gateway Device to try and assess the bandwidth availability on the WAN side – this requires an IGD be find-able via UPnP and also that the Internet Gateway Device support such reporting, so I am not sure how well it works in practice). You can also use GPOs to control BITS bandwidth usage.

Anyhoo, this is not a BITS intro so I’ll leave it at that. :)

Once you start a BITS transfer you can also pause it via Suspend-BitsTransfer or cancel it via Remove-BitsTransfer. The latter will also delete any files that are already transferred, so if you just want to cancel but leave the transferred files as it is use Complete-BitsTransfer instead.

That’s all for now!

p.s. Almost forgot. You can also use BITS to download HTTP files. Like in this post for instance.

vMotion NIC load balancing fails even though there is an active link

The other day I blogged about how I had a host whose vMotion VMkernel interface seemed to be broken. Any vMotion attempts to it would hang at 14%.

At that time I logged on to the destination host, then used vmkping with the -I switch (to explicitly specify the vMotion VMkernel interface of the destination host), and found that I couldn’t ping the VMkernel interface of the other hosts. These hosts could ping each other but couldn’t ping the destination host.

The VMKernel interface is backed by two physical NICs. I found that if I remove one of the physical NICs from the VMkernel it works. Interestingly this link wasn’t showing any CDP info either, so it looked like something was wrong with it (the physical NIC shows as unclaimed coz the screenshot was taken after I moved it to unclaimed).

Missing CDP infoSo the first question is why did the VMkernel fail when only one of the physical NICs failed? Since the other physical NIC backing the VMkernel NIC is still active shouldn’t it have continued working?

The reason why it failed is that by default network failover detection is via “Link status only”. This only detects failures to the link – like say the cable is broken, the switch is down, or the NIC has failed – while failures such as the link being connected but blocked by switch are not detected. In my case as you can see from the screenshot above the link status is connected – so the host doesn’t consider the link failed even though it isn’t actually working, thus continues to use it.

Next I discovered that other hosts too similarly had their second vMotion physical NIC in a failed state as above yet they weren’t failing like this host. The simple explanation for this is that the host above somehow selected the faulty physical NIC as the one to use, didn’t detect it as failed and so continued to use it; whereas other hosts were more lucky and chose the physical NIC that works alright, so didn’t have any issues.

I am not sure that’s the entire answer though. For once the host that failed was ESXi 5.5 and using a distributed switch, while the other two hosts were ESXi 4.0 and using standard switches. Did that make a difference?

The default load balancing method for both standard and distributed switches is the same. (For a standard switch you check this under the vSwitch properties on the host. For a distributed switch you check this under the portgroup in the Networking section of vSphere (web) client).

default load balancingLoad balancing is what I am concerned about here because that’s what the hosts should be using to balance between both the NICs. That’s what the host will be using to select the physical NIC to use for that particular traffic flow. The load balancing method is same between standard and distributed switches yet why were the distributed switch/ ESXi 5.5 hosts behaving differently?

I am still not sure of an answer but I have my theory. My theory is that since a distributed switch is across multiple hosts the load balancing method (above) of choosing a route based on virtual port ID comes into play. Here’s screenshots from two of my hosts connected to the same distributed switch port group for instance:

port numberAs you can see the virtual port number is different for the VMkernel NIC of each host. So each host could potentially use a different underlying physical NIC depending on how the load balancing algorithm maps it.

But what about a standard switch? Since the standard switch is only on the host, and the only VMkernel NIC connected to it (in the case of vMotion) is the single VMKernel NIC I have assigned for vMotion, there is no load balancing algorithm coming into play! If, instead of a VMkernel I had a Virtual Machine network, then the virtual port number matters because there are multiple VMs connecting to the various port numbers; but that doesn’t matter for VMkernel NICs as there is only one of them. And so my theory is that for a VMkernel NIC (such as vMotion) backed by multiple physical NICs and using the default load balancing algorithm of virtual port ID – all traffic by default goes into one of the physical NICs and the other physical NIC is never used unless the chosen one fails. And that is why my hosts using the standard switches were always using the same physical NIC (am guessing the lower numbered one as that’s what both hosts chose) while hosts using distributed switches would have chosen different physical NICs per host.

That’s all! Just thought I’d put this out there in case anyone else has the same question.

AppV – Empty package map for package content root

Had an interesting problem at work yesterday about which I wish I could write a long and interesting blog post, but truthfully it was such a simple thing once I identified the cause.

We use AppV for streaming applications. We have many branch offices so there’s a DFS share which points to targets in each office. AppV installations in each office point to this DFS share and thanks to the magic of DFS referrals correctly pick up the local Content folder. From day-before, however, one of our offices started getting errors with AppV apps (same as in this post), and when I checked the AppV server I found errors similar to this in the Event Logs:

The DFS share seemed to be working OK. I could open it via File Explorer and its contents seemed correct. I checked the number of files and the size of the share and they matched across offices. If I pointed the DFS share to use a different target (open the share in File Explorer, right click, Properties, go to the DFS tab and select a different location target) AppV works. So the problem definitely looked like something to do with the local target, but what was wrong?

I tried forcing a replication. And checked permissions and used tools like dfsrdiag to confirm things were alright. No issues anywhere. Restarting the DFS Replication service on the server threw up some errors in the Event Logs about some AD objects, so I spent some time chasing up that tree (looks like older replication groups that were still hanging around in AD with missing info but not present in the DFS Management console any more) until I realized all the replication servers were throwing similar errors. Moreover, adding a test folder to the source DFS share correctly resulted it in appearing on the local target immediately – so obviously replication was working correctly.

I also used robocopy to compare the the local target and another one and saw that they were identical.

Bummer. Looked like a dead end and I left it for a while.

Later, while sitting through a boring conference call I had a brainwave that maybe the AppV service runs in a different user context and that may not be seeing the DFS share? As in, maybe the error message above is literally what is happening. AppV is really seeing an empty content root and it’s not a case of a corrupt content root or just some missing files?

So I checked the AppV service and saw that it runs as NT AUTHORITY\NETWORK SERVICE. Ah ha! That means it authenticates with the remote server with the machine account of the server AppV is running on. I thought I’d verify what happens by launching File Explorer or a Command Prompt as NT AUTHORITY\NETWORK SERVICE but this was a Server 2003 and apparently there’s no straightforward way to do that. (You can use psexec to launch something as .\LOCALSYSTEM and starting from Server 2008 you can create a scheduled task that runs as NT AUTHORITY\NETWORK SERVICE and launch that to get what you want but I couldn’t use that here; also, I think you need to first run as the .\LOCALSYSTEM account and then run as the NT AUTHORITY\NETWORK SERVICE account). So I checked the Audit logs of the server hosting the DFS target and sure enough found errors that the machine account of the AppV server was indeed being denied login:

Awesome! Now we are getting somewhere.

I fired up the Local Security Policy console on the server hosting the DFS target (it’s under the Administrative Tools folder, or just type secpol.msc). Then went down to “Local Policies” > “User Rights Assignment” > “Access this computer from the Network”:

secpolSure enough this was limited to a set of computers which didn’t include the AppV server. When I compared this with our DFS servers I saw that they were still on the default values (which includes “Everyone” as in the screenshot above) and that’s why those targets worked.

To dig further I used gpresult and compared the GPOs that affected the above policy between both servers. The server that was affected had this policy modified via  GPO while the server that wasn’t affected showed the GPO as inaccessible. Both servers were in the same OU but upon examining the GPO I saw that it was limited to a certain group only. Nice! And when I checked that group our problem server was a member of it while the rest weren’t! :)

Turns out the server was added to the group by error two days ago. Removed the server from this group, waited a while for the change across the domain, did a gpupdate on the server, and tada! now the AppV server is able to access the DFS share on this local target again. Yay!

Moral of the story: if one of your services is unable to access a shared folder, check what user account the service runs as.

vMotion is using the Management Network (and failing)

Was migrating one of our offices to a new IP scheme the other day and vMotion started failing. I had a good idea what the problem could be (coz I encountered something similar a few days ago in another context) so here’s a blog post detailing what I did.

For simplicity let’s say the hosts have two VMkernel NICs – vmk0 and vmk1. vmk0 is connected to the Management Network. vmk1 is for vMotion. Both are on separate VLANs.

When our Network admins gave out the new IPs they gave IPs from the same range for both functions. That is, for example, vmk0 had an IP 10.20.1.2/24 (and 10.20.1.3/24 and 10.20.4/24 on the other hosts) and vmk1 had an IP of 10.20.12/24 (and 10.20.1.13/24 and 10.20.1.14/24 on the other hosts).

Since both interfaces are on separate VLANs (basically separate LANs) the above setup won’t work. That’s because as far as the hosts are concerned both interfaces are on the same network yet physically they are on separate networks. Here’s the routing table on the hosts:

Notice that any traffic to the 10.20.1.0/24 network goes via vmk0. And that includes the vMotion traffic because that too is in the same network! And since the network that vmk0 is on is physically a separate network (because it is a VLAN) this traffic will never reach the vMotion interfaces of the other hosts because they don’t know of it.

So even though you have specific vmk1 as your vMotion traffic NIC, it never gets used because of the default routes.

If you could force the outgoing traffic to specifically use vmk1 it will work. Below are the results of vmkping using the default route vs explicitly using vmk1:

The solution here is to either remove the VLANs and continue with the existing IP scheme, or to keep using VLANs but assign a different IP network for the vMotion interfaces.

Update: Came across the following from this blog post while searching for something else:

If the management network (actually the first VMkernel NIC) and the vMotion network share the same subnet (same IP-range) vMotion sends traffic across the network attached to first VMkernel NIC. It does not matter if you create a vMotion network on a different standard switch or distributed switch or assign different NICs to it, vMotion will default to the first VMkernel NIC if same IP-range/subnet is detected.

Please be aware that this behavior is only applicable to traffic that is sent by the source host. The destination host receives incoming vMotion traffic on the vMotion network!

That answered another question I had but didn’t blog about in my post above. You see, my network admins had also set the iSCSI networks to be in the same subnet as the management network – but separate VLANs – yet the iSCSI traffic was correctly flowing over that VLAN instead of defaulting to the management VMkernel NIC. Now I understand why! It’s only vMotion that defaults to the first VMkernel NIC in the same IP range/ subnet as vMotion. 

 

Find Outlook rules that are deleting a message

As part of troubleshooting something I needed to quickly find what Outlook rules the user had for deleting messages. So I came up with this one-liner.

The result is a list of rule names and a friendly description of what the rule does.

Run this from the EMS of course.

ESXi host – cannot install HA – no space left on device

These are less of notes and more of links and what I did when I encountered this issue. Just for my future self.

At work we had a host which was giving HA errors. The message was along the lines that vCenter could not contact HA. So I tried reconfiguring it for HA (right click the host and select “Reconfigure for vSphere HA”) upon which I got a new error: Cannot install the vCenter Server agent service. Cannot upload agent.

HA-errorInitially I thought it must just be a permissions issue. But it wasn’t so.

To investigate further I tried logging on to the server. I couldn’t enable SSH and ESXi Shell from the Configuration tab – it gave me an error. So I iLO’d into the server DCUI and enabled SSH and ESXi Shell. SSH still refused to let me in, and when I’d press Alt+F1 on the console to get the login prompt it was filled with messages like these: /bin/sh cant fork. Initially I thought it might be to do with HP AMS memory leak (see this and this) but it wasn’t.

I pressed Alt+F12 to see the on-screen logs. It was filled with messages like these:

alt+f12 logsBlimey!

There was nothing more I could do here basically. Couldn’t login to the server at all, heck I couldn’t even Shutdown/ Restart it gracefully via F12 in DCUI (nothing would happen). So I cold booted it and that got it working. 

It’s been about 2 hours since I did that and the server seems stable so maybe it was a one off-thing. I looked at more logs though and here’s what I found.

/var/log/syslog.log

(Contains: Management service initialization, watchdogs, scheduled tasks and DCUI use)

/var/log/vmkwarning.log

(Contains: A summary of Warning and Alert log messages excerpted from the VMkernel logs)

/var/log/vob.log

(Contains: VMkernel Observation events)

/var/log/vmkernel.log

(Contains: Core VMkernel logs, including device discovery, storage and networking device and driver events, and virtual machine startup)

/var/log/hostd.log

(Contains: Host management service logs, including virtual machine and host Task and Events, communication with the vSphere Client and vCenter Server vpxa agent, and SDK connections.)

From these logs one thing was clear. The ESXi RAMdisk hosting the root filesystem had run out of inodes. Possibly caused by the SFCB service. Because of this the root filesystem had run out of space and everything was failing. Great!

In Linux I am used to the df command to check filesystem usage. But in ESXi df only seems to be give info on the mounted filesystems whereas vdf gives the local filesystems (like RAMdisks and Tardisks (whatever that is)).

Above output is after a reboot and all seems fine. To check the inode usage use the stat command.

Or use exscli. It gives you the free space as well as the inode count!

Note to self: Make a habit of using the esxcli command as that seems to be the VMware preferred way of doing things. Plus it’s one command with various namespaces you can use for networking and other info.

In my case things look to be fine now.

KB 2037798 talks about this problem. Apparently it is fixed via a patch released in 2013, and as far as I can tell we are properly patched so we shouldn’t have been hit by this issue. If it happens again though the same KB article talks about creating a separate RAMdisk for SFCB so even if it eats up all the inodes your root file system isn’t affected. This involves creating a new RAMdisk at boot time by modifying rc.local (nice!). The esxcli command can be used to create a new ramdisk and mount it at the mount point required by SFCB:

Turns out such an issue can also occur because of SNMP. Or if you have an HP Gen8 blade server then coz of the hpHelper.log file, which is fixed via a patch from HP (this server was a Gen8 blade but it didn’t have this log file). KB 2040707 too talks about this. Didn’t help much in my case as that didn’t seem to be my issue.

Two useful links for future reference are:

That’s all for now.

p.s. I keep talking about SFCB above but have no idea what it is. Turns out it is the CIM server for ESXi. Found this blog post on it. 

How to fail-over the primary VC module to another

Probably obvious to most people but it was new to me. To failover a VC module from primary to backup login to the VC Manager, then go to “Tools > Reset Virtual Connect Manager”.

toolsClick “Yes”, but before that tick the box that says “Force Failover”.

force failoverThat’s all. Now you will be logged out and after a few minutes you can login to the module that is now primary. It takes a few minutes so be patient.

This has no impact on the network or FC connectivity.

Some more ways of doing this can be found at this HP link.

vMotion fails at 14% – ESX hosts failed to connect over the VMotion network

On a newly built host today vMotion was failing while migrating VMs to this host. vMotion would get stuck at 14% and then fail with the above error. Found this VMware KB article – it was informative but didn’t help. From this article I learnt though that I can use vmkping with the -I switch to specify an interface to use while pinging. This is handy when you want to ping a remote address via a specific interface – say, the vMotion IP address of a remote host, via the vMotion VMKernel of this host. Usually vmkping automatically selects an interface on the network you are trying to ping but it’s possible you are using the same subnet for vMotion and many other services.

Anyhow, in my case I noticed that if I removed one of the underlying physical adapters I am able to vmkping. So add that to the list of things to try if you too are in a similar situation. Odd though that it failed though! I would have thought a failed physical adapter means it will just try a different one? Clearly in my case the other adapter was working.

I don’t know more details but it could be that the physical NIC was up but the switch was blocking? Not sure.

Brief notes on Windows Time

The w32time service provides time for Windows. Since Windows XP NTP (Network Time Protocol) is supported. Prior to that it was only SNTP (Simple NTP).

Non domain joined computers (including servers) use SNTP.

This is a good article that explains the Windows Time service and its configurations. Covers both registry keys and GPOs. This is another good article that goes into even more detail.

Any Windows machine can be set up to sync time in one of four ways: (1) no syncing! (2) sync from specified NTP servers (3) sync via domain hierarchy (i.e. members sync from a DC in the domain; DCs sync from PDC of the parent domain/ forest root domain) (4) use either of the above (i.e. NTP and domain hierarchy). Default mechanism on domain joined computers is domain hierarchy (the setting is called NT5DS). Stand-alone machines have the default as NTP servers (the setting is called NTP; the default server is time.windows.com though you can change it (and probably recommended that you change it?)).

For machines that are off and on the domain – e.g. laptops – it is better to set their time sync mechanism as any. They needn’t always have contact with the DC to sync time.

When specifying NTP time servers you also specify flags. Check this post for an explanation of the flags. There are four possible flags: 0x01 SpecialInterval; 0x02 UseAsFallbackOnly; 0x04 SymmetricActive; 0x08 Client.

  • Flag UseAsFallbackOnly means the server is only used if the others are unavailable. Check out this post for an example of this.
  • Flag SpecialInterval lets you change how often the NTP server is polled. By default the interval is determined by Windows based on the quality of time samples, but you can use the above flag and set a registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient\SpecialPollInterval to change the polling interval.
  • I am not sure what the other two flags do. The Client flag seems to be a commonly used one. Some posts/ articles use it, others don’t. The default time.windows.com setting uses this flag as well as the SpecialInterval.

p.s. To turn on w32tm debugging check out this link.

The dig command

Just a quick post on how to use the dig command. Had to troubleshooting some DNS issues today and nslookup just wasn’t cutting it so I downloaded BIND for Windows and ran dig from there.  (For anyone else that’s interested – if you download BIND (it’s a zip file), extract the contents, and copy dig.exe and all the .dll files elsewhere that’s all you need to run dig).

Here’s the simple way of running dig:

A bit confusing when you come from nslookup where you type the name server first and then the name and type. Here’s the nslookup syntax for reference:

Although all documentation shows the dig syntax as above I think the order can be changed too. Since the name server is identified by the @ sign you can specify it later on too:

I will use this syntax everywhere as it’s easier for me to remember.

You can complicate the simple syntax by adding options to dig. There’s two sort of options: query options and domain options (I made the names up). These are usually placed after the type but again the order doesn’t really matter:

  • Query options start with a - (dash) and look like the switches you have in most command line tools. These are useful to control the behavior of dig when making the query.
    • For instance: say I want to for dig to only use IPv4. I can add the option -4. Similarly for only IPv6 it is -6.
    • If I want to specify the port of the name server (i.e. apart from 53) the -p option is for that.
    • (Very useful) If I want to specify a different source port for the client (i.e. the port from which the outgoing request is made and to which the reply is sent, in case you want to eliminate any issues with that) the -b switch is for that. It takes as argument address#port, if you don’t care about the address leave it as 0.0.0.0#port.
    • Apart from specifying the name and type as above it is also possible to pass these as query options via -n and -t.
    • There are some more query options. Above are just the common ones I can remember.
  • Domain options start with a + (plus). These pertain to the name lookup you are doing.
    • For instance: typically DNS queries are performed using UDP. To use TCP instead do +tcp. Most of these options can be prefixed with a “no”, so if you explicitly want to not use TCP then you can do +notcp instead.
    • To specify the default domain name do +domain=xxx.
    • To turn off or on recursive mode do +norecurse or +recurse. (This is similar to the -recurse or -norecurse switch in nslookup).
      • Note: By default recursion is on.
      • A quick note on recursion: When recursion is on dig asks the name server you supply (or that used by the OS) for an answer and if that name server doesn’t have an answer the name server is expected ask other name servers and get back with an answer. Hence the term “recursive”. The name server will send a recursive query to the name server it knows of, who will send a recursive query to the name server it knows of, until an answer is got down the chain of delegation. 
      • A quick note on iterative (non-recursive queries): When recursion is off dig gets an answer from the first name server, then it queries the second name server, gets an answer and queries the third name server, and so on … until dig itself gets an answer from a server who knows.
    • The output from dig is usually verbose and shows all the sections. To keep it short use the +short option.
      • This doesn’t show the name of the server that dig got an answer from. To print this too use the +identify option.
    • To list all the authoritative name servers of a domain along with their SOA records use the +nssearch option. This disables recursion.
    • To list a trace of the delegation path use the +trace option. This too disables recursion. The delegation path is listed by dig contacting each name server down the chain iteratively.
    • Couple of output modifying options. Most of these require no changing:
      • By default the first line of the output shows the version of  dig, the query, and some extra bits of info. To hide that use +nocmd.
      • By default the answer section of replies is shown. To hide it do +noanswer.
      • By default the additional sections of a reply are shown. To hide it do +noadditional.
      • By default the authority section of a reply is shown. To hide it do +noauthority.
      • By default the question asked is shown as a comment (this is useful because you might type something like dig rakhesh.com which dig interprets as dig rakhesh.com A so this makes the question explicit). To hide this do +noquestion.
        • Note that this is not same as the query that is sent. The query is the question plus other flags like whether we want a recursive query etc. By default the query that is sent is not shown. To show that do +qr.
      • By default comments are printed. To hide it do +nocomments.
      • By default some stats are printed. To hide these do +nostats.
    • To change the timeout use +time=X. Default is 5 seconds. This is equivalent to nslookup -timeout=X.
    • To change the number of retries use +retry=X. Default is 3. This is equivalent to nslookup -retry=X.
    • There are some more domain options. Above are just the common ones I can remember.

An interesting thing about dig (since BIND 9) is that it can do multiple queries (the brackets below are not part of the command, I put them to show the multiple queries):

Each of these queries can take its own options but in addition to that you can also provide global options. So the full command when you have multiple queries and global options is as below (again, without the brackets):

To add to the fun you can even have one name server to be used for all queries and over-ride these via name servers for each query! Thus a fuller syntax is:

This is also why the options can be at the beginning of the query rather than at the end of the query. When you specify global options you can over-ride these per query (except the +[no]cmd option).

Windows Update on remote machines

I mentioned yesterday that one can Windows Update a machine via script located at c:\Windows\System32\en-US\WUA_SearchDownloadInstall.vbs. It’s easy, just run via the following command on at machine:

I thought of taking it one step ahead and running on remote computers via PSExec. So I coped the script to the C:\ of all my servers (it’s only present in Server Core by default) and executed it via PSExec:

That worked – sort of. I got a list of updates and I selected to download and install them all, but it just seemed to hang after that. I know the script (and even Windows Update GUI) is slow in general so I gave PsExec a long long time to complete, but that didn’t help.

Side by side I was searching for any PowerShell alternative to this script and came across this one. Compared to the VBScript technique it has an advantage (apart from being in PowerShell!) that I can control the “install updates” and “reboot” behaviors via switches. So all I needed to do was run something like this from a command-prompt window to install all available updates on a machine and then reboot:

Neato!

Thought I’d try run this remotely via PsExec but this time I got a lot of errors:

Looking more into this I came across an MSDN article about using Windows Update Agent (WUA) remotely. Turns out the CreateUpdateDownloader method that’s erroring above – which from it’s name sounds like the method responsible for downloading updates – is not allowed to be called remotely. Looking at the VBScript too, it has a section like the below where it hangs, so that explains why I couldn’t run that script either remotely.

I found some more PowerShell scripts that updates Windows machines – for example this and this. All of them use the same methods and so don’t work remotely. The blog post talking about the last script goes into more detail on an alternative method though. The trick is to create a scheduled task with the PowerShell script and run that on demand remotely. Since it runs locally, the PowerShell script will then succeed! I am yet to try it out but it seems like a reasonable workaround. Can deploy it via a GPO after all to all my machines.

From that post though I noticed the author creates the scheduled task as the LOCALSYSTEM account. So I re-ran PsExec but this time told it to execute the command as the remote LOCALSYSTEM account. And that worked! So now I can run a command like this

Or this

And am able to update a machine remotely. Nice! I prefer the PowerShell method as it lets me reboot the machine too without any prompt.

Notes of Windows Update (wuauclt)

Had to update some of my Windows Server Core servers. Just writing these as a note to my future self.

The Windows Update command is wuauclt. I can never get that command name (except that it starts with “wu”, short for “Windows Update”) so I always go into c:\windows\system32 and type “wu” followed by a couple of TABs).

The command doesn’t have any output or help switches. Here’s a post with a list of switches. In my experience none of the switches return any output, even if you enter the wrong switch. Some of the legit switches like /showWindowsUpdate and /showWUAutoScan return an error on Server Core – possibly because the UI doesn’t exist.

To check for new updates the following switch works: /detectNow.

To update the WSUS server with the client’s status the following switch supposedly works: /r /ReportNow.

Windows Update has a log file located at c:\Windows\WindowsUpdate.log. It’s a useful file. For instance, after I applied a policy to change all my domain servers to point to the new WSUS server I could browse this log file to see the results. I could also see on my Server Core installs an error along these lines: “Can not perform non-interactive scan if AU is interactive-only”. This error is because I had set the Windows Update GPOs to be interactive but Server Core didn’t have a GUI for interactive operations.

For Server Core the easiest way to check for updates is via SConfig. Open it and select option 6 (Download and Install Updates). This just runs another script – located in c:\Windows\System32\en-US – called WUA_SearchDownloadInstall.vbs. So one could really run a command like this on Server Core:

That’s all for now!

 

 

KiTTY with Fedora 22/ OpenSSH 6.9 gives a blank screen and hangs

The KiTTY SSH client for Windows gives a blank screen and hangs when connecting to my Fedora 22 box (which runs OpenSSH 6.9). KiTTY is a fork of PuTTY, which I am happy with, so there’s no particular reason for me to use KiTTY except that I am checking it out.

kitty hungAfter a while it gives an error that the server closed the connection.

kitty fatalI have an Ubuntu Server 14.10 and CentOS 6.7 and KiTTY works fine with them. So it seems to be related to the version of OpenSSH in Fedora 22.

Fedora 22 logs show entries like these:

Looks like key-exchange was failing?

/*

Note to self: An alternative command syntax to the above is this:

I also noticed that including SELinux in the output gave a bit more info. Not that it helped but it’s worth mentioning here as a reference to myself. For that I include the sshd.service unit and also the SELinux context for SSHD.

Tip: After typing journalctl it is possible to keep pressing TAB to get the fields and their values. That’s how I discovered the SSHD context above.

On the topic of journalctl: this intro post by its creator and this DigitalOcean tutorial are worth a read.

Another useful tip with journalctrl: by default it doesn’t wrap the output but you can scroll left and right with the keyboard. This isn’t useful when you want to copy-paste the logs somewhere. The work-around is to use the --no-pager switch so everything’s dumped out together and then pipe it into less which wraps the text:

*/

I turned on SSH packet and raw data logging in KiTTY.

kitty logginFrom the log files I could see a sequence of events like this:

And with that it stops. So it looked like the DH Key Exchange was failing. Which confirms what I saw on the server side too.

SSH protocol 2 supports DSA, ECDSA, Ed25519 and RSA keys when clients establish a connection to the server. However, the first step is a Key Exchange for forward-secrecy and for this the DH algorithm is used.  From the sshd(8) manpage:

For protocol 2, forward security is provided through a Diffie-Hellman key agreement.  This key agreement results in a shared session key.  The rest of the session is encrypted using a symmetric cipher, currently 128-bit AES, Blowfish, 3DES, CAST128, Arcfour, 192-bit AES, or 256-bit AES.  The client selects the encryption algorithm to use from those offered by the server.  Additionally, session integrity is provided through a cryptographic message authentication code (hmac-md5, hmac-sha1, umac-64, umac-128, hmac-ripemd160, hmac-sha2-256 or hmac-sha2-512).

From the sshd_config(5) manpage:

The supported algorithms are:

curve25519-sha256@libssh.org
diffie-hellman-group1-sha1
diffie-hellman-group14-sha1
diffie-hellman-group-exchange-sha1
diffie-hellman-group-exchange-sha256
ecdh-sha2-nistp256
ecdh-sha2-nistp384
ecdh-sha2-nistp521

The default is:

curve25519-sha256@libssh.org,
ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,
diffie-hellman-group-exchange-sha256,
diffie-hellman-group14-sha1

The KexAlgorithms option can be used to change this but the point is all these are variants of the DH algorithm.

KiTTY has the following algorithm selection options:

kitty options

For protocol 2 only the first three (Diffie-Hellman options) apply. The fourth one (RSA-based key exchange) only applies for protocol 1. Comparing this with the preferred order for OpenSSH only the first and second options are common:

  • Diffie-Hellman group exchange (a.k.a. diffie-hellman-group-exchange-sha256)
  • Diffie-Hellman group 14 (a.k.a. diffie-hellman-group14-sha1)

Of these the second one (diffie-hellman-group14-sha1) is the older one. It is defined in RFC 4253. This RFC defines two key exchange algorithms: diffie-hellman-group14-sha1 and diffie-hellman-group1-sha1. (The group number defines the number of bits in the key. Group 1 has 768 bits, group 14 has 2048 bits. Larger is better). (Correction: Turns out diffie-hellman-group1-sha1 is actually Group 2).

The key exchange algorithms of RFC 4253 are updated via RFC 4419 – which defines two additional algorithms: diffie-hellman-group-exchange-sha1 and diffie-hellman-group-exchange-sha256. I am not very clear about these but it looks like they remove any weaknesses to do with groups that define the fixed number of bits and introduces the ability for the client and server to negotiate a custom group (of 1024 – 8192 bits). From RFC 4419:

Currently, SSH performs the initial key exchange using the “diffie-hellman-group1-sha1” method [RFC4253].  This method prescribes a fixed group on which all operations are performed.

The security of the Diffie-Hellman key exchange is based on the difficulty of solving the Discrete Logarithm Problem (DLP). Since we expect that the SSH protocol will be in use for many years in the future, we fear that extensive precomputation and more efficient algorithms to compute the discrete logarithm over a fixed group might pose a security threat to the SSH protocol.

The ability to propose new groups will reduce the incentive to use precomputation for more efficient calculation of the discrete logarithm.  The server can constantly compute new groups in the background.

So, to summarize: the common algorithms between KiTTY and OpenSSH are diffie-hellman-group-exchange-sha256 and diffie-hellman-group14-sha1. By default the preferred order between KiTTY client and OpenSSH server are in the order I specified above. In terms of security diffie-hellman-group-exchange-sha256 is preferred over diffie-hellman-group14-sha1. For some reason, however, KiTTY breaks with the newer version of OpenSSH when it comes to this stage.

Since diffie-hellman-group-exchange-sha256 is what both would be using I changed the preferred order on KiTTY such that  diffie-hellman-group14-sha1 is selected. 

kitty orderOnce I did that KiTTY connected successfully with OpenSSH – so that worked around the problem for now albeit by using a weaker algorithm – but at least it works. I wonder why diffie-hellman-group-exchange-sha256 breaks though.

I downloaded the latest version of PuTTY and gave that a shot with the default options (which are same as KiTTY). That worked fine! From the PuTTY logs I could see that at the point where KiTTY failed PuTTY manages to negotiate a key successfully:

Comparing the configuration settings of PuTTY and KiTTY I found the following:

putty fix

If I change the setting from “Auto” to “Yes”, PuTTY also hangs like KiTTY. So that narrows down the problem. RFC 4419 is the one I mentioned above, which introduced the new algorithms. Looks like that has been updated and KiTTY doesn’t support the new algorithms whereas PuTTY is able to.

PuTTY’s changelog didn’t have anything but its wishlist contained mention of what was going on. Apparently it isn’t a case of an update to RFC 4419 or any new algorithms, it is just a case of OpenSSH now strictly implementing the message format of RFC 4419 and thus breaking clients who do not implement this yet. To understand this have a look at this bit from the log entries of PuTTY and KiTTY at the point where KiTTY fails:

Pretty identical except for the type number. KiTTY uses 30 while PuTTY now uses 34. If you look at RFC 4419 these numbers are defined thus:

The following message numbers have been defined in this document.

They are in a name space private to this document and not assigned by IANA.

     #define SSH_MSG_KEX_DH_GEX_REQUEST_OLD  30
     #define SSH_MSG_KEX_DH_GEX_REQUEST      34
     #define SSH_MSG_KEX_DH_GEX_GROUP        31
     #define SSH_MSG_KEX_DH_GEX_INIT         32
     #define SSH_MSG_KEX_DH_GEX_REPLY        33

SSH_MSG_KEX_DH_GEX_REQUEST_OLD is used for backward compatibility. Instead of sending “min || n || max”, the client only sends “n”.  In addition, the hash is calculated using only “n” instead of “min || n || max”.

So RFC 4419 has updated the type number to 34 and set aside 30 for backward compatibility. Ideally clients should be sending type number 34 as their request. From the PuTTY wishlist link I saw the OpenSSH commit that removed the server from accepting type 30 any more. That’s why KiTTY was breaking, while PuTTY had updated their message type to the correct one and was successfully connecting! :)

(While digging around I saw that WinSCP too is (was?) affected.)

Exporting and Importing Windows DNS zones

Exporting a DNS zone is easy. Use the dnscmd.

Importing too is easy but the commands aren’t so obvious. Again you use dnscmd, with the /zoneadd switch as though you are creating a new zone. The help page for this misses out on an important switch though – /load – which lets you load the zone from an exported or pre-existing file.

You can find this switch in the dnscmd help:

So the way to import a zone is as follows: first, copy the exported file into the c:\windows\system32\dns folder of the DNS server and preferably rename it so the extension is a .dns (not required, just a nice thing to do). Then run a command similar to below:

That’s it. This will create a primary zone called “blah.com” and use the zone file that’s already in the location.

Note that you can’t use this technique for AD integrated zones. But that’s no issue. Simply import as above and then convert the zone to AD integrated via the GUI.