What is esx.problem.hyperthreading.unmitigated?

Upgraded one of our ESXi hosts with the latest patches released today that are aimed at fixing the L1 Terminal Fault issues. After that the host started giving this warning: esx.problem.hyperthreading.unmitigated. No idea what it’s supposed to mean!

Went to Configure > Settings > Advanced System Settings and searched for anything with “hyperthread” in it. Found VMkernel.Boot.hyperthreadingMitigation, which was set to “false” but sounded suspiciously similar to the warning I had. Changed it to “true”, rebooted the host, and Googled on this setting to come across this KB article. It’s a good read but here’s some excerpts if you are interested in only the highlights:

Like Meltdown, Rogue System Register Read, and “Lazy FP state restore”, the “L1 Terminal Fault” vulnerability can occur when affected Intel microprocessors speculate beyond an unpermitted data access. By continuing the speculation in these cases, the affected Intel microprocessors expose a new side-channel for attack. (Note, however, that architectural correctness is still provided as the speculative operations will be later nullified at instruction retirement.)

CVE-2018-3646 is one of these Intel microprocessor vulnerabilities and impacts hypervisors. It may allow a malicious VM running on a given CPU core to effectively infer contents of the hypervisor’s or another VM’s privileged information residing at the same time in the same core’s L1 Data cache. Because current Intel processors share the physically-addressed L1 Data Cache across both logical processors of a Hyperthreading (HT) enabled core, indiscriminate simultaneous scheduling of software threads on both logical processors creates the potential for further information leakage. CVE-2018-3646 has two currently known attack vectors which will be referred to here as “Sequential-Context” and “Concurrent-Context.” Both attack vectors must be addressed to mitigate CVE-2018-3646..

Attack Vector Summary

Sequential-context attack vector: a malicious VM can potentially infer recently accessed L1 data of a previous context (hypervisor thread or other VM thread) on either logical processor of a processor core.

Concurrent-context attack vector: a malicious VM can potentially infer recently accessed L1 data of a concurrently executing context (hypervisor thread or other VM thread) on the other logical processor of the hyperthreading-enabled processor core.

Mitigation Summary

Mitigation of the Sequential-Context attack vector is achieved by vSphere updates and patches. This mitigation is enabled by default and does not impose a significant performance impact. Please see resolution section for details.

Mitigation of the Concurrent-context attack vector requires enablement of a new feature known as the ESXi Side-Channel-Aware Scheduler. The initial version of this feature will only schedule the hypervisor and VMs on one logical processor of an Intel Hyperthreading-enabled core. This feature may impose a non-trivial performance impact and is not enabled by default.

So that’s what the warning was about. To enable the ESXi Side Channel Aware scheduler we need to set the key above to “true”. More excerpts:

The Concurrent-context attack vector is mitigated through enablement of the ESXi Side-Channel-Aware Scheduler which is included in the updates and patches listed in VMSA-2018-0020. This scheduler is not enabled by default. Enablement of this scheduler may impose a non-trivial performance impact on applications running in a vSphere environment. The goal of the Planning Phase is to understand if your current environment has sufficient CPU capacity to enable the scheduler without operational impact.

The following list summarizes potential problem areas after enabling the ESXi Side-Channel-Aware Scheduler:

VMs configured with vCPUs greater than the physical cores available on the ESXi host

VMs configured with custom affinity or NUMA settings

VMs with latency-sensitive configuration

ESXi hosts with Average CPU Usage greater than 70%

Hosts with custom CPU resource management options enabled

HA Clusters where a rolling upgrade will increase Average CPU Usage above 100%

Note: It may be necessary to acquire additional hardware, or rebalance existing workloads, before enablement of the ESXi Side-Channel-Aware Scheduler. Organizations can choose not to enable the ESXi Side-Channel-Aware Scheduler after performing a risk assessment and accepting the risk posed by the Concurrent-context attack vector. This is NOT RECOMMENDED and VMware cannot make this decision on behalf of an organization.

So to fix the second issue we need to enable the new scheduler. That can have a performance hit, so best to enable it manually so you are aware and can keep an eye on the load and performance hits. Also, if you are not in a shared environment and don’t care, you don’t need to enable it either. Makes sense.

That warning message could have been a bit more verbose though! :)