KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377
Conversation
KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off command reported success. Against an already-off chassis the BMC rejects the power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA therefore never restarted the VMs until the dead host was powered back on. Fencing now succeeds based on the actual chassis power state: - if the host is already powered off (OOBM STATUS == Off), treat it as fenced; - otherwise issue a best-effort power-off and confirm via OOBM STATUS; - only a confirmed Off state counts as success; if the state cannot be confirmed (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain. Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing an unresponsive host (SOFT remains the graceful ACPI shutdown). Fixes apache#13376
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13377 +/- ##
=========================================
Coverage 17.67% 17.68%
- Complexity 15792 15798 +6
=========================================
Files 5922 5922
Lines 533165 533184 +19
Branches 65208 65211 +3
=========================================
+ Hits 94242 94273 +31
+ Misses 428276 428264 -12
Partials 10647 10647
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@blueorangutan package kvm |
|
@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with kvm SystemVM template(s). I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18199 |
|
@blueorangutan test ol9 kvm-ol9 keepEnv |
|
@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests |
|
@blueorangutan test ol9 kvm-ol9 |
|
@andrijapanicsb a [SL] Trillian-Jenkins test job (ol9 mgmt + kvm-ol9) has been kicked to run smoke tests |
TL;DR --> hw/sw setup + what was tested + clock-measured timing results for VM HA to kick in:
What was measured/tested:
KVM.ha global configs changed:
The last setting ensures that CloudStack skip one or more attempts to "recover" the host by using the BMC POWER RESET command (a.k.a tries 0 times) - it rather fences it immediately via the BMC POWER OFF command (since the host already has reached "Degraded" state and needs help - kill or fix)
Host HA fencing improvement: handle already-powered-off hosts and reduce HA VM restart delayThis PR addresses a Host HA fencing scenario observed during testing on a physical environment using HPE iLO5 / BMC-based out-of-band management with IPMI driver (yet to test RedFish, which The test environment was based on Apache CloudStack 4.22.1 with KVM. Primary storage was configured as CloudStack shared mount point storage backed by an OCFS2 clustered filesystem, which is now supported for Host HA. Host HA was enabled only on a single selected host for this test. On that host, we placed two VMs:
A fat jar was produced from a branch based directly on the CloudStack 4.22.1 tag. The jar was extracted from the built RPM package and used for testing. Scenario being testedThe test intentionally simulated a somewhat unusual but important failure scenario: the host was manually powered off through the BMC / IPMI / iLO interface before CloudStack completed its Host HA fencing flow. This scenario matters because, depending on the out-of-band driver implementation, sending a power-off command to a chassis that is already powered off may return an error (Redfish does this, IPMI not affected) or otherwise be interpreted as a failed fencing operation The important point is that CloudStack should not treat “the host is already powered off” as a fencing failure. If the final power state is off, the host is effectively fenced and VM HA can safely proceed. Logic introduced by the patchThe patched logic changes the fencing flow to be state-driven instead of relying only on the return status of the bmc power-off command. The intended behavior (after host reacheds Degraded state) is:
In short: the final observed power state is what matters. If the chassis is off, the host is fenced. Test resultsThe test confirmed the expected VM HA behavior:
Before the patch, the host reached Alert state after approximately 2 minutes and 30 seconds, but it was not marked Down / fenced until 16:07:55. The VM-HA fired a VM start (for HA-enabled VM only), i.e. VM.SSTART event was observed one second later, at 16:07:56. This means the HA-enabled VM experienced roughly 8 minutes of downtime before the restart began. With the patched logic (replacing the fat jar), the same type of test was repeated. The chassis was manually powered off at 16:16:00. The host was marked Down / fenced at 16:18:39, and the HA-enabled VM start event was observed one second later, at 16:18:40. This reduced the time before HA restart from roughly 8 minutes to roughly 2 minutes and 40 seconds. The non-HA VM was not restarted in either case, which is the expected behavior. ResultThe patch reduced the observed HA VM restart delay by approximately 5 minutes and 16 seconds in this test scenario. More importantly, it makes the fencing logic safer and more deterministic: if the host is already powered off, CloudStack should recognize that condition as a successful fencing state rather than waiting longer or treating the operation as failed because the power-off command itself did not behave as expected (Redfish protocol) This allows Host HA to proceed much sooner while still preserving the important safety rule: VM HA should only be triggered after the host has been confirmed powered off / fenced. |
|
[SF] Trillian test result (tid-16285)
|
Description
When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to
Downand therefore never restarts its VMs on other hosts — the host stays inAlert/Disconnectedindefinitely.Root cause: the host-HA state machine declares a host dead (
HAState.Fenced→ investigatorStatus.Down) only after a successful OOBM power-off. Against an already-off chassis the BMC rejects the power-off (the Redfish driver mapsOFFtoGracefulShutdown, which returns HTTP 409 when the system is already off), soKVMHAProvider.fence()reports failure and the host stays stuck in theFencingstate — whichHAManagerImpl.getHostStatusFromHAConfig()maps toStatus.Disconnected, notStatus.Down. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds.Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376.
Fix
Fencing now succeeds based on the actual chassis power state, not the power-off command's return code:
OOBM STATUS == Off) → treat it as fenced (no power-off issued);Offstate counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain.This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers).
Additionally, the Redfish driver now maps
PowerOperation.OFFtoForceOff(a hard power-off) instead ofGracefulShutdown— consistent with the ipmitool driver and appropriate for fencing an unresponsive host;SOFTremains the graceful ACPI shutdown. Also fixes a latentString.formatargument-count bug on the RedfishSTATUSbranch.Fixes: #13376
Types of changes
Bug Severity
How Has This Been Tested?
Unit tests added to
KVMHostHATest(all green) covering the fence behaviour:Off→ fenced;Off→ still fenced (the regression for this issue);Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's
stopVirtualMachine, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above.