Skip to content

wslc: idle-terminate per-user session VMs when inactive#40781

Draft
benhillis wants to merge 43 commits into
masterfrom
user/benhill/wslc-idle-terminate-vm
Draft

wslc: idle-terminate per-user session VMs when inactive#40781
benhillis wants to merge 43 commits into
masterfrom
user/benhill/wslc-idle-terminate-vm

Conversation

@benhillis

Copy link
Copy Markdown
Member

Summary

Idle-terminates a per-user WSLC session's backing VM when it has been inactive, freeing memory while the session object (and its persistent storage) lives on. The VM is transparently recreated on the next operation.

Builds on #40770 (IWSLCVirtualMachineFactory).

Behavior

  • Only sessions with persistent storage (StoragePath set) idle-terminate.
  • An idle worker thread tears the VM down after a grace period (currently 30s) once there is no in-flight activity and no active container lock.
  • In-flight work holds an activity reference so the VM cannot be torn down mid-operation:
    • VmLease wraps CLI/container operations.
    • BeginContainerOperation hands clients an activity token (IFastRundown so a client crash reclaims it promptly).
    • Long-lived root-namespace processes (e.g. plugin hosts) created via CreateRootNamespaceProcess hold a keep-alive token for their lifetime.
  • Activity bookkeeping (count + wake event) lives in a shared IdleState held via shared_ptr, decoupled from the session's lifetime, so a held token suppresses idle teardown without extending the session object's lifetime (preserving the explicit-reset-invalidates-held-processes invariant from Add WSLC (WSL Containers) feature #40366).

Testing

  • New WSLCE2EVmIdleTests E2E suite (5 tests) including WSLCE2E_VmIdle_RootProcessKeepsVmAlive.
  • WSLCTests::CreateRootNamespaceProcess still passes.
  • Full x64 Debug build clean.

Notes / follow-ups (deferred)

  • Grace period is a hardcoded constexpr; making it injectable would enable deterministic race tests.
  • No crash-path (client dies holding token) automated coverage yet.

Note

Draft for early review.

Copilot AI review requested due to automatic review settings June 11, 2026 19:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds on-demand creation and idle-termination of per-user WSLC session VMs (for sessions with persistent storage), so memory can be reclaimed while keeping the session object and storage intact. It also introduces VM-liveness/activity bookkeeping to prevent teardown during in-flight operations and adds new E2E coverage around VM lifecycle behavior.

Changes:

  • Implement lazy VM bring-up and idle shutdown in wslcsession via an idle worker, activity counting/tokens, and a VmLease used by VM-requiring operations.
  • Add client-side “operation keep-alive” usage in wslc.exe container operations to prevent VM teardown between OpenContainer and subsequent calls/streaming.
  • Add a new E2E test suite validating lazy start, idle stop, persistence across restarts, keep-alive for root-namespace processes, and teardown/recreate races.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp New E2E tests covering lazy VM start, idle stop, persistence, keep-alive, and race scenarios.
test/windows/wslc/e2e/WSLCE2EHelpers.h Exposes the underlying IWSLCSession* for diagnostics/test-only calls.
src/windows/wslcsession/WSLCSession.h Adds VM lifecycle state, idle worker/tokens/lease declarations, and new session methods.
src/windows/wslcsession/WSLCSession.cpp Implements lazy VM creation, idle teardown, activity tokens, and VM diagnostics reporting.
src/windows/wslcsession/WSLCProcessControl.cpp Preserves a real exit code when signaling container release, only synthesizing SIGKILL when needed.
src/windows/wslcsession/WSLCProcess.h Stores a keep-alive token on root-namespace processes to keep the VM alive for their lifetime.
src/windows/wslcsession/WSLCContainer.cpp Signals idle re-checks on terminal container transitions; holds a VM lease during delete.
src/windows/wslcsession/IORelay.h Adds IsRelayThread() to safely avoid destroying the relay on its own thread.
src/windows/wslcsession/IORelay.cpp Co-initializes the relay thread into the MTA; implements IsRelayThread().
src/windows/wslc/services/SessionModel.h Adds a helper to acquire/hold a keep-alive token for client-side container operations.
src/windows/wslc/services/ContainerService.cpp Uses the keep-alive token across container operations (attach/start/stop/kill/delete/exec/etc.).
src/windows/service/inc/wslc.idl Adds VM diagnostics type + new session methods for diagnostics and operation keep-alive.
src/windows/service/exe/WSLCSessionManager.cpp Updates comments to reflect on-demand VM creation and recreation after idle termination.

Comment thread src/windows/wslcsession/WSLCSession.h
Comment thread src/windows/service/inc/wslc.idl Outdated
Copilot AI review requested due to automatic review settings June 11, 2026 20:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/windows/service/inc/wslc.idl Outdated
Comment thread src/windows/wslcsession/WSLCSession.cpp
Comment thread src/windows/wslcsession/WSLCSession.cpp Outdated
@benhillis benhillis force-pushed the user/benhill/wslc-idle-terminate-vm branch from c12d7e1 to fa2eb47 Compare June 12, 2026 01:28
Copilot AI review requested due to automatic review settings June 12, 2026 17:46
@benhillis benhillis force-pushed the user/benhill/wslc-idle-terminate-vm branch from fa2eb47 to ea2254c Compare June 12, 2026 17:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/service/inc/wslc.idl Outdated
Copilot AI review requested due to automatic review settings June 12, 2026 20:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/WslcSDK/wslcsdk.def
Comment thread src/windows/WslcSDK/winrt/Session.cpp
@benhillis benhillis force-pushed the user/benhill/wslc-idle-terminate-vm branch from b870044 to 4bcd87f Compare June 17, 2026 00:22
kvega005 and others added 15 commits June 17, 2026 09:17
Summary
-------
Decouples a wslc session's VM lifetime from the wslcsession.exe process so the
VM can be torn down when nothing needs it and recreated on demand later, while
the per-user session and its bookkeeping survive across VM restarts.

Previously a VM was created 1:1 with a session: the SYSTEM service eagerly built
an HcsVirtualMachine and handed the COM pointer to wslcsession.exe, and any VM
exit permanently terminated the (single-shot) session.

Detailed description
--------------------
* New service-side IWSLCVirtualMachineFactory lets the per-user process mint a
  fresh VM at any time. IWSLCSessionFactory::CreateSession / IWSLCSession::Initialize
  now take the factory instead of an eager IWSLCVirtualMachine. WSLCVirtualMachineFactory
  deep-copies the settings and duplicates the dmesg handle per creation.
* WSLCSession Initialize is now lightweight (persists settings, starts the idle
  worker). VM bring-up/teardown is split into re-runnable StartVmLockHeld /
  StopVmLockHeld / TearDownVmLockHeld driven by a VmState machine
  (None/Starting/Running/Stopping).
* On-demand bring-up + idle teardown: every VM-requiring operation takes a VmLease
  RAII (EnsureVmRunning + activity count); when activity drops to zero and no
  container is in the Created or Running state, the idle worker tears the VM down
  immediately. Idle termination is enabled only for persistent-storage sessions.
* VM-exit disambiguation: intentional stops (m_vmStopRequested) keep the session
  alive; unexpected exits still Terminate() permanently.
* New IWSLCSession::GetVmDiagnostics (Running + StartCount) exposes VM lifecycle
  for tests/diagnostics without bringing the VM up or counting as activity.

Concurrency fixes folded in (compile-validated; flagged for runtime stress):
* IORelay self-join: TearDownVmLockHeld no longer destroys the IO relay from its
  own thread (added IORelay::IsRelayThread); the stopped relay is left for
  ~WSLCSession on a non-relay thread.
* Lease-vs-idle-stop race: VmLease retries instead of throwing ERROR_INVALID_STATE
  when the idle worker tears down in the bring-up window.
* Idle-worker-vs-crash deadlock: IdleWorker bails when the VM exit event is already
  signaled, letting the relay-thread Terminate path own teardown.

Validation steps
-----------------
* Full solution build (x64 Debug) green, including wsltests.dll.
* Copyright-header validation: no new violations.
* Added E2E tests (test/windows/wslc/e2e/WSLCE2EVmIdleTests.cpp): lazy start +
  idle stop, recreate-on-demand + state persistence, Created container keeps VM
  alive, and concurrent recreate stress (lease/idle race).
* NOT runtime-validated here (requires deploy + Administrator + container runtime);
  run bin\x64\Debug\test.bat /name:*VmIdle* and stress the two race fixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…deadlock)

Runtime fixes for the idle-terminate feature so all four WSLCE2EVmIdleTests pass:

- Register the IWSLCVirtualMachineFactory proxy in the process Global Interface
  Table and re-fetch it per VM creation, and MTA-init the IdleWorker / IORelay
  threads, so on-demand VM creation no longer fails with RPC_E_WRONG_THREAD.
- Register the factory IID in the MSIX so cross-proc marshalling resolves.
- Preserve a container's real exit code in DockerContainerProcessControl::
  OnContainerReleased: only synthesize 128+SIGKILL when no exit code was ever
  recorded, so --rm 'container run' returns 0 instead of 137 when the VM
  idle-terminates immediately after the container exits.
- Fix an AB-BA deadlock between 'container rm' and idle teardown: hold a VmLease
  across WSLCContainer::Delete (keeps the VM up and blocks teardown) and drop the
  now-redundant shared-lock re-acquire in OnContainerDeleted (which would deadlock
  behind the idle worker's pending exclusive lock).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Operation

The wslc CLI performs each container mutation as two COM round-trips
(OpenContainer to resolve a wrapper, then the operation), and may stream
output afterwards. With on-demand VM idle-termination enabled (any
persistent-storage session), the VM could idle-stop in the gap between the
calls when the target container is not Created/Running: TearDownVmLockHeld
clears m_containers, disconnecting the client-held wrapper, so the second
call failed with RPC_E_DISCONNECTED. This regressed the container CRUD E2E
suite (rm of stopped containers, and cleanup helpers).

Add IWSLCSession::BeginContainerOperation, returning an activity token that
holds m_activityCount > 0 for as long as the client holds it. The CLI now
holds the token across the whole operation (resolve + operate + streamed
relay), so the idle worker cannot tear the VM down mid-operation. Releasing
the token (or the client exiting, via fast rundown) lets the VM idle again.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tearing the VM down the instant it went idle could thrash it (repeated
teardown/recreate) when containers are created and destroyed, or operations
issued, in quick succession. Keep an otherwise-idle VM running for a short
grace period and only tear it down once it has stayed idle for the whole
window.

The idle worker now waits with a timeout derived from a grace deadline. The
deadline is armed when the VM is first observed idle and reset on any non-idle
observation or explicit idle-check signal (raised on every lease/token release
and terminal container state change), so teardown occurs a full grace period
after the last activity. A WAIT_TIMEOUT wake means the VM has been continuously
idle for the grace period and is torn down.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A root-namespace process created via CreateRootNamespaceProcess is not
tracked as a container, so it did not contribute to the session activity
count or HasActiveContainerLockHeld(). The VmLease taken during creation
was released when the call returned, leaving the process eligible for
idle teardown: once the grace period elapsed the idle worker could stop
the VM and kill a long-lived root process (e.g. a plugin host) out from
under the client.

Bind an activity token to the returned WSLCProcess so the VM stays alive
for as long as the client holds the process proxy. Factor the existing
BeginContainerOperation token logic into CreateActivityToken() and reuse
it here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…VM lifecycle

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Kevin's PR #40767 (event-model termination) moved m_vmExitEvent.SetEvent()
in OnExit() to after a new std::lock_guard(m_lock) that caches the
termination reason. ~HcsVirtualMachine already holds m_lock across the 5s
exit-event wait and HcsCloseComputeSystem (which drains in-flight HCS
callbacks). An in-flight OnExit() therefore blocks acquiring m_lock, so it
never signals the exit event nor drains, and the close never completes:
a hard deadlock that StuckVmTermination reliably reproduces.

Drop the broad lock from the dtor. By the time the compute system is
closed no further callbacks can run, so the remaining teardown is safe
unguarded. Flag to Kevin to fold into #40767.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In the lazy-VM model, container/volume/network recovery runs at first VM
start rather than during CreateSession, so the create-time WarningCallback
was out of scope by the time recovery emitted warnings.

Park the session's WarningCallback in the GIT and have WSLCExecutionContext
fall back to it when an operation has no explicit callback, so lazy-recovery
warnings still reach the user. CLI-side, keep the create/enter callback alive
for the whole command by storing it in the Session model.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConfigureStorage validates the storage path lazily (via AttachDisk at first
VM start), so with a lazily-created VM a WSLCSessionStorageFlagsNoCreate
session pointing at a missing path no longer failed at CreateSession.
Validate the storage VHD existence eagerly in Initialize so misconfiguration
is reported up front.

The idle worker acquired m_lock exclusively (blocking) on every wake. Because
SRW locks favor a waiting writer, that pending acquire stalled all new
shared-lock operations behind it, so a long-running operation holding its
shared VmLease (e.g. a blocking SaveImage/Export) serialized every concurrent
operation until it completed. Use try_lock_exclusive and treat contention as
activity, re-evaluating on the next idle-check signal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ben Hillis and others added 2 commits June 18, 2026 12:44
Fix 1 (deadlock): OnIdleTimer now publishes m_vmStopRequested before the
m_vmExitedEvent crash probe so a concurrent OnVmExited on the relay thread
treats the exit as expected and does not drive Terminate() into a spin for
the exclusive lock we hold while StopVmLockHeld() joins that same relay
thread. StopVmLockHeld no longer manages the flag; the caller owns it.

Fix 2 (missing leases): Attach, Exec, Start, Inspect, Stats, Export, Logs,
ConnectToNetwork and DisconnectFromNetwork now acquire a VmLease. Exited
containers drop m_activityHold, so without a lease these guest-touching ops
could race idle teardown (a regression from removing the wrapper
AddRef/Release activity hold). The lease also lazily restarts the VM.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OnIdleTimer is invoked by IdleState::TimerCallback, which is noexcept and
calls the callback bare, so any escaping throw terminates the process. The
wil::CoInitializeEx call sat outside the try block, so a COM init failure
would throw past the handler. Convert OnIdleTimer to a function-try-block
(matching IORelay::Run) so CoInitializeEx and everything else are covered
by CATCH_LOG.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 19:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/wslcsession/WSLCSession.cpp Outdated
Ben Hillis and others added 2 commits June 18, 2026 14:26
The function-try-block conversion left the original function-closing brace
in place, so OnIdleTimer had an extra '}' after CATCH_LOG(). That unbalanced
the file and broke the build (cascade syntax errors at StartProcess and
beyond). Remove the orphaned brace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous m_vmStopRequested handshake could leave a session permanently
wedged. OnIdleTimer published the flag, then bailed if it saw the VM-exited
event signalled; OnVmExited (lock-free, on the relay thread) bailed if it
saw the flag set. A spontaneous VM crash landing in the window between the
flag store and the is_signaled() probe made BOTH defer: OnVmExited declined
(flag set) and consumed the one-shot exit handle, then OnIdleTimer declined
(event signalled). The VM was dead but m_vmState stayed Running, no teardown
ran, and EnsureVmRunning never restarted it - every later op failed against
a dead docker client.

Replace the two-flag inference with a single VmExitDisposition claimed by
compare_exchange. OnIdleTimer claims Active->StopRequested (owns a soft
stop); OnVmExited claims Active->ExitClaimed (owns a permanent Terminate).
The CAS loser defers. This makes ownership atomic with no gap, so teardown
happens exactly once and never deadlocks (the relay is always joined before
the VM is killed in TearDownVmLockHeld, and a deferring OnIdleTimer releases
the lock for OnVmExited's Terminate, which runs on the relay thread and so
skips the self-join).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 05:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread src/windows/service/inc/wslc.idl
Ben Hillis and others added 2 commits June 19, 2026 07:44
…dle-terminate-vm

# Conflicts:
#	src/windows/wslcsession/WSLCSession.cpp
… exit

If the VM exits spontaneously after the OnVmExited handle is registered in
StartVmLockHeld but before the VM reaches Running, OnVmExited (relay thread)
claimed VmExitDisposition::ExitClaimed and drove a permanent Terminate() that
spins for the exclusive session lock held by the starting thread. The start
cleanup path left the disposition Active and unconditionally called
TearDownVmLockHeld(), which joins the relay thread, deadlocking the session.

Have the start cleanup arbitrate through the same compare_exchange the idle
path uses: claim Active -> StopRequested to own teardown (OnVmExited then
declines, so joining the relay is safe), or, if OnVmExited already claimed
ExitClaimed, skip teardown and surrender the lock so its Terminate() finishes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 22, 2026 14:52
The single-owner VM-exit arbitration was open-coded as three near-identical
compare_exchange blocks (OnIdleTimer, the bring-up cleanup in StartVmLockHeld,
and OnVmExited), each carrying its own copy of the rationale. Extract the two
claims into TryClaimExpectedStop() (Active -> StopRequested, for a lock-holding
soft stop) and TryClaimSpontaneousExit() (Active -> ExitClaimed, for OnVmExited)
so the call sites read as intent and the deadlock/missed-teardown rationale
lives in one place on the VmExitDisposition enum. No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Comment thread src/windows/service/inc/wslc.idl
Comment thread src/windows/wslcsession/WSLCContainer.cpp
Comment thread src/windows/wslcsession/WSLCContainer.cpp
Comment thread src/windows/wslcsession/WSLCContainer.cpp
Comment thread src/windows/wslcsession/WSLCContainer.cpp
Comment thread src/windows/wslcsession/WSLCContainer.cpp
Comment thread src/windows/wslcsession/WSLCContainer.cpp
The claim helpers and the VmExitDisposition enum already carry the full
rationale, so trim the call-site comments (StartVmLockHeld cleanup, OnIdleTimer,
OnVmExited, StopVmLockHeld) to their local point instead of restating it.
Comment-only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 22, 2026 15:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread src/windows/service/inc/wslc.idl
Ben Hillis and others added 2 commits June 22, 2026 08:17
…dle-terminate-vm

# Conflicts:
#	src/windows/wslc/services/SessionModel.h
#	src/windows/wslc/services/SessionService.cpp
Attach, Exec, Stop, Kill, Inspect, and Export call AcquireVmLease(),
which throws when the session is terminating or the VM fails to start.
These methods were not function-try-blocks, so the exception could
escape across the COM boundary. Wrap each in try/CATCH_RETURN(), matching
their already-guarded siblings (Start, Stats, Delete, Logs, etc.).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 22, 2026 15:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread src/windows/service/inc/wslc.idl
Ben Hillis and others added 2 commits June 22, 2026 09:05
The VmLease class comment already documents what the lease does. Drop the
nine boilerplate per-call comments and keep only the three that flag a
non-obvious hazard (Stop/Kill --rm self-delete, Delete lock-order).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Condense the six-line N.B. block to three lines stating the same rationale:
don't hold m_lock because the exit-event wait and compute-system close block
on callbacks that may need it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 22, 2026 20:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread src/windows/service/inc/wslc.idl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants