Isolate plugins in an out-of-process COM host#40120
Conversation
There was a problem hiding this comment.
Pull request overview
This PR moves WSL plugins from in-process LoadLibrary inside wslservice.exe to isolated out-of-process wslpluginhost.exe COM local servers, aiming to keep the service alive even if a plugin crashes.
Changes:
- Introduces
wslpluginhost.exe(COM local server) that loads one plugin DLL and forwards lifecycle events, while proxying plugin API callbacks back to the service. - Adds new COM contracts (
IWslPluginHost,IWslPluginHostCallback) and consolidates proxy/stub generation intowslserviceproxystub.dll. - Updates service-side plugin management and adds a new
shared_mutexpath intended to avoid re-entrancy/deadlock on COM RPC callback threads.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/windows/wslpluginhost/exe/resource.h | Adds resource IDs for new host executable. |
| src/windows/wslpluginhost/exe/PluginHost.h | Declares COM class implementing IWslPluginHost plus API callback stubs. |
| src/windows/wslpluginhost/exe/PluginHost.cpp | Implements plugin DLL loading, lifecycle dispatch, and callback forwarding to service. |
| src/windows/wslpluginhost/exe/main.rc | Adds version/icon resources for wslpluginhost.exe. |
| src/windows/wslpluginhost/exe/main.cpp | Implements COM local server entrypoint and class factory registration. |
| src/windows/wslpluginhost/exe/CMakeLists.txt | Adds build target for wslpluginhost.exe. |
| src/windows/wslpluginhost/CMakeLists.txt | Wires new subdirectory into build. |
| src/windows/service/stub/CMakeLists.txt | Adds MIDL proxy/stub sources for WslPluginHost.idl into wslserviceproxystub. |
| src/windows/service/inc/WslPluginHost.idl | Defines out-of-proc COM interfaces for plugin hosting + callbacks. |
| src/windows/service/inc/CMakeLists.txt | Adds new wslpluginhostidl MIDL generation target. |
| src/windows/service/exe/PluginManager.h | Refactors plugin manager to track out-of-proc hosts and adds callback implementation type. |
| src/windows/service/exe/PluginManager.cpp | Implements COM activation of hosts, job object assignment, and service-side callback handlers. |
| src/windows/service/exe/LxssUserSession.h | Adds shared_mutex and makes plugin-callback methods private/friend-only. |
| src/windows/service/exe/LxssUserSession.cpp | Switches plugin callback locking from m_instanceLock to m_callbackLock and gates VM teardown. |
| src/windows/service/exe/CMakeLists.txt | Adds dependency on wslpluginhostidl. |
| src/windows/common/precomp.h | Adds <shared_mutex> include for new locking. |
| msipackage/package.wix.in | Installs wslpluginhost.exe and registers COM AppID/CLSID/interfaces for activation and proxy/stub. |
| msipackage/CMakeLists.txt | Adds wslpluginhost.exe to packaged binaries and build dependencies. |
| CMakeLists.txt | Adds subdirectory for wslpluginhost and adjusts global include directories. |
Comments suppressed due to low confidence (2)
src/windows/service/exe/LxssUserSession.cpp:3644
- CreateLinuxProcess now only takes m_callbackLock, but it calls _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and reads m_runningInstances. This is both a locking-contract violation and can race with writers that still use m_instanceLock only. Refactor so callback code can safely read the running-instance map (e.g., provide a callback-safe lookup guarded by m_callbackLock, and ensure all writes to m_runningInstances/m_utilityVm also take m_callbackLock exclusively after m_instanceLock per the stated lock ordering).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
src/windows/service/exe/LxssUserSession.cpp:2614
- m_runningInstances is updated here without taking m_callbackLock, but plugin callbacks now read m_runningInstances under m_callbackLock (and intentionally do not take m_instanceLock). Because these are different locks, this doesn’t provide synchronization and can lead to data races/UB when a callback runs concurrently with instance creation/termination. Writers that mutate m_runningInstances (and m_utilityVm if accessed by callbacks) need to also take m_callbackLock (exclusive) in the documented order m_instanceLock → m_callbackLock.
// This needs to be done before plugins are notified because they might try to run a command inside the distribution.
m_runningInstances[registration.Id()] = instance;
if (version == LXSS_WSL_VERSION_2)
{
auto cleanupOnFailure =
wil::scope_exit_log(WI_DIAGNOSTICS_INFO, [&]() { m_runningInstances.erase(registration.Id()); });
m_pluginManager.OnDistributionStarted(&m_session, instance->DistributionInformation());
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3643
- CreateLinuxProcess now only takes m_callbackLock, but it reads m_runningInstances via _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and the map itself is Guarded_by(m_instanceLock). This is a real race/contract violation (and can also break static analysis). You’ll need a consistent locking strategy for callback threads (e.g., make all accesses/mutations of m_runningInstances + m_utilityVm also take m_callbackLock in the documented order, or refactor callbacks to avoid touching m_runningInstances without m_instanceLock).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
3418f7e to
3f03d4f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3653
CreateLinuxProcessnow only takesm_callbackLockbut calls_RunningInstance(Distro)(which is annotated_Requires_lock_held_(m_instanceLock)and touchesm_lockedDistributions). This violates the locking contract and can race with instance state changes or deadlock if someone later adds the required lock. Consider adding a callback-safe lookup that is guarded bym_callbackLockonly (and does not call_EnsureNotLocked), or refactoring_RunningInstance/ the guarded annotations so callback code never needsm_instanceLock.
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
THROW_HR_IF(WSL_E_WSL2_NEEDED, !wsl2Distro);
|
Hey @benhillis 👋 — Following up on this PR. It currently has merge conflicts that need to be resolved, and the CI build didn't run (shows action_required). There are also 20 unresolved review threads, including some security-relevant findings (TOCTOU on DLL signature validation, |
f32e629 to
5641289
Compare
|
CI is all green now and conflicts are resolved. Would love to get some eyes on this when someone has a chance — the main design change is moving plugin loading out of wslservice into separate COM hosts so a bad plugin can't take down the service. |
3492d6b to
57501d9
Compare
ed98200 to
c66d3d7
Compare
8706282 to
c13fe64
Compare
c13fe64 to
e59fe72
Compare
e59fe72 to
40a7e2d
Compare
40a7e2d to
8861ece
Compare
OneBlue
left a comment
There was a problem hiding this comment.
Reading through this, I think the COM interface & sub-process creation logic look good.
I think it would be worth exploring trying to run plugin callbacks in separate threads and do something like this:
class ThreadedPluginAPI
{
struct PluginCall
{
std::function<HRESULT()> Call;
std::future<HRESULT> Result;
}
HRESULT CreateProcess(...)
{
// Add plugin call to the queue
auto future = create_future();
auto it = m_calls.emplace_back(PluginCall([&](){ g_pluginManager.Createprocess(...), future);
// Wait for the result and return it to the caller
return it->second->get();
}
std::vector<PluginCall> m_calls;
}
Then, when we call a plugin callback, we could do something like:
void PluginManager::OnSessionCreated(...)
{
// Create a new thread to call the plugin callbacks
ThreadedPluginAPI api;
std::thread callbackThread([&]() { plugin->OnSessionCreated(&api, ...);});
// Process plugin API calls in the current thread
while (...)
{
WaitForSingleObject(api.CallAvailable(), callbackThread);
if (callAvailable) // Run plugin call
{
api.RunPluginCall();
}
else
{
break; // Callback thread has exited,
}
}
}
This approach would allow us to merge this without changes to LxssUserSession / WSLCSessionManager, because this would allow us to keep using recursive locks.
I'm proposing this because this change is quite big, and brings a lot of synchronization changes, which I'm a bit worried about, especially long term given that we're adding new mutexes, and code paths where resources are protected by mutex A or B
8861ece to
542de0f
Compare
| std::vector<BYTE> PluginManager::SerializeSid(PSID Sid) | ||
| { | ||
| const DWORD sidLength = GetLengthSid(Sid); | ||
| std::vector<BYTE> buffer(sidLength); | ||
| CopySid(sidLength, buffer.data(), Sid); | ||
| return buffer; |
| // Build a WSLSessionInformation struct from the marshaled parameters. | ||
| // The returned struct and its SID allocation are valid for the lifetime of the wil::unique_handle. | ||
| struct SessionContext |
| /// <summary> | ||
| /// Contains the currently running utility VM's. | ||
| /// Contains the currently running instances. | ||
| /// Reads guarded by m_instanceLock OR m_callbackLock (shared). | ||
| /// Mutations require BOTH m_instanceLock AND m_callbackLock (exclusive). | ||
| /// </summary> | ||
| _Guarded_by_(m_instanceLock) std::map<GUID, std::shared_ptr<LxssRunningInstance>, wsl::windows::common::helpers::GuidLess> m_runningInstances; |
| /// <summary> | ||
| /// The running utility vm for WSL2 distributions. | ||
| /// | ||
| /// Reads guarded by m_instanceLock OR m_callbackLock (shared). | ||
| /// Mutations require BOTH m_instanceLock AND m_callbackLock (exclusive). | ||
| /// </summary> | ||
| _Guarded_by_(m_instanceLock) std::unique_ptr<WslCoreVm> m_utilityVm; |
542de0f to
2cae48e
Compare
…rocess plugin host Alternative to the m_callbackLock design in PR #40120, per @OneBlue's review. Instead of guarding plugin callbacks with a new shared_mutex (m_callbackLock) and a session-ref registry, this runs each outbound notification (host->On...) on a worker thread and pumps the plugin's service-side API calls back onto the original notifying thread, which already holds the session's recursive m_instanceLock. This reproduces in-process re-entrancy and lets us delete the synchronization the PR added. - New PluginCallPump primitive (Run/Invoke): the worker makes the COM notification call; the notifying thread pumps queued callback work and runs it under m_instanceLock. - PluginManager: WSL notifications (OnVm*/OnDistribution*) route via RunHostNotification; WSL callbacks (MountFolder/ExecuteBinary[InDistribution]) route via a hybrid InvokeOnWslPump - pump when a hook is in flight, direct (own m_instanceLock) when not, preserving async out-of-hook callbacks. - Revert m_callbackLock in LxssUserSessionImpl; MountRootNamespaceFolder/CreateLinuxProcess go back to the recursive m_instanceLock. Add LxssUserSessionImpl::TryInvokeUnderInstanceLock (timed try_lock_for on m_instanceLock) used by the direct path. - Deadlock fix: a callback racing the pre-notification window (lock held, pump not yet registered) must not block on m_instanceLock - InvokeOnWslPump uses a timed acquire and re-checks for a pump, so the racing callback is routed onto the pump instead of dead-blocking. - Hardening (from multi-model review): PluginCallPump::Invoke reports ran-vs-stopped out-of-band (no RPC_E_DISCONNECTED sentinel overload, avoids double-execution); the pump loop polls workerDone each iteration so callback traffic cannot starve the stop path; Run is exception-safe so std::thread creation failure never escapes into teardown hooks. - WSLC session-ref registry left as-is (documented follow-up): its lock is non-recursive, notifications already fire outside it, and it is entangled with the create-veto protocol. - Update test/plugin comments to describe the pump model (assertions unchanged and still pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rocess plugin host Alternative to the m_callbackLock design in PR #40120, per @OneBlue's review. Instead of guarding plugin callbacks with a new shared_mutex (m_callbackLock) and a session-ref registry, this runs each outbound notification (host->On...) on a worker thread and pumps the plugin's service-side API calls back onto the original notifying thread, which already holds the session's recursive m_instanceLock. This reproduces in-process re-entrancy and lets us delete the synchronization the PR added. - New PluginCallPump primitive (Run/Invoke): the worker makes the COM notification call; the notifying thread pumps queued callback work and runs it under m_instanceLock. - PluginManager: WSL notifications (OnVm*/OnDistribution*) route via RunHostNotification; WSL callbacks (MountFolder/ExecuteBinary[InDistribution]) route via a hybrid InvokeOnWslPump - pump when a hook is in flight, direct (own m_instanceLock) when not, preserving async out-of-hook callbacks. - Revert m_callbackLock in LxssUserSessionImpl; MountRootNamespaceFolder/CreateLinuxProcess go back to the recursive m_instanceLock. Add LxssUserSessionImpl::TryInvokeUnderInstanceLock (timed try_lock_for on m_instanceLock) used by the direct path. - Deadlock fix: a callback racing the pre-notification window (lock held, pump not yet registered) must not block on m_instanceLock - InvokeOnWslPump uses a timed acquire and re-checks for a pump, so the racing callback is routed onto the pump instead of dead-blocking. - Hardening (from multi-model review): PluginCallPump::Invoke reports ran-vs-stopped out-of-band (no RPC_E_DISCONNECTED sentinel overload, avoids double-execution); the pump loop polls workerDone each iteration so callback traffic cannot starve the stop path; Run is exception-safe so std::thread creation failure never escapes into teardown hooks. - WSLC session-ref registry left as-is (documented follow-up): its lock is non-recursive, notifications already fire outside it, and it is entangled with the create-veto protocol. - Update test/plugin comments to describe the pump model (assertions unchanged and still pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WSL plugin DLLs are moved out of wslservice.exe into a separate wslpluginhost.exe COM server so plugin code can no longer crash or destabilize the service. Each plugin is activated in its own host process (CLSCTX_LOCAL_SERVER, SYSTEM-only via AppID) and reached through a versioned COM interface defined in WslPluginHost.idl. All hosts are tied to a service-owned job object and terminate when wslservice exits. The plugin API is unchanged; existing plugins run unmodified. A crashing or disconnected host is classified by IsHostCrash (RPC_E_DISCONNECTED, RPC_E_SERVER_DIED[_DNE], CO_E_OBJNOTCONNECTED, RPC_S_SERVER_UNAVAILABLE, RPC_S_CALL_FAILED[_DNE]); the service logs it and continues instead of treating it as a fatal plugin error. RPC_E_CALL_REJECTED is intentionally excluded as a transient busy state rather than a dead host. Plugin->service callbacks (MountFolder, ExecuteBinary, and the WSLC session APIs) arrive on a different COM thread than the outbound hook, so they cannot re-enter the lock held during the hook: - VM path: LxssUserSessionImpl guards callbacks with a shared_mutex (shared for callbacks, exclusive in _VmTerminate after OnVmStopping drains in-flight callbacks before the utility VM is destroyed). - WSLC path: PluginManager resolves sessions through its own reference map under a dedicated lock, and WSLCSessionManager releases its session lock before any plugin notification fires, so callbacks never re-enter the session lock. A session is registered in the reference map but not published until OnWslcSessionCreated succeeds, so a vetoed or race-lost session is never handed out. Proxy/stub is consolidated into wslserviceproxystub.dll. One new exe, no new DLLs. Tests - HostCrashIsolation: kills wslpluginhost.exe mid-OnVmStarted and verifies the service survives and m_initOnce stays sticky. - ConcurrentCallbacks: four plugin threads hammer MountFolder and ExecuteBinary, exercising the shared callback lock. - AsyncApiCallFromWorker: a plugin worker thread calls into the service post-hook (cross-apartment, non-COM-initialized). - CallbacksDuringTerminationDoNotCrash: worker threads race _VmTerminate's exclusive lock and VM teardown, then wind down deterministically after OnVmStopping signals them and are joined on the next session start. - Existing WSL1 plugin tests broadened alongside the refactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2cae48e to
f8e4df2
Compare
| HRESULT Result() const | ||
| { | ||
| return (initHr == RPC_E_CHANGED_MODE) ? S_OK : initHr; | ||
| } |
…rocess plugin host Alternative to the m_callbackLock design in PR #40120, per @OneBlue's review. Instead of guarding plugin callbacks with a new shared_mutex (m_callbackLock) and a session-ref registry, this runs each outbound notification (host->On...) on a worker thread and pumps the plugin's service-side API calls back onto the original notifying thread, which already holds the session's recursive m_instanceLock. This reproduces in-process re-entrancy and lets us delete the synchronization the PR added. - New PluginCallPump primitive (Run/Invoke): the worker makes the COM notification call; the notifying thread pumps queued callback work and runs it under m_instanceLock. - PluginManager: WSL notifications (OnVm*/OnDistribution*) route via RunHostNotification; WSL callbacks (MountFolder/ExecuteBinary[InDistribution]) route via a hybrid InvokeOnWslPump - pump when a hook is in flight, direct (own m_instanceLock) when not, preserving async out-of-hook callbacks. - Revert m_callbackLock in LxssUserSessionImpl; MountRootNamespaceFolder/CreateLinuxProcess go back to the recursive m_instanceLock. Add LxssUserSessionImpl::TryInvokeUnderInstanceLock (timed try_lock_for on m_instanceLock) used by the direct path. - Deadlock fix: a callback racing the pre-notification window (lock held, pump not yet registered) must not block on m_instanceLock - InvokeOnWslPump uses a timed acquire and re-checks for a pump, so the racing callback is routed onto the pump instead of dead-blocking. - Hardening (from multi-model review): PluginCallPump::Invoke reports ran-vs-stopped out-of-band (no RPC_E_DISCONNECTED sentinel overload, avoids double-execution); the pump loop polls workerDone each iteration so callback traffic cannot starve the stop path; Run is exception-safe so std::thread creation failure never escapes into teardown hooks. - WSLC session-ref registry left as-is (documented follow-up): its lock is non-recursive, notifications already fire outside it, and it is entangled with the create-veto protocol. - Update test/plugin comments to describe the pump model (assertions unchanged and still pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f8e4df2 to
13b4be8
Compare
- Pass the kill-on-close job object into IWslPluginHost::Initialize and drop GetProcessHandle so the host assigns itself to the job before running any plugin code. - Return an IWSLCProcess reference from WslcCreateProcess instead of opaque process cookies; remove the cookie bookkeeping methods. - Make plugin host crashes fatal: a crash during a start/veto hook (or at load) blocks the WSL operation with a recorded, plugin-named error, matching pre-refactor behavior. Latch the first crash so subsequent operations fail fast with one consistent error instead of repeatedly driving a dead host (m_pluginError guarded by m_pluginErrorLock). Teardown hooks latch but stay non-fatal. Re-activation/non-fatal resilience is a follow-up. - Drop the sidData empty-guards so WSL and WSLC hooks pass the serialized SID consistently. - Rework the HostCrash test to assert the new fatal behavior. - Fix a use-after-free in the plugin host API stubs: a plugin worker thread could load g_pluginHost and clear the null-check just as ~PluginHost cleared the pointer and freed m_callback / unloaded the plugin DLL, then dereference freed memory. Guard host lifetime with a process-global wil::srwlock: stubs hold it shared for the call (via a ScopedHostRef helper) and the destructor takes it exclusive before clearing g_pluginHost, so teardown drains in-flight stubs and any later stub observes null and returns E_UNEXPECTED. - Drop the misleading [unique] from the WSLC hooks' scalar HANDLE UserToken in the IDL. [unique] is a no-op on a scalar (MIDL ignores it, so it compiled byte-identically), and it falsely implied an optional-token contract that was never on the wire. The WSL hooks already declare the token without it; this makes the two consistent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| DeadSessions.reserve(DeadSessions.size() + 1); | ||
| try | ||
| { | ||
| DeadSessions.push_back(std::move(entry)); | ||
| } | ||
| catch (...) | ||
| { | ||
| // Defensive: if queuing for the deferred stopping dispatch | ||
| // still fails, keep the session tracked and reap it on a | ||
| // later pass rather than dropping it without unregistering. | ||
| LOG_CAUGHT_EXCEPTION(); | ||
| return false; | ||
| } |
| std::vector<BYTE> PluginManager::SerializeSid(PSID Sid) | ||
| { | ||
| const DWORD sidLength = GetLengthSid(Sid); | ||
| std::vector<BYTE> buffer(sidLength); | ||
| CopySid(sidLength, buffer.data(), Sid); | ||
| return buffer; | ||
| } |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
WSL plugin DLLs are moved out of
wslservice.exeinto a separatewslpluginhost.exeCOM server so plugin code can no longer crash or destabilize the service. Each plugin is activated in its own host process (CLSCTX_LOCAL_SERVER, SYSTEM-only via AppID) and reached through a versioned COM interface defined inWslPluginHost.idl. All hosts are tied to a service-owned job object and terminate whenwslserviceexits.The plugin API is unchanged and existing plugins run unmodified. Proxy/stub is consolidated into
wslserviceproxystub.dll— one new exe, no new DLLs.Detailed Description
Host-crash classification
A crashing or disconnected host is classified by
IsHostCrashand surfaced as "host died, log and continue" rather than a fatal plugin error:RPC_E_DISCONNECTED,RPC_E_SERVER_DIED,RPC_E_SERVER_DIED_DNE,CO_E_OBJNOTCONNECTED,RPC_S_SERVER_UNAVAILABLE,RPC_S_CALL_FAILED,RPC_S_CALL_FAILED_DNE.RPC_E_CALL_REJECTEDis intentionally not classified as a host crash: it is a transient COM busy state (an STA message filter rejecting a call), not a "server process died" signal. The plugin host is MTA with no message filter, so it should not occur here; treating it as a crash would silently skip future legitimate calls.Callback re-entrancy
Plugin→service callbacks (
MountFolder,ExecuteBinary, and the WSLC session APIs) arrive on a different COM thread than the outbound hook, so they cannot re-enter the lock held during the hook:LxssUserSessionImplguards callbacks with ashared_mutex: shared for callbacks, exclusive in_VmTerminateafterOnVmStoppingdrains in-flight callbacks before the utility VM is destroyed.PluginManagerresolves sessions through its own reference map under a dedicated lock, andWSLCSessionManagerreleases its session lock before any plugin notification fires, so callbacks never re-enter the session lock.New tests
wslpluginhost.exemid-OnVmStartedand verifies the service survives andm_initOncestays sticky.MountFolder+ExecuteBinary, exercising the shared callback lock._VmTerminate's exclusive lock acquire / VM teardown, exiting deterministically via anOnVmStopping-set stop signal.Existing WSL1 plugin tests were broadened alongside the refactor.
Validation
bin\x64\debug\test.bat -f /name:*Plugin*— all plugin tests pass.FormatSource.ps1clean on changed files.