Fix STL wait-timer strand: restore pre-#975 backend (close #973) by rgomez391 · Pull Request #998 · microsoft/libHttpClient

rgomez391 · 2026-06-23T02:43:25Z

Summary

This rolls back the STL wait-timer backend rewrite from #975 (WaitTimer_stl.cpp) while keeping all of the timer-arming improvements from #975, #980, and #983 in TaskQueue.cpp. The #975 backend rewrite introduced a delayed-callback strand that manifests as a permanent hang on platforms that compile the STL backend (Linux, Apple, and other non-Windows platforms), reproduced on-device as an infinite sign-in / inventory hang in a shipping title.

The change is scoped to a single source file (WaitTimer_stl.cpp) plus a regression test, and is validated on-device.

Background: the #973 lineage

This is a continuation of the work on issue #973 ("a delayed callback becomes runnable only after unrelated later work wakes the queue"):

Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975 — Fix XTaskQueue delayed-callback timing and wait-timer teardown races. Reworked both the TaskQueue.cpp arming logic (wall-clock → monotonic due times) and rewrote the STL timer backend (WaitTimer_stl.cpp), replacing the O(N) pointer-keyed cancellation scan with an O(log N) per-timer generation scheme.
Update TaskQueue with changes from Windows repo #980 — Update TaskQueue with changes from Windows repo. Added post-Start() verification in SubmitPendingCallbacks.
Centralize timer-arming to close TOCTOU race #983 — Centralize timer-arming to close TOCTOU race. Centralized the CAS+Start+verify logic into three helpers (ArmTimerIfEarlier, ArmTimerForNextPendingDueTime, RearmTimerIfDueTimeUnchanged).

The TaskQueue.cpp arming work in all three PRs is correct and is retained by this change. The regression is specifically in the #975 STL backend rewrite.

The bug

The #975 STL backend's generation scheme can drop a due callback:

TimerQueue::Set (called by Start()) bumps the timer's generation and pushes a new heap entry; the worker is expected to discard the superseded entry on pop when entry.Generation != state->Generation().
But a Start() that races the worker between its Peek() of the earliest entry and its dispatch of that entry bumps the generation first.
When the worker then pops that entry, it reads a now-stale generation and discards it instead of firing it. The re-pushed entry carries a new deadline, so the original due callback is never delivered.
With the entry gone, the worker sleeps on the next (later or absent) deadline and goes idle.

On a single-port manual XTaskQueue this is unrecoverable. The platform HTTP provider re-arms its poll via XTaskQueueSubmitDelayedCallback(... Work ...); once that re-poll is dropped, there is no independent traffic to wake the queue and sweep the pending list again. The call freezes permanently.

Observed on-device: the delayed re-poll sits in m_pendingList while the timer thread waits in _Cnd_timedwait and m_timerDue == UINT64_MAX (idle) — i.e. a pending entry with no armed timer, which should be impossible.

Why the arming-layer fixes (#975/#980/#983) could not address it

All three hardened the CAS → Start bookkeeping on m_timerDue in TaskQueuePortImpl. That logic is correct, but it rests on the assumption that Start() reliably arms the backend for the published deadline. The generation backend violates that assumption by discarding the entry after Start(), so no amount of arming-layer post-Start verification can recover it.
The defect only affects WaitTimer_stl.cpp. Win32/UWP use WaitTimer_win32.cpp (OS CreateThreadpoolTimer, no generation scheme) and are unaffected. Critically, the libHttpClient Windows unit-test project compiles only the Win32 backend, so the Windows suite stayed green when Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975/Update TaskQueue with changes from Windows repo #980/Centralize timer-arming to close TOCTOU race #983 merged. The Linux CI lane that does exercise the STL backend (taskqueue-starvation-linux) was added after these PRs, and its existing scenario has rescue traffic that masks this particular strand (see Follow-up). This test-coverage gap is the systemic reason the regression shipped.

The fix

Restore the proven pre-#975 STL backend algorithm (libHttpClient 2.3.1, commit 0fa5f24):

Pointer-keyed cancellation and an unconditional notify_all on every Set, which cannot discard a due entry.
Bridged to the post-Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975 public WaitTimer API (GetCurrentTime / GetDueTime / Start(dueTime)) so TaskQueue.cpp is unchanged.
Clock pinned to steady_clock (the 2.3.1 code used high_resolution_clock, which is steady_clock on libc++ but wall-clock system_clock on libstdc++). Pinning keeps delayed-callback ordering monotonic on every platform, preserving Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975's monotonic-time intent.

Only the STL timer backend is reverted; the TaskQueue.cpp arming improvements from #975/#980/#983 are kept unchanged.

Validation

On-device, in a shipping title: the reproducing infinite sign-in / inventory hang no longer occurs with this backend combined with the current TaskQueue.cpp arming logic. Confirmed by the partner across multiple runs.
Root cause was isolated by a controlled A/B: new TaskQueue.cpp + new backend hangs; new TaskQueue.cpp + restored backend does not. The only variable is WaitTimer_stl.cpp.

Regression test (included in this PR)

A Linux CI test lane that exercises the STL backend already exists — taskqueue-starvation-linux (TaskQueueStarvationTests.Linux, added alongside the composite-queue starvation guard), which runs a shared scenario in Tests/Shared/ against WaitTimer_stl.cpp. That existing scenario keeps many concurrent submitters and terminate races running for the whole measurement window, so it always has independent traffic that can rescue a momentarily-stranded timer — which is exactly why it does not catch this particular strand.

This PR adds a complementary scenario that reproduces the failure shape directly:

Tests/Shared/SingleQueueDelayedPollStrandScenario.h — a single self-resubmitting delayed-poll loop on one manual Work port (mirroring a platform HTTP provider that re-arms its poll every few ms via XTaskQueueSubmitDelayedCallback). A short burst of unrelated one-shot delayed callbacks runs concurrently to create timer-arming contention, then stops. After the burst drains, the poll loop is the only remaining work — with no rescue traffic, a dropped re-arm stalls it permanently, and the test fails deterministically because it requires the loop to make measurable progress after the burst ends.
Tests/UnitTests/Tests/TaskQueueTests.cpp — VerifySingleQueueDelayedPollStrand runs it on the Win32 backend.
Tests/TaskQueueStarvation/SingleQueuePollStrandRepro.cpp + the new taskqueue-poll-strand-linux CTest target run it on the STL backend (where the regression manifests), wired into the Linux CI lane.

The scenario uses only the public XTaskQueue C API plus the standard library, so the same translation unit compiles on every platform. Like the existing starvation guard, the strand is a probabilistic race; the test is a stress loop (many iterations) that maximizes exposure rather than a deterministic single-shot.

#975 rewrote the shared STL wait-timer backend (WaitTimer_stl.cpp) to replace the O(N) pointer-keyed cancellation scan with an O(log N) per-timer generation scheme: every Start() bumps a generation counter and pushes a fresh heap entry, and the worker discards any popped entry whose generation no longer matches the timer's current generation. That scheme can drop a due callback. A Start() that races the worker between its Peek() of the earliest entry and its dispatch of that entry bumps the generation, so when the worker pops the entry it now reads a stale generation and discards it instead of firing it. The re-pushed entry carries a new deadline, so the original due callback is never delivered. With the entry gone, the worker sleeps on the next (later or absent) deadline and goes idle. On a single-port manual XTaskQueue this is unrecoverable: the platform HTTP provider re-arms its poll via XTaskQueueSubmitDelayedCallback, and once that re-poll is dropped there is no independent traffic to wake the queue and sweep the pending list again. The symptom is an infinite sign-in / inventory hang (the delayed re-poll sits in the pending list while the timer thread waits in _Cnd_timedwait, never re-armed). Why the arming-layer fixes did not catch or fix it: - #975, #980 and #983 (#983 = "Centralize timer-arming to close TOCTOU race") all hardened the CAS->Start bookkeeping in TaskQueuePortImpl (m_timerDue) in TaskQueue.cpp. Those fixes are correct and are retained here. But they assume Start() reliably arms the backend for the published deadline. The generation backend violates that assumption by dropping the entry after Start(), so no amount of arming-layer verification can recover it. - The defect only affects WaitTimer_stl.cpp. Win32/UWP use WaitTimer_win32.cpp (OS CreateThreadpoolTimer, no generation scheme) and are unaffected. The libHttpClient unit-test project compiles the Win32 backend, so the Windows suite stayed green and none of #975/#980/#983 exercised the STL backend. Fix: restore the proven pre-#975 STL backend algorithm (libHttpClient 2.3.1, commit 0fa5f24) - pointer-keyed cancellation and an unconditional notify on every Set, which cannot discard a due entry - bridged to the post-#975 public WaitTimer API (GetCurrentTime / GetDueTime / Start(dueTime)). The clock is pinned to steady_clock (the pre-#975 code used high_resolution_clock, which is steady_clock on libc++ but wall-clock system_clock on libstdc++); pinning keeps the delayed-callback ordering monotonic on every platform that compiles this backend, preserving #975's monotonic-time intent. The arming-layer improvements in TaskQueue.cpp from #975/#980/#983 are kept unchanged; only the STL timer backend is reverted. Validated on-device in a shipping title: the reproducing infinite sign-in / inventory hang no longer occurs with this backend and the current TaskQueue arming logic. Follow-up: the STL backend has no unit-test coverage (the test project compiles only the Win32 backend). A dedicated STL-backend test lane is needed so a future re-land of the generation optimization cannot reintroduce this class of strand.

Adds a regression guard for the strand fixed by the previous commit and wires it into both timer backends, complementing the existing composite-queue starvation guard. The existing CompositeQueueStarvationScenario keeps many concurrent submitters and terminate races running for the whole measurement window, so there is always independent traffic whose timer fires sweep the pending list and rescue a momentarily-stranded entry. That continuous rescue traffic masks the single-port strand, which is why the existing lane did not catch it. The new scenario reproduces the field failure shape directly: a single self-resubmitting delayed-poll loop on one manual Work port (mirroring a platform HTTP provider that re-arms its poll every few ms via XTaskQueueSubmitDelayedCallback). A short burst of unrelated one-shot delayed callbacks runs concurrently to create timer-arming contention, then stops. After the burst drains, the poll loop is the only remaining work: if a re-arm was dropped during the burst, nothing sweeps the pending list to rescue it, the poll loop stops advancing, and the test fails deterministically (it requires the loop to make measurable progress after the burst ends). Wiring (same pattern as CompositeQueueStarvationScenario): - Tests/Shared/SingleQueueDelayedPollStrandScenario.h - shared, platform-neutral scenario using only the public XTaskQueue API + the standard library. - Tests/UnitTests/Tests/TaskQueueTests.cpp - VerifySingleQueueDelayedPollStrand exercises it on the Win32 WaitTimer backend. - Tests/TaskQueueStarvation/SingleQueuePollStrandRepro.cpp + the taskqueue-poll-strand-linux CTest target in Build/libHttpClient.Linux exercise it on the STL WaitTimer backend (where the regression manifests and the Windows lanes cannot reach).

The Linux test step filtered ctest to -R taskqueue-starvation-linux, so the new taskqueue-poll-strand-linux target was built but never executed. Widen the filter to run both taskqueue regression tests.

Label both TaskQueue regression ctests with LABELS taskqueue and run them via 'ctest -L taskqueue' instead of an explicit -R name list. Future regression scenarios run automatically by adding the label in CMake, with no pipeline change required.

raulalbertog added 2 commits June 22, 2026 19:41

rgomez391 requested a review from brianpepin June 23, 2026 03:05

raulalbertog added 2 commits June 23, 2026 08:32

CI: run the single-port poll-strand test in the Linux ctest lane

8c802e3

The Linux test step filtered ctest to -R taskqueue-starvation-linux, so the new taskqueue-poll-strand-linux target was built but never executed. Widen the filter to run both taskqueue regression tests.

rgomez391 changed the title ~~Fix STL wait-timer strand: restore pre-#975 backend (close #973 on PS/Linux/Apple)~~ Fix STL wait-timer strand: restore pre-#975 backend (close #973) Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix STL wait-timer strand: restore pre-#975 backend (close #973)#998

Fix STL wait-timer strand: restore pre-#975 backend (close #973)#998
rgomez391 wants to merge 4 commits into
mainfrom
raulgomez/fix-stl-waittimer-strand

rgomez391 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgomez391 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background: the #973 lineage

The bug

Why the arming-layer fixes (#975/#980/#983) could not address it

The fix

Validation

Regression test (included in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rgomez391 commented Jun 23, 2026 •

edited

Loading