Skip to content

Fix STL wait-timer strand: restore pre-#975 backend (close #973)#998

Open
rgomez391 wants to merge 4 commits into
mainfrom
raulgomez/fix-stl-waittimer-strand
Open

Fix STL wait-timer strand: restore pre-#975 backend (close #973)#998
rgomez391 wants to merge 4 commits into
mainfrom
raulgomez/fix-stl-waittimer-strand

Conversation

@rgomez391

@rgomez391 rgomez391 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

This rolls back the STL wait-timer backend rewrite from #975 (WaitTimer_stl.cpp) while keeping all of the timer-arming improvements from #975, #980, and #983 in TaskQueue.cpp. The #975 backend rewrite introduced a delayed-callback strand that manifests as a permanent hang on platforms that compile the STL backend (Linux, Apple, and other non-Windows platforms), reproduced on-device as an infinite sign-in / inventory hang in a shipping title.

The change is scoped to a single source file (WaitTimer_stl.cpp) plus a regression test, and is validated on-device.

Background: the #973 lineage

This is a continuation of the work on issue #973 ("a delayed callback becomes runnable only after unrelated later work wakes the queue"):

The TaskQueue.cpp arming work in all three PRs is correct and is retained by this change. The regression is specifically in the #975 STL backend rewrite.

The bug

The #975 STL backend's generation scheme can drop a due callback:

  1. TimerQueue::Set (called by Start()) bumps the timer's generation and pushes a new heap entry; the worker is expected to discard the superseded entry on pop when entry.Generation != state->Generation().
  2. But a Start() that races the worker between its Peek() of the earliest entry and its dispatch of that entry bumps the generation first.
  3. When the worker then pops that entry, it reads a now-stale generation and discards it instead of firing it. The re-pushed entry carries a new deadline, so the original due callback is never delivered.
  4. With the entry gone, the worker sleeps on the next (later or absent) deadline and goes idle.

On a single-port manual XTaskQueue this is unrecoverable. The platform HTTP provider re-arms its poll via XTaskQueueSubmitDelayedCallback(... Work ...); once that re-poll is dropped, there is no independent traffic to wake the queue and sweep the pending list again. The call freezes permanently.

Observed on-device: the delayed re-poll sits in m_pendingList while the timer thread waits in _Cnd_timedwait and m_timerDue == UINT64_MAX (idle) — i.e. a pending entry with no armed timer, which should be impossible.

Why the arming-layer fixes (#975/#980/#983) could not address it

  • All three hardened the CAS → Start bookkeeping on m_timerDue in TaskQueuePortImpl. That logic is correct, but it rests on the assumption that Start() reliably arms the backend for the published deadline. The generation backend violates that assumption by discarding the entry after Start(), so no amount of arming-layer post-Start verification can recover it.
  • The defect only affects WaitTimer_stl.cpp. Win32/UWP use WaitTimer_win32.cpp (OS CreateThreadpoolTimer, no generation scheme) and are unaffected. Critically, the libHttpClient Windows unit-test project compiles only the Win32 backend, so the Windows suite stayed green when Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975/Update TaskQueue with changes from Windows repo #980/Centralize timer-arming to close TOCTOU race #983 merged. The Linux CI lane that does exercise the STL backend (taskqueue-starvation-linux) was added after these PRs, and its existing scenario has rescue traffic that masks this particular strand (see Follow-up). This test-coverage gap is the systemic reason the regression shipped.

The fix

Restore the proven pre-#975 STL backend algorithm (libHttpClient 2.3.1, commit 0fa5f24):

Only the STL timer backend is reverted; the TaskQueue.cpp arming improvements from #975/#980/#983 are kept unchanged.

Validation

  • On-device, in a shipping title: the reproducing infinite sign-in / inventory hang no longer occurs with this backend combined with the current TaskQueue.cpp arming logic. Confirmed by the partner across multiple runs.
  • Root cause was isolated by a controlled A/B: new TaskQueue.cpp + new backend hangs; new TaskQueue.cpp + restored backend does not. The only variable is WaitTimer_stl.cpp.

Regression test (included in this PR)

A Linux CI test lane that exercises the STL backend already existstaskqueue-starvation-linux (TaskQueueStarvationTests.Linux, added alongside the composite-queue starvation guard), which runs a shared scenario in Tests/Shared/ against WaitTimer_stl.cpp. That existing scenario keeps many concurrent submitters and terminate races running for the whole measurement window, so it always has independent traffic that can rescue a momentarily-stranded timer — which is exactly why it does not catch this particular strand.

This PR adds a complementary scenario that reproduces the failure shape directly:

  • Tests/Shared/SingleQueueDelayedPollStrandScenario.h — a single self-resubmitting delayed-poll loop on one manual Work port (mirroring a platform HTTP provider that re-arms its poll every few ms via XTaskQueueSubmitDelayedCallback). A short burst of unrelated one-shot delayed callbacks runs concurrently to create timer-arming contention, then stops. After the burst drains, the poll loop is the only remaining work — with no rescue traffic, a dropped re-arm stalls it permanently, and the test fails deterministically because it requires the loop to make measurable progress after the burst ends.
  • Tests/UnitTests/Tests/TaskQueueTests.cppVerifySingleQueueDelayedPollStrand runs it on the Win32 backend.
  • Tests/TaskQueueStarvation/SingleQueuePollStrandRepro.cpp + the new taskqueue-poll-strand-linux CTest target run it on the STL backend (where the regression manifests), wired into the Linux CI lane.

The scenario uses only the public XTaskQueue C API plus the standard library, so the same translation unit compiles on every platform. Like the existing starvation guard, the strand is a probabilistic race; the test is a stress loop (many iterations) that maximizes exposure rather than a deterministic single-shot.

#975 rewrote the shared STL wait-timer backend (WaitTimer_stl.cpp) to replace
the O(N) pointer-keyed cancellation scan with an O(log N) per-timer generation
scheme: every Start() bumps a generation counter and pushes a fresh heap entry,
and the worker discards any popped entry whose generation no longer matches the
timer's current generation.

That scheme can drop a due callback. A Start() that races the worker between
its Peek() of the earliest entry and its dispatch of that entry bumps the
generation, so when the worker pops the entry it now reads a stale generation
and discards it instead of firing it. The re-pushed entry carries a new
deadline, so the original due callback is never delivered. With the entry gone,
the worker sleeps on the next (later or absent) deadline and goes idle.

On a single-port manual XTaskQueue this is unrecoverable: the platform HTTP
provider re-arms its poll via XTaskQueueSubmitDelayedCallback, and once that
re-poll is dropped there is no independent traffic to wake the queue and sweep
the pending list again. The symptom is an infinite sign-in / inventory hang
(the delayed re-poll sits in the pending list while the timer thread waits in
_Cnd_timedwait, never re-armed).

Why the arming-layer fixes did not catch or fix it:
- #975, #980 and #983 (#983 = "Centralize timer-arming to close TOCTOU race")
  all hardened the CAS->Start bookkeeping in TaskQueuePortImpl (m_timerDue) in
  TaskQueue.cpp. Those fixes are correct and are retained here. But they assume
  Start() reliably arms the backend for the published deadline. The generation
  backend violates that assumption by dropping the entry after Start(), so no
  amount of arming-layer verification can recover it.
- The defect only affects WaitTimer_stl.cpp. Win32/UWP use WaitTimer_win32.cpp
  (OS CreateThreadpoolTimer, no generation scheme) and are unaffected. The
  libHttpClient unit-test project compiles the Win32 backend, so the Windows
  suite stayed green and none of #975/#980/#983 exercised the STL backend.

Fix: restore the proven pre-#975 STL backend algorithm (libHttpClient 2.3.1,
commit 0fa5f24) - pointer-keyed cancellation and an unconditional notify on
every Set, which cannot discard a due entry - bridged to the post-#975 public
WaitTimer API (GetCurrentTime / GetDueTime / Start(dueTime)). The clock is
pinned to steady_clock (the pre-#975 code used high_resolution_clock, which is
steady_clock on libc++ but wall-clock system_clock on libstdc++); pinning keeps
the delayed-callback ordering monotonic on every platform that compiles this
backend, preserving #975's monotonic-time intent.

The arming-layer improvements in TaskQueue.cpp from #975/#980/#983 are kept
unchanged; only the STL timer backend is reverted.

Validated on-device in a shipping title: the reproducing infinite sign-in /
inventory hang no longer occurs with this backend and the current TaskQueue
arming logic.

Follow-up: the STL backend has no unit-test coverage (the test project compiles
only the Win32 backend). A dedicated STL-backend test lane is needed so a future
re-land of the generation optimization cannot reintroduce this class of strand.
Adds a regression guard for the strand fixed by the previous commit and wires
it into both timer backends, complementing the existing composite-queue
starvation guard.

The existing CompositeQueueStarvationScenario keeps many concurrent submitters
and terminate races running for the whole measurement window, so there is
always independent traffic whose timer fires sweep the pending list and rescue
a momentarily-stranded entry. That continuous rescue traffic masks the
single-port strand, which is why the existing lane did not catch it.

The new scenario reproduces the field failure shape directly: a single
self-resubmitting delayed-poll loop on one manual Work port (mirroring a
platform HTTP provider that re-arms its poll every few ms via
XTaskQueueSubmitDelayedCallback). A short burst of unrelated one-shot delayed
callbacks runs concurrently to create timer-arming contention, then stops.
After the burst drains, the poll loop is the only remaining work: if a re-arm
was dropped during the burst, nothing sweeps the pending list to rescue it, the
poll loop stops advancing, and the test fails deterministically (it requires
the loop to make measurable progress after the burst ends).

Wiring (same pattern as CompositeQueueStarvationScenario):
- Tests/Shared/SingleQueueDelayedPollStrandScenario.h - shared, platform-neutral
  scenario using only the public XTaskQueue API + the standard library.
- Tests/UnitTests/Tests/TaskQueueTests.cpp - VerifySingleQueueDelayedPollStrand
  exercises it on the Win32 WaitTimer backend.
- Tests/TaskQueueStarvation/SingleQueuePollStrandRepro.cpp + the
  taskqueue-poll-strand-linux CTest target in Build/libHttpClient.Linux exercise
  it on the STL WaitTimer backend (where the regression manifests and the
  Windows lanes cannot reach).
@rgomez391 rgomez391 requested a review from brianpepin June 23, 2026 03:05
The Linux test step filtered ctest to -R taskqueue-starvation-linux, so the
new taskqueue-poll-strand-linux target was built but never executed. Widen
the filter to run both taskqueue regression tests.
Label both TaskQueue regression ctests with LABELS taskqueue and run them via
'ctest -L taskqueue' instead of an explicit -R name list. Future regression
scenarios run automatically by adding the label in CMake, with no pipeline
change required.
@rgomez391 rgomez391 changed the title Fix STL wait-timer strand: restore pre-#975 backend (close #973 on PS/Linux/Apple) Fix STL wait-timer strand: restore pre-#975 backend (close #973) Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants