Fix STL wait-timer strand: restore pre-#975 backend (close #973)#998
Open
rgomez391 wants to merge 4 commits into
Open
Fix STL wait-timer strand: restore pre-#975 backend (close #973)#998rgomez391 wants to merge 4 commits into
rgomez391 wants to merge 4 commits into
Conversation
#975 rewrote the shared STL wait-timer backend (WaitTimer_stl.cpp) to replace the O(N) pointer-keyed cancellation scan with an O(log N) per-timer generation scheme: every Start() bumps a generation counter and pushes a fresh heap entry, and the worker discards any popped entry whose generation no longer matches the timer's current generation. That scheme can drop a due callback. A Start() that races the worker between its Peek() of the earliest entry and its dispatch of that entry bumps the generation, so when the worker pops the entry it now reads a stale generation and discards it instead of firing it. The re-pushed entry carries a new deadline, so the original due callback is never delivered. With the entry gone, the worker sleeps on the next (later or absent) deadline and goes idle. On a single-port manual XTaskQueue this is unrecoverable: the platform HTTP provider re-arms its poll via XTaskQueueSubmitDelayedCallback, and once that re-poll is dropped there is no independent traffic to wake the queue and sweep the pending list again. The symptom is an infinite sign-in / inventory hang (the delayed re-poll sits in the pending list while the timer thread waits in _Cnd_timedwait, never re-armed). Why the arming-layer fixes did not catch or fix it: - #975, #980 and #983 (#983 = "Centralize timer-arming to close TOCTOU race") all hardened the CAS->Start bookkeeping in TaskQueuePortImpl (m_timerDue) in TaskQueue.cpp. Those fixes are correct and are retained here. But they assume Start() reliably arms the backend for the published deadline. The generation backend violates that assumption by dropping the entry after Start(), so no amount of arming-layer verification can recover it. - The defect only affects WaitTimer_stl.cpp. Win32/UWP use WaitTimer_win32.cpp (OS CreateThreadpoolTimer, no generation scheme) and are unaffected. The libHttpClient unit-test project compiles the Win32 backend, so the Windows suite stayed green and none of #975/#980/#983 exercised the STL backend. Fix: restore the proven pre-#975 STL backend algorithm (libHttpClient 2.3.1, commit 0fa5f24) - pointer-keyed cancellation and an unconditional notify on every Set, which cannot discard a due entry - bridged to the post-#975 public WaitTimer API (GetCurrentTime / GetDueTime / Start(dueTime)). The clock is pinned to steady_clock (the pre-#975 code used high_resolution_clock, which is steady_clock on libc++ but wall-clock system_clock on libstdc++); pinning keeps the delayed-callback ordering monotonic on every platform that compiles this backend, preserving #975's monotonic-time intent. The arming-layer improvements in TaskQueue.cpp from #975/#980/#983 are kept unchanged; only the STL timer backend is reverted. Validated on-device in a shipping title: the reproducing infinite sign-in / inventory hang no longer occurs with this backend and the current TaskQueue arming logic. Follow-up: the STL backend has no unit-test coverage (the test project compiles only the Win32 backend). A dedicated STL-backend test lane is needed so a future re-land of the generation optimization cannot reintroduce this class of strand.
Adds a regression guard for the strand fixed by the previous commit and wires it into both timer backends, complementing the existing composite-queue starvation guard. The existing CompositeQueueStarvationScenario keeps many concurrent submitters and terminate races running for the whole measurement window, so there is always independent traffic whose timer fires sweep the pending list and rescue a momentarily-stranded entry. That continuous rescue traffic masks the single-port strand, which is why the existing lane did not catch it. The new scenario reproduces the field failure shape directly: a single self-resubmitting delayed-poll loop on one manual Work port (mirroring a platform HTTP provider that re-arms its poll every few ms via XTaskQueueSubmitDelayedCallback). A short burst of unrelated one-shot delayed callbacks runs concurrently to create timer-arming contention, then stops. After the burst drains, the poll loop is the only remaining work: if a re-arm was dropped during the burst, nothing sweeps the pending list to rescue it, the poll loop stops advancing, and the test fails deterministically (it requires the loop to make measurable progress after the burst ends). Wiring (same pattern as CompositeQueueStarvationScenario): - Tests/Shared/SingleQueueDelayedPollStrandScenario.h - shared, platform-neutral scenario using only the public XTaskQueue API + the standard library. - Tests/UnitTests/Tests/TaskQueueTests.cpp - VerifySingleQueueDelayedPollStrand exercises it on the Win32 WaitTimer backend. - Tests/TaskQueueStarvation/SingleQueuePollStrandRepro.cpp + the taskqueue-poll-strand-linux CTest target in Build/libHttpClient.Linux exercise it on the STL WaitTimer backend (where the regression manifests and the Windows lanes cannot reach).
The Linux test step filtered ctest to -R taskqueue-starvation-linux, so the new taskqueue-poll-strand-linux target was built but never executed. Widen the filter to run both taskqueue regression tests.
Label both TaskQueue regression ctests with LABELS taskqueue and run them via 'ctest -L taskqueue' instead of an explicit -R name list. Future regression scenarios run automatically by adding the label in CMake, with no pipeline change required.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This rolls back the STL wait-timer backend rewrite from #975 (WaitTimer_stl.cpp) while keeping all of the timer-arming improvements from #975, #980, and #983 in TaskQueue.cpp. The #975 backend rewrite introduced a delayed-callback strand that manifests as a permanent hang on platforms that compile the STL backend (Linux, Apple, and other non-Windows platforms), reproduced on-device as an infinite sign-in / inventory hang in a shipping title.
The change is scoped to a single source file (
WaitTimer_stl.cpp) plus a regression test, and is validated on-device.Background: the #973 lineage
This is a continuation of the work on issue #973 ("a delayed callback becomes runnable only after unrelated later work wakes the queue"):
Start()verification inSubmitPendingCallbacks.ArmTimerIfEarlier,ArmTimerForNextPendingDueTime,RearmTimerIfDueTimeUnchanged).The TaskQueue.cpp arming work in all three PRs is correct and is retained by this change. The regression is specifically in the #975 STL backend rewrite.
The bug
The #975 STL backend's generation scheme can drop a due callback:
TimerQueue::Set(called byStart()) bumps the timer's generation and pushes a new heap entry; the worker is expected to discard the superseded entry on pop whenentry.Generation != state->Generation().Start()that races the worker between itsPeek()of the earliest entry and its dispatch of that entry bumps the generation first.On a single-port manual
XTaskQueuethis is unrecoverable. The platform HTTP provider re-arms its poll viaXTaskQueueSubmitDelayedCallback(... Work ...); once that re-poll is dropped, there is no independent traffic to wake the queue and sweep the pending list again. The call freezes permanently.Observed on-device: the delayed re-poll sits in
m_pendingListwhile the timer thread waits in_Cnd_timedwaitandm_timerDue == UINT64_MAX(idle) — i.e. a pending entry with no armed timer, which should be impossible.Why the arming-layer fixes (#975/#980/#983) could not address it
CAS → Startbookkeeping onm_timerDueinTaskQueuePortImpl. That logic is correct, but it rests on the assumption thatStart()reliably arms the backend for the published deadline. The generation backend violates that assumption by discarding the entry afterStart(), so no amount of arming-layer post-Startverification can recover it.WaitTimer_win32.cpp(OSCreateThreadpoolTimer, no generation scheme) and are unaffected. Critically, the libHttpClient Windows unit-test project compiles only the Win32 backend, so the Windows suite stayed green when Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975/Update TaskQueue with changes from Windows repo #980/Centralize timer-arming to close TOCTOU race #983 merged. The Linux CI lane that does exercise the STL backend (taskqueue-starvation-linux) was added after these PRs, and its existing scenario has rescue traffic that masks this particular strand (see Follow-up). This test-coverage gap is the systemic reason the regression shipped.The fix
Restore the proven pre-#975 STL backend algorithm (libHttpClient 2.3.1, commit
0fa5f24):notify_allon everySet, which cannot discard a due entry.WaitTimerAPI (GetCurrentTime/GetDueTime/Start(dueTime)) so TaskQueue.cpp is unchanged.steady_clock(the 2.3.1 code usedhigh_resolution_clock, which issteady_clockon libc++ but wall-clocksystem_clockon libstdc++). Pinning keeps delayed-callback ordering monotonic on every platform, preserving Fix XTaskQueue delayed-callback timing and wait-timer teardown races #975's monotonic-time intent.Only the STL timer backend is reverted; the TaskQueue.cpp arming improvements from #975/#980/#983 are kept unchanged.
Validation
new TaskQueue.cpp + new backendhangs;new TaskQueue.cpp + restored backenddoes not. The only variable is WaitTimer_stl.cpp.Regression test (included in this PR)
A Linux CI test lane that exercises the STL backend already exists —
taskqueue-starvation-linux(TaskQueueStarvationTests.Linux, added alongside the composite-queue starvation guard), which runs a shared scenario inTests/Shared/againstWaitTimer_stl.cpp. That existing scenario keeps many concurrent submitters and terminate races running for the whole measurement window, so it always has independent traffic that can rescue a momentarily-stranded timer — which is exactly why it does not catch this particular strand.This PR adds a complementary scenario that reproduces the failure shape directly:
Tests/Shared/SingleQueueDelayedPollStrandScenario.h— a single self-resubmitting delayed-poll loop on one manualWorkport (mirroring a platform HTTP provider that re-arms its poll every few ms viaXTaskQueueSubmitDelayedCallback). A short burst of unrelated one-shot delayed callbacks runs concurrently to create timer-arming contention, then stops. After the burst drains, the poll loop is the only remaining work — with no rescue traffic, a dropped re-arm stalls it permanently, and the test fails deterministically because it requires the loop to make measurable progress after the burst ends.Tests/UnitTests/Tests/TaskQueueTests.cpp—VerifySingleQueueDelayedPollStrandruns it on the Win32 backend.Tests/TaskQueueStarvation/SingleQueuePollStrandRepro.cpp+ the newtaskqueue-poll-strand-linuxCTest target run it on the STL backend (where the regression manifests), wired into the Linux CI lane.The scenario uses only the public
XTaskQueueC API plus the standard library, so the same translation unit compiles on every platform. Like the existing starvation guard, the strand is a probabilistic race; the test is a stress loop (many iterations) that maximizes exposure rather than a deterministic single-shot.