Summary
_mi_heap_delayed_free_all() busy-spins (while (!_mi_heap_delayed_free_partial(heap)) mi_atomic_yield();) until every thread-delayed block can be freed. A block cannot be freed while its page is transiently in MI_DELAYED_FREEING (set by a foreign thread in mi_free_block_delayed_mt). On a platform with strict priority-based, per-core preemptive scheduling and threads pinned to cores, if the spinning thread is higher priority than the foreign thread that parked the page and they share a core, the foreign thread never gets the core, the flag is never cleared, and the spin never ends. mi_atomic_yield() cannot break this — it does not make the higher-priority spinner non-runnable.
This reproduces as an intermittent full freeze in a long-running test; the spinning thread is mi_heap_collect → _mi_heap_delayed_free_all reached from the periodic collect inside mi_malloc_generic.
Environment
- mimalloc: 2.2.7 (
MI_MALLOC_VERSION 227).
- Build: compiled as C++. On our AArch64 target,
mi_atomic_yield() resolves to the aarch64 branch __asm__ volatile("wfe") (include/mimalloc/atomic.h). The point below holds for the C++ std::this_thread::yield() branch as well — see "Why yield can't fix it."
- Platform: an AArch64 game console with a strict priority-based preemptive scheduler. Relevant scheduler semantics:
- Per core, the highest-priority runnable thread always runs; a lower-priority thread cannot run on that core while a higher-priority thread is runnable there.
yield hands the core only to another runnable thread of the same priority.
- Round-robin time-slicing applies only within the lowest priority band.
- A lower-priority thread may still run on a different core.
- Threads are affinitized to specific cores.
Observed stack (the spinning thread)
_mi_page_try_use_delayed_free page.c:152 <- spinning (load/CAS retry; page is MI_DELAYED_FREEING)
_mi_free_delayed_block free.c:199
_mi_heap_delayed_free_partial page.c:331
_mi_heap_delayed_free_all page.c:315 <- unbounded `while (!partial) mi_atomic_yield()`
mi_heap_collect_ex heap.c:157 (collect == MI_NORMAL)
mi_heap_collect heap.c:187
mi_malloc_generic page.c:1013 (periodic collect, every `generic_collect` allocs)
... application allocation (zlib deflateInit during a normal allocation) ...
collect == MI_NORMAL, i.e. this is the best-effort periodic collect, not teardown.
Root cause
Page delayed-free state (xthread_free low bits, types.h):
MI_USE_DELAYED_FREE = 0 // full page: first foreign free goes to the owning heap's thread_delayed_free
MI_DELAYED_FREEING = 1 // transient: a foreign thread is mid-push onto the owning heap
MI_NO_DELAYED_FREE = 2 // a delayed free is already queued; further foreign frees go page-local
MI_NEVER_DELAYED_FREE = 3 // sticky (abandoned / force-abandon)
Producer — foreign thread B frees a block in a full page owned by thread A. mi_free_block_delayed_mt (free.c:218):
- CAS
MI_USE_DELAYED_FREE → MI_DELAYED_FREEING (free.c:229)
- push the block onto A's
heap->thread_delayed_free (free.c:245-248)
- CAS
MI_DELAYED_FREEING → MI_NO_DELAYED_FREE (free.c:252-257)
Between (1) and (3) the page is MI_DELAYED_FREEING. B holds no lock here, so the window is short only while B keeps getting CPU.
Consumer — owner A drains its list: _mi_heap_delayed_free_all (page.c:314) → _mi_heap_delayed_free_partial (page.c:321) → _mi_free_delayed_block (free.c:186) → _mi_page_try_use_delayed_free(page, MI_USE_DELAYED_FREE, false) (page.c:146). If the page is MI_DELAYED_FREEING, that inner call yields up to 4 times and returns false (page.c:155-156); _mi_free_delayed_block returns false; _mi_heap_delayed_free_partial re-queues the block and returns false; _mi_heap_delayed_free_all yields and loops again — with no bound (page.c:315).
So the wait ends only when every referenced page has left MI_DELAYED_FREEING, i.e. when every producer that touched A's full pages has reached step (3). If any such producer is not making forward progress, A spins indefinitely.
Why mi_atomic_yield() cannot fix it here
Under the scheduler above, a higher-priority runnable thread keeps the core; a lower-priority thread on that core does not run. _mi_heap_delayed_free_all keeps A runnable the whole time:
- aarch64 branch —
wfe: a CPU wait, not an OS call. It never enters the scheduler, never requeues the thread, and (with no paired LDXR/SEV) just naps until the next interrupt/event and resumes the same thread. It cannot let B run.
- C++ branch —
std::this_thread::yield(): hands off only to equal-or-higher priority threads; it will not run the lower-priority B, and never makes A non-runnable.
Either way, if B is lower priority and pinned to A's core, B never reaches step (3) and the loop never terminates. If B happens to be on a different core it completes in parallel and A recovers — which is exactly why this is intermittent (it depends on where B is scheduled). Consistent with this: pausing the spinning thread in the debugger and resuming lets it proceed (the debugger perturbs scheduling so B runs).
The inner bounded retry does not help: _mi_page_try_use_delayed_free gives up after 4 yields (page.c:156), but _mi_heap_delayed_free_all re-arms it unconditionally. Forcing _mi_page_try_use_delayed_free to return false makes _mi_heap_delayed_free_all loop forever deterministically — showing the outer loop has no escape.
Note mi_malloc_generic already avoids this on the hot path: page.c:1007 uses the non-looping _mi_heap_delayed_free_partial with the comment "free delayed frees from other threads (but skip contended ones)". It is the mi_heap_collect_ex path (heap.c:157) that still loops.
Questions
-
Is _mi_heap_delayed_free_all's unbounded spin meant to be safe when a producer may be unscheduled for an unbounded time? On a priority-scheduled, core-pinned platform, the (1)→(3) window in mi_free_block_delayed_mt is no longer bounded by "B is briefly busy," and the outer loop has no give-up.
-
Would you accept making the non-teardown collect use the partial (non-blocking) form, reserving the full drain for the paths that actually require thread_delayed_free == NULL (MI_ABANDON — assert heap.c:164; mi_heap_absorb — assert heap.c:463; _mi_page_force_abandon — page.c:415)? This mirrors what mi_malloc_generic already does. Proposed patch:
// heap.c, mi_heap_collect_ex(), replacing the unconditional _mi_heap_delayed_free_all(heap):
if (collect == MI_ABANDON) {
_mi_heap_delayed_free_all(heap); // teardown: must fully drain (assert below)
} else {
_mi_heap_delayed_free_partial(heap); // best-effort; contended blocks are retried next time
}
-
Should the remaining must-drain spin loops escalate mi_atomic_yield() to a real descheduling wait (e.g. a short sleep) after a few spins, so a lower-priority producer on the same core can complete step (3)? On platforms where yield cannot deschedule a higher-priority waiter, a pure spin (whether wfe, std::this_thread::yield, YieldProcessor, or pause) cannot break this inversion.
-
Is the aarch64 mi_atomic_yield = __asm__ volatile("wfe") intended as a scheduling yield? As a bare WFE (no LDXR-armed monitor, no SEV from the releaser) it is a low-power CPU pause that never reaches the OS scheduler, so it provides no scheduling fairness at all.
What we did locally
We shipped the patch in Q2 (partial drain on non-MI_ABANDON collect) and added a spin counter in _mi_heap_delayed_free_all that warns every ~2048 iterations to surface the remaining must-drain paths. Happy to open a PR for the collect change and/or a yield-escalation if you're open to either approach.
Possibly related
Summary
_mi_heap_delayed_free_all()busy-spins (while (!_mi_heap_delayed_free_partial(heap)) mi_atomic_yield();) until every thread-delayed block can be freed. A block cannot be freed while its page is transiently inMI_DELAYED_FREEING(set by a foreign thread inmi_free_block_delayed_mt). On a platform with strict priority-based, per-core preemptive scheduling and threads pinned to cores, if the spinning thread is higher priority than the foreign thread that parked the page and they share a core, the foreign thread never gets the core, the flag is never cleared, and the spin never ends.mi_atomic_yield()cannot break this — it does not make the higher-priority spinner non-runnable.This reproduces as an intermittent full freeze in a long-running test; the spinning thread is
mi_heap_collect→_mi_heap_delayed_free_allreached from the periodic collect insidemi_malloc_generic.Environment
MI_MALLOC_VERSION 227).mi_atomic_yield()resolves to the aarch64 branch__asm__ volatile("wfe")(include/mimalloc/atomic.h). The point below holds for the C++std::this_thread::yield()branch as well — see "Why yield can't fix it."yieldhands the core only to another runnable thread of the same priority.Observed stack (the spinning thread)
collect == MI_NORMAL, i.e. this is the best-effort periodic collect, not teardown.Root cause
Page delayed-free state (
xthread_freelow bits,types.h):Producer — foreign thread B frees a block in a full page owned by thread A.
mi_free_block_delayed_mt(free.c:218):MI_USE_DELAYED_FREE → MI_DELAYED_FREEING(free.c:229)heap->thread_delayed_free(free.c:245-248)MI_DELAYED_FREEING → MI_NO_DELAYED_FREE(free.c:252-257)Between (1) and (3) the page is
MI_DELAYED_FREEING. B holds no lock here, so the window is short only while B keeps getting CPU.Consumer — owner A drains its list:
_mi_heap_delayed_free_all(page.c:314) →_mi_heap_delayed_free_partial(page.c:321) →_mi_free_delayed_block(free.c:186) →_mi_page_try_use_delayed_free(page, MI_USE_DELAYED_FREE, false)(page.c:146). If the page isMI_DELAYED_FREEING, that inner call yields up to 4 times and returns false (page.c:155-156);_mi_free_delayed_blockreturns false;_mi_heap_delayed_free_partialre-queues the block and returns false;_mi_heap_delayed_free_allyields and loops again — with no bound (page.c:315).So the wait ends only when every referenced page has left
MI_DELAYED_FREEING, i.e. when every producer that touched A's full pages has reached step (3). If any such producer is not making forward progress, A spins indefinitely.Why
mi_atomic_yield()cannot fix it hereUnder the scheduler above, a higher-priority runnable thread keeps the core; a lower-priority thread on that core does not run.
_mi_heap_delayed_free_allkeeps A runnable the whole time:wfe: a CPU wait, not an OS call. It never enters the scheduler, never requeues the thread, and (with no pairedLDXR/SEV) just naps until the next interrupt/event and resumes the same thread. It cannot let B run.std::this_thread::yield(): hands off only to equal-or-higher priority threads; it will not run the lower-priority B, and never makes A non-runnable.Either way, if B is lower priority and pinned to A's core, B never reaches step (3) and the loop never terminates. If B happens to be on a different core it completes in parallel and A recovers — which is exactly why this is intermittent (it depends on where B is scheduled). Consistent with this: pausing the spinning thread in the debugger and resuming lets it proceed (the debugger perturbs scheduling so B runs).
The inner bounded retry does not help:
_mi_page_try_use_delayed_freegives up after 4 yields (page.c:156), but_mi_heap_delayed_free_allre-arms it unconditionally. Forcing_mi_page_try_use_delayed_freeto returnfalsemakes_mi_heap_delayed_free_allloop forever deterministically — showing the outer loop has no escape.Note
mi_malloc_genericalready avoids this on the hot path:page.c:1007uses the non-looping_mi_heap_delayed_free_partialwith the comment "free delayed frees from other threads (but skip contended ones)". It is themi_heap_collect_expath (heap.c:157) that still loops.Questions
Is
_mi_heap_delayed_free_all's unbounded spin meant to be safe when a producer may be unscheduled for an unbounded time? On a priority-scheduled, core-pinned platform, the(1)→(3)window inmi_free_block_delayed_mtis no longer bounded by "B is briefly busy," and the outer loop has no give-up.Would you accept making the non-teardown collect use the partial (non-blocking) form, reserving the full drain for the paths that actually require
thread_delayed_free == NULL(MI_ABANDON— assertheap.c:164;mi_heap_absorb— assertheap.c:463;_mi_page_force_abandon—page.c:415)? This mirrors whatmi_malloc_genericalready does. Proposed patch:Should the remaining must-drain spin loops escalate
mi_atomic_yield()to a real descheduling wait (e.g. a short sleep) after a few spins, so a lower-priority producer on the same core can complete step (3)? On platforms whereyieldcannot deschedule a higher-priority waiter, a pure spin (whetherwfe,std::this_thread::yield,YieldProcessor, orpause) cannot break this inversion.Is the aarch64
mi_atomic_yield = __asm__ volatile("wfe")intended as a scheduling yield? As a bareWFE(noLDXR-armed monitor, noSEVfrom the releaser) it is a low-power CPU pause that never reaches the OS scheduler, so it provides no scheduling fairness at all.What we did locally
We shipped the patch in Q2 (partial drain on non-
MI_ABANDONcollect) and added a spin counter in_mi_heap_delayed_free_allthat warns every ~2048 iterations to surface the remaining must-drain paths. Happy to open a PR for the collect change and/or ayield-escalation if you're open to either approach.Possibly related
from->thread_delayed_free == NULLduring heap deletion (the full-drain post-condition the looping form exists to satisfy; relevant to Q2).mi_heap_tcreated on different threads in thread pool #1055 — other reports in the cross-thread delayed-free path.