Skip to content

_mi_heap_delayed_free_all can spin forever when a page is stuck in MI_DELAYED_FREEING (priority-inversion livelock on a core-pinned, priority-scheduled platform) #1317

Description

@emrahsungu

Summary

_mi_heap_delayed_free_all() busy-spins (while (!_mi_heap_delayed_free_partial(heap)) mi_atomic_yield();) until every thread-delayed block can be freed. A block cannot be freed while its page is transiently in MI_DELAYED_FREEING (set by a foreign thread in mi_free_block_delayed_mt). On a platform with strict priority-based, per-core preemptive scheduling and threads pinned to cores, if the spinning thread is higher priority than the foreign thread that parked the page and they share a core, the foreign thread never gets the core, the flag is never cleared, and the spin never ends. mi_atomic_yield() cannot break this — it does not make the higher-priority spinner non-runnable.

This reproduces as an intermittent full freeze in a long-running test; the spinning thread is mi_heap_collect_mi_heap_delayed_free_all reached from the periodic collect inside mi_malloc_generic.

Environment

  • mimalloc: 2.2.7 (MI_MALLOC_VERSION 227).
  • Build: compiled as C++. On our AArch64 target, mi_atomic_yield() resolves to the aarch64 branch __asm__ volatile("wfe") (include/mimalloc/atomic.h). The point below holds for the C++ std::this_thread::yield() branch as well — see "Why yield can't fix it."
  • Platform: an AArch64 game console with a strict priority-based preemptive scheduler. Relevant scheduler semantics:
    • Per core, the highest-priority runnable thread always runs; a lower-priority thread cannot run on that core while a higher-priority thread is runnable there.
    • yield hands the core only to another runnable thread of the same priority.
    • Round-robin time-slicing applies only within the lowest priority band.
    • A lower-priority thread may still run on a different core.
    • Threads are affinitized to specific cores.

Observed stack (the spinning thread)

_mi_page_try_use_delayed_free   page.c:152    <- spinning (load/CAS retry; page is MI_DELAYED_FREEING)
_mi_free_delayed_block          free.c:199
_mi_heap_delayed_free_partial   page.c:331
_mi_heap_delayed_free_all       page.c:315    <- unbounded `while (!partial) mi_atomic_yield()`
mi_heap_collect_ex              heap.c:157    (collect == MI_NORMAL)
mi_heap_collect                 heap.c:187
mi_malloc_generic               page.c:1013   (periodic collect, every `generic_collect` allocs)
... application allocation (zlib deflateInit during a normal allocation) ...

collect == MI_NORMAL, i.e. this is the best-effort periodic collect, not teardown.

Root cause

Page delayed-free state (xthread_free low bits, types.h):

MI_USE_DELAYED_FREE   = 0   // full page: first foreign free goes to the owning heap's thread_delayed_free
MI_DELAYED_FREEING    = 1   // transient: a foreign thread is mid-push onto the owning heap
MI_NO_DELAYED_FREE    = 2   // a delayed free is already queued; further foreign frees go page-local
MI_NEVER_DELAYED_FREE = 3   // sticky (abandoned / force-abandon)

Producer — foreign thread B frees a block in a full page owned by thread A. mi_free_block_delayed_mt (free.c:218):

  1. CAS MI_USE_DELAYED_FREE → MI_DELAYED_FREEING (free.c:229)
  2. push the block onto A's heap->thread_delayed_free (free.c:245-248)
  3. CAS MI_DELAYED_FREEING → MI_NO_DELAYED_FREE (free.c:252-257)

Between (1) and (3) the page is MI_DELAYED_FREEING. B holds no lock here, so the window is short only while B keeps getting CPU.

Consumer — owner A drains its list: _mi_heap_delayed_free_all (page.c:314) → _mi_heap_delayed_free_partial (page.c:321) → _mi_free_delayed_block (free.c:186) → _mi_page_try_use_delayed_free(page, MI_USE_DELAYED_FREE, false) (page.c:146). If the page is MI_DELAYED_FREEING, that inner call yields up to 4 times and returns false (page.c:155-156); _mi_free_delayed_block returns false; _mi_heap_delayed_free_partial re-queues the block and returns false; _mi_heap_delayed_free_all yields and loops again — with no bound (page.c:315).

So the wait ends only when every referenced page has left MI_DELAYED_FREEING, i.e. when every producer that touched A's full pages has reached step (3). If any such producer is not making forward progress, A spins indefinitely.

Why mi_atomic_yield() cannot fix it here

Under the scheduler above, a higher-priority runnable thread keeps the core; a lower-priority thread on that core does not run. _mi_heap_delayed_free_all keeps A runnable the whole time:

  • aarch64 branch — wfe: a CPU wait, not an OS call. It never enters the scheduler, never requeues the thread, and (with no paired LDXR/SEV) just naps until the next interrupt/event and resumes the same thread. It cannot let B run.
  • C++ branch — std::this_thread::yield(): hands off only to equal-or-higher priority threads; it will not run the lower-priority B, and never makes A non-runnable.

Either way, if B is lower priority and pinned to A's core, B never reaches step (3) and the loop never terminates. If B happens to be on a different core it completes in parallel and A recovers — which is exactly why this is intermittent (it depends on where B is scheduled). Consistent with this: pausing the spinning thread in the debugger and resuming lets it proceed (the debugger perturbs scheduling so B runs).

The inner bounded retry does not help: _mi_page_try_use_delayed_free gives up after 4 yields (page.c:156), but _mi_heap_delayed_free_all re-arms it unconditionally. Forcing _mi_page_try_use_delayed_free to return false makes _mi_heap_delayed_free_all loop forever deterministically — showing the outer loop has no escape.

Note mi_malloc_generic already avoids this on the hot path: page.c:1007 uses the non-looping _mi_heap_delayed_free_partial with the comment "free delayed frees from other threads (but skip contended ones)". It is the mi_heap_collect_ex path (heap.c:157) that still loops.

Questions

  1. Is _mi_heap_delayed_free_all's unbounded spin meant to be safe when a producer may be unscheduled for an unbounded time? On a priority-scheduled, core-pinned platform, the (1)→(3) window in mi_free_block_delayed_mt is no longer bounded by "B is briefly busy," and the outer loop has no give-up.

  2. Would you accept making the non-teardown collect use the partial (non-blocking) form, reserving the full drain for the paths that actually require thread_delayed_free == NULL (MI_ABANDON — assert heap.c:164; mi_heap_absorb — assert heap.c:463; _mi_page_force_abandonpage.c:415)? This mirrors what mi_malloc_generic already does. Proposed patch:

    // heap.c, mi_heap_collect_ex(), replacing the unconditional _mi_heap_delayed_free_all(heap):
    if (collect == MI_ABANDON) {
      _mi_heap_delayed_free_all(heap);      // teardown: must fully drain (assert below)
    } else {
      _mi_heap_delayed_free_partial(heap);  // best-effort; contended blocks are retried next time
    }
  3. Should the remaining must-drain spin loops escalate mi_atomic_yield() to a real descheduling wait (e.g. a short sleep) after a few spins, so a lower-priority producer on the same core can complete step (3)? On platforms where yield cannot deschedule a higher-priority waiter, a pure spin (whether wfe, std::this_thread::yield, YieldProcessor, or pause) cannot break this inversion.

  4. Is the aarch64 mi_atomic_yield = __asm__ volatile("wfe") intended as a scheduling yield? As a bare WFE (no LDXR-armed monitor, no SEV from the releaser) it is a low-power CPU pause that never reaches the OS scheduler, so it provides no scheduling fairness at all.

What we did locally

We shipped the patch in Q2 (partial drain on non-MI_ABANDON collect) and added a spin counter in _mi_heap_delayed_free_all that warns every ~2048 iterations to surface the remaining must-drain paths. Happy to open a PR for the collect change and/or a yield-escalation if you're open to either approach.

Possibly related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions