Skip to content

Deterministic cleanup for collaborative editing documents#4840

Open
stuartc wants to merge 15 commits into
mainfrom
2026-06-05-flaky-tests
Open

Deterministic cleanup for collaborative editing documents#4840
stuartc wants to merge 15 commits into
mainfrom
2026-06-05-flaky-tests

Conversation

@stuartc

@stuartc stuartc commented Jun 8, 2026

Copy link
Copy Markdown
Member

Description

This PR fixes a class of intermittent test failures in the collaborative editing
suites, and adds the deterministic teardown that was missing to cause them.

Collaborative documents start under a global DynamicSupervisor and, until now,
had no clean way to be stopped — they outlived the process that started them. In
tests that meant a document's processes leaked past the test that created them
and kept writing to the database after its SQL sandbox had checked in, crashing
the next test with an owner ... exited error. Run order made it look random.

The fix gives documents a deterministic shutdown:

  • Lightning.Collaborate.stop_document/1 — synchronous, idempotent teardown
    that runs the final persistence flush. The symmetric partner to
    start_document/2.
  • start_document/3 now takes an optional owner: pid. The document tree
    monitors it and stops itself (:normal, so the flush runs and the transient
    child isn't restarted) when the owner exits. It defaults to nil — no monitor
    — so production documents still outlive the LiveView that starts them.
  • Tests start documents through a start_collaboration_document/2 helper that
    passes owner: self(), tying each document's lifetime to its test, with a
    blanket on_exit sweep as a safety net.

It also adds a developer guideline,
.claude/guidelines/testable-supervision-trees.md, on writing supervision trees
that stay addressable in tests without forcing the suite to run serially — the
pattern this fix follows.

Validation steps

  1. Run the collaboration suites repeatedly — they're consistently green now,
    where before they failed intermittently with post-teardown DB-owner errors:
    mix test test/lightning/collaborate_test.exs test/lightning/collaboration/session_test.exs test/lightning/collaboration/document_supervisor_test.exs
  2. Confirm production is untouched by default: start_document/2 with no owner
    registers no monitor and the document behaves exactly as before.

Additional notes for the reviewer

  1. Production behaviour is unchanged by default — owner-monitoring is opt-in.
  2. Validated against a repeat harness: 0/40 normal runs and 0/30 under simulated
    CI timing (+S 4:4, tight assert_receive timeout), with zero post-teardown
    DB-owner errors. Baseline before the fix was 11/40.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

Pre-submission checklist

  • I have performed an AI review of my code
  • I have implemented and tested all related authorization policies (N/A — no auth surface changed)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

stuartc added 9 commits June 8, 2026 12:09
Collaborator processes (DocumentSupervisor/SharedDoc/PersistenceWriter)
could outlive a test's SQL-sandbox owner and crash on DB writes after
teardown, poisoning the next test's start_document and producing flaky
collaboration suites.

Fix the lifecycle product-side rather than babysitting tests:

- Add Collaborate.stop_document/1, a synchronous, idempotent teardown
  that flushes and stops the whole document tree. The symmetric partner
  to start_document/2.
- Add DocumentSupervisor.stop/2 (GenServer.stop(:normal)) which
  guarantees terminate/2 runs the flush, unlike DynamicSupervisor's
  :shutdown.
- Collaborate.start/1 now tears down a document it started if the
  session fails to attach, closing the zero-observer orphan leak.
- Registry.doc_supervisor_names/0 owns the doc-supervisor match spec.
- Tests drop per-test polling for a uniform stop_all_collaboration_documents/0
  on_exit net, and a PID-reuse-safe restart assertion.
Tests that start a document via the production entrypoint
(Collaborate.start_document/2) place it under the global DocSupervisor,
which ExUnit does not own, so the doc outlives the test unless stopped.
Previously a single blanket on_exit net swept these up after the fact.

Add CollaborationHelpers.start_collaboration_document/2, which registers
an on_exit stopping THAT specific document before delegating to
start_document/2 — so each doc's lifetime is bound to the test that
created it, mirroring start_supervised. Migrate the 7 call sites in
session_test.exs.

The blanket stop_all_collaboration_documents/0 net stays as
belt-and-braces: every site is now individually bound, but a single
leaked global doc would still corrupt the next serial (async: false)
test, so the cheap idempotent sweep is kept as insurance.
Close three gaps surfaced by the collaboration document lifecycle fix
(commits a474204, 759ea81):

- §0: frame name-isolation, deterministic teardown and async-safety as
  three payoffs of one injectable ownership seam.
- §1: constructor sub-rule — start_link-style functions put name/owner in
  trailing opts, while call/lookup lead with the subject; the two rules
  govern mutually exclusive function shapes and never collide.
- §3: lifetime/ownership recipe for dynamic populations — owner-monitored
  self-cleanup (preferred) vs public symmetric stop + on_exit test binding.

Updates the §4 anti-patterns checklist and References accordingly.
Light editorial pass removing cross-section restatement after the
lifetime/ownership additions, without cutting teaching content:

- §0: let §3 carry the collaboration-fix detail; trim the duplicated
  test-support-vs-API-seam contrast and the Gray & Tate chapter title.
- §1: tighten the identity-vs-configuration restatement.
- §3 Option 1: reference the eventstore owner-monitor stated just above
  rather than re-narrate it.
- References: drop prose from the arg-order bullet (the rule lives in §1).
The collaboration document tree (DocumentSupervisor + SharedDoc +
PersistenceWriter) is started under a global DynamicSupervisor and is
meant to outlive the caller that started it. In tests this is a hazard:
a document outlives its test, then writes to the DB after the test's
Ecto-sandbox owner has exited, crashing with "owner ... exited" and
poisoning the next test.

Give start_document an optional owner pid that the DocumentSupervisor
monitors; when that owner goes :DOWN the supervisor stops :normal, so
terminate/2 runs the flush and the transient child is not restarted.
Any caller now gets deterministic teardown by passing owner: self(),
with no wrapper. Default owner: nil means no monitor, so production
documents outlive the LiveView that starts them exactly as before.

The test helper start_collaboration_document/2 now passes owner: self()
and drops its on_exit; the blanket stop_all_collaboration_documents/0
net stays as belt-and-braces for the serial suite. Validated against a
repeat harness over the three collaboration test files: 0/40 runs under
normal timing and 0/30 under constrained scheduling (+S 4:4,
ASSERT_RECEIVE_TIMEOUT=1000), with zero post-teardown owner crashes.
@github-project-automation github-project-automation Bot moved this to New Issues in Core Jun 8, 2026
@stuartc stuartc requested a review from elias-ba June 8, 2026 10:10
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Security Review ✅

  • S0 (project scoping): N/A — changes only adjust collaboration supervision-tree lifecycle (owner-monitored teardown, rollback on session-start failure); no new queries or web entrypoints touching project-scoped data.
  • S1 (authorization): N/A — start_document/3 and stop_document/1 are internal plumbing called only from the existing collaborate/1 flow; no new web-layer actions or handle_events introduced.
  • S2 (audit trail): N/A — no writes to workflows, credentials, project settings, OAuth clients, or other config resources; PersistenceWriter flush path is unchanged.

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.3%. Comparing base (1f514b8) to head (2ca9ac3).

Files with missing lines Patch % Lines
lib/lightning/collaboration.ex 87.5% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4840     +/-   ##
=======================================
- Coverage   90.3%   90.3%   -0.0%     
=======================================
  Files        444     444             
  Lines      22607   22626     +19     
=======================================
+ Hits       20419   20426      +7     
- Misses      2188    2200     +12     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

stuartc added 6 commits June 8, 2026 14:33
Introduce a Collaboration.Instance struct describing the registry,
dynamic supervisor and :pg scope a collaboration tree runs under, and
thread it through the supervisor, registry, document supervisor, session
and persistence writer. All new instance parameters default to the
current global names, so production behaviour is unchanged.

This is the seam that lets each test own an isolated collaboration tree.
DocumentSupervisor.init/1 now invokes a configurable callback for the
PersistenceWriter and SharedDoc pids it spawns, passing the owner pid.
The callback defaults to a no-op and is only invoked when an owner pid
is present, so production is unchanged. In the test environment it
grants the spawned processes access to the owner's sandbox connection
and Mox expectations, which is what lets these tests run without a
shared-mode sandbox.
Drop Mox.set_mox_global from the collaboration DataCase tests and rely
on private-mode Mox: expectations are set in the test process and
propagated to the spawned collaboration processes via the owner-anchored
allow hook (owner: self()) and a new allow_collaboration_process helper.

Also add collaboration helpers for driving an isolated supervisor
instance per test (per-instance Registry, DynamicSupervisor and :pg
scope), threaded through the document start/stop helpers. Default-instance
arities are preserved for tests that have not been migrated.

The channel-based tests keep their global Mox usage for now; converting
them is a separate effort.
Flip the first group of collaboration test modules to async: true, each
test owning an isolated supervisor instance (its own Registry, dynamic
supervisor and :pg scope) rather than sharing the application-wide
singletons.

Two production seam fixes the isolation exposed:
- Instance.derive/1 now names the :pg scope distinctly (base.PG) instead
  of reusing the base module, which collided with the supervisor's own
  registration. The production base still resolves to the existing
  :workflow_collaboration scope, unchanged.
- Thread the owner pid into SharedDoc's persistence state so its
  init-time read can reach the owner's DB connection:  is not
  propagated across GenServer.start_link, so it is set inside the
  process that performs the read. No-op in production (no owner).

Tests start their collaboration processes under start_supervised! (or
stop them synchronously) so every DB-touching process is torn down
before its owner exits, avoiding a sandbox-connection teardown race.
Flip the Collaborate and Persistence test modules to async: true, each
test owning an isolated supervisor instance and starting its documents
under start_supervised! with owner: self(), so the DB-writing children
are flushed and stopped before the owner exits.

One test that asserted the no-owner document does not set up a monitor
is reframed: an async sandbox cannot start a document without an owner
(the SharedDoc reads the database during init and needs the owner's
connection), so the test now verifies the monitor is keyed to the
explicit owner and that re-requesting an existing document is idempotent.
The no-owner production-default path remains exercised by the default
arities elsewhere.
Flip the Session and WorkflowReconciler test modules to async: true.
Session tests each drive an isolated supervisor instance and drain their
document trees synchronously (a :normal stop that runs the flush to
completion) before the test returns. WorkflowReconciler resolves its
SharedDoc through the default :pg scope exactly as the production commit
hook does, so those tests run against the default instance and isolate
on unique per-test workflow ids.

Two production seam fixes the isolation exposed:
- Persistence.update_v1 resolved the PersistenceWriter through the
  hard-coded global registry, so a document running under an isolated
  registry silently dropped every update. It now reads the registry from
  the persistence state, defaulting to the global registry (production
  unchanged).
- DocumentSupervisor threads its registry into the persistence state and
  gains an auto_exit option (default true, production unchanged) so a
  test can keep the SharedDoc alive until it stops the tree deterministically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

1 participant