Deterministic cleanup for collaborative editing documents#4840
Open
stuartc wants to merge 15 commits into
Open
Deterministic cleanup for collaborative editing documents#4840stuartc wants to merge 15 commits into
stuartc wants to merge 15 commits into
Conversation
Collaborator processes (DocumentSupervisor/SharedDoc/PersistenceWriter) could outlive a test's SQL-sandbox owner and crash on DB writes after teardown, poisoning the next test's start_document and producing flaky collaboration suites. Fix the lifecycle product-side rather than babysitting tests: - Add Collaborate.stop_document/1, a synchronous, idempotent teardown that flushes and stops the whole document tree. The symmetric partner to start_document/2. - Add DocumentSupervisor.stop/2 (GenServer.stop(:normal)) which guarantees terminate/2 runs the flush, unlike DynamicSupervisor's :shutdown. - Collaborate.start/1 now tears down a document it started if the session fails to attach, closing the zero-observer orphan leak. - Registry.doc_supervisor_names/0 owns the doc-supervisor match spec. - Tests drop per-test polling for a uniform stop_all_collaboration_documents/0 on_exit net, and a PID-reuse-safe restart assertion.
Tests that start a document via the production entrypoint (Collaborate.start_document/2) place it under the global DocSupervisor, which ExUnit does not own, so the doc outlives the test unless stopped. Previously a single blanket on_exit net swept these up after the fact. Add CollaborationHelpers.start_collaboration_document/2, which registers an on_exit stopping THAT specific document before delegating to start_document/2 — so each doc's lifetime is bound to the test that created it, mirroring start_supervised. Migrate the 7 call sites in session_test.exs. The blanket stop_all_collaboration_documents/0 net stays as belt-and-braces: every site is now individually bound, but a single leaked global doc would still corrupt the next serial (async: false) test, so the cheap idempotent sweep is kept as insurance.
Close three gaps surfaced by the collaboration document lifecycle fix (commits a474204, 759ea81): - §0: frame name-isolation, deterministic teardown and async-safety as three payoffs of one injectable ownership seam. - §1: constructor sub-rule — start_link-style functions put name/owner in trailing opts, while call/lookup lead with the subject; the two rules govern mutually exclusive function shapes and never collide. - §3: lifetime/ownership recipe for dynamic populations — owner-monitored self-cleanup (preferred) vs public symmetric stop + on_exit test binding. Updates the §4 anti-patterns checklist and References accordingly.
Light editorial pass removing cross-section restatement after the lifetime/ownership additions, without cutting teaching content: - §0: let §3 carry the collaboration-fix detail; trim the duplicated test-support-vs-API-seam contrast and the Gray & Tate chapter title. - §1: tighten the identity-vs-configuration restatement. - §3 Option 1: reference the eventstore owner-monitor stated just above rather than re-narrate it. - References: drop prose from the arg-order bullet (the rule lives in §1).
The collaboration document tree (DocumentSupervisor + SharedDoc + PersistenceWriter) is started under a global DynamicSupervisor and is meant to outlive the caller that started it. In tests this is a hazard: a document outlives its test, then writes to the DB after the test's Ecto-sandbox owner has exited, crashing with "owner ... exited" and poisoning the next test. Give start_document an optional owner pid that the DocumentSupervisor monitors; when that owner goes :DOWN the supervisor stops :normal, so terminate/2 runs the flush and the transient child is not restarted. Any caller now gets deterministic teardown by passing owner: self(), with no wrapper. Default owner: nil means no monitor, so production documents outlive the LiveView that starts them exactly as before. The test helper start_collaboration_document/2 now passes owner: self() and drops its on_exit; the blanket stop_all_collaboration_documents/0 net stays as belt-and-braces for the serial suite. Validated against a repeat harness over the three collaboration test files: 0/40 runs under normal timing and 0/30 under constrained scheduling (+S 4:4, ASSERT_RECEIVE_TIMEOUT=1000), with zero post-teardown owner crashes.
Security Review ✅
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4840 +/- ##
=======================================
- Coverage 90.3% 90.3% -0.0%
=======================================
Files 444 444
Lines 22607 22626 +19
=======================================
+ Hits 20419 20426 +7
- Misses 2188 2200 +12 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Introduce a Collaboration.Instance struct describing the registry, dynamic supervisor and :pg scope a collaboration tree runs under, and thread it through the supervisor, registry, document supervisor, session and persistence writer. All new instance parameters default to the current global names, so production behaviour is unchanged. This is the seam that lets each test own an isolated collaboration tree.
DocumentSupervisor.init/1 now invokes a configurable callback for the PersistenceWriter and SharedDoc pids it spawns, passing the owner pid. The callback defaults to a no-op and is only invoked when an owner pid is present, so production is unchanged. In the test environment it grants the spawned processes access to the owner's sandbox connection and Mox expectations, which is what lets these tests run without a shared-mode sandbox.
Drop Mox.set_mox_global from the collaboration DataCase tests and rely on private-mode Mox: expectations are set in the test process and propagated to the spawned collaboration processes via the owner-anchored allow hook (owner: self()) and a new allow_collaboration_process helper. Also add collaboration helpers for driving an isolated supervisor instance per test (per-instance Registry, DynamicSupervisor and :pg scope), threaded through the document start/stop helpers. Default-instance arities are preserved for tests that have not been migrated. The channel-based tests keep their global Mox usage for now; converting them is a separate effort.
Flip the first group of collaboration test modules to async: true, each test owning an isolated supervisor instance (its own Registry, dynamic supervisor and :pg scope) rather than sharing the application-wide singletons. Two production seam fixes the isolation exposed: - Instance.derive/1 now names the :pg scope distinctly (base.PG) instead of reusing the base module, which collided with the supervisor's own registration. The production base still resolves to the existing :workflow_collaboration scope, unchanged. - Thread the owner pid into SharedDoc's persistence state so its init-time read can reach the owner's DB connection: is not propagated across GenServer.start_link, so it is set inside the process that performs the read. No-op in production (no owner). Tests start their collaboration processes under start_supervised! (or stop them synchronously) so every DB-touching process is torn down before its owner exits, avoiding a sandbox-connection teardown race.
Flip the Collaborate and Persistence test modules to async: true, each test owning an isolated supervisor instance and starting its documents under start_supervised! with owner: self(), so the DB-writing children are flushed and stopped before the owner exits. One test that asserted the no-owner document does not set up a monitor is reframed: an async sandbox cannot start a document without an owner (the SharedDoc reads the database during init and needs the owner's connection), so the test now verifies the monitor is keyed to the explicit owner and that re-requesting an existing document is idempotent. The no-owner production-default path remains exercised by the default arities elsewhere.
Flip the Session and WorkflowReconciler test modules to async: true. Session tests each drive an isolated supervisor instance and drain their document trees synchronously (a :normal stop that runs the flush to completion) before the test returns. WorkflowReconciler resolves its SharedDoc through the default :pg scope exactly as the production commit hook does, so those tests run against the default instance and isolate on unique per-test workflow ids. Two production seam fixes the isolation exposed: - Persistence.update_v1 resolved the PersistenceWriter through the hard-coded global registry, so a document running under an isolated registry silently dropped every update. It now reads the registry from the persistence state, defaulting to the global registry (production unchanged). - DocumentSupervisor threads its registry into the persistence state and gains an auto_exit option (default true, production unchanged) so a test can keep the SharedDoc alive until it stops the tree deterministically.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a class of intermittent test failures in the collaborative editing
suites, and adds the deterministic teardown that was missing to cause them.
Collaborative documents start under a global DynamicSupervisor and, until now,
had no clean way to be stopped — they outlived the process that started them. In
tests that meant a document's processes leaked past the test that created them
and kept writing to the database after its SQL sandbox had checked in, crashing
the next test with an
owner ... exitederror. Run order made it look random.The fix gives documents a deterministic shutdown:
Lightning.Collaborate.stop_document/1— synchronous, idempotent teardownthat runs the final persistence flush. The symmetric partner to
start_document/2.start_document/3now takes an optionalowner:pid. The document treemonitors it and stops itself (
:normal, so the flush runs and the transientchild isn't restarted) when the owner exits. It defaults to
nil— no monitor— so production documents still outlive the LiveView that starts them.
start_collaboration_document/2helper thatpasses
owner: self(), tying each document's lifetime to its test, with ablanket
on_exitsweep as a safety net.It also adds a developer guideline,
.claude/guidelines/testable-supervision-trees.md, on writing supervision treesthat stay addressable in tests without forcing the suite to run serially — the
pattern this fix follows.
Validation steps
where before they failed intermittently with post-teardown DB-owner errors:
mix test test/lightning/collaborate_test.exs test/lightning/collaboration/session_test.exs test/lightning/collaboration/document_supervisor_test.exsstart_document/2with noownerregisters no monitor and the document behaves exactly as before.
Additional notes for the reviewer
CI timing (
+S 4:4, tightassert_receivetimeout), with zero post-teardownDB-owner errors. Baseline before the fix was 11/40.
AI Usage
Pre-submission checklist