Allow table-diff/update to run while mtree listen is active#131
Conversation
…ve (ACE-199) Previously `mtree table-diff` and `mtree update` failed with SQLSTATE 55006 when `mtree listen` was running: both consumed the same logical replication slot, and PostgreSQL allows only one active consumer per slot. Now, when a node's slot is already held by a running listener, the bounded CDC drain returns a sentinel error and UpdateMtree skips that node's drain (best-effort), comparing against the tree listen keeps warm, instead of aborting. Nodes without an active listener still get a full, guaranteed drain. - cdc: add ErrSlotBusy sentinel + 55006 classifier; the bounded drain detects a busy slot via an up-front active-pid check plus a StartReplication backstop - queries: extract a reusable GetActiveSlotPID helper - mtree: skip the CDC drain per-node on a busy slot with a warning, and surface cdc_skipped_nodes on both the update and table-diff paths - tests: integration test asserting diff/update succeed (and that the skip actually occurs) while listen holds the slot - docs: document the concurrent behavior and best-effort freshness caveat
|
Warning Review limit reached
Next review available in: 35 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds a ChangesSlot Busy Detection and CDC-Skip Diff Behavior
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 21 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/consistency/mtree/merkle.go (1)
1648-1657: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick winReset
CDCSkippedNodeson every update run.If one call hits the busy-slot path, a later
UpdateMtreeon the sameMerkleTreeTaskwith no skips leavesm.CDCSkippedNodesunchanged. Line 2036 then reuses that stale value, so a later diff can reportcdc_skipped_nodeseven after a fully drained run. Clear the field at the start ofUpdateMtree, or assign it unconditionally fromskippedNodes(includingnil).Proposed fix
func (m *MerkleTreeTask) UpdateMtree(skipAllChecks bool) (err error) { //nolint:gocyclo // orchestrates per-node CDC drain + tree rehash/rebalance; cyclomatic complexity is inherent to the staged flow resultCtx := map[string]any{ "rebalance": m.Rebalance, "skip_all_checks": skipAllChecks, } + m.CDCSkippedNodes = nil @@ - if len(skippedNodes) > 0 { - resultCtx["cdc_skipped_nodes"] = skippedNodes - m.CDCSkippedNodes = skippedNodes - } + m.CDCSkippedNodes = skippedNodes + if len(skippedNodes) > 0 { + resultCtx["cdc_skipped_nodes"] = skippedNodes + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/consistency/mtree/merkle.go` around lines 1648 - 1657, Reset the stale CDC skip state in UpdateMtree: `MerkleTreeTask.CDCSkippedNodes` is only set when `skippedNodes` is non-empty, so a later run can reuse the previous value. Update the `UpdateMtree` flow in `merkle.go` so `CDCSkippedNodes` is cleared at the start of each run or assigned unconditionally from `skippedNodes` (including nil), and keep the `resultCtx["cdc_skipped_nodes"]` population in sync with that field.tests/integration/merkle_tree_test.go (1)
809-819: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the skip marker before the second update call.
Right now this still passes if
DiffMtree()stops surfacing the busy-slot path and only the explicitUpdateMtree(true)setsCDCSkippedNodes. Check it immediately after Line 809, then clear it before Line 813 if you want to verify both paths independently.Suggested test tightening
err = mtreeTask.DiffMtree() require.NoError(t, err, "table-diff should succeed while listen is active") + require.NotEmpty(t, mtreeTask.CDCSkippedNodes, + "expected table-diff to record a skipped CDC node while listen holds the slot") + + mtreeTask.CDCSkippedNodes = nil // update must likewise not fail with a slot-active error. require.NoError(t, mtreeTask.UpdateMtree(true), "update should succeed while listen is active")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/merkle_tree_test.go` around lines 809 - 819, In the merkle tree integration test, the busy-slot skip assertion is too late and can be satisfied by UpdateMtree(true) alone. In the test around DiffMtree, UpdateMtree, and CDCSkippedNodes, assert the skip marker immediately after DiffMtree() succeeds, then clear CDCSkippedNodes before calling UpdateMtree(true) if you want to verify both paths independently; this keeps the test tied to the DiffMtree busy-slot path instead of the later update call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/infra/cdc/listen.go`:
- Around line 229-250: The active-slot check in listen startup is over-broad: it
returns ErrSlotBusy for any consumer, which lets non-listener flows like bounded
update/table-diff skip incorrectly. Update the logic in listen.go around the
GetActiveSlotPID path and the StartReplication/ isSlotBusyErr handling so only
the actual mtree listen ownership case maps to ErrSlotBusy. Keep generic
busy-slot or PID-occupied failures fatal, or introduce a listener-specific
marker before returning the skip sentinel.
---
Nitpick comments:
In `@internal/consistency/mtree/merkle.go`:
- Around line 1648-1657: Reset the stale CDC skip state in UpdateMtree:
`MerkleTreeTask.CDCSkippedNodes` is only set when `skippedNodes` is non-empty,
so a later run can reuse the previous value. Update the `UpdateMtree` flow in
`merkle.go` so `CDCSkippedNodes` is cleared at the start of each run or assigned
unconditionally from `skippedNodes` (including nil), and keep the
`resultCtx["cdc_skipped_nodes"]` population in sync with that field.
In `@tests/integration/merkle_tree_test.go`:
- Around line 809-819: In the merkle tree integration test, the busy-slot skip
assertion is too late and can be satisfied by UpdateMtree(true) alone. In the
test around DiffMtree, UpdateMtree, and CDCSkippedNodes, assert the skip marker
immediately after DiffMtree() succeeds, then clear CDCSkippedNodes before
calling UpdateMtree(true) if you want to verify both paths independently; this
keeps the test tied to the DiffMtree busy-slot path instead of the later update
call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7a5bbc15-2d5c-4955-8f49-d74dc954a922
📒 Files selected for processing (8)
db/queries/queries.godocs/commands/mtree/index.mddocs/commands/mtree/mtree-listen.mddocs/commands/mtree/mtree-table-diff.mdinternal/consistency/mtree/merkle.gointernal/infra/cdc/listen.gointernal/infra/cdc/listen_test.gotests/integration/merkle_tree_test.go
A busy slot can be held by any consumer (including a concurrent bounded mtree op sharing the node's slot), not only mtree listen. Make the skip visible instead of implying a guaranteed comparison: - add cdc_skipped_nodes to DiffSummary so it appears in the diff report - warn at end of diff when nodes were skipped (fires even if trees match) - reword the per-node warning to not assert the holder is listen and to note divergence can be under-reported - docs: shared per-node slot, concurrent-op trigger, best-effort caveat
Previously
mtree table-diffandmtree updatefailed with SQLSTATE 55006 whenmtree listenwas running: both consumed the same logical replication slot, and PostgreSQL allows only one active consumer per slot.Now, when a node's slot is already held by a running listener, the bounded CDC drain returns a sentinel error and UpdateMtree skips that node's drain (best-effort), comparing against the tree listen keeps warm, instead of aborting. Nodes without an active listener still get a full, guaranteed drain.