Skip to content

Fix mtree table-diff silently reporting a diverged cluster as identical (ACE-190)#130

Merged
mason-sharp merged 5 commits into
mainfrom
fix/ACE-190/cdc-drain-premature-stop
Jun 30, 2026
Merged

Fix mtree table-diff silently reporting a diverged cluster as identical (ACE-190)#130
mason-sharp merged 5 commits into
mainfrom
fix/ACE-190/cdc-drain-premature-stop

Conversation

@danolivo

Copy link
Copy Markdown
Contributor

Problem

On a cluster where one node has a large, un-replicated divergence (e.g. n1 has ~9.9M rows that n2 does not), mtree table-diff reports Merkle trees are identical — zero differences — while plain table-diff correctly finds the divergence. The mtree result is a silent false negative that can mislead table-repair. Reproduced exactly from the field report: build the tree while both nodes are in sync, bulk-insert ~9.9M rows on one node in a single transaction, then run mtree table-diff → "No updates needed" → "Merkle trees are identical". Small divergences (≲100k) work; large ones do not.

Root cause

mtree table-diff (and mtree update) compare cached Merkle trees, so they first reconcile each node's tree by draining its CDC stream. The bounded (non-continuous) drain in internal/infra/cdc/listen.go treated the 1-second idle receive-timeout as "drain complete" and returned success.

Under proto_version '1' the walsender streams nothing until a transaction's commit is decoded. A ~9.9M-row insert is a single transaction (~1.8 GB of WAL) whose decode takes well over a second, during which the consumer receives no data. The drain hit its 1s idle timeout, declared itself drained, and left the tree frozen at its previous state — so the comparison saw two unchanged trees and called them identical. This is why a small backlog (decodes in <1s, streams before the timeout) works and a large single transaction does not.

A real bound already existed but never engaged: UpdateMtree wraps the per-node drain in context.WithTimeout(cdc_processing_timeout) and a parent-context cancellation already returns an error — the 1s "drained" shortcut simply pre-empted it.

The fix

All changes are in internal/infra/cdc/listen.go, scoped to the bounded (!continuous) drain:

  • The 1s idle receive-timeout is now a poll, not completion. The drain stops only once it has actually received data up to targetFlushLSN (the call-time snapshot), or a keepalive confirms the decoder reached it. Otherwise it keeps waiting; cdc_processing_timeout bounds the wait and turns an unreachable target into a clear, actionable error.
  • reachedTarget flag + post-loop backstop. A bounded drain never checkpoints and returns success unless it actually reached the snapshot. It now fails loud rather than report a possibly-stale tree as current.
  • MaxBufferedChanges (default 1,000,000). Caps in-memory buffering of a single huge transaction (proto_version '1' buffers a whole transaction before its commit), erroring with a recommendation to rebuild rather than risking OOM or a silent partial drain.

Continuous mode (ListenForChanges) is unchanged. DiffMtree, the CLI, and the HTTP handler already propagate the drain error, so the new failure surfaces end to end.

Behaviour after the fix

  • Small / moderate backlog → drains fully, tree updates, divergence detected (unchanged for the common case; fixed for backlogs that take >1s to decode and were previously dropped).
  • Huge backlog (over the buffer cap, or can't reach the snapshot within cdc_processing_timeout) → hard error recommending ace mtree build. This is the data-safe trade-off: incremental CDC drain of ~10M rows is the wrong tool; rebuild is.
  • The 30s cdc_processing_timeout now actually bites — before, the 1s shortcut pre-empted it, so it was effectively dead.

The default mtree table-diff is now either authoritative (drained to the snapshot) or it errors — never a silent under-report. This restores the premise the bounded-drain proposal (--quick / --max-drain) assumed but which was actually broken; that opt-in mode can now layer on a correct default.

🤖 Generated with Claude Code

The bounded (non-continuous) CDC drain behind `mtree table-diff` and
`mtree update` treated the 1-second idle receive-timeout as "drain
complete" and returned success. Under proto_version '1' the walsender
streams nothing until a transaction's commit is decoded, so when a node
carries a large backlog (e.g. a bulk-loaded divergence) the drain saw no
data within 1s, declared itself drained, and left the Merkle tree frozen
at its previous state. The subsequent comparison then reported the trees
as identical even though millions of rows differed (ACE-190) -- a silent
under-report that could mislead table-repair.

A real bound already exists: UpdateMtree wraps the per-node drain in
context.WithTimeout(cdc_processing_timeout), and a parent-context
cancellation already returns an error. The 1s "drained" shortcut simply
pre-empted it.

Make the bounded drain honest about completion:

  - The 1s idle receive-timeout is now a poll, not completion. The drain
    stops only once it has actually received data up to targetFlushLSN
    (the call-time snapshot), or a keepalive confirms the decoder reached
    it. Otherwise it keeps waiting; cdc_processing_timeout bounds the wait
    and turns an unreachable target into a clear, actionable error.

  - A reachedTarget flag and a post-loop backstop guarantee a bounded
    drain never checkpoints and returns success unless it reached the
    snapshot. It now fails loud rather than report a possibly-stale tree
    as current.

  - MaxBufferedChanges (default 1,000,000) caps in-memory buffering of a
    single huge transaction -- proto_version '1' buffers a whole
    transaction before its commit -- erroring with a recommendation to
    rebuild rather than risking OOM or a silent partial drain.

Continuous mode (ListenForChanges) is unchanged. DiffMtree, the CLI, and
the HTTP handler already propagate the drain error, so the new failure
surfaces end to end.

Validated: the 9.9M-row reproduction now errors with an actionable
"rebuild" message instead of reporting the trees identical; the mtree/CDC
integration suite passes, including `go test -race` on the
drain-and-update tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@danolivo danolivo added the bug Something isn't working label Jun 29, 2026
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@danolivo, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 51 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6cd5d876-4c83-4113-9b73-54a67555cbb5

📥 Commits

Reviewing files that changed from the base of the PR and between 4493cec and 011dd3f.

📒 Files selected for processing (2)
  • ace.yaml
  • internal/cli/default_config.yaml
📝 Walkthrough

Walkthrough

The PR adds CDC timeout and flush-batch configuration wiring, updates bounded CDC draining to buffer primary keys only and flush synchronously in sub-batches, and tightens bounded-drain completion, timeout, and validation behavior. It also adds documentation and an integration test for oversized transactions.

Changes

Bounded CDC drain correctness and configuration

Layer / File(s) Summary
CDC timeout and flush batch configuration
ace.yaml, internal/cli/default_config.yaml, pkg/config/config.go, internal/consistency/mtree/merkle.go, docs/configuration.md, docs/CHANGELOG.md
cdc_processing_timeout is raised to 300, cdc_flush_batch_size is added to CDC config, CDCTimeoutSec is added to MerkleTreeTask, and UpdateMtree uses the per-invocation timeout override before config/default selection. The release notes also add the new CDC timeout and flush-batch settings.
CDC timeout CLI wiring
internal/cli/cli.go, docs/commands/mtree/mtree-update.md, docs/commands/mtree/mtree-table-diff.md
--cdc-timeout is added to update and diff commands, assigned into task.CDCTimeoutSec, and documented in the command reference.
PK-only bounded buffering and flushes
internal/infra/cdc/listen.go, docs/CHANGELOG.md
CDC messages now buffer extracted primary keys instead of full tuples, bounded mode uses FlushBatchSize and synchronous applyChanges, flushActiveTx applies and clears sub-batches, and processChanges memoizes type-name lookups while failing when no PK can be extracted. Release notes also record the bounded synchronous sub-batching behavior and timeout increase.
Bounded drain completion and validation
internal/infra/cdc/listen.go, tests/integration/cdc_busy_table_test.go, docs/CHANGELOG.md
Bounded drains now mark completion only when WAL progress reaches the target, return a specific timeout error when progress stalls early, refuse success if reachedTarget is never set, and an integration test covers a large single transaction split across multiple flush sub-batches. Release notes also record the no-PK failure case.

🐇 I thumped by the WAL in a tidy line,
Kept only the keys and let the rest align.
If the target’s missed, I call it out true,
And one big transaction still makes it through.
✨ Hop, flush, repeat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main fix: preventing mtree table-diff from falsely reporting divergence as identical.
Description check ✅ Passed The description is directly about the CDC drain bug and the related mtree table-diff/update changes in this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ACE-190/cdc-drain-premature-stop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codacy-production

codacy-production Bot commented Jun 29, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 6 medium

Results:
6 new issues

Category Results
Complexity 6 medium

View in Codacy

🟢 Metrics 9 duplication

Metric Results
Duplication 9

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/infra/cdc/listen.go`:
- Around line 368-371: The keepalive-only drain path in listen.go marks the
target as reached but leaves lastLSN unchanged, so the final status update can
checkpoint the stale start position. Update the replication state in the branch
that detects ServerWALEnd >= targetFlushLSN so lastLSN advances to the reached
WAL end (or target LSN) before the final standby status/metadata write, using
the existing reachedTarget and stopStreaming flow in the keepalive handling
logic.
- Around line 304-309: The timeout branch in listenForChanges is returning early
and skipping the normal teardown path, which can leave started processChanges
workers running past function exit. Update the timeout handling to set
processingErr and then flow through the common cleanup logic so the existing
wg.Wait() and deferred pool close still run; use the listenForChanges loop and
the processingErr/wg teardown path as the main references.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc101f1c-c25f-4871-b701-43140588d857

📥 Commits

Reviewing files that changed from the base of the PR and between 5fb45b6 and ed0406b.

📒 Files selected for processing (1)
  • internal/infra/cdc/listen.go

Comment thread internal/infra/cdc/listen.go
Comment thread internal/infra/cdc/listen.go
@mason-sharp

Copy link
Copy Markdown
Member

Attaching a suggested patch

Makes the bounded CDC drain behind mtree update / mtree table-diff handle large transactions instead of bailing to a full rebuild.

  • Sub-batch flushing: applies buffered changes to the Merkle tree every cdc_flush_batch_size (default 10k) instead of holding a whole transaction until commit — memory is now bounded regardless of transaction size. Removes the old MaxBufferedChanges cap that forced a rebuild (which didn't work — build is per-table and doesn't advance the publication's start_lsn).
  • Buffers only PKs, not full row tuples (only PKs needed), so memory scales with key size, not row width.
  • Synchronous apply in bounded mode so the slot's confirmed LSN never gets ahead of durably-applied work → a crash mid-drain recovers by re-streaming the in-flight transaction. (Continuous listener unchanged.)
  • Timeout: default cdc_processing_timeout 30s → 300s for large transactions, plus a new --cdc-timeout flag; a timeout now means "re-run / raise," not "rebuild."
  • Smaller fixes: keepalive before each flush (avoid wal_sender_timeout drops on busy nodes), memoized PK-type lookups, and a fail-loud error when a table's REPLICA IDENTITY exposes no PK (was a panic).

mtree-batch.patch

mason-sharp and others added 2 commits June 30, 2026 12:56
A bounded CDC drain previously buffered a whole transaction in memory until
its commit was decoded (proto_version '1' streams nothing before commit), and
refused to continue once buffering exceeded a hard cap -- erroring with a
recommendation to rebuild the tree. That made a routine large transaction
(e.g. a bulk load) force a full Merkle-tree rebuild, and the advice was wrong:
`ace mtree build` is per-table and leaves the publication's start_lsn and slot
untouched, so it does not advance the drain baseline and the next drain re-hits
the same backlog.

Make the bounded drain absorb large transactions instead:

  - Apply changes in sub-batches: once the in-flight transaction has accumulated
    FlushBatchSize changes, apply them to the Merkle tree and free the buffer
    rather than waiting for the commit. Peak memory is bounded by FlushBatchSize
    regardless of transaction size. Configurable via cdc_flush_batch_size
    (default 10000); the old MaxBufferedChanges cap is removed.

  - Buffer only the primary key of each change, not the whole row tuple, since
    the tree update needs only the PK (the row is recomputed from the live table
    at hash time). Per-change memory scales with key size, not row width.

  - Apply synchronously in bounded mode (continuous mode keeps its worker
    goroutines). lastLSN -- which drives the slot's confirmed_flush_lsn and the
    metadata checkpoint -- then never advances ahead of durably-applied work, so
    a crash mid-drain is recovered by re-streaming the in-flight transaction on
    the next run. The only cost is re-applying already-flushed sub-batches,
    which transiently over-counts per-block split heuristics (reset on the next
    tree update) and never misses a change.

  - Send a standby status update before each flush so a slow apply on a busy
    node cannot let wal_sender_timeout elapse while the receive loop is blocked.

  - Memoize PK type-name lookups per drain so a large multi-flush transaction
    does not re-query pg_type on every sub-batch.

Timeout and recovery:

  - Raise the cdc_processing_timeout default from 30s to 300s. A timeout now
    means "re-run or raise" (drain progress is durable), not "rebuild", so the
    budget favours absorbing large backlogs and busy-server slowdowns.

  - Add a --cdc-timeout flag to `mtree update` and `mtree table-diff` to override
    the drain budget per invocation.

  - Reword the drain-timeout error to drop the misleading "rebuild" advice:
    re-run `ace mtree update`, or raise cdc_processing_timeout to drain in one
    pass.

  - Fail loud with an actionable error when a tracked table's REPLICA IDENTITY
    exposes no primary key (FULL/NOTHING), instead of panicking on an empty PK.

Docs (configuration, mtree-update, mtree-table-diff, CHANGELOG) updated.

Validated: TestCDCDrainHandlesLargeSingleTransaction drains a single
transaction far larger than FlushBatchSize across many flush batches and
confirms every change lands in the leaf counters; the CDC, mtree-update, and
composite/simple-PK integration tests pass under `go test -race`.
…docstrings

- Keepalive-driven drain stop now advances lastLSN to the target before the
  final standby status update and metadata write. Previously a bounded drain
  that caught up via a keepalive (the no-change / other-tables-only case) left
  lastLSN at the start LSN, so the checkpoint regressed to a stale position, the
  slot never advanced, and every subsequent drain re-scanned the same span (and
  the server retained that WAL). Safe because the synchronous bounded apply means
  received == applied, and the branch only fires once the decoder has read past
  the target, so no tracked change <= target is still pending.

- Document all top-level declarations in internal/infra/cdc/listen.go to satisfy
  the docstring-coverage check.

CodeRabbit's other comment -- the drain-timeout branch leaking processChanges
workers past wg.Wait() -- is moot under the synchronous bounded-drain apply: a
bounded drain spawns no worker goroutines, so the timeout early-return has
nothing in flight to wait for.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cli/default_config.yaml`:
- Around line 33-37: The generated default config is missing the new
cdc_flush_batch_size setting, so update the cdc block in the default config
template to include it alongside cdc_processing_timeout and
cdc_metadata_flush_seconds. Use the existing cdc section in the default config
generation path to keep ace config init aligned with the settings supported by
pkg/config/config.go and internal/infra/cdc/listen.go.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 800de308-6e14-4873-b4e9-34dfc614e451

📥 Commits

Reviewing files that changed from the base of the PR and between ed0406b and 0b242e1.

📒 Files selected for processing (11)
  • ace.yaml
  • docs/CHANGELOG.md
  • docs/commands/mtree/mtree-table-diff.md
  • docs/commands/mtree/mtree-update.md
  • docs/configuration.md
  • internal/cli/cli.go
  • internal/cli/default_config.yaml
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • pkg/config/config.go
  • tests/integration/cdc_busy_table_test.go
✅ Files skipped from review due to trivial changes (4)
  • docs/commands/mtree/mtree-table-diff.md
  • docs/commands/mtree/mtree-update.md
  • ace.yaml
  • docs/CHANGELOG.md

Comment thread internal/cli/default_config.yaml
Andrei Lepikhov and others added 2 commits June 30, 2026 14:50
…nt complexity

- tests/integration/cdc_busy_table_test.go: mark the seeding DELETE in
  TestCDCDrainHandlesLargeSingleTransaction `// nosemgrep`. Codacy's critical
  SQL-injection finding is a false positive -- the table name is a sanitized
  constant and the bound is a constant int -- and the sibling query already
  carries the same marker (matching the suite-wide convention).

- listen.go (processReplicationStream) and merkle.go (UpdateMtree): add
  //nolint:gocyclo. Both functions were already far over the cyclomatic
  threshold before this PR (106 and 51); the bounded-drain changes only touched
  them, so Codacy re-reports their pre-existing complexity as new. The branching
  is inherent to the CDC drain protocol loop and the update orchestration;
  restructuring validated drain code to satisfy the metric is deferred to a
  dedicated refactor rather than risked under this fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mason's patch added the cdc_flush_batch_size setting (field in
pkg/config/config.go, read in internal/infra/cdc/listen.go, documented in
docs/configuration.md) but did not list it in the YAML config templates, so
`ace config init` emitted a config missing a supported setting.

Add it with the code/doc default (10000, matching cdc.FlushBatchSize) to
internal/cli/default_config.yaml, and to ace.yaml so the example config stays
in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mason-sharp mason-sharp merged commit 8dc2f4c into main Jun 30, 2026
3 checks passed
mason-sharp pushed a commit that referenced this pull request Jun 30, 2026
…docstrings

- Keepalive-driven drain stop now advances lastLSN to the target before the
  final standby status update and metadata write. Previously a bounded drain
  that caught up via a keepalive (the no-change / other-tables-only case) left
  lastLSN at the start LSN, so the checkpoint regressed to a stale position, the
  slot never advanced, and every subsequent drain re-scanned the same span (and
  the server retained that WAL). Safe because the synchronous bounded apply means
  received == applied, and the branch only fires once the decoder has read past
  the target, so no tracked change <= target is still pending.

- Document all top-level declarations in internal/infra/cdc/listen.go to satisfy
  the docstring-coverage check.

CodeRabbit's other comment -- the drain-timeout branch leaking processChanges
workers past wg.Wait() -- is moot under the synchronous bounded-drain apply: a
bounded drain spawns no worker goroutines, so the timeout early-return has
nothing in flight to wait for.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mason-sharp pushed a commit that referenced this pull request Jun 30, 2026
…nt complexity

- tests/integration/cdc_busy_table_test.go: mark the seeding DELETE in
  TestCDCDrainHandlesLargeSingleTransaction `// nosemgrep`. Codacy's critical
  SQL-injection finding is a false positive -- the table name is a sanitized
  constant and the bound is a constant int -- and the sibling query already
  carries the same marker (matching the suite-wide convention).

- listen.go (processReplicationStream) and merkle.go (UpdateMtree): add
  //nolint:gocyclo. Both functions were already far over the cyclomatic
  threshold before this PR (106 and 51); the bounded-drain changes only touched
  them, so Codacy re-reports their pre-existing complexity as new. The branching
  is inherent to the CDC drain protocol loop and the update orchestration;
  restructuring validated drain code to satisfy the metric is deferred to a
  dedicated refactor rather than risked under this fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants