Fix `mtree table-diff` silently reporting a diverged cluster as identical (ACE-190) by danolivo · Pull Request #130 · pgEdge/ace

danolivo · 2026-06-29T14:26:55Z

Problem

On a cluster where one node has a large, un-replicated divergence (e.g. n1 has ~9.9M rows that n2 does not), mtree table-diff reports Merkle trees are identical — zero differences — while plain table-diff correctly finds the divergence. The mtree result is a silent false negative that can mislead table-repair. Reproduced exactly from the field report: build the tree while both nodes are in sync, bulk-insert ~9.9M rows on one node in a single transaction, then run mtree table-diff → "No updates needed" → "Merkle trees are identical". Small divergences (≲100k) work; large ones do not.

Root cause

mtree table-diff (and mtree update) compare cached Merkle trees, so they first reconcile each node's tree by draining its CDC stream. The bounded (non-continuous) drain in internal/infra/cdc/listen.go treated the 1-second idle receive-timeout as "drain complete" and returned success.

Under proto_version '1' the walsender streams nothing until a transaction's commit is decoded. A ~9.9M-row insert is a single transaction (~1.8 GB of WAL) whose decode takes well over a second, during which the consumer receives no data. The drain hit its 1s idle timeout, declared itself drained, and left the tree frozen at its previous state — so the comparison saw two unchanged trees and called them identical. This is why a small backlog (decodes in <1s, streams before the timeout) works and a large single transaction does not.

A real bound already existed but never engaged: UpdateMtree wraps the per-node drain in context.WithTimeout(cdc_processing_timeout) and a parent-context cancellation already returns an error — the 1s "drained" shortcut simply pre-empted it.

The fix

All changes are in internal/infra/cdc/listen.go, scoped to the bounded (!continuous) drain:

The 1s idle receive-timeout is now a poll, not completion. The drain stops only once it has actually received data up to targetFlushLSN (the call-time snapshot), or a keepalive confirms the decoder reached it. Otherwise it keeps waiting; cdc_processing_timeout bounds the wait and turns an unreachable target into a clear, actionable error.
reachedTarget flag + post-loop backstop. A bounded drain never checkpoints and returns success unless it actually reached the snapshot. It now fails loud rather than report a possibly-stale tree as current.
MaxBufferedChanges (default 1,000,000). Caps in-memory buffering of a single huge transaction (proto_version '1' buffers a whole transaction before its commit), erroring with a recommendation to rebuild rather than risking OOM or a silent partial drain.

Continuous mode (ListenForChanges) is unchanged. DiffMtree, the CLI, and the HTTP handler already propagate the drain error, so the new failure surfaces end to end.

Behaviour after the fix

Small / moderate backlog → drains fully, tree updates, divergence detected (unchanged for the common case; fixed for backlogs that take >1s to decode and were previously dropped).
Huge backlog (over the buffer cap, or can't reach the snapshot within cdc_processing_timeout) → hard error recommending ace mtree build. This is the data-safe trade-off: incremental CDC drain of ~10M rows is the wrong tool; rebuild is.
The 30s cdc_processing_timeout now actually bites — before, the 1s shortcut pre-empted it, so it was effectively dead.

The default mtree table-diff is now either authoritative (drained to the snapshot) or it errors — never a silent under-report. This restores the premise the bounded-drain proposal (--quick / --max-drain) assumed but which was actually broken; that opt-in mode can now layer on a correct default.

🤖 Generated with Claude Code

The bounded (non-continuous) CDC drain behind `mtree table-diff` and `mtree update` treated the 1-second idle receive-timeout as "drain complete" and returned success. Under proto_version '1' the walsender streams nothing until a transaction's commit is decoded, so when a node carries a large backlog (e.g. a bulk-loaded divergence) the drain saw no data within 1s, declared itself drained, and left the Merkle tree frozen at its previous state. The subsequent comparison then reported the trees as identical even though millions of rows differed (ACE-190) -- a silent under-report that could mislead table-repair. A real bound already exists: UpdateMtree wraps the per-node drain in context.WithTimeout(cdc_processing_timeout), and a parent-context cancellation already returns an error. The 1s "drained" shortcut simply pre-empted it. Make the bounded drain honest about completion: - The 1s idle receive-timeout is now a poll, not completion. The drain stops only once it has actually received data up to targetFlushLSN (the call-time snapshot), or a keepalive confirms the decoder reached it. Otherwise it keeps waiting; cdc_processing_timeout bounds the wait and turns an unreachable target into a clear, actionable error. - A reachedTarget flag and a post-loop backstop guarantee a bounded drain never checkpoints and returns success unless it reached the snapshot. It now fails loud rather than report a possibly-stale tree as current. - MaxBufferedChanges (default 1,000,000) caps in-memory buffering of a single huge transaction -- proto_version '1' buffers a whole transaction before its commit -- erroring with a recommendation to rebuild rather than risking OOM or a silent partial drain. Continuous mode (ListenForChanges) is unchanged. DiffMtree, the CLI, and the HTTP handler already propagate the drain error, so the new failure surfaces end to end. Validated: the 9.9M-row reproduction now errors with an actionable "rebuild" message instead of reporting the trees identical; the mtree/CDC integration suite passes, including `go test -race` on the drain-and-update tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-29T14:27:11Z

Warning

Review limit reached

@danolivo, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 51 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6cd5d876-4c83-4113-9b73-54a67555cbb5

📥 Commits

Reviewing files that changed from the base of the PR and between 4493cec and 011dd3f.

📒 Files selected for processing (2)

ace.yaml
internal/cli/default_config.yaml

📝 Walkthrough

Walkthrough

The PR adds CDC timeout and flush-batch configuration wiring, updates bounded CDC draining to buffer primary keys only and flush synchronously in sub-batches, and tightens bounded-drain completion, timeout, and validation behavior. It also adds documentation and an integration test for oversized transactions.

Changes

Bounded CDC drain correctness and configuration

Layer / File(s)	Summary
CDC timeout and flush batch configuration `ace.yaml`, `internal/cli/default_config.yaml`, `pkg/config/config.go`, `internal/consistency/mtree/merkle.go`, `docs/configuration.md`, `docs/CHANGELOG.md`	`cdc_processing_timeout` is raised to 300, `cdc_flush_batch_size` is added to CDC config, `CDCTimeoutSec` is added to `MerkleTreeTask`, and `UpdateMtree` uses the per-invocation timeout override before config/default selection. The release notes also add the new CDC timeout and flush-batch settings.
CDC timeout CLI wiring `internal/cli/cli.go`, `docs/commands/mtree/mtree-update.md`, `docs/commands/mtree/mtree-table-diff.md`	`--cdc-timeout` is added to update and diff commands, assigned into `task.CDCTimeoutSec`, and documented in the command reference.
PK-only bounded buffering and flushes `internal/infra/cdc/listen.go`, `docs/CHANGELOG.md`	CDC messages now buffer extracted primary keys instead of full tuples, bounded mode uses `FlushBatchSize` and synchronous `applyChanges`, `flushActiveTx` applies and clears sub-batches, and `processChanges` memoizes type-name lookups while failing when no PK can be extracted. Release notes also record the bounded synchronous sub-batching behavior and timeout increase.
Bounded drain completion and validation `internal/infra/cdc/listen.go`, `tests/integration/cdc_busy_table_test.go`, `docs/CHANGELOG.md`	Bounded drains now mark completion only when WAL progress reaches the target, return a specific timeout error when progress stalls early, refuse success if `reachedTarget` is never set, and an integration test covers a large single transaction split across multiple flush sub-batches. Release notes also record the no-PK failure case.

🐇 I thumped by the WAL in a tidy line,
Kept only the keys and let the rest align.
If the target’s missed, I call it out true,
And one big transaction still makes it through.
✨ Hop, flush, repeat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main fix: preventing mtree table-diff from falsely reporting divergence as identical.
Description check	✅ Passed	The description is directly about the CDC drain bug and the related mtree table-diff/update changes in this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/ACE-190/cdc-drain-premature-stop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codacy-production · 2026-06-29T14:28:01Z

Up to standards ✅

🟢 Issues 6 medium

Results:
6 new issues

Category Results

Complexity 6 medium

View in Codacy

🟢 Metrics 9 duplication

Metric Results

Duplication 9

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/infra/cdc/listen.go`:
- Around line 368-371: The keepalive-only drain path in listen.go marks the
target as reached but leaves lastLSN unchanged, so the final status update can
checkpoint the stale start position. Update the replication state in the branch
that detects ServerWALEnd >= targetFlushLSN so lastLSN advances to the reached
WAL end (or target LSN) before the final standby status/metadata write, using
the existing reachedTarget and stopStreaming flow in the keepalive handling
logic.
- Around line 304-309: The timeout branch in listenForChanges is returning early
and skipping the normal teardown path, which can leave started processChanges
workers running past function exit. Update the timeout handling to set
processingErr and then flow through the common cleanup logic so the existing
wg.Wait() and deferred pool close still run; use the listenForChanges loop and
the processingErr/wg teardown path as the main references.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc101f1c-c25f-4871-b701-43140588d857

📥 Commits

Reviewing files that changed from the base of the PR and between 5fb45b6 and ed0406b.

📒 Files selected for processing (1)

internal/infra/cdc/listen.go

mason-sharp · 2026-06-30T01:05:34Z

Attaching a suggested patch

Makes the bounded CDC drain behind mtree update / mtree table-diff handle large transactions instead of bailing to a full rebuild.

Sub-batch flushing: applies buffered changes to the Merkle tree every cdc_flush_batch_size (default 10k) instead of holding a whole transaction until commit — memory is now bounded regardless of transaction size. Removes the old MaxBufferedChanges cap that forced a rebuild (which didn't work — build is per-table and doesn't advance the publication's start_lsn).
Buffers only PKs, not full row tuples (only PKs needed), so memory scales with key size, not row width.
Synchronous apply in bounded mode so the slot's confirmed LSN never gets ahead of durably-applied work → a crash mid-drain recovers by re-streaming the in-flight transaction. (Continuous listener unchanged.)
Timeout: default cdc_processing_timeout 30s → 300s for large transactions, plus a new --cdc-timeout flag; a timeout now means "re-run / raise," not "rebuild."
Smaller fixes: keepalive before each flush (avoid wal_sender_timeout drops on busy nodes), memoized PK-type lookups, and a fail-loud error when a table's REPLICA IDENTITY exposes no PK (was a panic).

mtree-batch.patch

A bounded CDC drain previously buffered a whole transaction in memory until its commit was decoded (proto_version '1' streams nothing before commit), and refused to continue once buffering exceeded a hard cap -- erroring with a recommendation to rebuild the tree. That made a routine large transaction (e.g. a bulk load) force a full Merkle-tree rebuild, and the advice was wrong: `ace mtree build` is per-table and leaves the publication's start_lsn and slot untouched, so it does not advance the drain baseline and the next drain re-hits the same backlog. Make the bounded drain absorb large transactions instead: - Apply changes in sub-batches: once the in-flight transaction has accumulated FlushBatchSize changes, apply them to the Merkle tree and free the buffer rather than waiting for the commit. Peak memory is bounded by FlushBatchSize regardless of transaction size. Configurable via cdc_flush_batch_size (default 10000); the old MaxBufferedChanges cap is removed. - Buffer only the primary key of each change, not the whole row tuple, since the tree update needs only the PK (the row is recomputed from the live table at hash time). Per-change memory scales with key size, not row width. - Apply synchronously in bounded mode (continuous mode keeps its worker goroutines). lastLSN -- which drives the slot's confirmed_flush_lsn and the metadata checkpoint -- then never advances ahead of durably-applied work, so a crash mid-drain is recovered by re-streaming the in-flight transaction on the next run. The only cost is re-applying already-flushed sub-batches, which transiently over-counts per-block split heuristics (reset on the next tree update) and never misses a change. - Send a standby status update before each flush so a slow apply on a busy node cannot let wal_sender_timeout elapse while the receive loop is blocked. - Memoize PK type-name lookups per drain so a large multi-flush transaction does not re-query pg_type on every sub-batch. Timeout and recovery: - Raise the cdc_processing_timeout default from 30s to 300s. A timeout now means "re-run or raise" (drain progress is durable), not "rebuild", so the budget favours absorbing large backlogs and busy-server slowdowns. - Add a --cdc-timeout flag to `mtree update` and `mtree table-diff` to override the drain budget per invocation. - Reword the drain-timeout error to drop the misleading "rebuild" advice: re-run `ace mtree update`, or raise cdc_processing_timeout to drain in one pass. - Fail loud with an actionable error when a tracked table's REPLICA IDENTITY exposes no primary key (FULL/NOTHING), instead of panicking on an empty PK. Docs (configuration, mtree-update, mtree-table-diff, CHANGELOG) updated. Validated: TestCDCDrainHandlesLargeSingleTransaction drains a single transaction far larger than FlushBatchSize across many flush batches and confirms every change lands in the leaf counters; the CDC, mtree-update, and composite/simple-PK integration tests pass under `go test -race`.

…docstrings - Keepalive-driven drain stop now advances lastLSN to the target before the final standby status update and metadata write. Previously a bounded drain that caught up via a keepalive (the no-change / other-tables-only case) left lastLSN at the start LSN, so the checkpoint regressed to a stale position, the slot never advanced, and every subsequent drain re-scanned the same span (and the server retained that WAL). Safe because the synchronous bounded apply means received == applied, and the branch only fires once the decoder has read past the target, so no tracked change <= target is still pending. - Document all top-level declarations in internal/infra/cdc/listen.go to satisfy the docstring-coverage check. CodeRabbit's other comment -- the drain-timeout branch leaking processChanges workers past wg.Wait() -- is moot under the synchronous bounded-drain apply: a bounded drain spawns no worker goroutines, so the timeout early-return has nothing in flight to wait for. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cli/default_config.yaml`:
- Around line 33-37: The generated default config is missing the new
cdc_flush_batch_size setting, so update the cdc block in the default config
template to include it alongside cdc_processing_timeout and
cdc_metadata_flush_seconds. Use the existing cdc section in the default config
generation path to keep ace config init aligned with the settings supported by
pkg/config/config.go and internal/infra/cdc/listen.go.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 800de308-6e14-4873-b4e9-34dfc614e451

📥 Commits

Reviewing files that changed from the base of the PR and between ed0406b and 0b242e1.

📒 Files selected for processing (11)

ace.yaml
docs/CHANGELOG.md
docs/commands/mtree/mtree-table-diff.md
docs/commands/mtree/mtree-update.md
docs/configuration.md
internal/cli/cli.go
internal/cli/default_config.yaml
internal/consistency/mtree/merkle.go
internal/infra/cdc/listen.go
pkg/config/config.go
tests/integration/cdc_busy_table_test.go

✅ Files skipped from review due to trivial changes (4)

docs/commands/mtree/mtree-table-diff.md
docs/commands/mtree/mtree-update.md
ace.yaml
docs/CHANGELOG.md

…nt complexity - tests/integration/cdc_busy_table_test.go: mark the seeding DELETE in TestCDCDrainHandlesLargeSingleTransaction `// nosemgrep`. Codacy's critical SQL-injection finding is a false positive -- the table name is a sanitized constant and the bound is a constant int -- and the sibling query already carries the same marker (matching the suite-wide convention). - listen.go (processReplicationStream) and merkle.go (UpdateMtree): add //nolint:gocyclo. Both functions were already far over the cyclomatic threshold before this PR (106 and 51); the bounded-drain changes only touched them, so Codacy re-reports their pre-existing complexity as new. The branching is inherent to the CDC drain protocol loop and the update orchestration; restructuring validated drain code to satisfy the metric is deferred to a dedicated refactor rather than risked under this fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mason's patch added the cdc_flush_batch_size setting (field in pkg/config/config.go, read in internal/infra/cdc/listen.go, documented in docs/configuration.md) but did not list it in the YAML config templates, so `ace config init` emitted a config missing a supported setting. Add it with the code/doc default (10000, matching cdc.FlushBatchSize) to internal/cli/default_config.yaml, and to ace.yaml so the example config stays in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…docstrings - Keepalive-driven drain stop now advances lastLSN to the target before the final standby status update and metadata write. Previously a bounded drain that caught up via a keepalive (the no-change / other-tables-only case) left lastLSN at the start LSN, so the checkpoint regressed to a stale position, the slot never advanced, and every subsequent drain re-scanned the same span (and the server retained that WAL). Safe because the synchronous bounded apply means received == applied, and the branch only fires once the decoder has read past the target, so no tracked change <= target is still pending. - Document all top-level declarations in internal/infra/cdc/listen.go to satisfy the docstring-coverage check. CodeRabbit's other comment -- the drain-timeout branch leaking processChanges workers past wg.Wait() -- is moot under the synchronous bounded-drain apply: a bounded drain spawns no worker goroutines, so the timeout early-return has nothing in flight to wait for. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt complexity - tests/integration/cdc_busy_table_test.go: mark the seeding DELETE in TestCDCDrainHandlesLargeSingleTransaction `// nosemgrep`. Codacy's critical SQL-injection finding is a false positive -- the table name is a sanitized constant and the bound is a constant int -- and the sibling query already carries the same marker (matching the suite-wide convention). - listen.go (processReplicationStream) and merkle.go (UpdateMtree): add //nolint:gocyclo. Both functions were already far over the cyclomatic threshold before this PR (106 and 51); the bounded-drain changes only touched them, so Codacy re-reports their pre-existing complexity as new. The branching is inherent to the CDC drain protocol loop and the update orchestration; restructuring validated drain code to satisfy the metric is deferred to a dedicated refactor rather than risked under this fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

danolivo added the bug Something isn't working label Jun 29, 2026

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread internal/infra/cdc/listen.go

Comment thread internal/infra/cdc/listen.go

mason-sharp and others added 2 commits June 30, 2026 12:56

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread internal/cli/default_config.yaml

Andrei Lepikhov and others added 2 commits June 30, 2026 14:50

mason-sharp approved these changes Jun 30, 2026

View reviewed changes

mason-sharp merged commit 8dc2f4c into main Jun 30, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `mtree table-diff` silently reporting a diverged cluster as identical (ACE-190)#130

Fix `mtree table-diff` silently reporting a diverged cluster as identical (ACE-190)#130
mason-sharp merged 5 commits into
mainfrom
fix/ACE-190/cdc-drain-premature-stop

danolivo commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

mason-sharp commented Jun 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

danolivo commented Jun 29, 2026

Problem

Root cause

The fix

Behaviour after the fix

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mason-sharp commented Jun 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

codacy-production Bot commented Jun 29, 2026 •

edited

Loading