Skip to content

[spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption#8246

Open
huangxiaopingRD wants to merge 1 commit into
apache:masterfrom
huangxiaopingRD:spark-reject-replace-columns
Open

[spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption#8246
huangxiaopingRD wants to merge 1 commit into
apache:masterfrom
huangxiaopingRD:spark-reject-replace-columns

Conversation

@huangxiaopingRD

Copy link
Copy Markdown
Contributor

Summary

Spark translates ALTER TABLE ... REPLACE COLUMNS into a batch that drops every existing column and re-adds the new set (a combination of DeleteColumn + AddColumn). For Paimon this is unsafe: re-adding columns assigns brand-new field ids while existing data files keep the old ids, so same-named columns are treated as new columns and read back as null — a silent data corruption.

This PR detects that change pattern in SparkCatalog.alterTable and throws an UnsupportedOperationException with a clear message pointing users to RENAME COLUMN / ALTER COLUMN TYPE / DROP COLUMN / ADD COLUMN instead.

The detection matches exclusively on DeleteColumn + AddColumn so a legitimate mixed batch (e.g. a programmatic DROP + RENAME) is not mistaken for a replace.

Tests

Added SparkSchemaEvolutionITCase#testReplaceColumnsUnsupported verifying the operation is rejected with the expected exception.

@huangxiaopingRD huangxiaopingRD force-pushed the spark-reject-replace-columns branch 3 times, most recently from 0bba7f8 to 69e7fd9 Compare June 16, 2026 04:04
…ption

Spark translates REPLACE COLUMNS into a DeleteColumn + AddColumn batch.
Re-adding columns assigns new field ids while existing data files keep the
old ids, so same-named columns are read back as null. Detect this pattern
and throw UnsupportedOperationException instead.
@huangxiaopingRD huangxiaopingRD force-pushed the spark-reject-replace-columns branch from 69e7fd9 to d1d7c96 Compare June 16, 2026 04:55
}
}

return hasDeleteColumn && hasAddColumn;

@JingsongLi JingsongLi Jun 16, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This heuristic also rejects any programmatic TableCatalog.alterTable call that batches a supported drop and add together, for example deleteColumn("b") plus addColumn("d", ...). That is not necessarily ALTER TABLE ... REPLACE COLUMNS and it used to be a valid combination of existing schema changes. Can we make the detection narrower, e.g. only reject the Spark replace pattern where all current top-level columns are deleted before the new columns are added, or otherwise avoid blocking ordinary drop+add batches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants