[spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption by huangxiaopingRD · Pull Request #8246 · apache/paimon

huangxiaopingRD · 2026-06-16T03:49:03Z

Summary

Spark translates ALTER TABLE ... REPLACE COLUMNS into a batch that drops every existing column and re-adds the new set (a combination of DeleteColumn + AddColumn). For Paimon this is unsafe: re-adding columns assigns brand-new field ids while existing data files keep the old ids, so same-named columns are treated as new columns and read back as null — a silent data corruption.

This PR detects that change pattern in SparkCatalog.alterTable and throws an UnsupportedOperationException with a clear message pointing users to RENAME COLUMN / ALTER COLUMN TYPE / DROP COLUMN / ADD COLUMN instead.

The detection matches exclusively on DeleteColumn + AddColumn so a legitimate mixed batch (e.g. a programmatic DROP + RENAME) is not mistaken for a replace.

Tests

Added SparkSchemaEvolutionITCase#testReplaceColumnsUnsupported verifying the operation is rejected with the expected exception.

…ption Spark translates REPLACE COLUMNS into a DeleteColumn + AddColumn batch. Re-adding columns assigns new field ids while existing data files keep the old ids, so same-named columns are read back as null. Detect this pattern and throw UnsupportedOperationException instead.

JingsongLi · 2026-06-16T08:54:11Z

+            }
+        }
+
+        return hasDeleteColumn && hasAddColumn;


This heuristic also rejects any programmatic TableCatalog.alterTable call that batches a supported drop and add together, for example deleteColumn("b") plus addColumn("d", ...). That is not necessarily ALTER TABLE ... REPLACE COLUMNS and it used to be a valid combination of existing schema changes. Can we make the detection narrower, e.g. only reject the Spark replace pattern where all current top-level columns are deleted before the new columns are added, or otherwise avoid blocking ordinary drop+add batches?

huangxiaopingRD force-pushed the spark-reject-replace-columns branch 3 times, most recently from 0bba7f8 to 69e7fd9 Compare June 16, 2026 04:04

huangxiaopingRD force-pushed the spark-reject-replace-columns branch from 69e7fd9 to d1d7c96 Compare June 16, 2026 04:55

JingsongLi reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption#8246

[spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption#8246
huangxiaopingRD wants to merge 1 commit into
apache:masterfrom
huangxiaopingRD:spark-reject-replace-columns

huangxiaopingRD commented Jun 16, 2026

Uh oh!

JingsongLi Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huangxiaopingRD commented Jun 16, 2026

Summary

Tests

Uh oh!

JingsongLi Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JingsongLi Jun 16, 2026 •

edited

Loading