Fix Iceberg read optimization returning NULLs for stats-less manifests#1895
Fix Iceberg read optimization returning NULLs for stats-less manifests#1895il9ue wants to merge 1 commit into
Conversation
When an Iceberg manifest's per-file column statistics are absent or empty (common for non-Spark writers like pyiceberg with default settings), DataFileMetaInfo::columns_info is empty. The optimization in StorageObjectStorageSource::createReader misread this as "all columns are absent from the file" and returned constant NULLs for every row while still returning the correct row count. Result: silent data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Gate the optimization's absent-NULL loop directly on columns_info.empty() instead of introducing a separate stats-presence flag. When no usable per-column stats were parsed -- whether the manifest omitted the stats fields entirely or declared them but left them empty -- fall through to the Parquet reader, which correctly handles physically-present columns (read normally) and schema-evolved-absent columns (handled by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header). columns_info is already serialized to workers in the cluster JSON path, so this changes no serialization format and keeps the fork's DataFileMetaInfo serde identical to upstream. Closes #1545. Mirror of #1688 (antalya-25.8 fix). Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
Audit update for PR #1895
Mirror of: Altinity/ClickHouse#1688 (antalya-25.8 fix). Closes Altinity/ClickHouse#1545. Re-open of the closed fork-based PR Altinity/ClickHouse#1814 (same commit Confirmed defectsNo confirmed defects in reviewed scope. |
PR #1895 CI Verification — Fix Iceberg read optimization returning NULLs for stats-less manifests
VerdictNo PR-caused regressions identified. All red checks are infrastructure timeouts, pre-existing flaky tests, or a pre-existing UBSan bug in unrelated code ( Job status summary (current SHA)
Failing checks on current SHA
|
| # | Check | Failure | Category |
|---|---|---|---|
| 1 | Stress test (arm_asan, s3) |
Cannot start clickhouse-server |
Pre-existing flaky |
| 2 | AST fuzzer (amd_ubsan) |
UBSan integer overflow in ToStartOfInterval<Year> |
Pre-existing upstream bug, unrelated |
| 3 | Integration tests (amd_msan, 4/6) |
Job-level error, 0 test fails |
Harness timeout |
| 4 | Integration tests (arm_binary, distributed plan, 2/4) |
Job-level error, 0 test fails |
Harness timeout |
| 5 | SQLLogic test |
Job-level error, 0 test fails |
Harness timeout (3h limit) |
| 6 | RegressionTestsRelease / S3Export (part) / s3_export_part |
could not bring up docker-compose cluster |
Infrastructure |
1. Stress test (arm_asan, s3) — pre-existing flake
- Test name
Cannot start clickhouse-server: 88 failures across 34 unrelated PRs in the last 60 days (since 2026-04-30). Recurring infrastructure / startup flake.
2. AST fuzzer (amd_ubsan) — pre-existing upstream UBSan bug
- Stack trace from
stderr.log:
/ClickHouse/src/Functions/DateTimeTransforms.h:817:79: runtime error:
signed integer overflow: 9223372036854775806 * 12 cannot be represented in type 'Int64'
#0 DB::ToStartOfInterval<(IntervalKind::Kind)10>::execute(...) Functions/DateTimeTransforms.h:817
#1 DB::FunctionToStartOfInterval::execute<DataTypeDate32, ...> Functions/toStartOfInterval.cpp:437
...
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
- The bug is in
Functions/DateTimeTransforms.handFunctions/toStartOfInterval.cpp— completely unrelated to Iceberg / object storage code touched by this PR. AST fuzzer (amd_ubsan)"Unknown error" hits in last 60 days: PR=0 (master) ×4, plus PRs 1645, 1568, 1803, 1895 — generic recurring fuzzer bucket.- This is a fuzzer-discovered upstream UBSan issue, not a regression introduced by PR Fix Iceberg read optimization returning NULLs for stats-less manifests #1895.
3–5. Integration tests (amd_msan 4/6, arm_binary distributed plan 2/4) and SQLLogic test — harness timeouts
All three jobs are reported as error in the database with 0 test failures — meaning every test that completed passed; the job itself was killed by the harness timeout.
From Integration tests (amd_msan, 4/6) job log:
[2026-06-08 22:32:51] PASSED test_multiple_disks/test.py::test_concurrent_alter_modify[mt]
[2026-06-08 22:36:10] PASSED test_multiple_disks/test.py::test_concurrent_alter_modify[replicated]
[2026-06-08 22:37:34] WARNING: Timeout exceeded [11400], sending SIGTERM to process group
[2026-06-08 22:49:53] PASSED test_ttl_replicated/test.py::test_ttl_drop_parts_limit
[2026-06-08 22:49:53] === 3 passed in 1218.94s ===
[2026-06-08 22:49:54] Test execution was interrupted (exit status: 2)
From SQLLogic test job log:
[2026-06-08 22:51:52] WARNING: Timeout exceeded [10800] for [3952]
[2026-06-08 22:51:55] WARNING: Job timed out: [SQLLogic test], timeout [10800], exit code [137]
All three are runner-level timeouts (3h / 11400s limits), not regressions. None of these jobs touch the Iceberg/StorageObjectStorageSource code path.
6. S3Export (part) regression suite — infrastructure failure
From the report:
/s3Fail 52scould not bring up docker-compose cluster
/s3/minioFail 52scould not bring up docker-compose cluster
The test environment failed to start in 52 seconds (no test ever ran). Pure infrastructure issue — completely independent of the PR's source change to Iceberg read optimization.
PR's own new tests — all passing
tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py includes 4 test cases:
test_iceberg_local_full_stats_manifest_reads_correctlytest_iceberg_local_partial_stats_manifest_reads_correctlytest_iceberg_local_returns_correct_rows_when_optimization_disabledtest_iceberg_local_returns_actual_rows_with_stats_less_manifest
All 4 tests passed in all 5 integration jobs that scheduled them on this SHA:
| Check | Status (4/4 cases) |
|---|---|
Integration tests (amd_tsan, 5/6) |
OK |
Integration tests (amd_msan, 5/6) |
OK |
Integration tests (amd_binary, 5/5) |
OK |
Integration tests (amd_asan, db disk, old analyzer, 5/6) |
OK |
Integration tests (arm_binary, distributed plan, 4/4) |
OK |
That is 20 / 20 PASS for the PR's own coverage — the fix's behavior is consistently exercised across debug, release, sanitizer, alternative-analyzer, and ARM builds.
Recommendations
- Rerun the failing jobs to clear the harness-timeout reds:
Integration tests (amd_msan, 4/6)Integration tests (arm_binary, distributed plan, 2/4)SQLLogic test
- Rerun
Stress test (arm_asan, s3)to clear the known startup flake. - Rerun
RegressionTestsRelease / S3Export (part)(infra docker-compose failure). - The
AST fuzzer (amd_ubsan)UBSan finding inToStartOfIntervalis an upstream bug worth tracking separately, but does not block this PR — the touched code is unrelated. - Merge once jobs are green; no PR-introduced regressions detected.
Re-opened from
Altinity/ClickHouse(instead of my fork) so CI publishes direct.debpackage URLs for clickhouse-regression. Same commit (8b597ed) as #1814. Fork PRs don't receive repo secrets, so the S3 upload step was skipped on #1814 and CI only emitted theDEB_ARM_RELEASEartifact zip.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects
icebergLocal,icebergS3,icebergAzure,icebergHDFS, and all*Clustervariants. Antalya 26.3 fix for #1545.Documentation entry for user-facing changes
Antalya-specific bug fix on
antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by #1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in #1688.Why this fires
When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the
allow_experimental_iceberg_read_optimizationpath produces silent data loss: correct row counts, every column valueNULL. Confirmed onicebergLocal; the same code path fires foricebergS3,icebergAzure,icebergHDFS, and all*Clustervariants.Root cause
IcebergIteratoralways populatesfile_meta_infobefore yielding objects, so thefile_meta_data.has_value()check in the optimization passes. The problem is what's inside the populatedDataFileMetaInfo: when the manifest'sdata_file.value_counts/column_sizes/null_value_countsAvro fields are all absent (all three are optional per the Iceberg spec),DataFileMetaInfo::columns_infostays empty.The optimization's second loop in
StorageObjectStorageSource::createReaderthen iterates every requested column, finds none in the emptycolumns_infomap, and adds them all toconstant_columns_with_valueswithField()(NULL).requested_columns_copyis cleared,need_only_count = true, the Parquet reader returns row count only, andgenerate()injects every column as a constant-NULL column at the correct row count. The optimization conflates "no stats were written" with "all columns are absent" — but absent stats tell us nothing about which columns are physically present.The fix
Add
any_stats_field_present(bool) toDataFileMetaInfo, populated during manifest parsing inAvroForIcebergDeserializer.cpp—trueif any ofvalue_counts,column_sizes, ornull_value_countswere emitted. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream byIcebergMetadata::getInitialSchemaByPathsetting the file's own schema asinitial_header).A per-column presence set was considered but is unnecessary — schema evolution is already handled upstream of the optimization, so the boolean is sufficient. JSON serialization (cluster reads via
toJson()/ JSON-ptr constructor) round-trips the new field; missing-on-deserialization defaults tofalse, matching pre-fix behavior.Files changed
src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h— addedany_stats_field_presenttoDataFileMetaInfo; constructor signature updated.src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp— JSON serde round-trips the new field; defaults tofalseon missing.src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h— header updates forParsedManifestFileEntry.src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp— tracks whether any stats Avro field was present during manifest parsing on 26.3.src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp— forwards the new bool when constructingDataFileMetaInfo.src/Storages/ObjectStorage/StorageObjectStorageSource.cpp— the absent-NULL loop now skips whenany_stats_field_presentisfalse.Tested
Integration test
tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py, ported from the 25.8 PR. Four scenarios:test_iceberg_local_returns_actual_rows_with_stats_less_manifest— reproducer, fails without the fix.test_iceberg_local_returns_correct_rows_when_optimization_disabled— control.test_iceberg_local_partial_stats_manifest_reads_correctly— manifest withvalue_countsonly.test_iceberg_local_full_stats_manifest_reads_correctly— full Spark-style stats regression guard.Closes #1545
Mirror of #1688 (antalya-25.8 fix).
CI/CD Options
Exclude tests:
Regression jobs to run: