[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files#8238
Draft
q8webmaster wants to merge 9 commits into
Draft
Conversation
added 6 commits
June 14, 2026 02:27
…v2 compatibility) Paimon emits TIMESTAMP(MILLIS) for precision <= 3 columns. The Iceberg v2 spec requires INT64 MICROS for timestamp/timestamptz; MILLIS is only valid under Iceberg v3. This causes Iceberg-aware engines (Athena, Trino, Spark) to reject Parquet files with a schema compatibility error. - ParquetSchemaConverter.createTimestampWithLogicalType: emit MICROS for precision <= 3 instead of MILLIS. - ParquetRowDataWriter.TimestampMillsWriter.writeTimestamp: call value.toMicros() so the stored INT64 matches the MICROS annotation unit. The reader path (MILLIS -> precision=3, MICROS -> precision=6) is left unchanged so files written by older versions remain readable. Existing tables with precision<=3 columns should be rebuilt after upgrading. Tests: testLowPrecisionTimestampUseMicrosAnnotation verifies MICROS annotation for precision 0-3; testPaimonParquetSchemaConvert updated for the widened round-trip precision.
…of micros ParquetSimpleStatsExtractor.toTimestampStats called fromEpochMillis for precision <= 3, but footer statistics for those columns now contain INT64 microseconds (matching the MICROS annotation). Switch to fromMicros so that Parquet column bounds are decoded correctly.
VectorizedColumnReader has a lazy dictionary fast path for INT64/ LongColumnVector: the raw Parquet dictionary is stored on the vector directly, bypassing LongTimestampUpdater.longTimestamp() which normalises on-disk microseconds to the milliseconds that ParquetTimestampVector. getTimestamp expects. The result is timestamps ~1000x too far in the future for any dictionary-encoded page (triggered when rowGroupSize is large enough to activate dictionary encoding). Exclude precision <= 3 timestamp types from lazy decoding via a new isLowPrecisionTimestamp helper so the eager path (decodeDictionaryIds) is always taken, applying the correct /1000 normalisation.
…vs epoch_µs After the MICROS annotation change, ParquetRowDataWriter stores TIMESTAMP(n<=3) values as epoch microseconds. ParquetFilters.convertLiteral was still using getMillisecond() (epoch_ms) for those columns, so the Parquet row-group statistics comparison always failed against the new epoch_µs statistics — causing WHERE predicates on low-precision timestamp columns to filter out all row groups and return empty results. Fix: use toMicros() for all INT64 timestamp precisions (0-6) in ParquetFilters.convertLiteral, matching the storage unit written by the writer. Update ParquetFiltersTest assertions accordingly.
…poch-milliseconds
…tated Parquet files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After PR #8230 lands,
TIMESTAMP(n<=3)columns will be written asINT64with aMICROSParquet annotation and epoch-microsecond values. If the vectorized reader fails to respect the annotation's time unit when decoding those columns, it returns timestamps ~1000× too large (year ~58xxx) or throwsArithmeticException: Millis overflow.This scenario had no test coverage.
Root cause
The fix is already present on master:
LongTimestampUpdater.timestampUnit()(introduced in #7845 for NANOS support) reads the actual Parquet annotation and normalises the stored value to epoch-milliseconds before it reachesParquetTimestampVector.getTimestamp(). Without that normalisation (e.g. paimon 1.4.1, which lackstimestampUnit()), the raw epoch-µs value is passed toTimestamp.fromEpochMillis()— 1000× wrong.Fix
No production code change — the fix is already in
LongTimestampUpdater. This PR adds a regression test that would catch any future regression in this code path.The test mirrors
testReadTimestampNanosWrittenByParquet: it writes a Parquet file externally withINT64 MICROSannotation for aTIMESTAMP(3)column, reads it back viaParquetReaderFactorywith aTimestampType(3)row type, and asserts the decoded values matchTimestamp.fromMicros().Prior art
PR #8230 introduced the MICROS writer path this test covers. PR #7845 introduced
timestampUnit(), which is the reader fix this test guards.Changes
ParquetReadWriteTest.java:testReadTimestampMicrosWrittenByParquetForLowPrecision— reads externally-writtenINT64 MICROSParquet with aTIMESTAMP(3)schema and verifies correct decoding