Skip to content

[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files#8238

Draft
q8webmaster wants to merge 9 commits into
apache:masterfrom
q8webmaster:fix/parquet-timestamp-vector-micros-reader
Draft

[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files#8238
q8webmaster wants to merge 9 commits into
apache:masterfrom
q8webmaster:fix/parquet-timestamp-vector-micros-reader

Conversation

@q8webmaster

@q8webmaster q8webmaster commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Problem

After PR #8230 lands, TIMESTAMP(n<=3) columns will be written as INT64 with a MICROS Parquet annotation and epoch-microsecond values. If the vectorized reader fails to respect the annotation's time unit when decoding those columns, it returns timestamps ~1000× too large (year ~58xxx) or throws ArithmeticException: Millis overflow.

This scenario had no test coverage.

Root cause

The fix is already present on master: LongTimestampUpdater.timestampUnit() (introduced in #7845 for NANOS support) reads the actual Parquet annotation and normalises the stored value to epoch-milliseconds before it reaches ParquetTimestampVector.getTimestamp(). Without that normalisation (e.g. paimon 1.4.1, which lacks timestampUnit()), the raw epoch-µs value is passed to Timestamp.fromEpochMillis() — 1000× wrong.

Fix

No production code change — the fix is already in LongTimestampUpdater. This PR adds a regression test that would catch any future regression in this code path.

The test mirrors testReadTimestampNanosWrittenByParquet: it writes a Parquet file externally with INT64 MICROS annotation for a TIMESTAMP(3) column, reads it back via ParquetReaderFactory with a TimestampType(3) row type, and asserts the decoded values match Timestamp.fromMicros().

Prior art

PR #8230 introduced the MICROS writer path this test covers. PR #7845 introduced timestampUnit(), which is the reader fix this test guards.

Changes

  • ParquetReadWriteTest.java: testReadTimestampMicrosWrittenByParquetForLowPrecision — reads externally-written INT64 MICROS Parquet with a TIMESTAMP(3) schema and verifies correct decoding

Q8Webmaster added 6 commits June 14, 2026 02:27
…v2 compatibility)

Paimon emits TIMESTAMP(MILLIS) for precision <= 3 columns. The Iceberg v2
spec requires INT64 MICROS for timestamp/timestamptz; MILLIS is only valid
under Iceberg v3. This causes Iceberg-aware engines (Athena, Trino, Spark)
to reject Parquet files with a schema compatibility error.

- ParquetSchemaConverter.createTimestampWithLogicalType: emit MICROS for
  precision <= 3 instead of MILLIS.
- ParquetRowDataWriter.TimestampMillsWriter.writeTimestamp: call
  value.toMicros() so the stored INT64 matches the MICROS annotation unit.

The reader path (MILLIS -> precision=3, MICROS -> precision=6) is left
unchanged so files written by older versions remain readable. Existing
tables with precision<=3 columns should be rebuilt after upgrading.

Tests: testLowPrecisionTimestampUseMicrosAnnotation verifies MICROS
annotation for precision 0-3; testPaimonParquetSchemaConvert updated for
the widened round-trip precision.
…of micros

ParquetSimpleStatsExtractor.toTimestampStats called fromEpochMillis for
precision <= 3, but footer statistics for those columns now contain INT64
microseconds (matching the MICROS annotation). Switch to fromMicros so
that Parquet column bounds are decoded correctly.
VectorizedColumnReader has a lazy dictionary fast path for INT64/
LongColumnVector: the raw Parquet dictionary is stored on the vector
directly, bypassing LongTimestampUpdater.longTimestamp() which normalises
on-disk microseconds to the milliseconds that ParquetTimestampVector.
getTimestamp expects. The result is timestamps ~1000x too far in the
future for any dictionary-encoded page (triggered when rowGroupSize is
large enough to activate dictionary encoding).

Exclude precision <= 3 timestamp types from lazy decoding via a new
isLowPrecisionTimestamp helper so the eager path (decodeDictionaryIds)
is always taken, applying the correct /1000 normalisation.
…vs epoch_µs

After the MICROS annotation change, ParquetRowDataWriter stores
TIMESTAMP(n<=3) values as epoch microseconds. ParquetFilters.convertLiteral
was still using getMillisecond() (epoch_ms) for those columns, so the
Parquet row-group statistics comparison always failed against the new
epoch_µs statistics — causing WHERE predicates on low-precision timestamp
columns to filter out all row groups and return empty results.

Fix: use toMicros() for all INT64 timestamp precisions (0-6) in
ParquetFilters.convertLiteral, matching the storage unit written by the writer.

Update ParquetFiltersTest assertions accordingly.
@q8webmaster q8webmaster marked this pull request as draft June 15, 2026 00:27
@q8webmaster q8webmaster reopened this Jun 15, 2026
@q8webmaster q8webmaster changed the title [parquet] Fix TIMESTAMP(n<=3) reader decoding epoch-microseconds as epoch-milliseconds [parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant