[python] Add schema short-circuit to SplitRead and FileScanner read paths by MgjLLL · Pull Request #8217 · apache/paimon

MgjLLL · 2026-06-12T06:45:16Z

Purpose

Fix redundant filesystem I/O in SplitRead and FileScanner when reading schema.

SplitRead has 3 call sites that unconditionally call schema_manager.get_schema(schema_id) even when schema_id == table.table_schema.id — the schema is already in memory. This causes unnecessary filesystem reads in the common case (no schema evolution).

Java equivalent (RawFileSplitRead.createFileReader()) short-circuits with:

schemaId == schema.id() ? schema : schemaManager.schema(schemaId)

Changes

split_read.py: Add _resolve_schema() method that returns in-memory schema when id matches, replacing 3 direct get_schema() calls in raw_reader_supplier, _get_fields_and_predicate, and _file_read_fields
file_scanner.py: Add _schema_fields() method with same short-circuit pattern for SimpleStatsEvolutions

Tests

Added file_scanner_schema_fields_test.py with 3 test cases covering short-circuit, delegation, and zero-id edge case
All existing tests pass (106 passed)

This closes #8216

…er docstring

The schema short-circuit in FileScanner._schema_fields() returns table.table_schema.fields when schema_id matches the current schema id. The test fixture only mocked Mock(id=0) without .fields, causing the short-circuit path to return a Mock auto-attribute that is not iterable when used by SimpleStatsEvolutions._create_index_cast_mapping.

JingsongLi · 2026-06-16T03:54:20Z

-
        self.simple_stats_evolutions = SimpleStatsEvolutions(
-            schema_fields_func,
+            self._schema_fields,


This only protects the SimpleStatsEvolutions callback, but a normal scan still resolves the file schema while decoding manifest entries before _filter_manifest_entry() can use this helper. ManifestFileManager.read() still calls table.schema_manager.get_schema(file_dict["_SCHEMA_ID"]).fields for every entry, so current-schema files from a REST catalog can still hit the same 403 before this short-circuit is reached. Please apply the same current-schema short-circuit in the manifest decode path as well, or pass a resolver into ManifestFileManager.

JingsongLi · 2026-06-16T03:54:30Z

    def _nested_path_by_name(self) -> Optional[Dict[str, List[str]]]:
        return self._cached_nested_path_by_name

+    def _resolve_schema(self, schema_id: int):


There is still another direct schema-manager access in DataEvolutionSplitRead._create_union_reader(): when a regular file has no write_cols, it calls self.table.schema_manager.get_schema(first_file.schema_id) to derive field ids. If first_file.schema_id is the current schema id, this bypasses _resolve_schema() and can trigger the same REST-catalog 403 on data-evolution reads. Please switch that remaining call to self._resolve_schema(first_file.schema_id) too.

MgjLLL added 4 commits June 12, 2026 11:13

[python] Fix schema fields callback to short-circuit current schema id

6f56c7c

[python] Add schema short-circuit to SplitRead and simplify FileScann…

e9f1267

…er docstring

[python] Trigger CI rerun

8817e84

JingsongLi reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Add schema short-circuit to SplitRead and FileScanner read paths#8217

[python] Add schema short-circuit to SplitRead and FileScanner read paths#8217
MgjLLL wants to merge 4 commits into
apache:masterfrom
MgjLLL:python-fix-stats-evolutions-eager-schema-read

MgjLLL commented Jun 12, 2026 •

edited

Loading

Uh oh!

JingsongLi Jun 16, 2026

Uh oh!

JingsongLi Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MgjLLL commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Tests

Uh oh!

JingsongLi Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MgjLLL commented Jun 12, 2026 •

edited

Loading