[python] Add schema short-circuit to SplitRead and FileScanner read paths#8217
[python] Add schema short-circuit to SplitRead and FileScanner read paths#8217MgjLLL wants to merge 4 commits into
Conversation
The schema short-circuit in FileScanner._schema_fields() returns table.table_schema.fields when schema_id matches the current schema id. The test fixture only mocked Mock(id=0) without .fields, causing the short-circuit path to return a Mock auto-attribute that is not iterable when used by SimpleStatsEvolutions._create_index_cast_mapping.
|
|
||
| self.simple_stats_evolutions = SimpleStatsEvolutions( | ||
| schema_fields_func, | ||
| self._schema_fields, |
There was a problem hiding this comment.
This only protects the SimpleStatsEvolutions callback, but a normal scan still resolves the file schema while decoding manifest entries before _filter_manifest_entry() can use this helper. ManifestFileManager.read() still calls table.schema_manager.get_schema(file_dict["_SCHEMA_ID"]).fields for every entry, so current-schema files from a REST catalog can still hit the same 403 before this short-circuit is reached. Please apply the same current-schema short-circuit in the manifest decode path as well, or pass a resolver into ManifestFileManager.
| def _nested_path_by_name(self) -> Optional[Dict[str, List[str]]]: | ||
| return self._cached_nested_path_by_name | ||
|
|
||
| def _resolve_schema(self, schema_id: int): |
There was a problem hiding this comment.
There is still another direct schema-manager access in DataEvolutionSplitRead._create_union_reader(): when a regular file has no write_cols, it calls self.table.schema_manager.get_schema(first_file.schema_id) to derive field ids. If first_file.schema_id is the current schema id, this bypasses _resolve_schema() and can trigger the same REST-catalog 403 on data-evolution reads. Please switch that remaining call to self._resolve_schema(first_file.schema_id) too.
Purpose
Fix redundant filesystem I/O in
SplitReadandFileScannerwhen reading schema.SplitReadhas 3 call sites that unconditionally callschema_manager.get_schema(schema_id)even whenschema_id == table.table_schema.id— the schema is already in memory. This causes unnecessary filesystem reads in the common case (no schema evolution).Java equivalent (
RawFileSplitRead.createFileReader()) short-circuits with:Changes
split_read.py: Add_resolve_schema()method that returns in-memory schema when id matches, replacing 3 directget_schema()calls inraw_reader_supplier,_get_fields_and_predicate, and_file_read_fieldsfile_scanner.py: Add_schema_fields()method with same short-circuit pattern forSimpleStatsEvolutionsTests
file_scanner_schema_fields_test.pywith 3 test cases covering short-circuit, delegation, and zero-id edge caseThis closes #8216