Skip to content

Remove the internal TOMLChar wrapper#492

Open
tfoutrein wants to merge 1 commit into
python-poetry:masterfrom
AstekGroup:perf/remove-tomlchar
Open

Remove the internal TOMLChar wrapper#492
tfoutrein wants to merge 1 commit into
python-poetry:masterfrom
AstekGroup:perf/remove-tomlchar

Conversation

@tfoutrein

Copy link
Copy Markdown
Contributor

Stacked on #489, #490 and #491 — the capstone of that series; best reviewed/merged after them.
This also supersedes #488 (interning TOMLChar): per @dimbleby's suggestion on that PR, removing the wrapper entirely is the better end-state, so I'd close #488 in favour of this.

What

After the bulk run-scans (#490/#491), the parser only constructs a TOMLChar (a str subclass) at run boundaries and uses a handful of its is_*() helpers. This removes the class entirely:

  • Source yields plain str characters; inc() / advance_* read self[i] directly.
  • End-of-input is detected positionally (_idx >= len / Source.end()) instead of an identity sentinel.
  • The remaining character-class checks use module-level frozensets.

A real NUL byte is still rejected as an invalid control char and is never mistaken for end-of-input, since EOF is now positional rather than a value/identity comparison.

Benchmarks

Median, interleaved A/B vs master (includes #489#491):

document speedup
large flat, single-line strings (~90 KB) 5.8×
poetry.lock-like (~64 KB) 2.4×
pyproject.toml 1.9×
typical mixed (~4 KB) 1.6×

The removal itself adds ~1.1–1.18× over #491. No regression on any shape.

Tests

Full suite passes (972, incl. the toml-test conformance submodule). On top of that, an 11.5k-input adversarial differential — EOF/truncation at every prefix length, real-NUL placement in every position, empty/whitespace/BOM, and structural fuzz — is byte-identical in output and exception type to master. No public API change (TOMLChar was not exported).

After the bulk run-scans, the parser only built a `TOMLChar` (a `str`
subclass) at run boundaries and used a handful of its `is_*()` helpers.
Drop the class entirely: `Source` now yields plain `str` characters and
detects end-of-input positionally (`_idx >= len` / `Source.end()`) instead
of an identity sentinel, and the remaining character-class checks use
module-level frozensets.

A real NUL byte is still rejected as an invalid control char and is never
mistaken for end-of-input, since EOF is now positional rather than a
sentinel comparison.

No behaviour change (972 tests incl. the toml-test conformance submodule;
plus an 11.5k-input adversarial differential over EOF/truncation, real-NUL
placement, empty/whitespace and structural fuzz — output and error-type
byte-identical to master). Removes the per-character object construction
and method dispatch (~1.1-1.18x over the previous step).
@tfoutrein tfoutrein force-pushed the perf/remove-tomlchar branch from 1c43d4d to d92e0a0 Compare June 10, 2026 10:08
@tfoutrein

Copy link
Copy Markdown
Contributor Author

Rebased onto master now that #491 has merged — single commit on top of current master, no conflicts (composes cleanly with the recent escape/bare-key fixes #493/#497/#501). CI is green. This removes TOMLChar entirely, so it supersedes #488 (interning) — I'll close #488 once this lands. Ready for squash-merge. Thanks!

Comment thread tomlkit/parser.py
Comment on lines +61 to +65
_SPACES = " \t"
_NL = "\n\r"
_WS = _SPACES + _NL
_BARE = string.ascii_letters + string.digits + "-_"
_KV = "= \t"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we only keep the *_SET variants?All __contains__ check should work, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants