TextDecoder: accept all WHATWG utf-8 encoding labels#198
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the TextDecoder polyfill to accept all UTF-8 encoding labels recognized by the WHATWG Encoding Standard (e.g., utf8, unicode11utf8), fixing real-world incompatibilities (notably Babylon.js glTF/Draco loader usage) and adds unit tests to prevent regressions.
Changes:
- Normalize the constructor’s encoding label (ASCII trim + ASCII lowercase) and accept all WHATWG UTF-8 labels.
- Preserve rejection behavior for unsupported encodings (e.g.,
utf-16). - Add unit tests covering accepted aliases and normalization behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| Tests/UnitTests/Scripts/tests.ts | Adds unit tests for UTF-8 label aliases, case/whitespace normalization, and unsupported encodings. |
| Polyfills/TextDecoder/Source/TextDecoder.cpp | Implements WHATWG-style label normalization and accepts the full set of UTF-8 labels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for the review. Addressing the one inline comment, plus a heads-up on additional commits:
@copilot a fresh review of be54234 (the |
bghgary
left a comment
There was a problem hiding this comment.
[Reviewed by Copilot on behalf of @bghgary]
Please split the Chakra N-API ObjectWrap finalizer-after-throw fix into its own PR. It's an independent heap-corruption fix, and a dedicated PR makes it far easier to find/bisect later than buried under a TextDecoder label change.
be54234 to
d3e2b7a
Compare
|
Done — split the Chakra N-API ObjectWrap finalizer-after-throw fix into its own PR: #199 (the |
|
Correction to my earlier note about merge order: these are not independent on Chakra — #198 depends on #199. CI confirms it: #198's So the order is: merge #199 first, then #198 (rebased on main) goes green on Chakra. The two Chakra failures here are expected until #199 lands. |
… throws (#199) ## What Fixes a heap corruption in the ChakraCore Node-API implementation when an `ObjectWrap`-based constructor throws. Split out of #198 as an independent fix, per review. ## Problem When a wrapped (`napi_wrap`) constructor throws, the C++ `ObjectWrap` instance is destroyed by stack unwinding inside the callback, but the wrap finalizer registered by `napi_wrap()` stays attached to `this`. `~ObjectWrap()` cannot detach it, because `napi_get_reference_value()` returns null for the wrap's weak (refcount 0) reference. A later GC then runs the finalizer on the freed native instance — a use-after-free that surfaces as non-deterministic `STATUS_HEAP_CORRUPTION` (`0xC0000374`). ## Fix `Core/Node-API/Source/js_native_api_chakra.cc`: - In `ExternalCallback`, after a construct call that left a pending JS exception, detach the wrap (`napi_remove_wrap`) before returning, then re-set the exception so it is preserved across the cleanup. - Guard `napi_remove_wrap` against a `nullptr` `result` out-param (the finalizer-cleanup call above passes `nullptr`), so it no longer dereferences a null pointer. ## Test Adds a regression test (`Tests/UnitTests/Scripts/tests.ts`) that throws from a wrapped constructor 100× to create many dangling wraps, then allocates/decodes to exercise the heap and surface any corruption within the run. It uses `TextDecoder` as a convenient throwing wrapped constructor; the test passes on `main`'s existing `TextDecoder` behavior and is independent of the TextDecoder label change in #198. ## Verification Reliably reproduced as `STATUS_HEAP_CORRUPTION` on Chakra (Win32 x64/x86) before the fix; passes after. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The TextDecoder constructor only accepted the exact labels "utf-8"/"UTF-8"
and threw for every other spelling. Per the WHATWG Encoding Standard, an
encoding label is matched after stripping leading/trailing ASCII whitespace
and ASCII-lowercasing, and several labels ("utf8", "unicode-1-1-utf-8",
"unicode11utf8", "unicode20utf8", "x-unicode20utf8") all map to UTF-8.
Consumers such as the Babylon.js glTF/Draco loader construct
`new TextDecoder("utf8")`; the throw aborted decoding mid-load and (in
Babylon Native) left the loader in a state that drove a native out-of-bounds
write, observed as non-deterministic heap corruption on the Draco
validation tests.
Normalize the label per spec and accept all UTF-8 labels. Adds regression
tests for "utf8", case/whitespace variants, the other aliases, and a
still-rejected unsupported encoding.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The constructor now accepts every WHATWG UTF-8 label, so the rejection message should say "only UTF-8 is supported" rather than implying the single exact spelling 'utf-8'. Addresses review feedback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngLabel helper Per review: extract the WHATWG trim+lowercase normalization out of the constructor into a named helper so the constructor reads as intent rather than inline character arithmetic. No behavior change.
d3e2b7a to
b221b7b
Compare
|
Done — the Chakra N-API ObjectWrap finalizer-after-throw fix is split out and landed separately as #199, so this PR is now TextDecoder-only: the three commits are the WHATWG utf-8 label change plus the two review fixes (clearer "only UTF-8 is supported" message, and the extracted NormalizeEncodingLabel helper). I rebased onto current Could you re-review to clear the change request? The diff is now just |
What
TextDecoder's constructor only accepted the exact labels"utf-8"and"UTF-8", throwing for every other spelling. This PR makes it accept all WHATWG-spec UTF-8 labels.Why
Per the WHATWG Encoding Standard, an encoding label is matched after stripping leading/trailing ASCII whitespace and ASCII-lowercasing, and several labels all decode as UTF-8:
utf-8,utf8,unicode-1-1-utf-8,unicode11utf8,unicode20utf8,x-unicode20utf8.Real consumers rely on this. The Babylon.js glTF/Draco loader constructs
new TextDecoder("utf8")(no hyphen). With the old check that threw, decoding aborted mid-load. In Babylon Native the aborted load left the loader in a state that drove a native out-of-bounds write, which surfaced as non-deterministic heap corruption (STATUS_HEAP_CORRUPTION,0xC0000374) on the Draco mesh-compression validation tests.Fix
Normalize the label per the spec (trim ASCII whitespace + ASCII-lowercase) and accept the full set of UTF-8 labels; still throw for genuinely unsupported encodings.
Verification
GLTF Serializer KHR draco mesh compression,GLTF Buggy with Draco Mesh Compression) now pass (3/3 runs each) with this fix vendored in; a third (GLTF Box with bad Draco normalized flag) no longer crashes.utf8label, case/whitespace variants, the other UTF-8 aliases, and a still-rejected encoding (utf-16).Update: also fixes a latent ChakraCore N-API heap-corruption bug
CI surfaced that the new "throws for an unsupported encoding" test reliably heap-corrupts on Chakra (Win32 x64/x86) —
STATUS_HEAP_CORRUPTION(0xC0000374) / access violations, crashing later in unrelated tests. This is a pre-existing latent bug that this PR's first constructor-throw test merely exposed (it does not reproduce onmainonly because no existing test throws from a wrapped constructor; V8/JSC/JSI and the sanitizer jobs are unaffected since the bug is in the Chakra N-API shim).Root cause: when a napi
ObjectWrapconstructor throws, the C++ instance is destroyed during stack unwinding, but the wrap finalizer registered bynapi_wrap()stays attached tothis. A later GC then runs the finalizer on the freed instance → use-after-free / double-free. The addon-api~ObjectWrap()is meant to detach the wrap on failure, but it obtainsthisvianapi_get_reference_value()on the wrap's weak (refcount-0) reference — and the Chakra backend returnsnullfor every refcount-0 reference (the in-boxjsrtAPI can't track weak-reference liveness), so the detach is skipped.Fix (
Core/Node-API/Source/js_native_api_chakra.cc):thisvianapi_remove_wrap()(preserving the pending exception) so the finalizer can't run on the freed instance.napi_remove_wrap()against a nullresultout-param (now called withnullptr, matching upstream).Verification: built the Chakra UnitTests locally; the constructor-throw scenario went from heap corruption every run to clean (8/8 runs, all tests passing). Added a regression test that throws from a wrapped constructor 100× then exercises the heap.