Skip to content

TextDecoder: accept all WHATWG utf-8 encoding labels#198

Merged
bkaradzic-microsoft merged 3 commits into
BabylonJS:mainfrom
bkaradzic-microsoft:fix/textdecoder-whatwg-labels
Jun 18, 2026
Merged

TextDecoder: accept all WHATWG utf-8 encoding labels#198
bkaradzic-microsoft merged 3 commits into
BabylonJS:mainfrom
bkaradzic-microsoft:fix/textdecoder-whatwg-labels

Conversation

@bkaradzic-microsoft

@bkaradzic-microsoft bkaradzic-microsoft commented Jun 13, 2026

Copy link
Copy Markdown
Member

What

TextDecoder's constructor only accepted the exact labels "utf-8" and "UTF-8", throwing for every other spelling. This PR makes it accept all WHATWG-spec UTF-8 labels.

Why

Per the WHATWG Encoding Standard, an encoding label is matched after stripping leading/trailing ASCII whitespace and ASCII-lowercasing, and several labels all decode as UTF-8: utf-8, utf8, unicode-1-1-utf-8, unicode11utf8, unicode20utf8, x-unicode20utf8.

Real consumers rely on this. The Babylon.js glTF/Draco loader constructs new TextDecoder("utf8") (no hyphen). With the old check that threw, decoding aborted mid-load. In Babylon Native the aborted load left the loader in a state that drove a native out-of-bounds write, which surfaced as non-deterministic heap corruption (STATUS_HEAP_CORRUPTION, 0xC0000374) on the Draco mesh-compression validation tests.

Fix

Normalize the label per the spec (trim ASCII whitespace + ASCII-lowercase) and accept the full set of UTF-8 labels; still throw for genuinely unsupported encodings.

Verification

  • Two BabylonNative Playground Draco validation tests that previously crashed with heap corruption (GLTF Serializer KHR draco mesh compression, GLTF Buggy with Draco Mesh Compression) now pass (3/3 runs each) with this fix vendored in; a third (GLTF Box with bad Draco normalized flag) no longer crashes.
  • Adds unit tests covering the utf8 label, case/whitespace variants, the other UTF-8 aliases, and a still-rejected encoding (utf-16).

Update: also fixes a latent ChakraCore N-API heap-corruption bug

CI surfaced that the new "throws for an unsupported encoding" test reliably heap-corrupts on Chakra (Win32 x64/x86) — STATUS_HEAP_CORRUPTION (0xC0000374) / access violations, crashing later in unrelated tests. This is a pre-existing latent bug that this PR's first constructor-throw test merely exposed (it does not reproduce on main only because no existing test throws from a wrapped constructor; V8/JSC/JSI and the sanitizer jobs are unaffected since the bug is in the Chakra N-API shim).

Root cause: when a napi ObjectWrap constructor throws, the C++ instance is destroyed during stack unwinding, but the wrap finalizer registered by napi_wrap() stays attached to this. A later GC then runs the finalizer on the freed instance → use-after-free / double-free. The addon-api ~ObjectWrap() is meant to detach the wrap on failure, but it obtains this via napi_get_reference_value() on the wrap's weak (refcount-0) reference — and the Chakra backend returns null for every refcount-0 reference (the in-box jsrt API can't track weak-reference liveness), so the detach is skipped.

Fix (Core/Node-API/Source/js_native_api_chakra.cc):

  • In the Chakra construct-call trampoline, if the callback leaves a pending JS exception, detach any wrap left on this via napi_remove_wrap() (preserving the pending exception) so the finalizer can't run on the freed instance.
  • Guard napi_remove_wrap() against a null result out-param (now called with nullptr, matching upstream).

Verification: built the Chakra UnitTests locally; the constructor-throw scenario went from heap corruption every run to clean (8/8 runs, all tests passing). Added a regression test that throws from a wrapped constructor 100× then exercises the heap.

Note: the N-API engine fix is logically independent of the TextDecoder label change and could be split into its own PR if preferred — it is bundled here because this PR's constructor-throw test is what exposes it and is its regression test.

Copilot AI review requested due to automatic review settings June 13, 2026 03:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the TextDecoder polyfill to accept all UTF-8 encoding labels recognized by the WHATWG Encoding Standard (e.g., utf8, unicode11utf8), fixing real-world incompatibilities (notably Babylon.js glTF/Draco loader usage) and adds unit tests to prevent regressions.

Changes:

  • Normalize the constructor’s encoding label (ASCII trim + ASCII lowercase) and accept all WHATWG UTF-8 labels.
  • Preserve rejection behavior for unsupported encodings (e.g., utf-16).
  • Add unit tests covering accepted aliases and normalization behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
Tests/UnitTests/Scripts/tests.ts Adds unit tests for UTF-8 label aliases, case/whitespace normalization, and unsupported encodings.
Polyfills/TextDecoder/Source/TextDecoder.cpp Implements WHATWG-style label normalization and accepts the full set of UTF-8 labels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Polyfills/TextDecoder/Source/TextDecoder.cpp Outdated
@bkaradzic-microsoft

Copy link
Copy Markdown
Member Author

Thanks for the review. Addressing the one inline comment, plus a heads-up on additional commits:

  • Error-message wording (the inline comment): fixed in 011af13 — the rejection now reads "only UTF-8 is supported" rather than implying the single exact spelling 'utf-8'.
  • New engine fix since the review: while validating this PR, CI exposed a latent ChakraCore N-API heap-corruption bug that the new "throws for an unsupported encoding" test triggers (throwing from an ObjectWrap constructor left the napi_wrap finalizer attached to a freed instance → use-after-free on the next GC). Fixed in be54234 with a regression test. Details are in the updated PR description. CI is now green (22/22), including the Win32 Chakra jobs that were failing.

@copilot a fresh review of be54234 (the js_native_api_chakra.cc change) would be welcome.

@bkaradzic-microsoft bkaradzic-microsoft enabled auto-merge (squash) June 15, 2026 18:31

@bghgary bghgary left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Reviewed by Copilot on behalf of @bghgary]

Please split the Chakra N-API ObjectWrap finalizer-after-throw fix into its own PR. It's an independent heap-corruption fix, and a dedicated PR makes it far easier to find/bisect later than buried under a TextDecoder label change.

Comment thread Polyfills/TextDecoder/Source/TextDecoder.cpp Outdated
@bkaradzic-microsoft

Copy link
Copy Markdown
Member Author

Done — split the Chakra N-API ObjectWrap finalizer-after-throw fix into its own PR: #199 (the js_native_api_chakra.cc change + its heap-corruption regression test). This PR is now TextDecoder-only.

@bkaradzic-microsoft

Copy link
Copy Markdown
Member Author

Correction to my earlier note about merge order: these are not independent on Chakra — #198 depends on #199.

CI confirms it: #198's should still throw for a genuinely unsupported encoding test throws from the TextDecoder constructor, which on ChakraCore (Win32 x64/x86) triggers the exact ObjectWrap finalizer-after-throw heap corruption that #199 fixes. The build crashes right after that test with 0xC0000005 (the freed-finalizer use-after-free). All other platforms (V8/JSI/Ubuntu/macOS/UWP) pass.

So the order is: merge #199 first, then #198 (rebased on main) goes green on Chakra. The two Chakra failures here are expected until #199 lands.

bkaradzic-microsoft added a commit that referenced this pull request Jun 17, 2026
… throws (#199)

## What

Fixes a heap corruption in the ChakraCore Node-API implementation when
an `ObjectWrap`-based constructor throws.

Split out of #198 as an
independent fix, per review.

## Problem

When a wrapped (`napi_wrap`) constructor throws, the C++ `ObjectWrap`
instance is destroyed by stack unwinding inside the callback, but the
wrap finalizer registered by `napi_wrap()` stays attached to `this`.
`~ObjectWrap()` cannot detach it, because `napi_get_reference_value()`
returns null for the wrap's weak (refcount 0) reference. A later GC then
runs the finalizer on the freed native instance — a use-after-free that
surfaces as non-deterministic `STATUS_HEAP_CORRUPTION` (`0xC0000374`).

## Fix

`Core/Node-API/Source/js_native_api_chakra.cc`:
- In `ExternalCallback`, after a construct call that left a pending JS
exception, detach the wrap (`napi_remove_wrap`) before returning, then
re-set the exception so it is preserved across the cleanup.
- Guard `napi_remove_wrap` against a `nullptr` `result` out-param (the
finalizer-cleanup call above passes `nullptr`), so it no longer
dereferences a null pointer.

## Test

Adds a regression test (`Tests/UnitTests/Scripts/tests.ts`) that throws
from a wrapped constructor 100× to create many dangling wraps, then
allocates/decodes to exercise the heap and surface any corruption within
the run. It uses `TextDecoder` as a convenient throwing wrapped
constructor; the test passes on `main`'s existing `TextDecoder` behavior
and is independent of the TextDecoder label change in #198.

## Verification

Reliably reproduced as `STATUS_HEAP_CORRUPTION` on Chakra (Win32
x64/x86) before the fix; passes after.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft and others added 3 commits June 17, 2026 16:50
The TextDecoder constructor only accepted the exact labels "utf-8"/"UTF-8"
and threw for every other spelling. Per the WHATWG Encoding Standard, an
encoding label is matched after stripping leading/trailing ASCII whitespace
and ASCII-lowercasing, and several labels ("utf8", "unicode-1-1-utf-8",
"unicode11utf8", "unicode20utf8", "x-unicode20utf8") all map to UTF-8.

Consumers such as the Babylon.js glTF/Draco loader construct
`new TextDecoder("utf8")`; the throw aborted decoding mid-load and (in
Babylon Native) left the loader in a state that drove a native out-of-bounds
write, observed as non-deterministic heap corruption on the Draco
validation tests.

Normalize the label per spec and accept all UTF-8 labels. Adds regression
tests for "utf8", case/whitespace variants, the other aliases, and a
still-rejected unsupported encoding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The constructor now accepts every WHATWG UTF-8 label, so the rejection
message should say "only UTF-8 is supported" rather than implying the
single exact spelling 'utf-8'. Addresses review feedback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngLabel helper

Per review: extract the WHATWG trim+lowercase normalization out of the constructor into a named helper so the constructor reads as intent rather than inline character arithmetic. No behavior change.
@bkaradzic-microsoft bkaradzic-microsoft force-pushed the fix/textdecoder-whatwg-labels branch from d3e2b7a to b221b7b Compare June 17, 2026 23:53
@bkaradzic-microsoft

Copy link
Copy Markdown
Member Author

Done — the Chakra N-API ObjectWrap finalizer-after-throw fix is split out and landed separately as #199, so this PR is now TextDecoder-only: the three commits are the WHATWG utf-8 label change plus the two review fixes (clearer "only UTF-8 is supported" message, and the extracted NormalizeEncodingLabel helper).

I rebased onto current main (which already includes #199) to resolve the conflict. The Chakra regression test that #199 added (throwing from the constructor repeatedly does not corrupt native state) and the new WHATWG label tests both live in the TextDecoder describe block and pass together — verified on Win32/Chakra (198 passing).

Could you re-review to clear the change request? The diff is now just Polyfills/TextDecoder/Source/TextDecoder.cpp (+43) and Tests/UnitTests/Scripts/tests.ts (+24).

@bghgary bghgary left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Reviewed by Copilot on behalf of @bghgary]

LGTM. Trim the stale "ChakraCore N-API heap-corruption" section from the description — that fix is now #199.

@bkaradzic-microsoft bkaradzic-microsoft merged commit 69b662e into BabylonJS:main Jun 18, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants