fix: treat non-breaking space as a separator in links (#66)#113
Open
patchwright wants to merge 1 commit into
Open
fix: treat non-breaking space as a separator in links (#66)#113patchwright wants to merge 1 commit into
patchwright wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #66. A non-breaking space (U+00A0) next to an e-mail address or URL is swallowed into the link instead of separating it. For example
test@example.com\u{a0}(NBSP after) extends the e-mail link into the following text, andhttps://example.com\u{a0}nowpulls the NBSP andnowinto the URL. Confirmed on main HEAD b663d4e.Root cause
All three scanners admit U+00A0 as a valid character because it sits in the non-ASCII range they accept for internationalized text:
local_atom_allowed(_ => c >= '\u{80}') in email.rs, the ALPHA arm'\u{80}'..=char::MAXin domains.rs, and the URL delimiter set in url.rs which stops at\u{9F}, one code point below NBSP. So NBSP never reaches a separator/break path even though it is whitespace per Unicode.Fix
Treat U+00A0 as a separator in all three scanners: exclude it from
local_atom_allowed, break the authority scan on it, and add it to the URL "never part of a URL" set. 3 source lines changed, no API change; other non-ASCII characters (e.g. U+03F8) are unaffected.How to test
Both new tests fail on main and pass with this change; the full suite stays green (91 passed).
Backward compatibility
No breaking changes. NBSP is whitespace and is not a valid character in e-mail local-parts, domain labels, or URLs (RFC 3986/3987, RFC 5321/6531), so no previously valid link changes; links that previously ran through an NBSP now terminate at it.
Assisted-by: Claude (code generation, reviewed and tested locally)