Skip to content

Fix is_external_url misclassifying sibling domains as internal#2031

Open
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/is-external-url-domain-boundary
Open

Fix is_external_url misclassifying sibling domains as internal#2031
jichaowang02-lang wants to merge 1 commit into
unclecode:developfrom
jichaowang02-lang:fix/is-external-url-domain-boundary

Conversation

@jichaowang02-lang

Copy link
Copy Markdown

Summary

is_external_url decides same-site vs external with:

return not url_domain.endswith(base)

str.endswith is a raw string-suffix test with no domain-label boundary, so
any host whose name merely ends with the base-domain string is treated as
internal:

is_external_url("https://notexample.com/landing", "example.com")
#   -> False   ❌ (classified as internal/same-site)

notexample.com, myexample.com, evilexample.com are all different
registrable domains from example.com, but each ends with the string
example.com, so they are wrongly bucketed as internal.

This matters because quick_extract_links() calls get_base_domain(base_url)
then is_external_url(...) to split every <a href> into the internal /
external
lists that drive deep-crawl scoping and link reporting — so a
deep crawl can follow a look-alike/sibling domain as if it were the same site,
and internal/external link sets are simply wrong.

Fix

Require a label boundary — a host is same-site only if it equals the base
domain or is a true sub-domain of it:

return not (url_domain == base or url_domain.endswith("." + base))
host (base = example.com) before after
example.com internal internal ✅
www.example.com internal internal ✅
sub.example.com internal internal ✅
notexample.com internal ❌ external ✅
evilexample.com internal ❌ external ✅
other.com external external ✅

Testing

Adds TestIsExternalUrl to tests/regression/test_reg_utils.py:

$ pytest tests/regression/test_reg_utils.py::TestIsExternalUrl -q
9 passed

The sibling/look-alike cases fail on the current code (3 failed) and pass
with this fix
(9 passed); same-domain, www, real-subdomain, unrelated,
special-scheme, and relative-URL cases are all preserved.

is_external_url decided same-site vs external with `not url_domain.endswith(base)`,
a raw string-suffix test with no domain-label boundary. Any host that merely
ends with the base string was treated as internal, so look-alike / sibling
domains were mislabeled:

    is_external_url("https://notexample.com/x", "example.com")  -> False (internal)

even though notexample.com is a different registrable domain. This corrupts the
internal/external link buckets that quick_extract_links() and deep-crawl scoping
rely on (and lets a phishing-style look-alike host pass as same-site).

Require a label boundary: same site iff the domain equals base or is a true
sub-domain (`url_domain == base or url_domain.endswith("." + base)`). Real
subdomains (sub.example.com) and the www variant stay internal; unrelated and
look-alike domains are correctly external.

Adds TestIsExternalUrl in tests/regression/test_reg_utils.py; the sibling-domain
cases fail on the old code and pass with this fix.
Copilot AI review requested due to automatic review settings June 21, 2026 16:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants