Fix is_external_url misclassifying sibling domains as internal#2031
Open
jichaowang02-lang wants to merge 1 commit into
Open
Fix is_external_url misclassifying sibling domains as internal#2031jichaowang02-lang wants to merge 1 commit into
jichaowang02-lang wants to merge 1 commit into
Conversation
is_external_url decided same-site vs external with `not url_domain.endswith(base)`,
a raw string-suffix test with no domain-label boundary. Any host that merely
ends with the base string was treated as internal, so look-alike / sibling
domains were mislabeled:
is_external_url("https://notexample.com/x", "example.com") -> False (internal)
even though notexample.com is a different registrable domain. This corrupts the
internal/external link buckets that quick_extract_links() and deep-crawl scoping
rely on (and lets a phishing-style look-alike host pass as same-site).
Require a label boundary: same site iff the domain equals base or is a true
sub-domain (`url_domain == base or url_domain.endswith("." + base)`). Real
subdomains (sub.example.com) and the www variant stay internal; unrelated and
look-alike domains are correctly external.
Adds TestIsExternalUrl in tests/regression/test_reg_utils.py; the sibling-domain
cases fail on the old code and pass with this fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
is_external_urldecides same-site vs external with:str.endswithis a raw string-suffix test with no domain-label boundary, soany host whose name merely ends with the base-domain string is treated as
internal:
notexample.com,myexample.com,evilexample.comare all differentregistrable domains from
example.com, but each ends with the stringexample.com, so they are wrongly bucketed as internal.This matters because
quick_extract_links()callsget_base_domain(base_url)then
is_external_url(...)to split every<a href>into the internal /external lists that drive deep-crawl scoping and link reporting — so a
deep crawl can follow a look-alike/sibling domain as if it were the same site,
and internal/external link sets are simply wrong.
Fix
Require a label boundary — a host is same-site only if it equals the base
domain or is a true sub-domain of it:
example.com)example.comwww.example.comsub.example.comnotexample.comevilexample.comother.comTesting
Adds
TestIsExternalUrltotests/regression/test_reg_utils.py:The sibling/look-alike cases fail on the current code (3 failed) and pass
with this fix (9 passed); same-domain,
www, real-subdomain, unrelated,special-scheme, and relative-URL cases are all preserved.