Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overcrawling due to sitemap links acting like transclusions #469

Closed
ato opened this issue Mar 9, 2022 · 0 comments · Fixed by #470
Closed

Overcrawling due to sitemap links acting like transclusions #469

ato opened this issue Mar 9, 2022 · 0 comments · Fixed by #470
Labels

Comments

@ato
Copy link
Collaborator

ato commented Mar 9, 2022

If I crawl site a.com which includes an image from b.com and b.com has a sitemap then Heritrix will start crawling all the pages listed in b.com's sitemap even though they should be out of scope for the crawl. This seems to be because sitemap links ('M') are currently incorrectly treated as transclusions by TransclusionDecideRule.

@ato ato added the bug label Mar 9, 2022
ato added a commit that referenced this issue Mar 9, 2022
We should only crawl sitemaps and sitemap links if they are in the
primary SURT scope. Otherwise we'll start crawling the sitemaps of
every site an embedded resource is pulled in from even when they
should be out of scope.

Fixes #469
ato added a commit that referenced this issue Mar 9, 2022
We should only crawl sitemaps and sitemap links if they are in the
primary SURT scope. Otherwise we'll start crawling the sitemaps of
every site an embedded resource is pulled in from even when they
should be out of scope.

Fixes #469
@ato ato closed this as completed in #470 Mar 28, 2022
ato added a commit that referenced this issue Mar 28, 2022
We should only crawl sitemaps and sitemap links if they are in the
primary SURT scope. Otherwise we'll start crawling the sitemaps of
every site an embedded resource is pulled in from even when they
should be out of scope.

Fixes #469
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant