You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I crawl site a.com which includes an image from b.com and b.com has a sitemap then Heritrix will start crawling all the pages listed in b.com's sitemap even though they should be out of scope for the crawl. This seems to be because sitemap links ('M') are currently incorrectly treated as transclusions by TransclusionDecideRule.
The text was updated successfully, but these errors were encountered:
We should only crawl sitemaps and sitemap links if they are in the
primary SURT scope. Otherwise we'll start crawling the sitemaps of
every site an embedded resource is pulled in from even when they
should be out of scope.
Fixes#469
ato
added a commit
that referenced
this issue
Mar 9, 2022
We should only crawl sitemaps and sitemap links if they are in the
primary SURT scope. Otherwise we'll start crawling the sitemaps of
every site an embedded resource is pulled in from even when they
should be out of scope.
Fixes#469
We should only crawl sitemaps and sitemap links if they are in the
primary SURT scope. Otherwise we'll start crawling the sitemaps of
every site an embedded resource is pulled in from even when they
should be out of scope.
Fixes#469
If I crawl site a.com which includes an image from b.com and b.com has a sitemap then Heritrix will start crawling all the pages listed in b.com's sitemap even though they should be out of scope for the crawl. This seems to be because sitemap links ('M') are currently incorrectly treated as transclusions by TransclusionDecideRule.
The text was updated successfully, but these errors were encountered: