treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187

nlevitt · 2017-10-06T19:23:22Z

… as http errors: full allow

Archive-it has been having problems twitter images hosted at pbs.twimg.com. Sometimes fetching https://pbs.twimg.com/robots.txt results in a socket timeout, while fetching content from the domain does not. With this change, we are able to capture the content.

Note that if the crawl is configured to ignore robots using the "calculate robots only" setting,
as recommended at https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages/8005
currently a fetch failure of robots.txt precludes any further fetches from the domain.

… as http errors: full allow

anjackson · 2017-10-12T08:38:25Z

The policy looks fine to me. The differences between this and GoogleBot seem reasonable given the difference between the indexing and archiving use cases.

That said, this is one of the areas of the H3 codebase I'm less familiar with, so I can't really look at the code and know that it precisely implements the policy.

treat a failed fetch (e.g. socket timeout) of robots.txt the same way…

84debcf

… as http errors: full allow

galgeek merged commit f0d1c04 into internetarchive:master Oct 13, 2017

galgeek mentioned this pull request Nov 15, 2017

updateRobots deem S_CONNECT_FAILED 404 #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187

treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187

nlevitt commented Oct 6, 2017

anjackson commented Oct 12, 2017

treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187

treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187

Conversation

nlevitt commented Oct 6, 2017

anjackson commented Oct 12, 2017