Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187

Merged
merged 1 commit into from
Oct 13, 2017

Conversation

nlevitt
Copy link
Contributor

@nlevitt nlevitt commented Oct 6, 2017

… as http errors: full allow

Archive-it has been having problems twitter images hosted at pbs.twimg.com. Sometimes fetching https://pbs.twimg.com/robots.txt results in a socket timeout, while fetching content from the domain does not. With this change, we are able to capture the content.

Note that if the crawl is configured to ignore robots using the "calculate robots only" setting,
as recommended at https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages/8005
currently a fetch failure of robots.txt precludes any further fetches from the domain.

@anjackson
Copy link
Collaborator

The policy looks fine to me. The differences between this and GoogleBot seem reasonable given the difference between the indexing and archiving use cases.

That said, this is one of the areas of the H3 codebase I'm less familiar with, so I can't really look at the code and know that it precisely implements the policy.

@galgeek galgeek merged commit f0d1c04 into internetarchive:master Oct 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants