-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724
NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724
Conversation
- add properties http.robots.503.defer.visits : enable/disable the feature (default: enabled) http.robots.503.defer.visits.delay : delay to wait before the next trial to fetch the deferred URL and the corresponding robots.txt (default: wait 5 minutes) http.robots.503.defer.visits.retries : max. number of retries before giving up and dropping all URLs from the given host / queue (default: give up after the 3rd retry, ie. after 4 attempts) - handle HTTP 5xx in robots.txt parser - handle delay, retries and dropping queues in Fetcher Stop adding fetch items if timelimit is reached - redirects (http.redirect.max > 0) or - outlinks (fetcher.follow.outlinks.depth > 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this patch. It's actually something which will help us loop back in with website administrators to notify them of service issues. Thanks for including the metrics this is really useful. Some minor suggestions from me @sebastian-nagel .
@@ -263,6 +283,10 @@ public synchronized int checkExceptionThreshold(String queueid) { | |||
return 0; | |||
} | |||
|
|||
public int checkExceptionThreshold(String queueid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Basic Javadoc?
- rename counter to follow naming scheme of other robots.txt related counters: `robots_defer_visits_dropped` - rename method timelimitReached -> timelimitExceeded - add Javadoc
Hi @lewismc, done: updated metrics wiki page (hitByTimeLimit is already documented), added Javadocs and renamed the counter to follow the naming convention of the other robots_* counters. Also renamed the method ("timelimitReached" -> "timelimitExceeded"). |
Looks like it failed on Javadoc generation @sebastian-nagel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once Javadoc fixed +1 from me.
- fix javadoc error
170e3fc
to
7f944c1
Compare
http.robots.503.defer.visits
:enable/disable the feature (default: enabled)
http.robots.503.defer.visits.delay
:delay to wait before the next trial to fetch the properties
(default: wait 5 minutes)
http.robots.503.defer.visits.retries
:max. number of retries before giving up and dropping all URLs from the given host / queue
(default: give up after the 3rd retry, ie. after 4 attempts)
Stop queuing fetch items if timelimit is reached
In a first version, I forgot to verify whether the Fetcher timelimit (
fetcher.timelimit.mins
) was already reached before re-queuing the fetch item. This caused very few fetcher task to end up in an infinite loop. In detail, this happened:Then steps 1 and 3 are retried until the max number of retries is reached. But this was fixed and I've also made sure that redirects or outlinks are not queued if the timelimit is reached.