You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similar to the Sample pipeline, we can maybe force the spider to stop once a threshold is reached of, let's say, 5 duplicates of the same item. The Kingfisher extension should check the close_spider reason and leave the collection open if the reason is 'duplicate'. That way, the data registry will not complete the job and auto-publish a bad crawl.
We actually already have duplicate checking (on filename) in the Validate pipeline.
Like in #1058, it's maybe hard to set a threshold since:
We have a wide range of number of files downloaded (e.g. 1 to millions). So, we can't make the threshold a fixed number.
We don't always know the total number of files that will be downloaded. So, it would be hard to set a percentage threshold. We could do a sort of rolling percentage, but that can still lead to cases where the first share of requests error but the majority at the end succeed, etc.
Since we haven't encountered this issue often, and since it's just an optimization over reading the log file of the full collection #531, I will close.
Also, in Collect, we try not to parse the response content where possible. So, we aren't currently considering a duplicate checker at the data level (package, release or record).
Inspired by #1054
Similar to the Sample pipeline, we can maybe force the spider to stop once a threshold is reached of, let's say, 5 duplicates of the same item. The Kingfisher extension should check the
close_spider
reason and leave the collection open if the reason is 'duplicate'. That way, the data registry will not complete the job and auto-publish a bad crawl.Sample code: https://docs.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
The text was updated successfully, but these errors were encountered: