Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add duplicate-checking pipeline #1055

Closed
jpmckinney opened this issue Feb 7, 2024 · 1 comment
Closed

Add duplicate-checking pipeline #1055

jpmckinney opened this issue Feb 7, 2024 · 1 comment
Labels
framework Relating to other common functionality
Milestone

Comments

@jpmckinney
Copy link
Member

Inspired by #1054

Similar to the Sample pipeline, we can maybe force the spider to stop once a threshold is reached of, let's say, 5 duplicates of the same item. The Kingfisher extension should check the close_spider reason and leave the collection open if the reason is 'duplicate'. That way, the data registry will not complete the job and auto-publish a bad crawl.

Sample code: https://docs.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter

@jpmckinney
Copy link
Member Author

jpmckinney commented Apr 9, 2024

We actually already have duplicate checking (on filename) in the Validate pipeline.

Like in #1058, it's maybe hard to set a threshold since:

  1. We have a wide range of number of files downloaded (e.g. 1 to millions). So, we can't make the threshold a fixed number.
  2. We don't always know the total number of files that will be downloaded. So, it would be hard to set a percentage threshold. We could do a sort of rolling percentage, but that can still lead to cases where the first share of requests error but the majority at the end succeed, etc.

Since we haven't encountered this issue often, and since it's just an optimization over reading the log file of the full collection #531, I will close.

Also, in Collect, we try not to parse the response content where possible. So, we aren't currently considering a duplicate checker at the data level (package, release or record).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework Relating to other common functionality
Projects
None yet
Development

No branches or pull requests

1 participant