Add duplicate-checking pipeline #1055

jpmckinney · 2024-02-07T00:57:55Z

Inspired by #1054

Similar to the Sample pipeline, we can maybe force the spider to stop once a threshold is reached of, let's say, 5 duplicates of the same item. The Kingfisher extension should check the close_spider reason and leave the collection open if the reason is 'duplicate'. That way, the data registry will not complete the job and auto-publish a bad crawl.

Sample code: https://docs.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter

The text was updated successfully, but these errors were encountered:

jpmckinney · 2024-04-09T19:45:40Z

We actually already have duplicate checking (on filename) in the Validate pipeline.

Like in #1058, it's maybe hard to set a threshold since:

We have a wide range of number of files downloaded (e.g. 1 to millions). So, we can't make the threshold a fixed number.
We don't always know the total number of files that will be downloaded. So, it would be hard to set a percentage threshold. We could do a sort of rolling percentage, but that can still lead to cases where the first share of requests error but the majority at the end succeed, etc.

Since we haven't encountered this issue often, and since it's just an optimization over reading the log file of the full collection #531, I will close.

Also, in Collect, we try not to parse the response content where possible. So, we aren't currently considering a duplicate checker at the data level (package, release or record).

jpmckinney added the framework Relating to other common functionality label Feb 7, 2024

jpmckinney added this to the Priority milestone Feb 7, 2024

This was referenced Feb 7, 2024

New command: logreport #531

Open

Filter out invalid and incomplete JSON #1058

Closed

jpmckinney mentioned this issue Apr 9, 2024

Acceptance criteria - Kingfisher Collect open-contracting/data-registry#29

Open

jpmckinney closed this as completed Apr 9, 2024

yolile mentioned this issue Apr 24, 2024

Add new features open-contracting/scrapy-log-analyzer#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add duplicate-checking pipeline #1055

Add duplicate-checking pipeline #1055

jpmckinney commented Feb 7, 2024

jpmckinney commented Apr 9, 2024 •

edited

Loading

Add duplicate-checking pipeline #1055

Add duplicate-checking pipeline #1055

Comments

jpmckinney commented Feb 7, 2024

jpmckinney commented Apr 9, 2024 • edited Loading

jpmckinney commented Apr 9, 2024 •

edited

Loading