duplicate dataset comes back after dedupe #5016

FuhuXia · 2024-12-16T17:42:21Z

After dedupe, a duplicate of https://catalog.data.gov/dataset/national-settlement-service-data keeps coming back after next harvest job.

How to reproduce

harvest https://catalog.data.gov/harvest/federal-reserve
Run dedupe script on org https://catalog.data.gov/organization/board-of-governors-of-the-federal-reserve-system
reharvest

Expected behavior

The dataset number should not change.

Actual behavior

One duplicate created for dataset https://catalog.data.gov/dataset/national-settlement-service-data

Sketch

Three approaches to fix the issue

Clear the harvest source then reharvest. This will lose tracking stats for all datasets in this source.
Exam the state of the affected dataset in DB and SOLR and figure out why duplicate occurs. Could be a new bug in ckanext-datajson.
Could be a bug in the dedupe process that an edge case is not handled well.

The text was updated successfully, but these errors were encountered:

FuhuXia · 2024-12-18T16:56:07Z

Did a harvest source clear and reharvest. The issue is back but the duplicate dataset changed to
international-summary-statistics
international-summary-statistics-2ddc8

Trying to replicate it in other environments.

FuhuXia · 2024-12-18T17:23:03Z

Could not replicate on develop or staging.

FuhuXia · 2024-12-30T17:58:13Z

The root cause of the issue is that package international-summary-statistics has a harvest object associated with it on the UI and SOLR but that harvest object has no package_id in the DB. This discrepancy makes duplicate dataset keeps coming back after dedupe process.

Manually api calls have fixed the packages for the the harvest source federal-reserve.

To address the issue, we need to improve dedupe process to detect and delete packages associated with harvest objects that has no pacakge_ids.

FuhuXia added the bug Software defect or bug label Dec 16, 2024

github-project-automation bot added this to data.gov team board Dec 16, 2024

FuhuXia added the O&M Operations and maintenance tasks for the Data.gov platform label Dec 18, 2024

FuhuXia moved this to 🏗 In Progress [8] in data.gov team board Dec 18, 2024

FuhuXia self-assigned this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate dataset comes back after dedupe #5016

duplicate dataset comes back after dedupe #5016

FuhuXia commented Dec 16, 2024 •

edited

Loading

FuhuXia commented Dec 18, 2024

FuhuXia commented Dec 18, 2024

FuhuXia commented Dec 30, 2024

duplicate dataset comes back after dedupe #5016

duplicate dataset comes back after dedupe #5016

Comments

FuhuXia commented Dec 16, 2024 • edited Loading

How to reproduce

Expected behavior

Actual behavior

Sketch

FuhuXia commented Dec 18, 2024

FuhuXia commented Dec 18, 2024

FuhuXia commented Dec 30, 2024

FuhuXia commented Dec 16, 2024 •

edited

Loading