-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-harvesting ICPSR datasets #63
Comments
Error setting up harvesting client on UNC Dataverse and Demo Dataverse (but not Harvard Dataverse) Documentation of ICPSR's OAI-PMH feeds are at https://www.icpsr.umich.edu/web/pages/membership/or/metdata/oai.html. Thu-Mai at Odum/UNC let us know today (see RT support email) and I've confirmed on Demo Dataverse that Dataverse shows the following error when we try the first step of creating a harvesting client using the server URL https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies:
Demo Dataverse reports the same error when I try to use the second of ICPSR's two documented Server URLs, https://www.icpsr.umich.edu/icpsrweb/neutral/oai/citations UNC is running Dataverse version 4.16. Demo Dataverse is running 5.13. (Both show this error.) On Harvard Dataverse, also running 5.13, I don't get this error and I am able to get past all 4 steps for creating a harvesting client. I haven't started a harvesting run on Harvard Dataverse because:
Dataverse misses 3,573 of the 10,890 records Every time I've been able to harvest ICPSR's metadata (using the oai_ddi25 metadata format and ICPSR "archive type"), more recently by creating a Dataverse instance on AWS, Dataverse fails to harvest 3,573 records. It gets the other 7,317 records. I compared the DDI 2.5 metadata of a few records that Dataverse could and could not harvest, and the only difference I could see is that records that Dataverse failed to harvest have file level metadata (in DDI's fileDscr element) while records that Dataverse was able to harvest do not include file level metadata. Also, when ICPSR's DDI 2.5 metadata does include file level metadata, it doesn't include the fileDscr element's ID or URI attributes. So maybe the Dataverse code that imports DDI metadata from ICPSR's OAI-PMH feed doesn't like seeing file level metadata or expects to see more metadata about the files (like the ID and/or URI)? Questions
|
Because we've tried to harvest ICPSR's DDI Codebook exports, and I wrote that harvesting failures might be related to file-level metadata, I wonder if the recently opened GitHub issue at IQSS/dataverse#8629 is related. Also, folks from ICPSR are doing research to improve how ICPSR exports the metadata of their datasets, including exports in their OAI-PMH feeds. An archivist at ICPSR asked me to consider taking a survey on behalf of the Harvard Dataverse Repository (or share it with those who can take the survey), but others in the community have reported issues with harvesting from ICPSR, such as the repository managed at UNC, so I thought I'd share their survey more widely, including in this issue. The survey deadline is May 31, 2022. The email and the survey link are below.
|
So I think it would be great if some of the responses to that survey could be informed by future attempts to re-harvest ICPSR metadata as the Dataverse team works on the related harvesting issues that @pdurbin and @landreev have been listing and looking into. Maybe we won't know enough by the survey deadline, which is basically a month from now. But in case that's possible, I'm going to create a Google Doc with the survey's questions and so that interested folks can collaborate on answers to the survey questions. |
A Google Doc with the survey is at https://docs.google.com/document/d/1_qmnh_acoTX_MF8zDnOW4HW9BtO7smyjksqMb1XlbRY. In a post in the Dataverse Community Forums and in the metadata interest group channel in the Dataverse Community Slack I wrote about the survey, the Google Doc, and how it might be helpful for any Dataverse community members who answer the survey to share their answers. |
OK, I'll take a look at the doc. |
What should we do with the current harvested ICPSR dataverse in prod., with all its stale records, etc. Should we just wipe it clean until we can re-harvest from scratch? |
I'm not sure about wiping before we can re-harvest. Maybe in the meantime there's still some value in having the stale records discoverable in the Harvard repo. I think a lot of records are missing, but at least some of the records that are in https://dataverse.harvard.edu/dataverse/icpsr still point to datasets. |
Now that IQSS/dataverse#4964 is technically resolved, Harvard Dataverse should be able to harvest ICPSR dataset metadata over OAI-PMH (to keep the records up to date) and the dataset title links will point to the ICPSR dataset pages.
Can someone confirm that it's okay to delete the existing harvesting client listed on Harvard Dataverse's "Manage Harvesting Clients" page, to remove the 8k+ stale records? I don't know how that client was created, but the delete button on the "Manage Harvesting Clients" page looks nice and clickable...
After that's done, I'd like to create a new client and schedule a harvesting job for fresh ICPSR datasets.
There's an Archive Type called "ICPSR"
It looks like the Archive Type called "ICPSR" should be used when harvesting from ICPSR, otherwise, when choosing oai_ddi25 as the metadata format, the dataset title links point to the ICPSR homepage (see this dataverse on Demo Dataverse), or, when choosing oai_dc as the metadata format, no records are harvested (all 10k+ records fail to harvest). I think this should be documented.
The text was updated successfully, but these errors were encountered: