Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6650 export import mismatch #6669

Merged
merged 13 commits into from
Mar 24, 2020
Merged

Conversation

lubitchv
Copy link
Contributor

What this PR does / why we need it: Fixes the existing mismatch between import and export DDI functions.

Which issue(s) this PR closes: #6650

Special notes for your reviewer: "Astronomy and Astrophysics", "Life Sciences" and "Journal" metadata does not have DDI compliant fields and therefore cannot be exported/imported using DDI export/import, hence it was not included in this PR.
Some fields that exist in dataverse (json) do not exist in DDI 2.5 and cannot be exported/imported. These are the fields:
Alternative URL
otherId fields
authorIdentifierScheme
authorIdentifier
subject does not exist in DDI standard and is transformed into keywords.
contributor (does not existin 2.5 but exists in 3.1)
For geospetial:
DDI doesnot have state and country. They go under geogCover.
For social sciences:
datasetLevelErrorNotes went into stdyDscr notes

Suggestions on how to test this: There is a file in src/test/java/edu/harvard/iq/dataverse/export/ddi/dataset-create-new-all-ddi-fields.json that has all the fields that suppose to be exported/imported. One may import it using
curl -H "X-Dataverse-key: $API_TOKEN" --upload-file dataset-create-new-all-ddi-fields.json -X POST $SERVER_URL/api/dataverses/$DV_ALIAS/datasets
Then one will need to publish it. And then export metadata using either UI ("Export Metadata->DDI) or curl.
Then one can import exported xml back using
curl -H "X-Dataverse-key: $API_TOKEN" -X POST --upload-file export.xml $SERVER_URL/api/dataverses/$DV_ALIAS/datasets/:importddi?pid=new_pid&release=no
Then one can compare metadata fields in UI.

The example of proper xml export with all the fields is in src/test/java/edu/harvard/iq/dataverse/export/ddi/exportfull.xml
In DdiExportUtilTest.java there is a unit testtestExportDDI() that converts json to ddi and compares it to exportfull.xml

@coveralls
Copy link

coveralls commented Feb 21, 2020

Coverage Status

Coverage increased (+0.3%) to 19.648% when pulling 4e62dbb on lubitchv:6650-export-import-mismatch into 2b8c4ee on IQSS:develop.

@djbrooke
Copy link
Contributor

Thanks @lubitchv for the PR (and for including/updating unit tests!). We'll review this.

@jggautier - do you want to take a quick look at the implementation here, either before a developer takes a look or at the same time?

@jggautier
Copy link
Contributor

jggautier commented Feb 22, 2020

Thanks, yes, taking a quick look while a developer takes a look would be helpful. I looked at the xml export example and have some questions, but I could use help getting this PR into a working instance of Dataverse to see how Dataverse maps the DDI metadata on import. (When I tried spinning up this branch on AWS it failed, and I think I got it running locally (with Vagrant), but I think I need to publish datasets to see the DDI exports, and I can't publish datasets without asking for and configuring in my local instance a test DataCite account.)

I have questions about the approach, too (i.e. using DDI or resolving the issues with using the Dataverse_JSON), but it sounds like that wouldn't be quick.

@donsizemore
Copy link
Contributor

@jggautier i’m happy to spin up the branch if you have trouble. will the default FAKE DOI provider be sufficient for your testing? Odum has “test” credentials we can use temporarily.

@jggautier
Copy link
Contributor

@donsizemore That would be great! The default fake DOI would be fine. Thanks!

@donsizemore
Copy link
Contributor

@jggautier http://ec2-52-87-250-239.compute-1.amazonaws.com/ =)

Sampledata going in at the moment. Credentials coming in Slack.

@jggautier
Copy link
Contributor

jggautier commented Feb 25, 2020

@lubitchv I was worried at first about the amount of work required if you needed to re-add the metadata that was lost by using DDI to migrate dataset metadata, e.g. from fields in other metadatablocks, but @djbrooke let me know that you're only really concerned with the metadata that can be mapped to DDI from the citation, social science and geospatial blocks.

Your changes bring the DDI export to closer alignment with the codebook schema, so I'm very excited.

I couldn't think of any integrations or applications outside of Dataverse that would be affected by the changes to the DDI export (which also affects the DDI metadata available over OAI-PMH), since the integrations/applications I know of are all using the dataverse_json export.

As you pointed out, there will always be some information loss, but for the migration you're working on, would the loss be lessened if we considered how to add additional metadata to the DDI export? These are the fields I think it's possible to add, especially if the effort would be worth it for your use case:

  • OtherId (which has actually been in Dataverse's DDI exports, although in the wrong place)

  • Contributor (I think othID element and its role attribute could be used)

  • Fields in the Terms of Access panel, like Data Access Place and Terms of Access (which isn't a field in DDI Codebook but I think could be handled in a Notes field, the way you've handled the Terms of Use field)

  • Citation Requirements in the Terms of Use panel (all of the other fields in the Terms of Use panel were imported, although the CC0 waiver button isn't triggered)

  • Geographic Coverage, specifically state/province and city (country field was imported). As you found, too, I can't find a way to specify state/province and city using Codebook elements. But if this is important for your migration, maybe we could ask Wendy Thompson who leads DDI metadata development. Last week I was in a call with her and I think she would be happy to try to find a way, or provide extra confirmation that there is no way.

  • Related publication. It might be possible to include the ID and ID Type fields:

    Example
    <relPubl>
    	<citation>
    		<titlStmt>
    			<titl/>
    			<IDNo agency="DOI">identifier</IDNo>
    		</titlStmt>
    		<biblCit>RelatedPublicationCitation</biblCit>
    	</citation>
    </relPubl>
    
  • Subjects. Like you wrote, on a round trip the subjects values are added to Dataverse's keywords field. This means that datasets will be published without a Subject chosen (which has always been possible using the API. We've often seen people use Subject as a facet when browsing for data, so datasets that don't have Subjects might be less discoverable during browsing. The code for creating datasets using the SWORD API v1.1 solves this by checking if any dc:subject values in the passed metadata document are in Dataverse's Subjects controlled vocabulary. If they are, they're mapped to Dataverse's Subject field. Otherwise, they're mapped to the keyword field. (http://guides.dataverse.org/en/4.19/api/sword.html#new-features-as-of-v1-1)

Lastly, this PR properly organizes the distributor elements in the exported XML, but also removes the Dataverse-based repository itself as a distributor of the study. To point out what I mean using this XML as an example, currently Dataverse adds the repository as both a distributor of the metadata document (line 9) and of the study (line 47 (ignore the use of ExtLink, which as you've probably seen isn't used now)). This PR removes the second instance under the stdyDscr element. Was this intentional? Should the repository not be declared as a distributor of the study? Should it only be considered a distributor of the the study's metadata?

For reference, the dataset created while I tested this import and export code is at http://ec2-52-87-250-239.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/ZNDIWD

@lubitchv
Copy link
Contributor Author

lubitchv commented Feb 25, 2020

Thank you @jggautier for detailed review and suggestions. Yes, it would be useful for us to have export to be more close to DDI standards. Let me write the fields that should be added to make sure that I understood you correctly.

  1. otherId fields would look like that:
<stdyDscr>
 <citation>
      <titlStmt>
        <titl>Replication Data for: Title</titl>
        <IDNo agency="DOI">doi:10.5072/FK2/WKUKGV</IDNo>
        <IDNo agency="OtherIDAgency1">OtherIDIdentifier1</IDNo>
        <IDNo agency="OtherIDAgency2">OtherIDIdentifier2</IDNo>
      </titlStmt>
      ....
</citation>
...
</stdyDscr>
  1. Contributor (with role):
<stdyDscr>
    <rspStmt>
       <AuthEnty affiliation="AuthorAffiliation1">LastAuthor1, FirstAuthor1</AuthEnty>
       <AuthEnty affiliation="AuthorAffiliation2">LastAuthor2, FirstAuthor2</AuthEnty>
       <othId role="Data Collector">LastContributor1, FirstContributor1</othId>
       <othId role="Data Curator">LastContributor2, FirstContributor2</othId>
     </rspStmt>
...
</stdyDscr>
  1. Terms of Access EDIT: Data Access Place already exists, it should be moved into setAvail element
<dataAccs>
      <notes type="DVN:TOU" level="dv">CC0 Waiver</notes>
      <notes type="DVN:TOA" level="dv">Terms of Access</notes>
    ...
</dataAccs>
  1. Terms of Use (citation)
<useStmt>
          <citReq>Citation Requirements</citReq>
          ...
</useStmt>
  1. Related Publications, instead of
<relPubl>RelatedPublicationCitation1, ark, RelatedPublicationIDNumber1, http://RelatedPublicationURL1.org</relPubl>
<relPubl>RelatedPublicationCitation2, arXiv, RelatedPublicationIDNumber2, http://RelatedPublicationURL2.org</relPubl>

should be:

     <relPubl>
        <citation>
          <titlStmt>
            <titl/>
            <IDNo agency="ark">RelatedPublicationIDNumber1</IDNo>
          </titlStmt>
          <biblCit>RelatedPublicationCitation1</biblCit>
        </citation>
        <ExtLink URI=http://RelatedPublicationURL1.org></ExtLink>
      </relPubl>
      <relPubl>
        <citation>
          <titlStmt>
            <titl/>
            <IDNo agency="arXiv">RelatedPublicationIDNumber2</IDNo>
          </titlStmt>
          <biblCit>RelatedPublicationCitation2</biblCit>
        </citation>
        <ExtLink URI=http://RelatedPublicationURL2.org></ExtLink>
      </relPubl>

Let me know if I made a mistake or misunderstood. I can and will add these fields to export and import for this PR.

Regarding dataverse distributor, it is exported. You can see it in export:

<distStmt>
        <distrbtr>Root</distrbtr>
        <distDate>2020-02-25</distDate>
</distStmt>

I do not know how one can import it, since Distributor is a name of dataverse and date is a date of publishing in dataverse.

Regarding Geographic Coverage, it is not critical for us. It would be nice to know if it is possible at all, but it is not urgent.

@jggautier
Copy link
Contributor

Hi @lubitchv. This is great. I'd like to clarify what I meant in a few places, but might not be able to until this Friday or over the weekend. Is that okay?

@lubitchv
Copy link
Contributor Author

Hi @jggautier. Yes, it is fine.

@jggautier
Copy link
Contributor

I agree with your points in 1-5. I didn't even know about the ExtLink element you're using for the related publication URL! That's great!

About the Terms metadata, you caught what I meant to write about Data Access. My sentence construction was a little awkward. =) So would all of the Terms metadata look like this, including the logic for Terms of Use described in the comment next to the first "Notes" element?:

<dataAccs>
	<setAvail>
		<accsPlac>Data Access Place</accsPlac>
		<origArch>Original Archive</origArch>
		<avlStatus>Availability Statement</avlStatus>
		<collSize>Collection Size</collSize>
		<complete>Study Completion</complete>
	</setAvail>
	<useStmt>
		<confDec>Confidentiality Declaration</confDec>
		<specPerm>Special Permissions</specPerm>
		<restrctn>Restrictions</restrctn>
		<contact>Contact for Access</contact>
		<citReq>Citation Requirements</citReq>
		<deposReq>Depositor Requirements</deposReq>
		<conditions>Conditions</conditions>
		<disclaimer>Disclaimer</disclaimer>
	</useStmt>
	<notes type="DVN:TOU" level="dv">See comment</notes><!--If CC0, value is CC0 waiver. If CC0 is waived, and Terms of Use is entered, value is what's entered in Terms of Use field. If CC0 is waived, and no Terms of Use is entered, this element won't be included-->
	<notes type="DVN:TOA" level="dv">Terms of Access</notes>
</dataAccs>

@jggautier
Copy link
Contributor

More incoming about the distributor issue. It's just taking a while to write this clearly :)

@lubitchv
Copy link
Contributor Author

Yes, you are rights. Terms metadata should look like you described.

@jggautier
Copy link
Contributor

jggautier commented Feb 28, 2020

I just realized that in Dataverse 4.19.1 (and maybe from the start of Dataverse 4) the date entered in Dataverse's distribution date field (in the citation block) isn't mapped to anything in the DDI exports. In both the docDscr and stdyDscr sections of the DDI exports, the distDate is the date when the dataset was first published in the Dataverse installation. I can't find any GitHub issues or email threads about this (clarifying this in the current crosswalk shortly). I think your PR fixes it, so that this metadata, under the docDscr section, will look like this:

<docDscr>
    <citation>
        ...
        <distStmt>
            <distrbtr>Dataverse installation name</distrbtr>
            <distDate>2019-08-01</distDate><!--Date when the dataset was first published in the Dataverse installation-->
        </distStmt>
    ...
    </citation>
</docDscr>

The metadata under the stdyDscr section will look like this:

<stdyDscr>
	...
	<distStmt>
		<distrbtr affiliation="DistributorAffiliation1" abbr="DistributorAbbreviation1" URI="http://DistributorURL1.org">LastDistributor1, FirstDistributor1</distrbtr>
		...
		<distDate>2017-03-22</distDate><!--Date value entered in distribution date metadata field-->
	</distStmt>
	...
</stdyDscr>

But it also removes the line <distrbtr>Dataverse installation name</distrbtr> from the stdyDscr section of Dataverse's current DDI export.

I've always assumed that that line was included in the first two sections of the DDI export (docDscr and stdyDscr) because it was important to state that the repository (at the Dataverse installation level) is the distributor of the DDI document and also of the study itself. I'd like to know if that assumption is right. But I haven't found any years-old metadata design documentation with that level of detail, and it's in the weeds so I wouldn't expect anyone to remember. That's why I asked if it's always appropriate for the repository to be declared as a distributor of the study, or if it should only always be considered a distributor of the study's DDI metadata document.

If it is appropriate to always consider the repository to be declared as a distributor of the study, and <distrbtr>Dataverse installation name</distrbtr> was added under stdyDscr > distStmt, I wouldn't expect that metadata to be mapped to any field or displayed in the UI, just like how it isn't mapped now. But I suppose that the way the metadata is organized in the current Dataverse DDI exports, where that line is under a <distStmt> parent element, separate from the distributor metadata in the citation metadata block, which is under its own <distrbtr> element (which is invalid), makes it easier ignore that line (<distrbtr>Dataverse installation name</distrbtr>) when mapping DDI elements to Dataverse fields during DDI import.

I hope I've written this clearly (I'd be happy to have a call) and that it doesn't continue to unnecessarily hold up your migration. I think the data sharing community's different interpretations of fields like "distributor" and "producer" are behind discussions around "more flexible" dataset citations (#2297), and won't be resolved soon. So I'm wondering if we could keep this metadata designed as is for now, so that the distStmt under stdyDscr looks like this:

<stdyDscr>
	...
	<distStmt>
		<distrbtr>Dataverse installation name</distrbtr>
		<distrbtr affiliation="DistributorAffiliation1" abbr="DistributorAbbreviation1" URI="http://DistributorURL1.org">LastDistributor1, FirstDistributor1</distrbtr>
		...
		<distDate>2017-03-22</distDate><!--Date value entered in distribution date metadata field-->
	</distStmt>
	...
</stdyDscr>

And when importing DDI, the value in <distrbtr>Dataverse installation name</distrbtr> is not mapped to any Dataverse fields (and is not displayed in the UI).

And discussion about whether or not it's always appropriate for the repository to be declared as a distributor of the study as well as of the DDI document can happen outside of this issue (and your migration).

Thanks again for this! It's contributing to resolving the issue about DDI exports not being valid against the schema (#3648), which I plan to update after your great work.

@lubitchv
Copy link
Contributor Author

lubitchv commented Mar 2, 2020

@jggautier I do not know the answer to your question, should dataverse be in distrbtr section of Study level. I asked librarians, maybe they will come up with the answer.

The problem that I have with putting dataverse as distributor in study level is that I do not know how to differentiate between distributors on import, especially from different dataverse instances with different names. I guess I can remember the dataverse distributor name from docDscr section and compare it to distributor name with stdyDscr section. If it is the same when ignore it and only map distributors with different names. I can do that.

@jggautier
Copy link
Contributor

jggautier commented Mar 2, 2020

I asked librarians, maybe they will come up with the answer.

Thanks! :) Hopefully getting an answer isn't difficult and doesn't hold up your migration.

The problem that I have with putting dataverse as distributor in study level is that I do not know how to differentiate between distributors on import, especially from different dataverse instances with different names.

The "source" attribute could be used to distinguish metadata added by the archive/repository versus metadata added by the producer/depositor. Would using the source attribute be a simpler method?:
<distrbtr source="archive">Dataverse installation name</distrbtr>

@djbrooke djbrooke assigned landreev and unassigned jggautier Mar 5, 2020
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @jggautier has reviewed and approved the metadata mapping, the code changes look solid to me, so I'm moving this along.

@kcondon kcondon self-assigned this Mar 10, 2020
@djbrooke
Copy link
Contributor

djbrooke commented Mar 10, 2020

@lubitchv @landreev will a rexportall be required for this changes to take effect? If so, I'll add a release note.

@lubitchv
Copy link
Contributor Author

@djbrooke Since the creation of DDI xml file happens on publishing so datasets that already were published will need to be cleaned of old cached DDI xml and export should be rerun. So yes, I think reExportAll will be needed.

@djbrooke
Copy link
Contributor

@lubitchv ok, thanks, I'll add it

@djbrooke
Copy link
Contributor

Added in 770194d

@kcondon
Copy link
Contributor

kcondon commented Mar 10, 2020

@lubitchv Would you refresh this branch from develop? I'm getting a flyway db error on deployment that I think is due to some outdated code that was removed.

Update from develop IQSS
@lubitchv
Copy link
Contributor Author

@kcondon I updated the branch from develop.

Update from IQSS develop
@kcondon
Copy link
Contributor

kcondon commented Mar 23, 2020

@lubitchv Hi Victoria, hope things are well with you. Would you mind syncing this branch once more with develop? There were some export/multistore changes that would help. Also, apologies for this taking so long, have been a bit distracted lately :(

Update from IQSS develop
@lubitchv
Copy link
Contributor Author

No problem @kcondon I just updated the branch from IQSS develop.

@kcondon
Copy link
Contributor

kcondon commented Mar 24, 2020

@lubitchv Thanks! Testing now

@kcondon
Copy link
Contributor

kcondon commented Mar 24, 2020

@lubitchv I was able to create, export, import according to your instructions, with some minor difficulty due to a bug in the api. When I compared the metadata of the exported and imported datasets, I saw all the metadata was preserved with just a few differences that I hope you can comment on.

Update: I realized you mentioned the export limitation of Subject above so we can ignore that one.
Subject in the exported dataset became keywords in the imported dataset.

Notes field value in exported dataset became field name: value in Notes field in imported dataset, ie. Notes1 became Notes: Notes1

What do you think? I'll also check with @jggautier to see what he thinks.
Notes

@lubitchv
Copy link
Contributor Author

@kcondon Yes, I think it how it suppose to work. There is no subject in xml ddi so it is moved to keywords. It is also how it worked before.
Regarding notes, that is how import works for all notes in xml

@kcondon kcondon merged commit c85c542 into IQSS:develop Mar 24, 2020
@kcondon
Copy link
Contributor

kcondon commented Mar 24, 2020

@lubitchv Thanks, have merged this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants