6650 export import mismatch #6669

lubitchv · 2020-02-21T20:02:25Z

What this PR does / why we need it: Fixes the existing mismatch between import and export DDI functions.

Which issue(s) this PR closes: #6650

Special notes for your reviewer: "Astronomy and Astrophysics", "Life Sciences" and "Journal" metadata does not have DDI compliant fields and therefore cannot be exported/imported using DDI export/import, hence it was not included in this PR.
Some fields that exist in dataverse (json) do not exist in DDI 2.5 and cannot be exported/imported. These are the fields:
Alternative URL
otherId fields
authorIdentifierScheme
authorIdentifier
subject does not exist in DDI standard and is transformed into keywords.
contributor (does not existin 2.5 but exists in 3.1)
For geospetial:
DDI doesnot have state and country. They go under geogCover.
For social sciences:
datasetLevelErrorNotes went into stdyDscr notes

Suggestions on how to test this: There is a file in src/test/java/edu/harvard/iq/dataverse/export/ddi/dataset-create-new-all-ddi-fields.json that has all the fields that suppose to be exported/imported. One may import it using
curl -H "X-Dataverse-key: $API_TOKEN" --upload-file dataset-create-new-all-ddi-fields.json -X POST $SERVER_URL/api/dataverses/$DV_ALIAS/datasets
Then one will need to publish it. And then export metadata using either UI ("Export Metadata->DDI) or curl.
Then one can import exported xml back using
curl -H "X-Dataverse-key: $API_TOKEN" -X POST --upload-file export.xml $SERVER_URL/api/dataverses/$DV_ALIAS/datasets/:importddi?pid=new_pid&release=no
Then one can compare metadata fields in UI.

The example of proper xml export with all the fields is in src/test/java/edu/harvard/iq/dataverse/export/ddi/exportfull.xml
In DdiExportUtilTest.java there is a unit testtestExportDDI() that converts json to ddi and compares it to exportfull.xml

Update

…dataverse into export-import-mismatch

coveralls · 2020-02-21T20:12:19Z

Coverage increased (+0.3%) to 19.648% when pulling 4e62dbb on lubitchv:6650-export-import-mismatch into 2b8c4ee on IQSS:develop.

djbrooke · 2020-02-21T20:17:14Z

Thanks @lubitchv for the PR (and for including/updating unit tests!). We'll review this.

@jggautier - do you want to take a quick look at the implementation here, either before a developer takes a look or at the same time?

jggautier · 2020-02-22T21:39:25Z

Thanks, yes, taking a quick look while a developer takes a look would be helpful. I looked at the xml export example and have some questions, but I could use help getting this PR into a working instance of Dataverse to see how Dataverse maps the DDI metadata on import. (When I tried spinning up this branch on AWS it failed, and I think I got it running locally (with Vagrant), but I think I need to publish datasets to see the DDI exports, and I can't publish datasets without asking for and configuring in my local instance a test DataCite account.)

I have questions about the approach, too (i.e. using DDI or resolving the issues with using the Dataverse_JSON), but it sounds like that wouldn't be quick.

donsizemore · 2020-02-22T23:59:07Z

@jggautier i’m happy to spin up the branch if you have trouble. will the default FAKE DOI provider be sufficient for your testing? Odum has “test” credentials we can use temporarily.

jggautier · 2020-02-23T00:01:52Z

@donsizemore That would be great! The default fake DOI would be fine. Thanks!

donsizemore · 2020-02-23T18:27:46Z

@jggautier http://ec2-52-87-250-239.compute-1.amazonaws.com/ =)

Sampledata going in at the moment. Credentials coming in Slack.

jggautier · 2020-02-25T03:50:23Z

@lubitchv I was worried at first about the amount of work required if you needed to re-add the metadata that was lost by using DDI to migrate dataset metadata, e.g. from fields in other metadatablocks, but @djbrooke let me know that you're only really concerned with the metadata that can be mapped to DDI from the citation, social science and geospatial blocks.

Your changes bring the DDI export to closer alignment with the codebook schema, so I'm very excited.

I couldn't think of any integrations or applications outside of Dataverse that would be affected by the changes to the DDI export (which also affects the DDI metadata available over OAI-PMH), since the integrations/applications I know of are all using the dataverse_json export.

As you pointed out, there will always be some information loss, but for the migration you're working on, would the loss be lessened if we considered how to add additional metadata to the DDI export? These are the fields I think it's possible to add, especially if the effort would be worth it for your use case:

OtherId (which has actually been in Dataverse's DDI exports, although in the wrong place)
Contributor (I think othID element and its role attribute could be used)
Fields in the Terms of Access panel, like Data Access Place and Terms of Access (which isn't a field in DDI Codebook but I think could be handled in a Notes field, the way you've handled the Terms of Use field)
Citation Requirements in the Terms of Use panel (all of the other fields in the Terms of Use panel were imported, although the CC0 waiver button isn't triggered)
Geographic Coverage, specifically state/province and city (country field was imported). As you found, too, I can't find a way to specify state/province and city using Codebook elements. But if this is important for your migration, maybe we could ask Wendy Thompson who leads DDI metadata development. Last week I was in a call with her and I think she would be happy to try to find a way, or provide extra confirmation that there is no way.

Related publication. It might be possible to include the ID and ID Type fields:

Example

<relPubl>
	<citation>
		<titlStmt>
			<titl/>
			<IDNo agency="DOI">identifier</IDNo>
		</titlStmt>
		<biblCit>RelatedPublicationCitation</biblCit>
	</citation>
</relPubl>

Subjects. Like you wrote, on a round trip the subjects values are added to Dataverse's keywords field. This means that datasets will be published without a Subject chosen (which has always been possible using the API. We've often seen people use Subject as a facet when browsing for data, so datasets that don't have Subjects might be less discoverable during browsing. The code for creating datasets using the SWORD API v1.1 solves this by checking if any dc:subject values in the passed metadata document are in Dataverse's Subjects controlled vocabulary. If they are, they're mapped to Dataverse's Subject field. Otherwise, they're mapped to the keyword field. (http://guides.dataverse.org/en/4.19/api/sword.html#new-features-as-of-v1-1)

Lastly, this PR properly organizes the distributor elements in the exported XML, but also removes the Dataverse-based repository itself as a distributor of the study. To point out what I mean using this XML as an example, currently Dataverse adds the repository as both a distributor of the metadata document (line 9) and of the study (line 47 (ignore the use of ExtLink, which as you've probably seen isn't used now)). This PR removes the second instance under the stdyDscr element. Was this intentional? Should the repository not be declared as a distributor of the study? Should it only be considered a distributor of the the study's metadata?

For reference, the dataset created while I tested this import and export code is at http://ec2-52-87-250-239.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/ZNDIWD

lubitchv · 2020-02-25T17:59:39Z

Thank you @jggautier for detailed review and suggestions. Yes, it would be useful for us to have export to be more close to DDI standards. Let me write the fields that should be added to make sure that I understood you correctly.

otherId fields would look like that:

<stdyDscr>
 <citation>
      <titlStmt>
        <titl>Replication Data for: Title</titl>
        <IDNo agency="DOI">doi:10.5072/FK2/WKUKGV</IDNo>
        <IDNo agency="OtherIDAgency1">OtherIDIdentifier1</IDNo>
        <IDNo agency="OtherIDAgency2">OtherIDIdentifier2</IDNo>
      </titlStmt>
      ....
</citation>
...
</stdyDscr>

Contributor (with role):

<stdyDscr>
    <rspStmt>
       <AuthEnty affiliation="AuthorAffiliation1">LastAuthor1, FirstAuthor1</AuthEnty>
       <AuthEnty affiliation="AuthorAffiliation2">LastAuthor2, FirstAuthor2</AuthEnty>
       <othId role="Data Collector">LastContributor1, FirstContributor1</othId>
       <othId role="Data Curator">LastContributor2, FirstContributor2</othId>
     </rspStmt>
...
</stdyDscr>

Terms of Access EDIT: Data Access Place already exists, it should be moved into setAvail element

<dataAccs>
      <notes type="DVN:TOU" level="dv">CC0 Waiver</notes>
      <notes type="DVN:TOA" level="dv">Terms of Access</notes>
    ...
</dataAccs>

Terms of Use (citation)

<useStmt>
          <citReq>Citation Requirements</citReq>
          ...
</useStmt>

Related Publications, instead of

<relPubl>RelatedPublicationCitation1, ark, RelatedPublicationIDNumber1, http://RelatedPublicationURL1.org</relPubl>
<relPubl>RelatedPublicationCitation2, arXiv, RelatedPublicationIDNumber2, http://RelatedPublicationURL2.org</relPubl>

should be:

     <relPubl>
        <citation>
          <titlStmt>
            <titl/>
            <IDNo agency="ark">RelatedPublicationIDNumber1</IDNo>
          </titlStmt>
          <biblCit>RelatedPublicationCitation1</biblCit>
        </citation>
        <ExtLink URI=http://RelatedPublicationURL1.org></ExtLink>
      </relPubl>
      <relPubl>
        <citation>
          <titlStmt>
            <titl/>
            <IDNo agency="arXiv">RelatedPublicationIDNumber2</IDNo>
          </titlStmt>
          <biblCit>RelatedPublicationCitation2</biblCit>
        </citation>
        <ExtLink URI=http://RelatedPublicationURL2.org></ExtLink>
      </relPubl>

Let me know if I made a mistake or misunderstood. I can and will add these fields to export and import for this PR.

Regarding dataverse distributor, it is exported. You can see it in export:

<distStmt>
        <distrbtr>Root</distrbtr>
        <distDate>2020-02-25</distDate>
</distStmt>

I do not know how one can import it, since Distributor is a name of dataverse and date is a date of publishing in dataverse.

Regarding Geographic Coverage, it is not critical for us. It would be nice to know if it is possible at all, but it is not urgent.

jggautier · 2020-02-26T22:01:44Z

Hi @lubitchv. This is great. I'd like to clarify what I meant in a few places, but might not be able to until this Friday or over the weekend. Is that okay?

lubitchv · 2020-02-27T14:55:19Z

Hi @jggautier. Yes, it is fine.

jggautier · 2020-02-28T19:22:00Z

I agree with your points in 1-5. I didn't even know about the ExtLink element you're using for the related publication URL! That's great!

About the Terms metadata, you caught what I meant to write about Data Access. My sentence construction was a little awkward. =) So would all of the Terms metadata look like this, including the logic for Terms of Use described in the comment next to the first "Notes" element?:

<dataAccs>
	<setAvail>
		<accsPlac>Data Access Place</accsPlac>
		<origArch>Original Archive</origArch>
		<avlStatus>Availability Statement</avlStatus>
		<collSize>Collection Size</collSize>
		<complete>Study Completion</complete>
	</setAvail>
	<useStmt>
		<confDec>Confidentiality Declaration</confDec>
		<specPerm>Special Permissions</specPerm>
		<restrctn>Restrictions</restrctn>
		<contact>Contact for Access</contact>
		<citReq>Citation Requirements</citReq>
		<deposReq>Depositor Requirements</deposReq>
		<conditions>Conditions</conditions>
		<disclaimer>Disclaimer</disclaimer>
	</useStmt>
	<notes type="DVN:TOU" level="dv">See comment</notes><!--If CC0, value is CC0 waiver. If CC0 is waived, and Terms of Use is entered, value is what's entered in Terms of Use field. If CC0 is waived, and no Terms of Use is entered, this element won't be included-->
	<notes type="DVN:TOA" level="dv">Terms of Access</notes>
</dataAccs>

jggautier · 2020-02-28T19:32:45Z

More incoming about the distributor issue. It's just taking a while to write this clearly :)

lubitchv · 2020-02-28T20:07:03Z

Yes, you are rights. Terms metadata should look like you described.

jggautier · 2020-02-28T21:40:10Z

I just realized that in Dataverse 4.19.1 (and maybe from the start of Dataverse 4) the date entered in Dataverse's distribution date field (in the citation block) isn't mapped to anything in the DDI exports. In both the docDscr and stdyDscr sections of the DDI exports, the distDate is the date when the dataset was first published in the Dataverse installation. I can't find any GitHub issues or email threads about this (clarifying this in the current crosswalk shortly). I think your PR fixes it, so that this metadata, under the docDscr section, will look like this:

<docDscr>
    <citation>
        ...
        <distStmt>
            <distrbtr>Dataverse installation name</distrbtr>
            <distDate>2019-08-01</distDate><!--Date when the dataset was first published in the Dataverse installation-->
        </distStmt>
    ...
    </citation>
</docDscr>

The metadata under the stdyDscr section will look like this:

<stdyDscr>
	...
	<distStmt>
		<distrbtr affiliation="DistributorAffiliation1" abbr="DistributorAbbreviation1" URI="http://DistributorURL1.org">LastDistributor1, FirstDistributor1</distrbtr>
		...
		<distDate>2017-03-22</distDate><!--Date value entered in distribution date metadata field-->
	</distStmt>
	...
</stdyDscr>

But it also removes the line <distrbtr>Dataverse installation name</distrbtr> from the stdyDscr section of Dataverse's current DDI export.

I've always assumed that that line was included in the first two sections of the DDI export (docDscr and stdyDscr) because it was important to state that the repository (at the Dataverse installation level) is the distributor of the DDI document and also of the study itself. I'd like to know if that assumption is right. But I haven't found any years-old metadata design documentation with that level of detail, and it's in the weeds so I wouldn't expect anyone to remember. That's why I asked if it's always appropriate for the repository to be declared as a distributor of the study, or if it should only always be considered a distributor of the study's DDI metadata document.

If it is appropriate to always consider the repository to be declared as a distributor of the study, and <distrbtr>Dataverse installation name</distrbtr> was added under stdyDscr > distStmt, I wouldn't expect that metadata to be mapped to any field or displayed in the UI, just like how it isn't mapped now. But I suppose that the way the metadata is organized in the current Dataverse DDI exports, where that line is under a <distStmt> parent element, separate from the distributor metadata in the citation metadata block, which is under its own <distrbtr> element (which is invalid), makes it easier ignore that line (<distrbtr>Dataverse installation name</distrbtr>) when mapping DDI elements to Dataverse fields during DDI import.

I hope I've written this clearly (I'd be happy to have a call) and that it doesn't continue to unnecessarily hold up your migration. I think the data sharing community's different interpretations of fields like "distributor" and "producer" are behind discussions around "more flexible" dataset citations (#2297), and won't be resolved soon. So I'm wondering if we could keep this metadata designed as is for now, so that the distStmt under stdyDscr looks like this:

<stdyDscr>
	...
	<distStmt>
		<distrbtr>Dataverse installation name</distrbtr>
		<distrbtr affiliation="DistributorAffiliation1" abbr="DistributorAbbreviation1" URI="http://DistributorURL1.org">LastDistributor1, FirstDistributor1</distrbtr>
		...
		<distDate>2017-03-22</distDate><!--Date value entered in distribution date metadata field-->
	</distStmt>
	...
</stdyDscr>

And when importing DDI, the value in <distrbtr>Dataverse installation name</distrbtr> is not mapped to any Dataverse fields (and is not displayed in the UI).

And discussion about whether or not it's always appropriate for the repository to be declared as a distributor of the study as well as of the DDI document can happen outside of this issue (and your migration).

Thanks again for this! It's contributing to resolving the issue about DDI exports not being valid against the schema (#3648), which I plan to update after your great work.

lubitchv · 2020-03-02T17:19:17Z

@jggautier I do not know the answer to your question, should dataverse be in distrbtr section of Study level. I asked librarians, maybe they will come up with the answer.

The problem that I have with putting dataverse as distributor in study level is that I do not know how to differentiate between distributors on import, especially from different dataverse instances with different names. I guess I can remember the dataverse distributor name from docDscr section and compare it to distributor name with stdyDscr section. If it is the same when ignore it and only map distributors with different names. I can do that.

jggautier · 2020-03-02T18:04:22Z

I asked librarians, maybe they will come up with the answer.

Thanks! :) Hopefully getting an answer isn't difficult and doesn't hold up your migration.

The problem that I have with putting dataverse as distributor in study level is that I do not know how to differentiate between distributors on import, especially from different dataverse instances with different names.

The "source" attribute could be used to distinguish metadata added by the archive/repository versus metadata added by the producer/depositor. Would using the source attribute be a simpler method?:
<distrbtr source="archive">Dataverse installation name</distrbtr>

landreev

Since @jggautier has reviewed and approved the metadata mapping, the code changes look solid to me, so I'm moving this along.

djbrooke · 2020-03-10T17:52:29Z

@lubitchv @landreev will a rexportall be required for this changes to take effect? If so, I'll add a release note.

lubitchv · 2020-03-10T18:07:18Z

@djbrooke Since the creation of DDI xml file happens on publishing so datasets that already were published will need to be cleaned of old cached DDI xml and export should be rerun. So yes, I think reExportAll will be needed.

djbrooke · 2020-03-10T18:10:17Z

@lubitchv ok, thanks, I'll add it

djbrooke · 2020-03-10T18:32:56Z

Added in 770194d

kcondon · 2020-03-10T22:21:45Z

@lubitchv Would you refresh this branch from develop? I'm getting a flyway db error on deployment that I think is due to some outdated code that was removed.

Update from develop IQSS

lubitchv · 2020-03-11T14:11:09Z

@kcondon I updated the branch from develop.

Update from IQSS develop

kcondon · 2020-03-23T23:57:40Z

@lubitchv Hi Victoria, hope things are well with you. Would you mind syncing this branch once more with develop? There were some export/multistore changes that would help. Also, apologies for this taking so long, have been a bit distracted lately :(

Update from IQSS develop

lubitchv · 2020-03-24T14:40:19Z

No problem @kcondon I just updated the branch from IQSS develop.

kcondon · 2020-03-24T15:32:50Z

@lubitchv Thanks! Testing now

kcondon · 2020-03-24T17:35:09Z

@lubitchv I was able to create, export, import according to your instructions, with some minor difficulty due to a bug in the api. When I compared the metadata of the exported and imported datasets, I saw all the metadata was preserved with just a few differences that I hope you can comment on.

Update: I realized you mentioned the export limitation of Subject above so we can ignore that one.
Subject in the exported dataset became keywords in the imported dataset.

Notes field value in exported dataset became field name: value in Notes field in imported dataset, ie. Notes1 became Notes: Notes1

What do you think? I'll also check with @jggautier to see what he thinks.
Notes

lubitchv · 2020-03-24T18:15:58Z

@kcondon Yes, I think it how it suppose to work. There is no subject in xml ddi so it is moved to keywords. It is also how it worked before.
Regarding notes, that is how import works for all notes in xml

kcondon · 2020-03-24T18:42:50Z

@lubitchv Thanks, have merged this pr.

lubitchv and others added 8 commits January 24, 2020 15:15

Merge pull request #42 from IQSS/develop

b49617b

Update

Merge pull request #44 from IQSS/develop

02c3e38

Update

export

0b59f5e

Merge pull request #45 from IQSS/develop

7d7a258

Update

import export

a53c857

Merge branch 'export-import-mismatch' of https://github.com/lubitchv/…

a7aaabb

…dataverse into export-import-mismatch

export import ddi

9b98584

remove commented lines

03d48a8

djbrooke assigned jggautier Feb 21, 2020

jggautier mentioned this pull request Feb 28, 2020

Make Dataverse produce valid DDI codebook 2.5 XML #3648

Closed

jggautier mentioned this pull request Mar 3, 2020

Study Level Error Notes field maps in DDI to "anylInfo" instead of "anlyInfo" #6715

Closed

fix export import ddi

8446bfb

djbrooke assigned landreev and unassigned jggautier Mar 5, 2020

landreev approved these changes Mar 9, 2020

View reviewed changes

djbrooke unassigned landreev Mar 10, 2020

kcondon self-assigned this Mar 10, 2020

adding release note

770194d

Merge pull request #46 from IQSS/develop

65ac90e

Update from develop IQSS

Merge pull request #48 from IQSS/develop

a646eb2

Update from IQSS develop

Merge pull request #49 from IQSS/develop

4e62dbb

Update from IQSS develop

kcondon merged commit c85c542 into IQSS:develop Mar 24, 2020

djbrooke added this to the 4.20 milestone Mar 25, 2020

jggautier mentioned this pull request Apr 28, 2020

Mismatch between export and import DDI functions #6650

Closed

jggautier mentioned this pull request Jan 19, 2022

Feature Request/Idea: Add license info to DDI exports #8355

Closed

pdurbin added the Feature: Harvesting label Apr 13, 2022

pdurbin mentioned this pull request Apr 13, 2022

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6650 export import mismatch #6669

6650 export import mismatch #6669

lubitchv commented Feb 21, 2020

coveralls commented Feb 21, 2020 •

edited

Loading

djbrooke commented Feb 21, 2020

jggautier commented Feb 22, 2020 •

edited

Loading

donsizemore commented Feb 22, 2020

jggautier commented Feb 23, 2020

donsizemore commented Feb 23, 2020

jggautier commented Feb 25, 2020 •

edited

Loading

lubitchv commented Feb 25, 2020 •

edited

Loading

jggautier commented Feb 26, 2020

lubitchv commented Feb 27, 2020

jggautier commented Feb 28, 2020

jggautier commented Feb 28, 2020

lubitchv commented Feb 28, 2020

jggautier commented Feb 28, 2020 •

edited

Loading

lubitchv commented Mar 2, 2020

jggautier commented Mar 2, 2020 •

edited

Loading

landreev left a comment

djbrooke commented Mar 10, 2020 •

edited

Loading

lubitchv commented Mar 10, 2020

djbrooke commented Mar 10, 2020

djbrooke commented Mar 10, 2020

kcondon commented Mar 10, 2020

lubitchv commented Mar 11, 2020

kcondon commented Mar 23, 2020

lubitchv commented Mar 24, 2020

kcondon commented Mar 24, 2020

kcondon commented Mar 24, 2020 •

edited by jggautier

Loading

lubitchv commented Mar 24, 2020

kcondon commented Mar 24, 2020

6650 export import mismatch #6669

6650 export import mismatch #6669

Conversation

lubitchv commented Feb 21, 2020

coveralls commented Feb 21, 2020 • edited Loading

djbrooke commented Feb 21, 2020

jggautier commented Feb 22, 2020 • edited Loading

donsizemore commented Feb 22, 2020

jggautier commented Feb 23, 2020

donsizemore commented Feb 23, 2020

jggautier commented Feb 25, 2020 • edited Loading

lubitchv commented Feb 25, 2020 • edited Loading

jggautier commented Feb 26, 2020

lubitchv commented Feb 27, 2020

jggautier commented Feb 28, 2020

jggautier commented Feb 28, 2020

lubitchv commented Feb 28, 2020

jggautier commented Feb 28, 2020 • edited Loading

lubitchv commented Mar 2, 2020

jggautier commented Mar 2, 2020 • edited Loading

landreev left a comment

Choose a reason for hiding this comment

djbrooke commented Mar 10, 2020 • edited Loading

lubitchv commented Mar 10, 2020

djbrooke commented Mar 10, 2020

djbrooke commented Mar 10, 2020

kcondon commented Mar 10, 2020

lubitchv commented Mar 11, 2020

kcondon commented Mar 23, 2020

lubitchv commented Mar 24, 2020

kcondon commented Mar 24, 2020

kcondon commented Mar 24, 2020 • edited by jggautier Loading

lubitchv commented Mar 24, 2020

kcondon commented Mar 24, 2020

coveralls commented Feb 21, 2020 •

edited

Loading

jggautier commented Feb 22, 2020 •

edited

Loading

jggautier commented Feb 25, 2020 •

edited

Loading

lubitchv commented Feb 25, 2020 •

edited

Loading

jggautier commented Feb 28, 2020 •

edited

Loading

jggautier commented Mar 2, 2020 •

edited

Loading

djbrooke commented Mar 10, 2020 •

edited

Loading

kcondon commented Mar 24, 2020 •

edited by jggautier

Loading