Skip to content

Commit 926ad11

Browse files
committed
Guide entries for the DataCite harvesting feature #10909
1 parent 75f0621 commit 926ad11

File tree

2 files changed

+42
-4
lines changed

2 files changed

+42
-4
lines changed

doc/sphinx-guides/source/admin/harvestclients.rst

+9-1
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,14 @@ Please note that in some rare cases this GUI may fail to create a client because
2525

2626
Note that as of 5.13, a new entry "Custom HTTP Header" has been added to the Step 1. of Create or Edit form. This optional field can be used to configure this client with a specific HTTP header to be added to every OAI request. This is to accommodate a (rare) use case where the remote server may require a special token of some kind in order to offer some content not available to other clients. Most OAI servers offer the same publicly-available content to all clients, so few admins will have a use for this feature. It is however on the very first, Step 1. screen in case the OAI server requires this token even for the "ListSets" and "ListMetadataFormats" requests, which need to be sent in the Step 2. of creating or editing a client. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
2727

28+
Harvesting from Datacite
29+
~~~~~~~~~~~~~~~~~~~~~~~~
30+
31+
As of v6.6, it is now possible to harvest metadata directly from DataCite. Their OAI gateway (https://oai.datacite.org/oai) serves records for every DOI they have registered. Therefore, it is now possible to harvest metadata from any participating institution even if they do not maintain an OAI server of their own. Their OAI implementation offers a concept of a "dynamic set", making it possible to use any query supported by the DataCite search API as though it were a "set". This makes harvesting from them extra flexible, allowing to harvest virtually any arbitrary subset of metadata records, potentially spanning multiple institutions and registration authorities.
32+
33+
As of this release the functionality is being offered as somewhat experimental. Its beta version is nevertheless already in use at IQSS with seemingly satisfactory results.
34+
35+
For various reasons, in order to take advantage of this feature harvesting clients must be created and edited via the ``/api/harvest/clients`` API. Once configured however, harvests can be run from the Harvesting Clients control panel in the UI. See the :ref:`managing-harvesting-clients-api` section of the :doc:`/api/native-api` guide for more information.
2836

2937
How to Stop a Harvesting Run in Progress
3038
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -51,4 +59,4 @@ Note that you'll want to run a minimum of Dataverse Software 4.6, optimally 4.18
5159
Harvesting Non-OAI-PMH
5260
~~~~~~~~~~~~~~~~~~~~~~
5361

54-
`DOI2PMH <https://github.com/IQSS/doi2pmh-server>`__ is a community-driven project intended to allow OAI-PMH harvesting from non-OAI-PMH sources.
62+
`DOI2PMH <https://github.com/IQSS/doi2pmh-server>`__ is a community-driven project intended to allow OAI-PMH harvesting from non-OAI-PMH sources.

doc/sphinx-guides/source/api/native-api.rst

+33-3
Original file line numberDiff line numberDiff line change
@@ -5254,7 +5254,8 @@ Shows a Harvesting Client with a defined nickname::
52545254
"dataverseAlias": "fooData",
52555255
"nickName": "myClient",
52565256
"set": "fooSet",
5257-
"useOaiIdentifiersAsPids": false
5257+
"useOaiIdentifiersAsPids": false,
5258+
"useListRecords": false,
52585259
"schedule": "none",
52595260
"status": "inActive",
52605261
"lastHarvest": "Thu Oct 13 14:48:57 EDT 2022",
@@ -5285,11 +5286,12 @@ You must supply a JSON file that describes the configuration, similarly to the o
52855286
The following optional fields are supported:
52865287
52875288
- archiveDescription: What the name suggests. If not supplied, will default to "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data."
5288-
- set: The OAI set on the remote server. If not supplied, will default to none, i.e., "harvest everything".
5289+
- set: The OAI set on the remote server. If not supplied, will default to none, i.e., "harvest everything". (Note: see the note below on using sets when harvesting from DataCite; this is new as of v6.6).
52895290
- style: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
52905291
- customHeaders: This can be used to configure this client with a specific HTTP header that will be added to every OAI request. This is to accommodate a use case where the remote server requires this header to supply some form of a token in order to offer some content not available to other clients. See the example below. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
52915292
- allowHarvestingMissingCVV: Flag to allow datasets to be harvested with Controlled Vocabulary Values that existed in the originating Dataverse Project but are not in the harvesting Dataverse Project. (Default is false). Currently only settable using API.
5292-
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
5293+
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
5294+
- useListRecords: Defaults to false; if set to true, the harvester will attempt to retrieve multiple records in a single pass using the OAI-PMH verb ListRecords. By default, our harvester relies on the combination of ListIdentifiers followed by multiple GetRecord calls for each individual record. Note that this option is required when configuring harvesting from DataCite. (this is new as of v6.6).
52935295
52945296
Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored. For example, as of writing this there is no way to configure a harvesting schedule via this API.
52955297
@@ -5365,8 +5367,36 @@ Self-explanatory:
53655367
53665368
Only users with superuser permissions may delete harvesting clients.
53675369
5370+
Harvesting from DataCite
5371+
~~~~~~~~~~~~~~~~~~~~~~~~
5372+
5373+
The following 2 options are **required** when harvesting from DataCite (https://oai.datacite.org/oai):
5374+
5375+
.. code-block:: bash
5376+
"useOaiIdentifiersAsPids": false,
5377+
"useListRecords": false,
5378+
5379+
There are two ways the ``set`` parameter can be used when harvesting from DataCite:
5380+
5381+
- DataCite maintains pre-configured OAI sets for every subscribing institution that registers DOIs with them. This can be used to harvest the entire set of metadata registered by the Institution X (this is identical to how the set parameter is used with any other standard OAI archive);
5382+
- As a unique, proprietary DataCite feature, it can be used to harvest virtually any arbitrary subset of records (potentially spanning different institutions and authorities, etc.). Any query that the DataCite search API understands can be used as an OAI "set" name (!). For example, the following search query finds one specific dataset:
5383+
5384+
.. code-block:: bash
5385+
https://api.datacite.org/dois?query=doi:10.7910/DVN/TJCLKP
53685386
5387+
you can now create a single-record OAI set by using its base64-encoded form as the set name:
53695388
5389+
.. code-block:: bash
5390+
echo "doi:10.7910/DVN/TJCLKP" | base64
5391+
ZG9pOjEwLjc5MTAvRFZOL1RKQ0xLUAo=
5392+
5393+
use the encoded string above prefixed by the ``~`` character in your harvesting client configuration:
5394+
5395+
.. code-block:: bash
5396+
"set": "~ZG9pOjEwLjc5MTAvRFZOL1RKQ0xLUAo="
5397+
5398+
5399+
53705400
.. _pids-api:
53715401
53725402
PIDs

0 commit comments

Comments
 (0)