-
Notifications
You must be signed in to change notification settings - Fork 36
Example: Integrated Ocean Drilling Project DSpace
The Integrated Ocean Drilling Project has a prototype DSpace installation at http://data.oceandrilling.org/xmlui, which dumps CSV of its contents. This example shows how to convert its dump to RDF.
Use git
to check out https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/iodp-org/scientific-ocean-drilling-repository-dspace. Since this is in csv2rdf4lod-automation's examples, you can just update the converter.
$ cd $CSV2RDF4LOD_HOME/doc/examples/source/iodp-org/scientific-ocean-drilling-repository-dspace/version
(Note that this example is a special case - it is bundled as part of the converter directory. For your normal projects, your [data root](csv2rdf4lod-automation data root) should be outside of the converter installation.)
$ ./retrieve.sh
will [create a new version](Automated creation of a new Versioned Dataset) using the data file sandboxMeta.csv, which I cached from Doug's email. This can be changed to the DSpace URL that provides the dynamic CSV.
$ cd 2011-Oct-18/
puts you into the conversion cockpit for the version just created, which already has the conversion results:
$ l automatic/
total 584
-rw-r--r-- 1 lebot staff 179406 Oct 18 10:14 sandboxMeta.csv.e1.ttl
-rw-r--r-- 1 lebot staff 43495 Oct 18 10:14 sandboxMeta.csv.e1.void.ttl
-rw-r--r-- 1 lebot staff 59505 Oct 18 10:14 sandboxMeta.csv.e1.sample.ttl
Running the publish script:
$ publish/bin/publish.sh
will aggregate the conversions (which is described [here](Conversion process phase: publish))
$ ls publish/
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.sample.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.void.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.pml.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.nt.graph
and create the following publishing options (also described [here](Conversion process phase: publish)):
$ ls publish/bin/
publish.sh
ln-to-www-root-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
tdbloader-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
virtuoso-delete-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
virtuoso-load-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
4store-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
joseki-config-anterior-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
lod-materialize-apache-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
lod-materialize-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18-void.sh
lod-materialize-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
and run them according to the values of your CSV2RDF4LOD environment variables.
Looking at the complete RDF data file:
vi publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.ttl
we can see the results of the first data row:
:dougDSpaceItem_2
dcterms:isReferencedBy
<http://localhost/source/iodp-org/dataset/scientific-ocean-drilling-repository-dspace/version/2011-Oct-18> ;
void:inDataset
<http://localhost/source/iodp-org/dataset/scientific-ocean-drilling-repository-dspace/version/2011-Oct-18> ;
a owl:Thing , local_vocab:DougDSpaceItem ;
dcterms:identifier "2" ;
e1:in_collection <http://hdl.handle.net/123456789_3> ;
e1:dc_date_accessioned "2011-09-22T07:46:08-04:00"^^xsd:dateTime ;
e1:dc_date_available "2011-09-22T07:46:08-04:00"^^xsd:dateTime ;
e1:dc_date_issued "2011-09-22"^^xsd:date ;
e1:dc_description_provenance_en """Submitted by Douglas Fil...ksum: 9a0e7e1fcbbebfe8342842fb10264b6b (MD5)""" ;
e1:dc_identifier_uri <http://hdl.handle.net/123456789/4> ;
e1:dc_subject_en_us "JR" ;
dcterms:subject "JR" ;
e1:dc_title_en_us "JR Cross Section" ;
dcterms:title "JR Cross Section" ;
e1:dc_type_en_us "Image" ;
dcterms:type "Image" ;
ov:csvRow "2"^^xsd:integer .
Comparing the enhancement parameters shows how little they needed to be changed from the default to create the "better" RDF above (it's using DCTerms, has typed dates and dateTimes, and are rdf:typed). The enhancement parameters can to change to suit specific use cases; this enhancement was done without any specific use in mind (other than general RDF consumption).
Since four unit tests are included in the data skeleton, we can check to see if subsequent conversions of the same or similar data conform to intended structures. Two tests fail to remind us that the current conversion isn't as great as we'd like it -- and to give use concrete test cases to resolve the outstanding issues.
bash-3.2$ cr-test-conversion.sh --setup -v
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.sample.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.pml.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.void.ttl
WARN [main] (FactoryGraphTDB.java:241) - No BGP optimizer
Load: publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
3,593 triples: loaded in 1.0 seconds [3,673.8 triples/s]
................................................................................
../../rq/test/ask/absent/huge-date.rq (Ask => No)
?item e1:dc_date_issued "2189-07-09"^^xsd:date
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/ \ \ \ FAIL / /
../../rq/test/ask/absent/squashing-uri-bad.rq (Ask => Yes)
?item e1:in_collection <http://hdl.handle.net/123456789_3> # should be http://hdl.handle.net/123456789/3
# see https://github.com/timrdf/csv2rdf4lod-automation/issues/246
................................................................................
../../rq/test/ask/present/dcterms-title.rq (Ask => Yes)
?item e1:dc_title_en_us "JR Cross Section"; # Default predicate
dcterms:title "JR Cross Section" # More commonly recognized predicate
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/ \ \ \ FAIL / /
../../rq/test/ask/present/reuse-subject-uri.rq (Ask => No)
<http://hdl.handle.net/123456789/4> e1:dc_identifier_uri <http://hdl.handle.net/123456789/4>
--------------------------------------------------------------------------------
2 of 4 passed