Skip to content

Example: Integrated Ocean Drilling Project DSpace

Timothy Lebo edited this page Feb 14, 2012 · 11 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

The Integrated Ocean Drilling Project has a prototype DSpace installation at http://data.oceandrilling.org/xmlui, which dumps CSV of its contents. This example shows how to convert its dump to RDF.

Use git to check out https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/iodp-org/scientific-ocean-drilling-repository-dspace. Since this is in csv2rdf4lod-automation's examples, you can just update the converter.

$ cd $CSV2RDF4LOD_HOME/doc/examples/source/iodp-org/scientific-ocean-drilling-repository-dspace/version

(Note that this example is a special case - it is bundled as part of the converter directory. For your normal projects, your [data root](csv2rdf4lod-automation data root) should be outside of the converter installation.)

$ ./retrieve.sh

will [create a new version](Automated creation of a new Versioned Dataset) using the data file sandboxMeta.csv, which I cached from Doug's email. This can be changed to the DSpace URL that provides the dynamic CSV.

$ cd 2011-Oct-18/

puts you into the conversion cockpit for the version just created, which already has the conversion results:

$ l automatic/
total 584
-rw-r--r--  1 lebot  staff  179406 Oct 18 10:14 sandboxMeta.csv.e1.ttl
-rw-r--r--  1 lebot  staff   43495 Oct 18 10:14 sandboxMeta.csv.e1.void.ttl
-rw-r--r--  1 lebot  staff   59505 Oct 18 10:14 sandboxMeta.csv.e1.sample.ttl

Running the publish script:

$ publish/bin/publish.sh

will aggregate the conversions (which is described [here](Conversion process phase: publish))

$ ls publish/

iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.sample.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.void.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.pml.ttl
iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.nt.graph

and create the following publishing options (also described [here](Conversion process phase: publish)):

$ ls publish/bin/

publish.sh
ln-to-www-root-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
tdbloader-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
virtuoso-delete-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
virtuoso-load-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
4store-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
joseki-config-anterior-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
lod-materialize-apache-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh
lod-materialize-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18-void.sh
lod-materialize-iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.sh

and run them according to the values of your CSV2RDF4LOD environment variables.

Looking at the complete RDF data file:

vi publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.ttl

we can see the results of the first data row:

:dougDSpaceItem_2 
  dcterms:isReferencedBy 
  <http://localhost/source/iodp-org/dataset/scientific-ocean-drilling-repository-dspace/version/2011-Oct-18> ;
   void:inDataset 
  <http://localhost/source/iodp-org/dataset/scientific-ocean-drilling-repository-dspace/version/2011-Oct-18> ;
   a owl:Thing , local_vocab:DougDSpaceItem ;
   dcterms:identifier "2" ;
   e1:in_collection <http://hdl.handle.net/123456789_3> ;
   e1:dc_date_accessioned "2011-09-22T07:46:08-04:00"^^xsd:dateTime ;
   e1:dc_date_available "2011-09-22T07:46:08-04:00"^^xsd:dateTime ;
   e1:dc_date_issued "2011-09-22"^^xsd:date ;
   e1:dc_description_provenance_en """Submitted by Douglas Fil...ksum: 9a0e7e1fcbbebfe8342842fb10264b6b (MD5)""" ;
   e1:dc_identifier_uri <http://hdl.handle.net/123456789/4> ;
   e1:dc_subject_en_us "JR" ;
   dcterms:subject "JR" ;
   e1:dc_title_en_us "JR Cross Section" ;
   dcterms:title "JR Cross Section" ;
   e1:dc_type_en_us "Image" ;
   dcterms:type "Image" ;
   ov:csvRow "2"^^xsd:integer .

Comparing the enhancement parameters shows how little they needed to be changed from the default to create the "better" RDF above (it's using DCTerms, has typed dates and dateTimes, and are rdf:typed). The enhancement parameters can to change to suit specific use cases; this enhancement was done without any specific use in mind (other than general RDF consumption).

Since four unit tests are included in the data skeleton, we can check to see if subsequent conversions of the same or similar data conform to intended structures. Two tests fail to remind us that the current conversion isn't as great as we'd like it -- and to give use concrete test cases to resolve the outstanding issues.

bash-3.2$ cr-test-conversion.sh --setup -v
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.sample.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.e1.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.pml.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.void.ttl
 WARN [main] (FactoryGraphTDB.java:241) - No BGP optimizer
Load: publish/iodp-org-scientific-ocean-drilling-repository-dspace-2011-Oct-18.ttl
3,593 triples: loaded in 1.0 seconds [3,673.8 triples/s]
................................................................................
../../rq/test/ask/absent/huge-date.rq (Ask => No)

      ?item e1:dc_date_issued "2189-07-09"^^xsd:date

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           \ \ \ FAIL / /
../../rq/test/ask/absent/squashing-uri-bad.rq (Ask => Yes)

      ?item e1:in_collection <http://hdl.handle.net/123456789_3> # should be http://hdl.handle.net/123456789/3
      # see https://github.com/timrdf/csv2rdf4lod-automation/issues/246

................................................................................
../../rq/test/ask/present/dcterms-title.rq (Ask => Yes)

      ?item e1:dc_title_en_us "JR Cross Section"; # Default predicate
            dcterms:title     "JR Cross Section"  # More commonly recognized predicate

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           \ \ \ FAIL / /
../../rq/test/ask/present/reuse-subject-uri.rq (Ask => No)

      <http://hdl.handle.net/123456789/4> e1:dc_identifier_uri <http://hdl.handle.net/123456789/4>

--------------------------------------------------------------------------------
2 of 4 passed
Clone this wiki locally