Skip to content

Darrin Freshwater Institute detailed data conversion notes

apseyed edited this page Jun 25, 2013 · 18 revisions

In what follows we provide example enhancements and corresponding example RDF statements that result, for our conversion of Darrin Fresh Water Institute water chemistry data available in tabular format. Briefly, enhancements are simply ways in which a simple conversion of tabular data to RDF can be modified to provide more explicit representations of what the data is about, in some cases for using established vocabularies/ontologies.

For our chemistry data tables, the first column represents samples:

 conversion:enhance [
         ov:csvCol          1;
         ov:csvHeader       "Accession Code (sample#)";
         conversion:label   "Accession Code (sample#)";
         a conversion:DataStartRow;
         conversion:comment "A unique numeric reference to a specific water sample taken from a lake";
         conversion:range   rdfs:Resource;
         conversion:range_name  "WaterSample";
         conversion:equivalent_property oboe:ofEntity;
      ];

Given the ``equivalent_property'' enhancement, for the measurement columns we later enhance as "cell-based", each is in the oboe:ofEntity relationship with the URI generated for the sample. We'll see later that every measurement is taken of a certain sample, and that there are multiple measurement taken of a sample.

In the second column, we have a literal that represents the "human readable" name of a lake:

      conversion:enhance [
         ov:csvCol          2;
         ov:csvHeader       "Lake Name";
         conversion:label   "Lake Name";
         conversion:equivalent_property rdfs:label;
         conversion:comment "The name of the lake the sample is from";
         conversion:range   rdfs:Literal;
         conversion:bundled_by [ ov:csvCol 3 ];
      ];

This column is "enhanced" in the sense that it is "bundled by" column 3, which represents each distinct lake. Column 3 enhancements are:


 conversion:enhance [
         ov:csvCol          3;
         ov:csvHeader       "DEC Code";
         conversion:label   "DEC Code";
         conversion:comment "A numeric reference to a lake in the DEC registry, unique to a specific lake";
         conversion:range   rdfs:Resource;
         conversion:range_name  "Lake";
         conversion:equivalent_property oboe:hasContext;
         conversion:bundled_by [ ov:csvCol 1 ];
         conversion:range_template "[/sd]typed/lake/[.]";
      ];

Given the enhancements for column 2 and specification of column 3 as a resource, RDF statements like what follows are generated:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040750A> rdfs:label "Windfall" .

Column 3 has as values DEC Codes, which are state-wide unique identifiers for NY State Lakes. We use these identifiers as the basis for URIs for lakes for our Darrin Freshwater Institute RDF dataset.

Please also note that our representation for Lakes are in turned bundled by another column, that of column 1 for water samples, where the relationship between sample and lake is declared as oboe:hasContext. Here is an example triple that results from this enhancement:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/10646116> dcterms:identifier "10646116" ;
        a dfw_lake_samples_vocab:WaterSample ;
        rdfs:label "10646116" ;
        oboe:hasContext <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040750A> .

Continuing left to right on the columns, we also have a conversion for the date column:

 ];
      conversion:enhance [
         ov:csvCol          4;
         ov:csvHeader       "Date";
         conversion:label   "Date";
         conversion:comment "The date the sample was taken from the lake";
         conversion:range   xsd:date;
         conversion:eg      "13-Aug-89";
         conversion:pattern "dd-MMM-yy"; # This is different form current documentation
         conversion:equivalent_property time:inXSDDateTime;
      ];

Further, we bundle column 5-9 into an implicit bundle. We show just column 9 as an example:

conversion:enhance [
         ov:csvCol          9;
         ov:csvHeader       "Secchi (m)";
         conversion:label   "Secchi (m)";
         conversion:comment "TODO: has to do with how clear the water is";
         conversion:range   xsd:decimal;
         conversion:bundled_by :a_location_bundle;
      ];

The implicit bundle is defined as follows:

a_location_bundle
   a conversion:ImplicitBundle;
   conversion:property_name oboe:hasContext; 
   conversion:name_template "[/sd]sampleLocation/[r]";
   conversion:type_name     prov:Location;

So in the resulting RDF we get statements like what follows:

http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/sampleLocation/2008> a prov:Location ;
        e1:z_max_m "5.9"^^xsd:decimal ;
        e1:sample_type "G" ;
        e1:sample_z_m "5"^^xsd:decimal ;
        e1:sample_layer "Yh" ;
        e1:secchi_m "3.5"^^xsd:decimal .

We established previously that a water sample has Context (i.e., is taken from) a lake. It is also the case that a measurement is of Entity a water sample. This knowledge will come in handy for later SemantEco module development. Remember then, that the path from a measurement to a lake is through its water sample. An example set of statements for a measurement (represented through a column) is given below:

:waterMeasurement_2_11 dcterms:isReferencedBy <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/version/2013-April-24> ;
        void:inDataset <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/version/2013-April-24> ;
        a water:WaterMeasurement , dfw_lake_samples_vocab:WaterMeasurement ;
        oboe:ofEntity <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> ;
        time:inXSDDateTime "1994-06-30"^^xsd:date ;
        oboe:hasContext <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/sampleLocation/2> ;
        pollution:hasCharacteristic pollution:PH ;
        pollution:hasUnit "SU" ;
        pollution:hasValue "5.1"^^xsd:decimal ;
        ov:csvRow "2"^^xsd:integer ;
        ov:csvCol "11"^^xsd:integer ;
        e1:hydrologic_type "TDL" ;
        e1:watershed "black" ;
        e1:watershed_area_ha "9643.7" ;
        e1:wa_sa_ratio "18.84" ;
        e1:surface_area_ha "511.9" ;
        e1:surface_area_m2 "5119000" ;
        ov:subjectDiscriminator dfw_lake_samples:aeap-nyserda-chem-94-12-v9-web ;
        dcterms:isReferencedBy dfw_lake_samples:aeap-nyserda-chem-94-12-v9-web .

For the sample:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> dcterms:identifier "9446846" ;
        a dfw_lake_samples_vocab:WaterSample ;
        rdfs:label "9446846" .

And context being the lake:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> 
   oboe:hasContext 
   <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040752> .

As you might have noticed, we are seeing many more triples then we are probably interested in. Therefore as a summary of sorts I'll provide the most crucial triples set is one place, where I'll also put aside the full uris or even CURIs, so you as the reader can focus on the relationships instead of the features of the specification of RDF (I take this one step further as represent this in typical predicate logic form):

hasContext(waterSample01, LakeGeorge)
ofEntity(waterMeasurement01, waterSample01)
instanceOf(waterMeasurement01, WaterMeasurement)
hasCharacteristic(waterMeasurement01, pH)
hasValue(waterMeasurement01, "24.3")
hasUnit(waterMeasurement01, mg/L)

What's important here is that a measurement is bound to a characteristic, a value, and a unit. There are other considerations like the date the sample was taken and the geospatial coordinates in the lake the sample was take from. For the latter, it is this dataset it is the sample point for each take, based on the center, where we use lat/long for the lake for the google maps plotting in SemantEco.

Note that we are yet to tie watersheds to unique identifiers. Taking a step back, we note that at the global-level the conversions are for water measurements:


 conversion:enhance [
          conversion:domain_name "WaterMeasurement";
      ];

which are subclasses of the class WaterMeasurement of the SemantEco water.owl ontology. This will be crucial for integration with SemantEco, which we will cover in detail in a later wiki page.

 ];
      conversion:enhance [
         conversion:class_name "WaterMeasurement";
         conversion:subclass_of <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>;
      ];

They are water measurements in that some quality or characteristic of water is being measured.

cell-based conversion, for those columns that represent the value for some measurement is covered by the enhancement:

  conversion:enhance [ # Multi line n-ary assignment
         ov:csvCol 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46
         a qb:Observation;
         rdf:type           water:WaterMeasurement;
         prov:atLocation query:site;
         conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
      ];

in a recent update to csv2rd4lod, this can now be specified with a range:

conversion:enhance [
         conversion:fromCol        10;
         conversion:toCol          46;
         a qb:Observation;
         rdf:type           water:WaterMeasurement;
         prov:atLocation query:site;
         conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
      ];

Edit: it looks like we use hasContext to relate a water measurement to the site in the lake the measurement was taken from, and also hasContext to relate a water sample to the lake it is taken from. We should also relate the lake site to the lake. (part-of?) but note that the site is the very same for all the samples for a lake, for the dataset we are working with. Also regarding the lake site, we are yet to use the values of z max m, sample z m, or sechhi m in any of our SemantEco run queries. We'll come back to this point later.

Going forward, how do we use these enhancements to convert the Darrin FreshWater institutes species data?

Clone this wiki locally