-
Notifications
You must be signed in to change notification settings - Fork 1
Darrin Freshwater Institute detailed data conversion notes
In what follows we provide example enhancements and corresponding example RDF statements that result, for our conversion of Darrin Fresh Water Institute water chemistry data available in tabular format. Briefly, enhancements are simply ways in which a simple conversion of tabular data to RDF can be modified to provide more explicit representations of what the data is about, in some cases for using established vocabularies/ontologies.
For our chemistry data tables, the first column represents samples:
conversion:enhance [
ov:csvCol 1;
ov:csvHeader "Accession Code (sample#)";
conversion:label "Accession Code (sample#)";
a conversion:DataStartRow;
conversion:comment "A unique numeric reference to a specific water sample taken from a lake";
conversion:range rdfs:Resource;
conversion:range_name "WaterSample";
conversion:equivalent_property oboe:ofEntity;
];
Given the ``equivalent_property'' enhancement, for the measurement columns we later enhance as "cell-based", each is in the oboe:ofEntity relationship with the URI generated for the sample. We'll see later that every measurement is taken of a certain sample, and that there are multiple measurement taken of a sample.
In the second column, we have a literal that represents the "human readable" name of a lake:
conversion:enhance [
ov:csvCol 2;
ov:csvHeader "Lake Name";
conversion:label "Lake Name";
conversion:equivalent_property rdfs:label;
conversion:comment "The name of the lake the sample is from";
conversion:range rdfs:Literal;
conversion:bundled_by [ ov:csvCol 3 ];
];
This column is "enhanced" in the sense that it is "bundled by" column 3, which represents each distinct lake. Column 3 enhancements are:
conversion:enhance [
ov:csvCol 3;
ov:csvHeader "DEC Code";
conversion:label "DEC Code";
conversion:comment "A numeric reference to a lake in the DEC registry, unique to a specific lake";
conversion:range rdfs:Resource;
conversion:range_name "Lake";
conversion:equivalent_property oboe:hasContext;
conversion:bundled_by [ ov:csvCol 1 ];
conversion:range_template "[/sd]typed/lake/[.]";
];
Given the enhancements for column 2 and specification of column 3 as a resource, RDF statements like what follows are generated:
<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040750A>
rdfs:label "Windfall" .
Column 3 has as values DEC Codes, which are state-wide unique identifiers for NY State Lakes. We use these identifiers as the basis for URIs for lakes for our Darrin Freshwater Institute RDF dataset.
Please also note that our representation for Lakes are in turned bundled by another column, that of column 1 for water samples, where the relationship between sample and lake is declared as oboe:hasContext. Here is an example triple that results from this enhancement:
<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/10646116> dcterms:identifier "10646116" ;
a dfw_lake_samples_vocab:WaterSample ;
rdfs:label "10646116" ;
oboe:hasContext <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040750A> .
Continuing left to right on the columns, we also have a conversion for the date column:
];
conversion:enhance [
ov:csvCol 4;
ov:csvHeader "Date";
conversion:label "Date";
conversion:comment "The date the sample was taken from the lake";
conversion:range xsd:date;
conversion:eg "13-Aug-89";
conversion:pattern "dd-MMM-yy"; # This is different form current documentation
conversion:equivalent_property time:inXSDDateTime;
];
Further, we bundle column 5-9 into an implicit bundle. We show just column 9 as an example:
conversion:enhance [
ov:csvCol 9;
ov:csvHeader "Secchi (m)";
conversion:label "Secchi (m)";
conversion:comment "TODO: has to do with how clear the water is";
conversion:range xsd:decimal;
conversion:bundled_by :a_location_bundle;
];
The implicit bundle is defined as follows:
a_location_bundle
a conversion:ImplicitBundle;
conversion:property_name oboe:hasContext;
conversion:name_template "[/sd]sampleLocation/[r]";
conversion:type_name prov:Location;
So in the resulting RDF we get statements like what follows:
http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/sampleLocation/2008> a prov:Location ;
e1:z_max_m "5.9"^^xsd:decimal ;
e1:sample_type "G" ;
e1:sample_z_m "5"^^xsd:decimal ;
e1:sample_layer "Yh" ;
e1:secchi_m "3.5"^^xsd:decimal .
We established previously that a water sample has Context (i.e., is taken from) a lake. It is also the case that a measurement is of Entity a water sample. This knowledge will come in handy for later SemantEco module development. Remember then, that the path from a measurement to a lake is through its water sample. An example set of statements for a measurement (represented through a column) is given below:
:waterMeasurement_2_11 dcterms:isReferencedBy <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/version/2013-April-24> ;
void:inDataset <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/version/2013-April-24> ;
a water:WaterMeasurement , dfw_lake_samples_vocab:WaterMeasurement ;
oboe:ofEntity <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> ;
time:inXSDDateTime "1994-06-30"^^xsd:date ;
oboe:hasContext <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/sampleLocation/2> ;
pollution:hasCharacteristic pollution:PH ;
pollution:hasUnit "SU" ;
pollution:hasValue "5.1"^^xsd:decimal ;
ov:csvRow "2"^^xsd:integer ;
ov:csvCol "11"^^xsd:integer ;
e1:hydrologic_type "TDL" ;
e1:watershed "black" ;
e1:watershed_area_ha "9643.7" ;
e1:wa_sa_ratio "18.84" ;
e1:surface_area_ha "511.9" ;
e1:surface_area_m2 "5119000" ;
ov:subjectDiscriminator dfw_lake_samples:aeap-nyserda-chem-94-12-v9-web ;
dcterms:isReferencedBy dfw_lake_samples:aeap-nyserda-chem-94-12-v9-web .
For the sample:
<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> dcterms:identifier "9446846" ;
a dfw_lake_samples_vocab:WaterSample ;
rdfs:label "9446846" .
And context being the lake:
<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846>
oboe:hasContext
<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040752> .
As you might have noticed, we are seeing many more triples then we are probably interested in. Therefore as a summary of sorts I'll provide the most crucial triples set is one place, where I'll also put aside the full uris or even CURIs, so you as the reader can focus on the relationships instead of the features of the specification of RDF (I take this one step further as represent this in typical predicate logic form):
hasContext(waterSample01, LakeGeorge)
ofEntity(waterMeasurement01, waterSample01)
instanceOf(waterMeasurement01, WaterMeasurement)
hasCharacteristic(waterMeasurement01, pH)
hasValue(waterMeasurement01, "24.3")
hasUnit(waterMeasurement01, mg/L)
What's important here is that a measurement is bound to a characteristic, a value, and a unit. There are other considerations like the date the sample was taken and the geospatial coordinates in the lake the sample was take from. For the latter, it is this dataset it is the sample point for each take, based on the center, where we use lat/long for the lake for the google maps plotting in SemantEco.
Note that we are yet to tie watersheds to unique identifiers. Taking a step back, we note that at the global-level the conversions are for water measurements:
conversion:enhance [
conversion:domain_name "WaterMeasurement";
];
which are subclasses of the class WaterMeasurement of the SemantEco water.owl ontology. This will be crucial for integration with SemantEco, which we will cover in detail in a later wiki page.
];
conversion:enhance [
conversion:class_name "WaterMeasurement";
conversion:subclass_of <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>;
];
They are water measurements in that some quality or characteristic of water is being measured.
cell-based conversion, for those columns that represent the value for some measurement is covered by the enhancement:
conversion:enhance [ # Multi line n-ary assignment
ov:csvCol 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46
a qb:Observation;
rdf:type water:WaterMeasurement;
prov:atLocation query:site;
conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
];
in a recent update to csv2rd4lod, this can now be specified with a range:
conversion:enhance [
conversion:fromCol 10;
conversion:toCol 46;
a qb:Observation;
rdf:type water:WaterMeasurement;
prov:atLocation query:site;
conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
];
Edit: it looks like we use hasContext to relate a water measurement to the site in the lake the measurement was taken from, and also hasContext to relate a water sample to the lake it is taken from. We should also relate the lake site to the lake. (part-of?) but note that the site is the very same for all the samples for a lake, for the dataset we are working with. Also regarding the lake site, we are yet to use the values of z max m, sample z m, or sechhi m in any of our SemantEco run queries. We'll come back to this point later.
Going forward, how do we use these enhancements to convert the Darrin FreshWater institutes species data?
Looking at our csv data file for phytoplankton, it is also a cell-based conversion focus:
conversion:enhance [
conversion:fromCol 3;
conversion:toCol 539;
a qb:Observation;
conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
];
Since units are not part of the dataset, we explicitly assert a statement about the units for each cell-based conversion:
(Here's where a screenshot of table is useful.)
conversion:enhance [
ov:csvCol 10;
conversion:fromCol 3;
conversion:toCol 539;
a conversion:SubjectAnnotation;
conversion:predicate <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasUnit>;
conversion:object <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#OrganismsPerCubicMeterofWater>;
];
Also, in this case, there is no specific column with identifiers for the samples, contrary to the chemistry data we described above. Therefore we create an implicit bundle for samples:
:a_sample_bundle
a conversion:ImplicitBundle;
conversion:property_name oboe:ofEntity; # Can also be a URI, e.g. dcterms:title.
conversion:name_template "[/sd]waterMeasurement[r]";
conversion:type_name <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>
.
And we make sure to apply it to all measurement-based (cell-based) columns:
conversion:enhance [
conversion:fromCol 3;
conversion:toCol 539;
conversion:bundled_by :a_sample_bundle;
];
We might question though, is the measurement of phytoplankton within an amount of water a water measurement or a phytoplankton measurement? For now we assume the former, and apply the following enhancement for the row-based conversion:
conversion:enhance [
conversion:domain_name "WaterMeasurement";
];
conversion:enhance [
conversion:class_name "WaterMeasurement";
conversion:subclass_of <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>;
];
For column 1 it is the Lake name, same as a column in the chemistry dataset, and we maintain the following enhancement:
conversion:enhance [
ov:csvCol 1;
ov:csvHeader "Lake Name";
conversion:comment "";
conversion:range todo:Literal;
];
Next, as before, we provide enhancement for a date species column:
conversion:enhance [
ov:csvCol 2;
ov:csvHeader "Date";
conversion:comment "";
conversion:range xsd:date;
conversion:pattern "MM/dd/yyyy"; # This is different form current documentation
conversion:eg "7/25/1994";
];
We provide this pattern because example dates include 7/25/1994. We add this example into the enhancement as additional documentation.By maintaining this column as a literal, we must "join" this literal with the lake identifiers from the other RDF dataset (We would "join" on a RDF named graph for a Lake specific dataset also, that was not covered yet.). Alternatively, we could populate a new column with the DEC identifiers. For consideration of time, for the present we take the first option.
We also want to make sure and not generate triples when the values are null, but null is specified in different ways, and we need an enhancement for how null is encoded in our data file, thus:
conversion:interpret [
conversion:symbol "0","0.00";
conversion:interpretation conversion:null;
];
``
What is the schema we should follow similar to that of existing species data (bird, fish)?
And what are the mandatory linkages to the SemantEco ontology that a RDF dataset must satisfy to integrate into SemantEco?