Darrin Freshwater Institute detailed data conversion notes

Introduction

In this document we will explain the process of taking tabular data and enhancing it into a compliant and useful Resource Data Framework (RDF) Model. The process will be explained through example enhancement code blocks and the corresponding RDF statements that result from these parameters.

The data we are working with is Darrin Fresh Water Institute (DFWI) water chemistry data available in a tabular format. The DFWI data is complex enough to warrant enhancements to help organize and shape the data, and is perfect for explaining the process of data conversion.

Briefly, enhancements are parameters which modify and direct a conversion of tabular data to RDF to provide more explicit representations of what the data is about and how it is organized. In many cases enhancements are referenced from established vocabularies or ontologies.

The Enhancement Pipeline

There is an iterative process when you want to move your data from tabular data to an RDF Model. See the flow diagram below:

Obtaining data is when you have data you want to enhance in front of you. The can be done by querying a public API, or even downloading the tabular data off the internet. Defining enhancements is the process of writing your enhancement parameters that will later be applied against your tabular data. Enhance data is the process of pulling your enhancement trigger, which will generate an RDF data model against your tabular data using the enhancement parameter you just created. Publish data is the final step in the process. When you are happy with your current generated RDF data model, you can store the RDF triples contained in that model into a triple store for use in the wild Semantic Web.

Syntax of Enhancement Parameters

It is crucial to understand the syntax and layout of enhancement parameters before they can be used in this guide to describe various topics and ideas to you.

TODO: Write Syntax guide. I will do this soon, but patrice wants me to work on a subject lower so I will. :)

Data Overview | Layout

In the following documentation we will be working with a data file called AEAP_NYSERDA_Chem_94-12_v9_web from the DFWI data set. The context of the data contained in this data file is water samples taken from a particular water body. This file's contents are organized in a tabular format. There are columns with named headers denoting what data is contained in the column, and rows containing data for each of these columns.

There are many columns, and it will be the job of the conversion process to organize the numerous columns that make up a single row. As a starting point, the columns can be color-coded to highlight some initial grouping within the data. Let us review this color-coding now.

These four columns are grouped by an orange coloring. This data is grouped since it concerns which specific water body was sampled from for this row of data. The date denotes when the sample was taken. The DEC Code identifies the lake sampled from by an identifying code tied to a specific water body. DEC Codes are defined and maintained by the New York State Department of Environmental Conservation. The Lake Name is a plain-text common name associated with the water body sampled from. The Accession Code is a unique number within the DFWI data set assigned to a particular water sample.

These five columns are grouped by a green coloring. This data is grouped since it concerns where a sample was taken. The Z Max is. The Sample Type is. The Sample Z is. The Sample Layer is. The Secchi is.

These 37 columns (not all shown) are grouped by a blue coloring. This data is grouped since it concerns the presence of various chemicals in the sample taken from the water body. Each column in this grouping represents a single chemical that was tested for in the DFWI study and the cells in the column hold a decimal values denoting the amount of chemical detected for a sample.

These seven columns are grouped by a purple coloring. This data is grouped since it converns various properties of the water body that was sampled from. Each column in this grouping represents a single property of the water body that was sampled from. The Hydrologic Type is a shorthand for. The Watershed is a plain-text name for. The Watershed area is the area of the water body's watershed in hectares. The Lake Volume is the volume of the water body in meters cubed. The WA/WA Ratio is a ratio of the water body's watershed area to the water body's surface area. The Surface area (ha) is the surface area of the water body in hectares. The Surface area (m2) is the surface area of the water body in meters squared.

Organizing our thoughts; Preparing for enhancements

Before we start enhancing, we should think how we want to organize our data:

Is a particular column's data tied to another column in the data set? * How can we show that in our enhancements?
Is a group of columns more of a description of a larger idea? * How can we group them into such an abstract idea?
Is our data best organized by its current tabular structure, or can it better be visualized by some other structure? * Can we change that in our enhancements?

Let's tackle these questions by looking at our data.

Tackling Question 1: "Explicit Bundling"

TODO: Write this

Tackling Question 2: "Implicit Bundling"

Tackling Question 3: "Cell-Based vs. Row-Based"

TODO: Write this

Blah blah

For our chemistry data tables, the first column represents samples:

 conversion:enhance [
         ov:csvCol          1;
         ov:csvHeader       "Accession Code (sample#)";
         conversion:label   "Accession Code (sample#)";
         a conversion:DataStartRow;
         conversion:comment "A unique numeric reference to a specific water sample taken from a lake";
         conversion:range   rdfs:Resource;
         conversion:range_name  "WaterSample";
         conversion:equivalent_property oboe:ofEntity;
      ];

Given the ``equivalent_property'' enhancement, for the measurement columns we later enhance as "cell-based", each is in the oboe:ofEntity relationship with the URI generated for the sample. We'll see later that every measurement is taken of a certain sample, and that there are multiple measurement taken of a sample.

In the second column, we have a literal that represents the "human readable" name of a lake:

      conversion:enhance [
         ov:csvCol          2;
         ov:csvHeader       "Lake Name";
         conversion:label   "Lake Name";
         conversion:equivalent_property rdfs:label;
         conversion:comment "The name of the lake the sample is from";
         conversion:range   rdfs:Literal;
         conversion:bundled_by [ ov:csvCol 3 ];
      ];

This column is "enhanced" in the sense that it is "bundled by" column 3, which represents each distinct lake. Column 3 enhancements are:


 conversion:enhance [
         ov:csvCol          3;
         ov:csvHeader       "DEC Code";
         conversion:label   "DEC Code";
         conversion:comment "A numeric reference to a lake in the DEC registry, unique to a specific lake";
         conversion:range   rdfs:Resource;
         conversion:range_name  "Lake";
         conversion:equivalent_property oboe:hasContext;
         conversion:bundled_by [ ov:csvCol 1 ];
         conversion:range_template "[/sd]typed/lake/[.]";
      ];

Given the enhancements for column 2 and specification of column 3 as a resource, RDF statements like what follows are generated:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040750A> 
   rdfs:label "Windfall" .

Column 3 has as values DEC Codes, which are state-wide unique identifiers for NY State Lakes. We use these identifiers as the basis for URIs for lakes for our Darrin Freshwater Institute RDF dataset.

Please also note that our representation for Lakes are in turned bundled by another column, that of column 1 for water samples, where the relationship between sample and lake is declared as oboe:hasContext. Here is an example triple that results from this enhancement:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/10646116> dcterms:identifier "10646116" ;
        a dfw_lake_samples_vocab:WaterSample ;
        rdfs:label "10646116" ;
        oboe:hasContext <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040750A> .

Continuing left to right on the columns, we also have a conversion for the date column:

 ];
      conversion:enhance [
         ov:csvCol          4;
         ov:csvHeader       "Date";
         conversion:label   "Date";
         conversion:comment "The date the sample was taken from the lake";
         conversion:range   xsd:date;
         conversion:eg      "13-Aug-89";
         conversion:pattern "dd-MMM-yy"; # This is different form current documentation
         conversion:equivalent_property time:inXSDDateTime;
      ];

Further, we bundle column 5-9 into an implicit bundle. We show just column 9 as an example:

conversion:enhance [
         ov:csvCol          9;
         ov:csvHeader       "Secchi (m)";
         conversion:label   "Secchi (m)";
         conversion:comment "TODO: has to do with how clear the water is";
         conversion:range   xsd:decimal;
         conversion:bundled_by :a_location_bundle;
      ];

The implicit bundle is defined as follows:

a_location_bundle
   a conversion:ImplicitBundle;
   conversion:property_name oboe:hasContext; 
   conversion:name_template "[/sd]sampleLocation/[r]";
   conversion:type_name     prov:Location;

So in the resulting RDF we get statements like what follows:

http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/sampleLocation/2008> a prov:Location ;
        e1:z_max_m "5.9"^^xsd:decimal ;
        e1:sample_type "G" ;
        e1:sample_z_m "5"^^xsd:decimal ;
        e1:sample_layer "Yh" ;
        e1:secchi_m "3.5"^^xsd:decimal .

We established previously that a water sample has Context (i.e., is taken from) a lake. It is also the case that a measurement is of Entity a water sample. This knowledge will come in handy for later SemantEco module development. Remember then, that the path from a measurement to a lake is through its water sample. An example set of statements for a measurement (represented through a column) is given below:

:waterMeasurement_2_11 dcterms:isReferencedBy <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/version/2013-April-24> ;
        void:inDataset <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/version/2013-April-24> ;
        a water:WaterMeasurement , dfw_lake_samples_vocab:WaterMeasurement ;
        oboe:ofEntity <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> ;
        time:inXSDDateTime "1994-06-30"^^xsd:date ;
        oboe:hasContext <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/sampleLocation/2> ;
        pollution:hasCharacteristic pollution:PH ;
        pollution:hasUnit "SU" ;
        pollution:hasValue "5.1"^^xsd:decimal ;
        ov:csvRow "2"^^xsd:integer ;
        ov:csvCol "11"^^xsd:integer ;
        e1:hydrologic_type "TDL" ;
        e1:watershed "black" ;
        e1:watershed_area_ha "9643.7" ;
        e1:wa_sa_ratio "18.84" ;
        e1:surface_area_ha "511.9" ;
        e1:surface_area_m2 "5119000" ;
        ov:subjectDiscriminator dfw_lake_samples:aeap-nyserda-chem-94-12-v9-web ;
        dcterms:isReferencedBy dfw_lake_samples:aeap-nyserda-chem-94-12-v9-web .

For the sample:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> dcterms:identifier "9446846" ;
        a dfw_lake_samples_vocab:WaterSample ;
        rdfs:label "9446846" .

And context being the lake:

<http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/aeap-nyserda-chem-94-12-v9-web/typed/watersample/9446846> 
   oboe:hasContext 
   <http://purl.org/twc/semantgeo/source/aeap_nys/dataset/dfw_lake_samples/typed/lake/040752> .

As you might have noticed, we are seeing many more triples then we are probably interested in. Therefore as a summary of sorts I'll provide the most crucial triples set is one place, where I'll also put aside the full uris or even CURIs, so you as the reader can focus on the relationships instead of the features of the specification of RDF (I take this one step further as represent this in typical predicate logic form):

hasContext(waterSample01, LakeGeorge)
ofEntity(waterMeasurement01, waterSample01)
instanceOf(waterMeasurement01, WaterMeasurement)
hasCharacteristic(waterMeasurement01, pH)
hasValue(waterMeasurement01, "24.3")
hasUnit(waterMeasurement01, mg/L)

What's important here is that a measurement is bound to a characteristic, a value, and a unit. There are other considerations like the date the sample was taken and the geospatial coordinates in the lake the sample was take from. For the latter, it is this dataset it is the sample point for each take, based on the center, where we use lat/long for the lake for the google maps plotting in SemantEco.

Note that we are yet to tie watersheds to unique identifiers. Taking a step back, we note that at the global-level the conversions are for water measurements:


 conversion:enhance [
          conversion:domain_name "WaterMeasurement";
      ];

which are subclasses of the class WaterMeasurement of the SemantEco water.owl ontology. This will be crucial for integration with SemantEco, which we will cover in detail in a later wiki page.

 ];
      conversion:enhance [
         conversion:class_name "WaterMeasurement";
         conversion:subclass_of <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>;
      ];

They are water measurements in that some quality or characteristic of water is being measured.

cell-based conversion, for those columns that represent the value for some measurement is covered by the enhancement:

  conversion:enhance [ # Multi line n-ary assignment
         ov:csvCol 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46
         a qb:Observation;
         rdf:type           water:WaterMeasurement;
         prov:atLocation query:site;
         conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
      ];

in a recent update to csv2rd4lod, this can now be specified with a range:

conversion:enhance [
         conversion:fromCol        10;
         conversion:toCol          46;
         a qb:Observation;
         rdf:type           water:WaterMeasurement;
         prov:atLocation query:site;
         conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
      ];

Edit: it looks like we use hasContext to relate a water measurement to the site in the lake the measurement was taken from, and also hasContext to relate a water sample to the lake it is taken from. We should also relate the lake site to the lake. (part-of?) but note that the site is the very same for all the samples for a lake, for the dataset we are working with. Also regarding the lake site, we are yet to use the values of z max m, sample z m, or sechhi m in any of our SemantEco run queries. We'll come back to this point later.

Going forward, how do we use these enhancements to convert the Darrin FreshWater institutes species data?

Looking at our csv data file for phytoplankton, it is also a cell-based conversion focus:

conversion:enhance [
         conversion:fromCol        3;
         conversion:toCol         539;
         a qb:Observation;
        conversion:equivalent_property <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasCharacteristic>;
      ];

Since units are not part of the dataset, we explicitly assert a statement about the units for each cell-based conversion:

(Here's where a screenshot of table is useful.)

conversion:enhance [
     ov:csvCol          10;
     conversion:fromCol        3;
     conversion:toCol         539;
     a conversion:SubjectAnnotation;
     conversion:predicate <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#hasUnit>;
     conversion:object    <http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#OrganismsPerCubicMeterofWater>;
  ];

Also, in this case, there is no specific column with identifiers for the samples, contrary to the chemistry data we described above. Therefore we create an implicit bundle for samples:


:a_sample_bundle
   a conversion:ImplicitBundle;
   conversion:property_name oboe:ofEntity; # Can also be a URI, e.g. dcterms:title.
   conversion:name_template "[/sd]waterMeasurement[r]";
   conversion:type_name <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>
.

And we make sure to apply it to all measurement-based (cell-based) columns:

conversion:enhance [
         conversion:fromCol        3;
         conversion:toCol         539;
         conversion:bundled_by :a_sample_bundle;
      ];

We might question though, is the measurement of phytoplankton within an amount of water a water measurement or a phytoplankton measurement? For now we assume the former, and apply the following enhancement for the row-based conversion:

conversion:enhance [
          conversion:domain_name "WaterMeasurement";
      ];
      conversion:enhance [
         conversion:class_name "WaterMeasurement";
         conversion:subclass_of <http://escience.rpi.edu/ontology/semanteco/2/0/water.owl#WaterMeasurement>;
      ];

For column 1 it is the Lake name, same as a column in the chemistry dataset, and we maintain the following enhancement:

 conversion:enhance [
         ov:csvCol          1;
         ov:csvHeader       "Lake Name";
         conversion:comment "";
         conversion:range   todo:Literal;
      ];

We add this example into the enhancement as additional documentation.By maintaining this column as a literal, we must "join" this literal with the lake identifiers from the other RDF dataset (We would "join" on a RDF named graph for a Lake specific dataset also, that was not covered yet.). Alternatively, we could populate a new column with the DEC identifiers. For consideration of time, for the present we take the first option. Update: we need to link the samples with the lakes so I think we need to add a column for the Lake dec codes, as they existed in the chemistry data.

Next, as before, we provide enhancement for a date species column:

  conversion:enhance [
         ov:csvCol          2;
         ov:csvHeader       "Date";
         conversion:comment "";
         conversion:range   xsd:date;
         conversion:pattern "MM/dd/yyyy"; # This is different form current documentation
         conversion:eg "7/25/1994";
      ];

We provide this pattern because example dates include 7/25/1994.

We also want to make sure and not generate triples when the values are null, but null is specified in different ways, and we need an enhancement for how null is encoded in our data file, thus:

conversion:interpret [
         conversion:symbol        "0","0.00";
         conversion:interpretation conversion:null;
      ];

What is the schema we should follow similar to that of existing species data (bird, fish)? And what are the mandatory linkages to the SemantEco ontology that a RDF dataset must satisfy to integrate into SemantEco?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly