Skip to content

Script: justify.sh

Timothy Lebo edited this page Feb 14, 2012 · 9 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

Example

http://logd.tw.rpi.edu/source/data-gov/dataset/1008 provides a zip file with a pdf and csv in it.

Let's say we only want to work with the WATER subset of the entire CSV, which is 25 of 46 data entries (a real world example of this is discussed in frbr: CSHALS 2011 tutorial, where we are only interested in Human genes and not others):

bash-3.2$ wc -l source/STATE_SINGLE_PW.CSV 
      46 source/STATE_SINGLE_PW.CSV
bash-3.2$ cat source/STATE_SINGLE_PW.CSV | grep "WATER" | wc -l
      25

We make a new file in manual/ because it is a modified version of the original (from-government) files in source/:

bash-3.2$ cat source/STATE_SINGLE_PW.CSV | grep "WATER" > manual/STATE_SINGLE_PW.CSV

We want to associate this new file to where it came from:

bash-3.2$ justify.sh source/STATE_SINGLE_PW.CSV manual/STATE_SINGLE_PW.CSV 
usage: justify.sh /path/to/source/a.xls /path/to/destination/a.xls.csv <engine-name>
   engine-name: (URI-friendly) e.g.:
      xls2csv,   tab2comma,     redelimit,            file_rename,   escaping_commas_redelimit
      duplicate, google_refine, serialization_change, parse_field,   tabulating_fixed_width
      html_tidy, pretty_print,  xsl_html_scrape,      manual_csvify, uncompress
      select_subset, etc.

justify.sh just provided some suggestions for methods that could be applied from one file to the next. We'll choose select_subset by adding it to the end of the command and rerunning it:

bash-3.2$ justify.sh source/STATE_SINGLE_PW.CSV manual/STATE_SINGLE_PW.CSV select_subset

---------------------------------- justify ---------------------------------------
source/STATE_SINGLE_PW.CSV (a conv:Select_subset_Engine applying conv:select_subset_Method) -> manual/STATE_SINGLE_PW.CSV
manual/STATE_SINGLE_PW.CSV came from source/STATE_SINGLE_PW.CSV
source/STATE_SINGLE_PW.CSV -> manual/STATE_SINGLE_PW.CSV
--------------------------------------------------------------------------------

The provenance captured is stored in a file with the resulting file name plus .pml.ttl:

bash-3.2$ cat manual/STATE_SINGLE_PW.CSV.pml.ttl

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix sioc:    <http://rdfs.org/sioc/ns#> .
@prefix pmlp:    <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlj:    <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix conv:    <http://purl.org/twc/vocab/conversion/> .

<STATE_SINGLE_PW.CSV>
   a pmlp:Information;
   pmlp:hasModificationDateTime "2011-03-02T13:31:53-05:00"^^xsd:dateTime;
.
<STATE_SINGLE_PW.CSV>
   a pmlp:Information;
   nfo:hasHash <md5_db0a34538e1441633ab05bd962af6d4c_time_1299090728>;
.
<md5_db0a34538e1441633ab05bd962af6d4c_time_1299090728>
   a nfo:FileHash; 
   dcterms:date "2011-03-02T13:32:08-05:00"^^xsd:dateTime;
   nfo:hashAlgorithm "md5";
   nfo:hashValue "db0a34538e1441633ab05bd962af6d4c";
.

<../source/STATE_SINGLE_PW.CSV>
   a pmlp:Information;
   pmlp:hasModificationDateTime "2011-03-02T13:31:51-05:00"^^xsd:dateTime;
.
<../source/STATE_SINGLE_PW.CSV>
   a pmlp:Information;
   nfo:hasHash <md5_2afc25f886dbe56fdd15007d47f0c4c5_time_1299090728>;
.
<md5_2afc25f886dbe56fdd15007d47f0c4c5_time_1299090728>
   a nfo:FileHash; 
   dcterms:date "2011-03-02T13:32:08-05:00"^^xsd:dateTime;
   nfo:hashAlgorithm "md5";
   nfo:hashValue "2afc25f886dbe56fdd15007d47f0c4c5";
.

<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
   a pmlj:NodeSet;
   pmlj:hasConclusion <STATE_SINGLE_PW.CSV>;
   pmlj:isConsequentOf [
      a pmlj:InferenceStep;
      pmlj:hasIndex 0;
      pmlj:hasAntecedentList ( <nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_antecedent> 
                               <nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_user> );
      pmlj:hasInferenceEngine <select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>;
      pmlj:hasInferenceRule   conv:select_subset_Method;
   ];
.

<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_antecedent>
   a pmlj:NodeSet;
   pmlj:hasConclusion <source/STATE_SINGLE_PW.CSV>;
.

<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_user>
   a pmlj:NodeSet;
   pmlp:hasConclusion <user_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>;
.

<user_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
   foaf:accountName "lebot";
.

<select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
   a pmlp:InferenceEngine, conv:Select_subset_Engine;
   dcterms:identifier "select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc";
.

conv:Select_subset_Engine rdfs:subClassOf pmlp:InferenceEngine .
Clone this wiki locally