-
Notifications
You must be signed in to change notification settings - Fork 36
Script: justify.sh
http://logd.tw.rpi.edu/source/data-gov/dataset/1008 provides a zip file with a pdf and csv in it.
Let's say we only want to work with the WATER
subset of the entire CSV, which is 25 of 46 data entries (a real world example of this is discussed in frbr: CSHALS 2011 tutorial, where we are only interested in Human genes and not others):
bash-3.2$ wc -l source/STATE_SINGLE_PW.CSV
46 source/STATE_SINGLE_PW.CSV
bash-3.2$ cat source/STATE_SINGLE_PW.CSV | grep "WATER" | wc -l
25
We make a new file in manual/
because it is a modified version of the original (from-government) files in source/
:
bash-3.2$ cat source/STATE_SINGLE_PW.CSV | grep "WATER" > manual/STATE_SINGLE_PW.CSV
We want to associate this new file to where it came from:
bash-3.2$ justify.sh source/STATE_SINGLE_PW.CSV manual/STATE_SINGLE_PW.CSV
usage: justify.sh /path/to/source/a.xls /path/to/destination/a.xls.csv <engine-name>
engine-name: (URI-friendly) e.g.:
xls2csv, tab2comma, redelimit, file_rename, escaping_commas_redelimit
duplicate, google_refine, serialization_change, parse_field, tabulating_fixed_width
html_tidy, pretty_print, xsl_html_scrape, manual_csvify, uncompress
select_subset, etc.
justify.sh
just provided some suggestions for methods that could be applied from one file to the next. We'll choose select_subset
by adding it to the end of the command and rerunning it:
bash-3.2$ justify.sh source/STATE_SINGLE_PW.CSV manual/STATE_SINGLE_PW.CSV select_subset
---------------------------------- justify ---------------------------------------
source/STATE_SINGLE_PW.CSV (a conv:Select_subset_Engine applying conv:select_subset_Method) -> manual/STATE_SINGLE_PW.CSV
manual/STATE_SINGLE_PW.CSV came from source/STATE_SINGLE_PW.CSV
source/STATE_SINGLE_PW.CSV -> manual/STATE_SINGLE_PW.CSV
--------------------------------------------------------------------------------
The provenance captured is stored in a file with the resulting file name plus .pml.ttl
:
bash-3.2$ cat manual/STATE_SINGLE_PW.CSV.pml.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix sioc: <http://rdfs.org/sioc/ns#> .
@prefix pmlp: <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlj: <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix conv: <http://purl.org/twc/vocab/conversion/> .
<STATE_SINGLE_PW.CSV>
a pmlp:Information;
pmlp:hasModificationDateTime "2011-03-02T13:31:53-05:00"^^xsd:dateTime;
.
<STATE_SINGLE_PW.CSV>
a pmlp:Information;
nfo:hasHash <md5_db0a34538e1441633ab05bd962af6d4c_time_1299090728>;
.
<md5_db0a34538e1441633ab05bd962af6d4c_time_1299090728>
a nfo:FileHash;
dcterms:date "2011-03-02T13:32:08-05:00"^^xsd:dateTime;
nfo:hashAlgorithm "md5";
nfo:hashValue "db0a34538e1441633ab05bd962af6d4c";
.
<../source/STATE_SINGLE_PW.CSV>
a pmlp:Information;
pmlp:hasModificationDateTime "2011-03-02T13:31:51-05:00"^^xsd:dateTime;
.
<../source/STATE_SINGLE_PW.CSV>
a pmlp:Information;
nfo:hasHash <md5_2afc25f886dbe56fdd15007d47f0c4c5_time_1299090728>;
.
<md5_2afc25f886dbe56fdd15007d47f0c4c5_time_1299090728>
a nfo:FileHash;
dcterms:date "2011-03-02T13:32:08-05:00"^^xsd:dateTime;
nfo:hashAlgorithm "md5";
nfo:hashValue "2afc25f886dbe56fdd15007d47f0c4c5";
.
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
a pmlj:NodeSet;
pmlj:hasConclusion <STATE_SINGLE_PW.CSV>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ( <nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_antecedent>
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_user> );
pmlj:hasInferenceEngine <select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>;
pmlj:hasInferenceRule conv:select_subset_Method;
];
.
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_antecedent>
a pmlj:NodeSet;
pmlj:hasConclusion <source/STATE_SINGLE_PW.CSV>;
.
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_user>
a pmlj:NodeSet;
pmlp:hasConclusion <user_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>;
.
<user_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
foaf:accountName "lebot";
.
<select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
a pmlp:InferenceEngine, conv:Select_subset_Engine;
dcterms:identifier "select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc";
.
conv:Select_subset_Engine rdfs:subClassOf pmlp:InferenceEngine .