-
Notifications
You must be signed in to change notification settings - Fork 36
Script: justify.sh
timrdf edited this page Mar 2, 2011
·
9 revisions
http://logd.tw.rpi.edu/source/data-gov/dataset/1008 provides a zip file with a pdf and csv in it.
Let's say we only want to work with the WATER
subset of the entire CSV, which is 25 of 46 data entries:
bash-3.2$ wc -l source/STATE_SINGLE_PW.CSV
46 source/STATE_SINGLE_PW.CSV
bash-3.2$ cat source/STATE_SINGLE_PW.CSV | grep "WATER" | wc -l
25
We make a new file in manual/
because it is a modified version of the original (from-government) files in source/
:
bash-3.2$ cat source/STATE_SINGLE_PW.CSV | grep "WATER" > manual/STATE_SINGLE_PW.CSV
We want to associate this new file to where it came from:
bash-3.2$ justify.sh source/STATE_SINGLE_PW.CSV manual/STATE_SINGLE_PW.CSV
usage: justify.sh /path/to/source/a.xls /path/to/destination/a.xls.csv <engine-name>
engine-name: (URI-friendly) e.g.:
xls2csv, tab2comma, redelimit, file_rename, escaping_commas_redelimit
duplicate, google_refine, serialization_change, parse_field, tabulating_fixed_width
html_tidy, pretty_print, xsl_html_scrape, manual_csvify, uncompress
select_subset, etc.
justify.sh
just provided some suggestions for methods that could be applied from one file to the next. We'll choose select_subset
by adding it to the end of the command and rerunning it:
bash-3.2$ justify.sh source/STATE_SINGLE_PW.CSV manual/STATE_SINGLE_PW.CSV select_subset
---------------------------------- justify ---------------------------------------
source/STATE_SINGLE_PW.CSV (a conv:Select_subset_Engine applying conv:select_subset_Method) -> manual/STATE_SINGLE_PW.CSV
manual/STATE_SINGLE_PW.CSV came from source/STATE_SINGLE_PW.CSV
source/STATE_SINGLE_PW.CSV -> manual/STATE_SINGLE_PW.CSV
--------------------------------------------------------------------------------
The provenance captured is stored in a file with the resulting file name plus .pml.ttl
:
bash-3.2$ cat manual/STATE_SINGLE_PW.CSV.pml.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix sioc: <http://rdfs.org/sioc/ns#> .
@prefix pmlp: <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlj: <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix conv: <http://purl.org/twc/vocab/conversion/> .
<STATE_SINGLE_PW.CSV>
a pmlp:Information;
pmlp:hasModificationDateTime "2011-03-02T13:31:53-05:00"^^xsd:dateTime;
.
<STATE_SINGLE_PW.CSV>
a pmlp:Information;
nfo:hasHash <md5_db0a34538e1441633ab05bd962af6d4c_time_1299090728>;
.
<md5_db0a34538e1441633ab05bd962af6d4c_time_1299090728>
a nfo:FileHash;
dcterms:date "2011-03-02T13:32:08-05:00"^^xsd:dateTime;
nfo:hashAlgorithm "md5";
nfo:hashValue "db0a34538e1441633ab05bd962af6d4c";
.
<../source/STATE_SINGLE_PW.CSV>
a pmlp:Information;
pmlp:hasModificationDateTime "2011-03-02T13:31:51-05:00"^^xsd:dateTime;
.
<../source/STATE_SINGLE_PW.CSV>
a pmlp:Information;
nfo:hasHash <md5_2afc25f886dbe56fdd15007d47f0c4c5_time_1299090728>;
.
<md5_2afc25f886dbe56fdd15007d47f0c4c5_time_1299090728>
a nfo:FileHash;
dcterms:date "2011-03-02T13:32:08-05:00"^^xsd:dateTime;
nfo:hashAlgorithm "md5";
nfo:hashValue "2afc25f886dbe56fdd15007d47f0c4c5";
.
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
a pmlj:NodeSet;
pmlj:hasConclusion <STATE_SINGLE_PW.CSV>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ( <nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_antecedent> <nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_user> );
pmlj:hasInferenceEngine <select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>;
pmlj:hasInferenceRule conv:select_subset_Method;
];
.
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_antecedent>
a pmlj:NodeSet;
pmlj:hasConclusion <source/STATE_SINGLE_PW.CSV>;
.
<nodeSet_6111cfa9-0179-42c5-a03b-fa7be3fc92cc_user>
a pmlj:NodeSet;
pmlp:hasConclusion <user_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>;
.
<user_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
foaf:accountName "lebot";
.
<select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc>
a pmlp:InferenceEngine, conv:Select_subset_Engine;
dcterms:identifier "select_subset_6111cfa9-0179-42c5-a03b-fa7be3fc92cc";
.
conv:Select_subset_Engine rdfs:subClassOf pmlp:InferenceEngine .