-
Notifications
You must be signed in to change notification settings - Fork 36
frbr: CSHALS 2011 tutorial
Jim McCusker used csv2rdf4lod to incorporate some data for his Semantic Healthcare and Life Sciences Tutorial. This (on-the-fly!) tutorial provides some more detail on how he did it. I am piecing it together from the Provenance captured by csv2rdf4lod while Jim originally used it for his demo.
You can get the source at:
https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/ncbi-nih-gov
Installing csv2rdf4lod automation
Data: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Step a.1: [name](Conversion process phase: name) the data:
- base URI:
http://sparql.tw.rpi.edu/ontowiki/
- source:
ncbi-nih-gov
- dataset:
gene2go
- version:
2011-Feb-23
(see Conversion process phase: name)
Use the HTTP modification date to name the version
:
bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Last-Modified: Wed, 23 Feb 2011 07:49:05 GMT
Content-Length: 12359614
Accept-ranges: bytes
Step a.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:
mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source
Step a.3: Get the zip, uncompress, and log the provenance:
pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
gunzip -c gene2go.gz > gene2go
justify.sh gene2go.gz gene2go uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/
Step a.4: [csv-ify](Conversion process phase: csv-ify) the data: We only need homo sapien, and it's tab-delimited (we need comma-separated). We make a manual tweak and store it in manual/
(capturing the provenance):
grep "^9606" source/gene2go | perl -pe 's/^/"/; s/GO://; s/\t/","/g; s/$/"/' > manual/gene2go-9606.csv
Step a.5: Create verbatim interpretation of tabular literals ([create](Conversion process phase: create conversion trigger) and [pull](Conversion process phase: pull conversion trigger) the conversion trigger):
cr-create-convert-sh.sh -w manual/gene2go.csv
./convert-gene2go.sh
Step a.6: Cheat and get Jim's enhanced interpretation parameters:
curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/manual/gene2go-9606.csv.e1.params.ttl > manual/gene2go-9606.csv.e1.params.ttl
Step a.7: Create enhanced interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger again):
./convert-gene2go.sh
Step a.8: Check out automatic/gene2go-9606.csv.e1.ttl
Data: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Step b.1: [name](Conversion process phase: name) the data:
- base URI:
http://sparql.tw.rpi.edu/ontowiki/
- source:
ncbi-nih-gov
- dataset:
gene-mammalia-homo-sapien
- version:
2011-Feb-23
Use the HTTP modification date to name the version
:
bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Last-Modified: Wed, 23 Feb 2011 08:04:26 GMT
Content-Length: 2402004
Accept-ranges: bytes
Step b.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:
mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source
Step b.3: Get the zip, uncompress, and log the provenance.
pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
gunzip -c Homo_sapiens.gene_info.gz > Homo_sapiens.gene_info
justify.sh Homo_sapiens.gene_info.gz Homo_sapiens.gene_info uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/
Step b.4: We only need homo sapien, and it's tab-delimited (we need comma-separated). We make a manual tweak and store it in manual/
(capturing the provenance):
cat source/Homo_sapiens.gene_info | perl -pe 's/^/"/; s/\t/","/g; s/$/"/' > manual/Homo_sapiens.gene_info.csv
justify.sh source/Homo_sapiens.gene_info manual/Homo_sapiens.gene_info.csv tab2comma
Step b.5: Create verbatim interpretation of tabular literals:
cr-create-convert-sh.sh -w manual/Homo_sapiens.gene_info.csv
./convert-gene-mammalia-homo-sapien.sh
Step b.6: Cheat and get Jim's enhanced interpretation parameters:
curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/manual/Homo_sapiens.gene_info.csv.e1.params.ttl > manual/Homo_sapiens.gene_info.csv.e1.params.ttl
Step b.7: Create enhanced interpretation of tabular literals:
./convert-gene-mammalia-homo-sapien.sh