Skip to content

frbr: CSHALS 2011 tutorial

timrdf edited this page Feb 23, 2011 · 76 revisions

In-depth conversion for Jim's tutorial

Jim McCusker used csv2rdf4lod to incorporate some data for his Semantic Healthcare and Life Sciences Tutorial. This (on-the-fly!) tutorial provides some more detail on how he did it. I am piecing it together from the Provenance captured by csv2rdf4lod while Jim originally used it for his demo.

Source files on GitHub

You can get the source at:

https://github.com/timrdf/csv2rdf4lod-automation/tree/master/doc/examples/source/ncbi-nih-gov

Overview of csv2rdf4lod workflow

diagram of provenance captured during csv2rdf4lod conversion

Install csv2rdf4lod

Installing csv2rdf4lod automation

gene2go

Data: ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz

Step a.1: [name](Conversion process phase: name) the data:

Use the HTTP modification date to name the version:

bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
Last-Modified: Wed, 23 Feb 2011 07:49:05 GMT
Content-Length: 12359614
Accept-ranges: bytes

Step a.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:

mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/source

Step a.3: Get the zip, uncompress, and log the provenance:

pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
gunzip -c gene2go.gz > gene2go
justify.sh gene2go.gz gene2go uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/

Step a.4: [csv-ify](Conversion process phase: csv-ify) the data: We only need homo sapien, and it's tab-delimited (we need comma-separated). We make a manual tweak and store it in manual/ (capturing the provenance):

grep "^9606" source/gene2go | perl -pe 's/^/"/; s/GO://; s/\t/","/g; s/$/"/' > manual/gene2go-9606.csv

Step a.5: Create verbatim interpretation of tabular literals ([create](Conversion process phase: create conversion trigger) and [pull](Conversion process phase: pull conversion trigger) the conversion trigger):

cr-create-convert-sh.sh -w manual/gene2go.csv
./convert-gene2go.sh

Step a.6: Cheat and get Jim's enhanced interpretation parameters:

curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene2go/version/2011-Feb-23/manual/gene2go-9606.csv.e1.params.ttl > manual/gene2go-9606.csv.e1.params.ttl

Step a.7: Create enhanced interpretation of tabular literals ([pull](Conversion process phase: pull conversion trigger) the conversion trigger again):

./convert-gene2go.sh

Step a.8: Check out automatic/gene2go-9606.csv.e1.ttl

homo sapiens gene info

Data: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

Step b.1: [name](Conversion process phase: name) the data:

  • base URI: http://sparql.tw.rpi.edu/ontowiki/
  • source: ncbi-nih-gov
  • dataset: gene-mammalia-homo-sapien
  • version: 2011-Feb-23

Use the HTTP modification date to name the version:

bash-3.2$ curl -I ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
Last-Modified: Wed, 23 Feb 2011 08:04:26 GMT
Content-Length: 2402004
Accept-ranges: bytes

Step b.2: [retrieve](Conversion process phase: retrieve) the data: Create the directory to keep a local copy of NIH's data:

mkdir ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/source

Step b.3: Get the zip, uncompress, and log the provenance.

pcurl.sh ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
gunzip -c Homo_sapiens.gene_info.gz > Homo_sapiens.gene_info
justify.sh Homo_sapiens.gene_info.gz Homo_sapiens.gene_info uncompress
cd ~/Desktop/cshals-2011-demo/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/

Step b.4: We only need homo sapien, and it's tab-delimited (we need comma-separated). We make a manual tweak and store it in manual/ (capturing the provenance):

cat source/Homo_sapiens.gene_info | perl -pe 's/^/"/; s/\t/","/g; s/$/"/' > manual/Homo_sapiens.gene_info.csv
justify.sh source/Homo_sapiens.gene_info manual/Homo_sapiens.gene_info.csv tab2comma

Step b.5: Create verbatim interpretation of tabular literals:

cr-create-convert-sh.sh -w manual/Homo_sapiens.gene_info.csv
./convert-gene-mammalia-homo-sapien.sh

Step b.6: Cheat and get Jim's enhanced interpretation parameters:

curl https://github.com/timrdf/csv2rdf4lod-automation/raw/master/doc/examples/source/ncbi-nih-gov/gene-mammalia-homo-sapien/version/2011-Feb-23/manual/Homo_sapiens.gene_info.csv.e1.params.ttl > manual/Homo_sapiens.gene_info.csv.e1.params.ttl

Step b.7: Create enhanced interpretation of tabular literals:

./convert-gene-mammalia-homo-sapien.sh
Clone this wiki locally