-
Notifications
You must be signed in to change notification settings - Fork 31
FairData
As an initial remark about data sharing: SARS-CoV-2 genomes are sequenced by a variety of different institutions, who submit their results to GISAID.org. From there, these data are only accessible after making a user account and then clicking through the UI to get the record you want. Simply fetching all genomes (it's only a few hundred, and they're 30k bases each so it's not a huge set) is currently not possible at all, let alone via an API.
FAIRification Strategy Stawman for Monday 0900 CEST meeting
We have set-up a Web location where you can deposit data that should be "fairified". please submit your data to that folder as follows:
- Create a ZIP file containing:
- your data
- a metadata file explaining license and citation at a minimum (more is better!)
- a data dictionary to explain e.g. what the column headers mean
- contact info for you so that we can ask questions
- upload put it wherever you wish on the Web. If you want to put it into our repository (if it isn't HUGE!), you can do so as follows:
curl -v -L -X PUT -H "Accept: text/turtle" -H "Content-type: application/zip" -u hackathon:b**h******n --data-binary @sampledata.zip http://ldp.cbgp.upm.es:8890/DAV/coronavirus/To_Be_FAIRified/sampledata.zip
- indicate in the "To Be Fairified" section at the bottom of this page:
- your Name
- your slack ID
- The URL to your zip file
- we will add it to the FAIR transformation queue!
- Mark Wilkinson (coordinator)
- Michel Dumontier
- Stian Soiland-Reyes
- Philippe Rocca-Serra
- Evangelos Pafilis
- Susanna-Assunta Sansone
- Lynn Schriml
- https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/ (in progress by MarkW)
For instance describe/package as an RO-Crate: (MDW: note that I have spoken with the RO Crate team, and they think the use of LDP as the container system for Crates would be a good idea. that's what I plan to do...)
NOTE: I am looking for someone who knows how to configure Virtuoso for HTTPS... thanks!
I have created a Linked Data Platform endpoint on my institutional server in Madrid for us to use for back-end storage. It uses Virtuoso's LDP implementation (so we get SPARQL over the Linked Data submitted to that server):
-
https://w3id.org/FAIR_COVID19/DAV/coronavirus/ro-crates/ (note the trailing slash!)
-
For GET operations, you need no un/pw
-
For POST and PUT operations your username is hackathon and pw is b**hac****on
-
The endpoint for PUT/POST operations is: http://ldp.cbgp.upm.es:8890/DAV/coronavirus/
(see below for the curl command to push data into that Container. Please choose a unique identifier for your crate, make it an LDP container, and then push the crate into it.)
There is NOTHING on that server that is in any way valuable - it is entirely used for FAIR training - so we can make as many mistakes as we need to and I can wipe the DB and start again if necessary. Alternately, you can download the image linked above, and run it on localhost for your tests.
Please be "good citizens" and start by creating a sub-container inside of the /coronavirus/ container where you can store your information. Please remember that LDP Containers have a trailing slash! I believe that the Virtuoso implementation of LDP can ingest both Turtle and JSON-LD for the purposes of SPARQL, but I have only ever tried Turtle so I cannot promise the latter. The SPARQL endpoint is: https://w3id.org/FAIR_Training_LDP/sparql
To create your "home" or "unique crate" Container:
Create a file "container.ttl" that contains a small piece of turtle:
@prefix ldp: <http://www.w3.org/ns/ldp#> . <> a ldp:Container.
To upload this to the server:
curl -v -H "Accept: text/turtle" -H "Content-type: text/turtle" -u hackathon:********** --data-binary @container.ttl -H "Slug: myCrateName" http://ldp.cbgp.upm.es:8890/DAV/coronavirus/ro-crates/
(note that the trailing slash is required for containers! If you miss it, you will get a 301 redirect)
To create an ldp:Resource, the RDF should have the rdf:type ldp:Resource .
For more complex interactions, see the options in the HTTP headers.
A page where people can deposit any properties/classes that are currently missing from existing ontologies. https://docs.google.com/document/d/1HWp2EvTRCn-lNSoN5RF_XLcbbT9j8IrGkMQXcgjdbTI/edit#heading=h.rbnwes4ofzsi (shared with the Ontology team)
- Sub-topic: Workflow Hub
Working with ELIXIR effort, this project proposes to set up an early pre-production instance of the EOSC-Life Workflow Hub, covid19.workflowhub.eu
, to be a registry that gather the COVID-19 workflows and their metadata. Part of the tasks here is also to curate the existing workflows and help making them interoperable, reusable and reproducible.
The curated metadata will be in a FAIR format based on RO-Crate and BioSchemas annotations and where possible contributed back to the workflow's origin GitHub repositories.
For details, tasks and participants, see sub-topic Workflow Hub.
Mark Wilkinson @Mark Wilkinson Ministerio de Sanidad sobre coronavirus España (link to up-to-date data in bottom left)
Mark Wilkinson @Mark Wilkinson SARS-CoV-2 sequences GenBank
Philippe Rocca-Serra SARS-CoV-2 exposed CACO-2 cell - protein profiling - proteome analysis
- available from dedicated GitHub repository
- original data available from PRIDE with accession number PXD107710
- reannotated dataset available from Zenodo
- metadata available as a ISA format (ISA-Tab and ISA-JSON)
- raw data available as mzML format, converted from raw MS files
- derived data available as R ready csv file, long table layout, ready for consumption by ggplot2 R library.
- bundled as a bdbag archive.
- release via Zenodo
- Johns Hopkins repo
- European Centre for Disease Prevention and Control
- Automated Data Collection: COVID-19/SARS-COV-2 Cases in EU by Country, State/Province/Local Authorities, and Date
- EBI Data
-
nCoV sequences GISAID
- Please be aware of the licenses
- Kaggle (all COVID-19 Related challenges)
- Kaggle COVID-19 Open Research Dataset Challenge (CORD-19)
- COVID Epidemiology
- NY Times data
- NHS Covid19 symptom tracker