Skip to content

PublicSequenceResource

LLTommy edited this page Apr 5, 2020 · 23 revisions

Deliverable: a public sequence resource

Coordinator: Thomas Liener

One recurring idea is to create an uploader where raw data from a sequencer (long reads and short reads) is loaded onto a backend and mapped using traditional tools as well as the variation graph/pangenome tools. Next a visualization is generated of the viral strain in comparison with data we already have in the database. Furthermore, phenotypes that we have and metadata can be presented at the same time, to show how this viral strain relates to other strains, geo info, clinical info, treatment info - anything that we have and that can be linked out. Obviously the uploaded data becomes part of the whole.

The justification of such an uploader is easy. Currently there is no system that handles ontologies well. Currently there is no system that allows for on-the-fly analysis of raw data.

Mind, this is a pretty large project! But if we split it into small parts where each group owns subsections we should be able to put it together and make a working prototype. When the full application works we can improve after the BioHackathon and encourage data providers to add their material. As a BioHackathon we can get a high impact paper out of such a project though that is not the primary goal.

We can discuss subtasks here and ask for group coordinators for each subtask to work out what needs to be done? Subtasks we identify:

  1. Uploader with authentication, uploading fastq or BAM, add known (clinical) phenotypes. Study usage of Phenopackets as standard for phenotypic data submission. Going the other direction, this may be useful: omopomics
  2. Create workflow for traditional analysis (coordinator Michael R. Crusoe)
  3. Create workflow for vgtools (coordinator Michael R. Crusoe)
  4. Run workflows in cloud/HPC (coordinator Michael R. Crusoe)
  5. Store results in persistent storage (coordinator Michael R. Crusoe)
  6. Metadata and ontologies (interim? coordinator Thomas Liener)
  7. Define and query linked data (wikidata) (interim? coordinator Thomas Liener)
  8. Create visualization (coordinators Josiah Seaman and Simon Heumos)
  9. Create output website
  10. Coordinate with existing efforts (e.g. NextStrain, ELIXIR, others) to be able to port data back and forth! Ben Busby -- if anyone has contacts that they want to share -- please do so in slack and tag me!

All items (1-9) Vanessasaurus

Does that sound reasonable? Other tasks may be

  1. Deploy graph store, database, IPFS (coordinator Pjotr Prins)
  2. Deploy cloud/HPC workflow runner (coordinator Pjotr Prins)
  3. Deploy web interfaces (coordinator Pjotr Prins)

The Galaxy team already has put some things in place and we may be able to collaborate on this. Galaxy team, wdyt?

Work plan revisit (proposal)

This should define the pieces and what BH groups might takes on which topic. The change compared to the plan above is that we aim for a data repository as goal, while leaving the output website to "consuming applications" if they wish. (A consuming application could also be a public resource I guess).

  1. (Yellow) Raw sequence up-loader (Pjotr, Workflows) and sample metadata (Ontology, FAIR Data)
  2. (Pink) Workflows and workflow meta data (Workflow, FAIR Workflow Hub)
  3. (Blue) Workflow output (Workflow, VGGraph, ...)
  4. (Orange) Data repository: Searchable data access (Ontology, FAIR Data)
  5. (Green) Applications: Different applications and BH subgroups could consume the data (All interested Applications)

[diagram missing]

Clone this wiki locally