Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for INSDC Sequence records (i.e., Genbank/EMBL format)? #1219

Closed
laserson opened this issue Oct 19, 2016 · 10 comments
Closed

Support for INSDC Sequence records (i.e., Genbank/EMBL format)? #1219

laserson opened this issue Oct 19, 2016 · 10 comments
Assignees
Milestone

Comments

@laserson
Copy link
Contributor

Any thoughts about supporting the INSDC data model as well? This would be for the data in genbank/embl format. This corresponds to the SeqRecord data type in biopython. Apologies if this questions is totally redundant/annoying.

@fnothaft
Copy link
Member

I think that'd map well to @heuermh's bigdatagenomics/bdg-formats#83, no? If so, SGTM!

@laserson
Copy link
Contributor Author

Yes, but it would extend a ways past it. What @heuermh currently implemented is similar to Biopython's Seq object. The SeqRecord object includes one of those, but also includes per-letter annotations (e.g., quality scores), dbxrefs, and a list of Feature objects. Basically, the sequence and all the annotations on top of it. Is there any appetite to develop that?

@fnothaft
Copy link
Member

Oh, that's kinda cool. I might say that we could add the per-base annotations and the dbxrefs, but I would leave out the Features, since we can reconstruct that from a join. @heuermh thoughts?

@heuermh
Copy link
Member

heuermh commented Oct 19, 2016

I was thinking Sequence and Feature as currently defined, and then a SequenceAnnotation or similar for the list of features and other stuff that goes in a Biopython SeqRecord or Biojava RichSequence. Didn't the GA4GH recently propose something for this?

@fnothaft
Copy link
Member

Nice! I like that proposal.

@heuermh
Copy link
Member

heuermh commented Jan 10, 2017

@laserson What would you like to use as the specification/reference model for this?

I've written Biojava 1.x (biojava-legacy) to/from Sequence/Slice/Read conversions here, similar could be done for Biojava 5.x on HEAD (biojava), and it wouldn't take much to use those libraries to support Genbank/EMBL formats in addition to FASTA and FASTQ.

Since Biojava is LGPL version 2.1, there is a question of where this code should live. There is also a question of the proposal at https://github.com/bigdatagenomics/bdg-convert.

@laserson
Copy link
Contributor Author

I don't have experience with Biojava but probably a good place to start?

@heuermh
Copy link
Member

heuermh commented Jan 10, 2017

A good place to start, sure, but some of the code is 15 years old. Is there a more up-to-date specification or reference?

There's this for the feature table
http://www.insdc.org/files/feature_table.html

and an annotated example Genbank record
https://www.ncbi.nlm.nih.gov/genbank/samplerecord/

I think I'm with Chris Fields in that, well, good luck
https://www.biostars.org/p/165727/

Given a Genbank record, which bits would be important to you?

@laserson
Copy link
Contributor Author

Yeah, I'm also aware of how unspecified the Genbank format is.

For me, the obvious things to add are Features on top of the sequence. Also nice would be the per-letter annotations. And also the ket-value pairs you can add (is this called "qualifiers" or something like that?)

@fnothaft fnothaft added this to the 0.24.0 milestone Mar 3, 2017
@heuermh heuermh modified the milestones: 0.24.0, 0.25.0 Jan 9, 2018
@heuermh heuermh modified the milestones: 0.26.0, 0.27.0 Feb 18, 2019
@heuermh heuermh modified the milestones: 0.27.0, 0.28.0 May 7, 2019
@heuermh
Copy link
Member

heuermh commented Jun 24, 2019

Fixed by https://github.com/heuermh/biojava-adam, which is in the process of being transferred to the Biojava organization.

@heuermh heuermh closed this as completed Jun 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants