-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for INSDC Sequence records (i.e., Genbank/EMBL format)? #1219
Comments
I think that'd map well to @heuermh's bigdatagenomics/bdg-formats#83, no? If so, SGTM! |
Yes, but it would extend a ways past it. What @heuermh currently implemented is similar to Biopython's |
Oh, that's kinda cool. I might say that we could add the per-base annotations and the dbxrefs, but I would leave out the Features, since we can reconstruct that from a join. @heuermh thoughts? |
I was thinking |
Nice! I like that proposal. |
@laserson What would you like to use as the specification/reference model for this? I've written Biojava 1.x (biojava-legacy) to/from Since Biojava is LGPL version 2.1, there is a question of where this code should live. There is also a question of the proposal at https://github.com/bigdatagenomics/bdg-convert. |
I don't have experience with Biojava but probably a good place to start? |
A good place to start, sure, but some of the code is 15 years old. Is there a more up-to-date specification or reference? There's this for the feature table and an annotated example Genbank record I think I'm with Chris Fields in that, well, good luck Given a Genbank record, which bits would be important to you? |
Yeah, I'm also aware of how unspecified the Genbank format is. For me, the obvious things to add are Features on top of the sequence. Also nice would be the per-letter annotations. And also the ket-value pairs you can add (is this called "qualifiers" or something like that?) |
Fixed by https://github.com/heuermh/biojava-adam, which is in the process of being transferred to the Biojava organization. |
Any thoughts about supporting the INSDC data model as well? This would be for the data in genbank/embl format. This corresponds to the
SeqRecord
data type in biopython. Apologies if this questions is totally redundant/annoying.The text was updated successfully, but these errors were encountered: