Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement a non-mafft-based / pairwise approach for genbank annotation #76

Closed
dpark01 opened this issue May 19, 2020 · 0 comments · Fixed by #79
Closed

implement a non-mafft-based / pairwise approach for genbank annotation #76

dpark01 opened this issue May 19, 2020 · 0 comments · Fixed by #79
Assignees

Comments

@dpark01
Copy link
Member

dpark01 commented May 19, 2020

Current approach for preparing genbank annotation in tbl format involves creating an MSA via mafft with all genomes being prepared for submission as well as an NCBI reference sequence. We then use ncbi.tbl_transfer_prealigned to transfer the gene annotation from the reference sequence to each new genome. This is unnecessarily complicated when preparing large submissions due to the unneeded step of creating an MSA.

We should change the current genbank.wdl pipeline to instead compute pairwise alignments between the reference and each genome. These can be scattered by sample--but even if not, will results in an O(n) set of pairwise alignments instead of an O(n^2) multiple sequence alignment. It should improve the mappability of features for good genomes when included in a batch with poor genomes.

The viral-phylo docker container already contains ncbi.tbl_transfer, which performs the pairwise alignment on the fly and can be scattered per-sample. We should revert to this approach. I think we had shifted to the prealigned approach at some point in the past when mafft alignments were assumed to be a standard output in the snakemake pipeline anyway (partly due to their necessity for V-Phaser-based iSNV analyses--these are less relevant for assemble_refbased-based outputs, and even in the case of denovo outputs, viral species that are diverse enough to require denovo assembly are also diverse enough to cause problems with a single mafft MSA approach).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant