implement a non-mafft-based / pairwise approach for genbank annotation #76

dpark01 · 2020-05-19T17:20:50Z

Current approach for preparing genbank annotation in tbl format involves creating an MSA via mafft with all genomes being prepared for submission as well as an NCBI reference sequence. We then use ncbi.tbl_transfer_prealigned to transfer the gene annotation from the reference sequence to each new genome. This is unnecessarily complicated when preparing large submissions due to the unneeded step of creating an MSA.

We should change the current genbank.wdl pipeline to instead compute pairwise alignments between the reference and each genome. These can be scattered by sample--but even if not, will results in an O(n) set of pairwise alignments instead of an O(n^2) multiple sequence alignment. It should improve the mappability of features for good genomes when included in a batch with poor genomes.

The viral-phylo docker container already contains ncbi.tbl_transfer, which performs the pairwise alignment on the fly and can be scattered per-sample. We should revert to this approach. I think we had shifted to the prealigned approach at some point in the past when mafft alignments were assumed to be a standard output in the snakemake pipeline anyway (partly due to their necessity for V-Phaser-based iSNV analyses--these are less relevant for assemble_refbased-based outputs, and even in the case of denovo outputs, viral species that are diverse enough to require denovo assembly are also diverse enough to cause problems with a single mafft MSA approach).

dpark01 self-assigned this May 19, 2020

dpark01 mentioned this issue May 20, 2020

genbank workflow: switch from MSA to pairwise approach #79

Merged

dpark01 closed this as completed in #79 May 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement a non-mafft-based / pairwise approach for genbank annotation #76

implement a non-mafft-based / pairwise approach for genbank annotation #76

dpark01 commented May 19, 2020

implement a non-mafft-based / pairwise approach for genbank annotation #76

implement a non-mafft-based / pairwise approach for genbank annotation #76

Comments

dpark01 commented May 19, 2020