You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current approach for preparing genbank annotation in tbl format involves creating an MSA via mafft with all genomes being prepared for submission as well as an NCBI reference sequence. We then use ncbi.tbl_transfer_prealigned to transfer the gene annotation from the reference sequence to each new genome. This is unnecessarily complicated when preparing large submissions due to the unneeded step of creating an MSA.
We should change the current genbank.wdl pipeline to instead compute pairwise alignments between the reference and each genome. These can be scattered by sample--but even if not, will results in an O(n) set of pairwise alignments instead of an O(n^2) multiple sequence alignment. It should improve the mappability of features for good genomes when included in a batch with poor genomes.
The viral-phylo docker container already contains ncbi.tbl_transfer, which performs the pairwise alignment on the fly and can be scattered per-sample. We should revert to this approach. I think we had shifted to the prealigned approach at some point in the past when mafft alignments were assumed to be a standard output in the snakemake pipeline anyway (partly due to their necessity for V-Phaser-based iSNV analyses--these are less relevant for assemble_refbased-based outputs, and even in the case of denovo outputs, viral species that are diverse enough to require denovo assembly are also diverse enough to cause problems with a single mafft MSA approach).
The text was updated successfully, but these errors were encountered:
Current approach for preparing genbank annotation in tbl format involves creating an MSA via mafft with all genomes being prepared for submission as well as an NCBI reference sequence. We then use ncbi.tbl_transfer_prealigned to transfer the gene annotation from the reference sequence to each new genome. This is unnecessarily complicated when preparing large submissions due to the unneeded step of creating an MSA.
We should change the current genbank.wdl pipeline to instead compute pairwise alignments between the reference and each genome. These can be scattered by sample--but even if not, will results in an O(n) set of pairwise alignments instead of an O(n^2) multiple sequence alignment. It should improve the mappability of features for good genomes when included in a batch with poor genomes.
The viral-phylo docker container already contains ncbi.tbl_transfer, which performs the pairwise alignment on the fly and can be scattered per-sample. We should revert to this approach. I think we had shifted to the prealigned approach at some point in the past when mafft alignments were assumed to be a standard output in the snakemake pipeline anyway (partly due to their necessity for V-Phaser-based iSNV analyses--these are less relevant for assemble_refbased-based outputs, and even in the case of denovo outputs, viral species that are diverse enough to require denovo assembly are also diverse enough to cause problems with a single mafft MSA approach).
The text was updated successfully, but these errors were encountered: