You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposal: break up build_augur_tree into two workflows (or rather, keep the original end-to-end workflow and provide two half workflows for typical use):
genomes (unaligned fastas) -> min base count filter (not augur filter, not metadata-informed) -> mafft -> snp-sites. Emits an MSA fasta and VCF.
MSA or VCF -> augur mask, a re-positioned augur filter and everything downstream in build_augur_tree
In typical rapid iterative analyses, it would be nice to do step 1 once (2-3hrs) and not have to re-do it. In the current build_augur_tree workflow, any modification to the metadata.tsv will cause the augur filter (subsample sequences) task to re-run (since it takes metadata as input), which then forces mafft and everything else to re-run. Similarly, there are cases where we'd want to run mafft just once, or not at all (users can download full MSAs from gisaid and could skip that step entirely).
Users could then more rapidly iterate on changes to their metadata, subsampling strategies, ancestral inferences, and color/geoloc renderings by operating on step 2 more often. The slowest part of step 2 is treetime (augur refine) which can take a couple hours--all other steps complete within minutes. We can separately look into optimizations of the treetime step (outside the scope of this PR).
Additionally, users who want to work on non-viral genomes where an MSA might not be appropriate or practical, but VCF-based SNP-calls are, could run all of step 2 based on VCF inputs. Note that all augur steps that take MSAs (including augur tree, refine, ancestral, filter, mask, etc) can also take VCFs.
The text was updated successfully, but these errors were encountered:
Proposal: break up
build_augur_tree
into two workflows (or rather, keep the original end-to-end workflow and provide two half workflows for typical use):augur filter
, not metadata-informed) -> mafft -> snp-sites. Emits an MSA fasta and VCF.augur mask
, a re-positionedaugur filter
and everything downstream inbuild_augur_tree
In typical rapid iterative analyses, it would be nice to do step 1 once (2-3hrs) and not have to re-do it. In the current build_augur_tree workflow, any modification to the metadata.tsv will cause the
augur filter
(subsample sequences) task to re-run (since it takes metadata as input), which then forces mafft and everything else to re-run. Similarly, there are cases where we'd want to run mafft just once, or not at all (users can download full MSAs from gisaid and could skip that step entirely).Users could then more rapidly iterate on changes to their metadata, subsampling strategies, ancestral inferences, and color/geoloc renderings by operating on step 2 more often. The slowest part of step 2 is treetime (
augur refine
) which can take a couple hours--all other steps complete within minutes. We can separately look into optimizations of the treetime step (outside the scope of this PR).Additionally, users who want to work on non-viral genomes where an MSA might not be appropriate or practical, but VCF-based SNP-calls are, could run all of step 2 based on VCF inputs. Note that all
augur
steps that take MSAs (includingaugur tree
,refine
,ancestral
,filter
,mask
, etc) can also take VCFs.The text was updated successfully, but these errors were encountered: