nextstrain workflow split #101

dpark01 · 2020-06-09T13:06:53Z

Proposal: break up build_augur_tree into two workflows (or rather, keep the original end-to-end workflow and provide two half workflows for typical use):

genomes (unaligned fastas) -> min base count filter (not augur filter, not metadata-informed) -> mafft -> snp-sites. Emits an MSA fasta and VCF.
MSA or VCF -> augur mask, a re-positioned augur filter and everything downstream in build_augur_tree

In typical rapid iterative analyses, it would be nice to do step 1 once (2-3hrs) and not have to re-do it. In the current build_augur_tree workflow, any modification to the metadata.tsv will cause the augur filter (subsample sequences) task to re-run (since it takes metadata as input), which then forces mafft and everything else to re-run. Similarly, there are cases where we'd want to run mafft just once, or not at all (users can download full MSAs from gisaid and could skip that step entirely).

Users could then more rapidly iterate on changes to their metadata, subsampling strategies, ancestral inferences, and color/geoloc renderings by operating on step 2 more often. The slowest part of step 2 is treetime (augur refine) which can take a couple hours--all other steps complete within minutes. We can separately look into optimizations of the treetime step (outside the scope of this PR).

Additionally, users who want to work on non-viral genomes where an MSA might not be appropriate or practical, but VCF-based SNP-calls are, could run all of step 2 based on VCF inputs. Note that all augur steps that take MSAs (including augur tree, refine, ancestral, filter, mask, etc) can also take VCFs.

The text was updated successfully, but these errors were encountered:

dpark01 mentioned this issue Jun 9, 2020

nextstrain: add augur mask step #99

Merged

dpark01 self-assigned this Jun 10, 2020

dpark01 mentioned this issue Jun 11, 2020

add snp-sites task to nextstrain workflows #115

Merged

dpark01 closed this as completed in #115 Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nextstrain workflow split #101

nextstrain workflow split #101

dpark01 commented Jun 9, 2020

nextstrain workflow split #101

nextstrain workflow split #101

Comments

dpark01 commented Jun 9, 2020