Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nextstrain workflow split #101

Closed
dpark01 opened this issue Jun 9, 2020 · 0 comments · Fixed by #115
Closed

nextstrain workflow split #101

dpark01 opened this issue Jun 9, 2020 · 0 comments · Fixed by #115
Assignees

Comments

@dpark01
Copy link
Member

dpark01 commented Jun 9, 2020

Proposal: break up build_augur_tree into two workflows (or rather, keep the original end-to-end workflow and provide two half workflows for typical use):

  1. genomes (unaligned fastas) -> min base count filter (not augur filter, not metadata-informed) -> mafft -> snp-sites. Emits an MSA fasta and VCF.
  2. MSA or VCF -> augur mask, a re-positioned augur filter and everything downstream in build_augur_tree

In typical rapid iterative analyses, it would be nice to do step 1 once (2-3hrs) and not have to re-do it. In the current build_augur_tree workflow, any modification to the metadata.tsv will cause the augur filter (subsample sequences) task to re-run (since it takes metadata as input), which then forces mafft and everything else to re-run. Similarly, there are cases where we'd want to run mafft just once, or not at all (users can download full MSAs from gisaid and could skip that step entirely).

Users could then more rapidly iterate on changes to their metadata, subsampling strategies, ancestral inferences, and color/geoloc renderings by operating on step 2 more often. The slowest part of step 2 is treetime (augur refine) which can take a couple hours--all other steps complete within minutes. We can separately look into optimizations of the treetime step (outside the scope of this PR).

Additionally, users who want to work on non-viral genomes where an MSA might not be appropriate or practical, but VCF-based SNP-calls are, could run all of step 2 based on VCF inputs. Note that all augur steps that take MSAs (including augur tree, refine, ancestral, filter, mask, etc) can also take VCFs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant