Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextalign rebased index seq pr #562

Merged
merged 38 commits into from
Feb 23, 2021
Merged

Conversation

rneher
Copy link
Member

@rneher rneher commented Feb 22, 2021

Description of proposed changes

Split the priorities calculation into a calculation of proximity (in batches to save memory) and priorities.

this should make the calculation of priorities more flexible and more transparent and limit the memory requirements to a few Gb instead of 50Gb.

Related issue(s)

Fixes #546

@rneher rneher requested a review from huddlej February 22, 2021 20:23
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still testing this on our SLURM cluster, so I can't comment on whether everything works or not. Inline comments below cover minor or prospective discussion points.

workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Show resolved Hide resolved
workflow/snakemake_rules/main_workflow.smk Show resolved Hide resolved
workflow/snakemake_rules/core_workflow.smk Outdated Show resolved Hide resolved
workflow/envs/nextstrain.yaml Show resolved Hide resolved
nextstrain_profiles/nextstrain-scicore/cluster.json Outdated Show resolved Hide resolved
scripts/get_distance_to_focal_set.py Outdated Show resolved Hide resolved
scripts/get_distance_to_focal_set.py Outdated Show resolved Hide resolved
scripts/explicit_translation.py Show resolved Hide resolved
@rneher
Copy link
Member Author

rneher commented Feb 22, 2021

Thanks so much, @huddlej ! Super useful. I'll sign off for the day -- feel free to work on it!

rneher and others added 27 commits February 23, 2021 10:49
Builds on previous commit to 

* Remove unused upload/download rules related to `prefilter` and `refilter` steps, which have been removed
* Remove references in tutorials to those rules
* Restore functionality whereby one can opt-out of the `diagnostic` rule for a certain input.
The pipeline was broken for builds without nextalign (e.g. my_profile/example) due to a quirk or snakemake (make?). 

Issue: the outputs from `rule build_align` differ if we have nextalign or not - only in the former case does `rules.build_align.output.translations` exist.  This file is used as input by `rule aa_muts_explicit`, but we only ask for the output of this rule if we are running nextalign (as we know it only exists in this case). However snakemake still tries to build the entire (?) graph and thus won't work for profiles which don't use nextalign. Changing the input to a function fixes this!
rneher and others added 10 commits February 23, 2021 10:51
Replaces "nextalign" parameter with a string literal in all nextalign rules,
adds nextalign to the conda environment, and adds conda environments to rules
that needed them.
For Nextstrain builds on the rhino cluster, use more CPUs for the new nextalign
rules but request less memory. Also, specifies a more reasonable amount of time
for the alignment job to complete for more accurate scheduling.
Adds a command line argument for users to change the chunk size. This parameter
allows users to control how much memory the script uses with the trade-off of
increased run-time for lower memory usage.
Also, adds a `chunk_size` param to the proximity score rule to expose this
option and potentially allow users to modify it via the config eventually.
Use Snakemake's resources and threads interfaces to specify memory and CPU
requirements instead of defining these in the deprecated cluster configuration
file. Eventually, we'd like to replace all of the cluster config file contents
with the corresponding Snakemake resource definitions.
@huddlej huddlej force-pushed the nextalign-rebased-index-seq-pr branch from 4ee4b73 to b542bee Compare February 23, 2021 18:51
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working well for me, @rneher. Even though the nextalign package for OS X hasn't made it through Bioconda yet, we should probably merge this as is. We can always update the default config to use nextalign once it is available on all platforms.

Also, I rebased onto master and force-pushed, to resolve the conflict with the master branch.

@rneher
Copy link
Member Author

rneher commented Feb 23, 2021

Thanks, John! I am happy for this to be merged. MacOS binaries are available as one-line download. Do we need to adjust documentation?

@huddlej
Copy link
Contributor

huddlej commented Feb 23, 2021

MacOS binaries are available as one-line download.

Yeah, I'm mostly worried about people who manage their workflow environment with --use-conda who won't have a clean path to nextalign installation. Worst case, these folks (including myself) can copy the appropriate binary into their .snakemake/conda/xyx/bin/ directory.

Do we need to adjust documentation?

I'll review the docs with these changes in mind to see what needs updating.

Since nextalign isn't available in Bioconda for all platforms yet, don't
try to install it from the default conda environment. Instead, we can
use a custom Nextstrain conda environment for now.
@huddlej huddlej merged commit 3ddf737 into master Feb 23, 2021
@huddlej huddlej deleted the nextalign-rebased-index-seq-pr branch February 23, 2021 21:12
joverlee521 added a commit that referenced this pull request Oct 23, 2024
This commit removes the rules `mutation_summary` and
`build_mutation_summary`, both of which called the removed script
`scripts/mutation_summary.py`.

I was looking into updating the script to work with Nextclade v3 outputs
but it seems like the rules using the script are not actively used in
the Snakemake workflow!

In Feb 2021, the script and rule `mutation_summary` were added and the
only use of the output `results/mutation_summary_{origin}.tsv.xz`
was to be uploaded it as an intermediate file of the builds.¹
In Jan 2022, the upload of the intermediate file was removed² and I
cannot find any other uses of the file so the rule `mutation_summary`
seems unused.

In March 2021, the rule `build_mutation_summary` was added to support
the rule `add_mutation_counts`.³ However, `add_mutation_counts` was
later updated to use `augur distance`⁴ which does not require the
mutation summary output. I cannot find any other uses of the output file
`results/{build_name}/mutation_summary.tsv`, so the rule
`build_mutation_summary` seems unused.

¹ <#562>
² <#814>
³ <3b23e8f>
⁴ <d24d531>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reserve more ram for priorities.py by default in snakefile
3 participants