Nextalign rebased index seq pr #562

rneher · 2021-02-22T19:20:11Z

Description of proposed changes

Split the priorities calculation into a calculation of proximity (in batches to save memory) and priorities.

this should make the calculation of priorities more flexible and more transparent and limit the memory requirements to a few Gb instead of 50Gb.

Related issue(s)

Fixes #546

huddlej

I'm still testing this on our SLURM cluster, so I can't comment on whether everything works or not. Inline comments below cover minor or prospective discussion points.

workflow/snakemake_rules/main_workflow.smk

workflow/snakemake_rules/core_workflow.smk

workflow/envs/nextstrain.yaml

nextstrain_profiles/nextstrain-scicore/cluster.json

scripts/get_distance_to_focal_set.py

scripts/explicit_translation.py

rneher · 2021-02-22T20:57:41Z

Thanks so much, @huddlej ! Super useful. I'll sign off for the day -- feel free to work on it!

Builds on previous commit to * Remove unused upload/download rules related to `prefilter` and `refilter` steps, which have been removed * Remove references in tutorials to those rules * Restore functionality whereby one can opt-out of the `diagnostic` rule for a certain input.

The pipeline was broken for builds without nextalign (e.g. my_profile/example) due to a quirk or snakemake (make?). Issue: the outputs from `rule build_align` differ if we have nextalign or not - only in the former case does `rules.build_align.output.translations` exist. This file is used as input by `rule aa_muts_explicit`, but we only ask for the output of this rule if we are running nextalign (as we know it only exists in this case). However snakemake still tries to build the entire (?) graph and thus won't work for profiles which don't use nextalign. Changing the input to a function fixes this!

Replaces "nextalign" parameter with a string literal in all nextalign rules, adds nextalign to the conda environment, and adds conda environments to rules that needed them.

For Nextstrain builds on the rhino cluster, use more CPUs for the new nextalign rules but request less memory. Also, specifies a more reasonable amount of time for the alignment job to complete for more accurate scheduling.

Adds a command line argument for users to change the chunk size. This parameter allows users to control how much memory the script uses with the trade-off of increased run-time for lower memory usage.

Also, adds a `chunk_size` param to the proximity score rule to expose this option and potentially allow users to modify it via the config eventually.

Use Snakemake's resources and threads interfaces to specify memory and CPU requirements instead of defining these in the deprecated cluster configuration file. Eventually, we'd like to replace all of the cluster config file contents with the corresponding Snakemake resource definitions.

huddlej

This is working well for me, @rneher. Even though the nextalign package for OS X hasn't made it through Bioconda yet, we should probably merge this as is. We can always update the default config to use nextalign once it is available on all platforms.

Also, I rebased onto master and force-pushed, to resolve the conflict with the master branch.

rneher · 2021-02-23T19:20:54Z

Thanks, John! I am happy for this to be merged. MacOS binaries are available as one-line download. Do we need to adjust documentation?

huddlej · 2021-02-23T20:28:17Z

MacOS binaries are available as one-line download.

Yeah, I'm mostly worried about people who manage their workflow environment with --use-conda who won't have a clean path to nextalign installation. Worst case, these folks (including myself) can copy the appropriate binary into their .snakemake/conda/xyx/bin/ directory.

Do we need to adjust documentation?

I'll review the docs with these changes in mind to see what needs updating.

Since nextalign isn't available in Bioconda for all platforms yet, don't try to install it from the default conda environment. Instead, we can use a custom Nextstrain conda environment for now.

This commit removes the rules `mutation_summary` and `build_mutation_summary`, both of which called the removed script `scripts/mutation_summary.py`. I was looking into updating the script to work with Nextclade v3 outputs but it seems like the rules using the script are not actively used in the Snakemake workflow! In Feb 2021, the script and rule `mutation_summary` were added and the only use of the output `results/mutation_summary_{origin}.tsv.xz` was to be uploaded it as an intermediate file of the builds.¹ In Jan 2022, the upload of the intermediate file was removed² and I cannot find any other uses of the file so the rule `mutation_summary` seems unused. In March 2021, the rule `build_mutation_summary` was added to support the rule `add_mutation_counts`.³ However, `add_mutation_counts` was later updated to use `augur distance`⁴ which does not require the mutation summary output. I cannot find any other uses of the output file `results/{build_name}/mutation_summary.tsv`, so the rule `build_mutation_summary` seems unused. ¹ <#562> ² <#814> ³ <3b23e8f> ⁴ <d24d531>

rneher requested a review from huddlej February 22, 2021 20:23

huddlej reviewed Feb 22, 2021

View reviewed changes

rneher and others added 27 commits February 23, 2021 10:49

add nextalign option

bb0e90d

remove filter rules

b4a2839

extend filter time limit

f80b455

add translations

c14f832

add mutation_summary rule and script

a4244a3

fix paths

d259902

add annotation

50ceffa

fix argument name

b30f876

add alignment for build

5bb46bf

fix wild card

d4ca026

add explicit translation script

9816596

fix missing output error when origin is null

295309f

fix wildcard/basename

21bf857

don't restart failed jobs. we don't have stochastic failures anymore

1d06242

fix script name

773bbd4

increase memory req for proximity rule

b484186

reenable restarts, mainly for oom reasons

ad7b386

add index

040f675

fix

2a27d6b

use function to determine unified alignment

c1beeec

calculate priorities sequentially

5584c88

fix

e58d1f9

fix

4dee5d7

split priorities into proximities and priorities

8446693

remove metadata from proximity

fc0e7dd

rneher and others added 10 commits February 23, 2021 10:51

fix import

9adefea

fix

ac30ca0

remove accidentally committed file, upload mutation summary

b83c75c

Update conda environments for new rules

211aa0d

Replaces "nextalign" parameter with a string literal in all nextalign rules, adds nextalign to the conda environment, and adds conda environments to rules that needed them.

adjust docstrings

161fc55

adjust resources

7453c71

Parameterize chunk size for proximity score calculations

4578781

Adds a command line argument for users to change the chunk size. This parameter allows users to control how much memory the script uses with the trade-off of increased run-time for lower memory usage.

Specify memory resources for higher-memory rules

27b8856

Also, adds a `chunk_size` param to the proximity score rule to expose this option and potentially allow users to modify it via the config eventually.

huddlej force-pushed the nextalign-rebased-index-seq-pr branch from 4ee4b73 to b542bee Compare February 23, 2021 18:51

huddlej approved these changes Feb 23, 2021

View reviewed changes

Only install nextalign for nextstrain builds

4c2cb3f

Since nextalign isn't available in Bioconda for all platforms yet, don't try to install it from the default conda environment. Instead, we can use a custom Nextstrain conda environment for now.

huddlej merged commit 3ddf737 into master Feb 23, 2021

huddlej deleted the nextalign-rebased-index-seq-pr branch February 23, 2021 21:12

huddlej mentioned this pull request Feb 23, 2021

Use sequence index for augur filter rules #552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextalign rebased index seq pr #562

Nextalign rebased index seq pr #562

rneher commented Feb 22, 2021

huddlej left a comment

rneher commented Feb 22, 2021

huddlej left a comment •

edited

Loading

rneher commented Feb 23, 2021

huddlej commented Feb 23, 2021

Nextalign rebased index seq pr #562

Nextalign rebased index seq pr #562

Conversation

rneher commented Feb 22, 2021

Description of proposed changes

Related issue(s)

huddlej left a comment

Choose a reason for hiding this comment

rneher commented Feb 22, 2021

huddlej left a comment • edited Loading

Choose a reason for hiding this comment

rneher commented Feb 23, 2021

huddlej commented Feb 23, 2021

huddlej left a comment •

edited

Loading