-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nextalign rebased index seq pr #562
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still testing this on our SLURM cluster, so I can't comment on whether everything works or not. Inline comments below cover minor or prospective discussion points.
Thanks so much, @huddlej ! Super useful. I'll sign off for the day -- feel free to work on it! |
Builds on previous commit to * Remove unused upload/download rules related to `prefilter` and `refilter` steps, which have been removed * Remove references in tutorials to those rules * Restore functionality whereby one can opt-out of the `diagnostic` rule for a certain input.
The pipeline was broken for builds without nextalign (e.g. my_profile/example) due to a quirk or snakemake (make?). Issue: the outputs from `rule build_align` differ if we have nextalign or not - only in the former case does `rules.build_align.output.translations` exist. This file is used as input by `rule aa_muts_explicit`, but we only ask for the output of this rule if we are running nextalign (as we know it only exists in this case). However snakemake still tries to build the entire (?) graph and thus won't work for profiles which don't use nextalign. Changing the input to a function fixes this!
Replaces "nextalign" parameter with a string literal in all nextalign rules, adds nextalign to the conda environment, and adds conda environments to rules that needed them.
For Nextstrain builds on the rhino cluster, use more CPUs for the new nextalign rules but request less memory. Also, specifies a more reasonable amount of time for the alignment job to complete for more accurate scheduling.
Adds a command line argument for users to change the chunk size. This parameter allows users to control how much memory the script uses with the trade-off of increased run-time for lower memory usage.
Also, adds a `chunk_size` param to the proximity score rule to expose this option and potentially allow users to modify it via the config eventually.
Use Snakemake's resources and threads interfaces to specify memory and CPU requirements instead of defining these in the deprecated cluster configuration file. Eventually, we'd like to replace all of the cluster config file contents with the corresponding Snakemake resource definitions.
4ee4b73
to
b542bee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is working well for me, @rneher. Even though the nextalign
package for OS X hasn't made it through Bioconda yet, we should probably merge this as is. We can always update the default config to use nextalign once it is available on all platforms.
Also, I rebased onto master and force-pushed, to resolve the conflict with the master branch.
Thanks, John! I am happy for this to be merged. MacOS binaries are available as one-line download. Do we need to adjust documentation? |
Yeah, I'm mostly worried about people who manage their workflow environment with
I'll review the docs with these changes in mind to see what needs updating. |
Since nextalign isn't available in Bioconda for all platforms yet, don't try to install it from the default conda environment. Instead, we can use a custom Nextstrain conda environment for now.
This commit removes the rules `mutation_summary` and `build_mutation_summary`, both of which called the removed script `scripts/mutation_summary.py`. I was looking into updating the script to work with Nextclade v3 outputs but it seems like the rules using the script are not actively used in the Snakemake workflow! In Feb 2021, the script and rule `mutation_summary` were added and the only use of the output `results/mutation_summary_{origin}.tsv.xz` was to be uploaded it as an intermediate file of the builds.¹ In Jan 2022, the upload of the intermediate file was removed² and I cannot find any other uses of the file so the rule `mutation_summary` seems unused. In March 2021, the rule `build_mutation_summary` was added to support the rule `add_mutation_counts`.³ However, `add_mutation_counts` was later updated to use `augur distance`⁴ which does not require the mutation summary output. I cannot find any other uses of the output file `results/{build_name}/mutation_summary.tsv`, so the rule `build_mutation_summary` seems unused. ¹ <#562> ² <#814> ³ <3b23e8f> ⁴ <d24d531>
Description of proposed changes
Split the priorities calculation into a calculation of proximity (in batches to save memory) and priorities.
this should make the calculation of priorities more flexible and more transparent and limit the memory requirements to a few Gb instead of 50Gb.
Related issue(s)
Fixes #546