-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop backward compatibility #615
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @huddlej! Enforcing the new style of inputs allows the snakefile to be much nicer. Apart from the comment on trimmed_origin
, everything looks good as is.
Given that this will be a "breaking change", I would favour adding a few more changes to this major version bump. (Equally some could go in the next one).
- Remove usage of "exposure" - removes the need for a lot of "getter" functions, removes the
rule incorporate_travel_history
entirely, and generally makes auspice easier to understand. Edit: this might be quite a lot of work... - Add nice error messages if one supplies an old-style sequence/metadata input.
Thank you for checking this out, @jameshadfield! I'll work on cutting out the
This seems really important, since we're dropping support for that interface in this PR. We also need to provide nice errors when no inputs are defined.
If this will be a lot of work, we could handle it in a separate PR. Some of the proposals in the Google Doc for making things simpler won't be simple to implement themselves. :) |
80bb8d2
to
6b32071
Compare
Commit 6b32071 adds helpful error messages for users with the old input config or no input configs and also removes all references to the trimmed origin, underscores in origins, etc. from the Snakemake files and multiple inputs tutorial. I think I caught every reference to the old config style and origin wildcards with underscores, but would you mind giving this another look, @jameshadfield, just to verify? |
my_profiles/example/builds.yaml
Outdated
inputs: | ||
- name: example-data | ||
metadata: data/example_metadata.tsv | ||
sequences: data/example_sequences.fasta.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently fails, as the compressed fasta is handed to nextalign
which results in an empty results/aligned_example-data.fasta
(no error raised). The pipeline eventually fails at a downstream filter step (which is slightly confusing).
We could add a check to _get_path_for_input
which throws a nice error if the input looks like it's compressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
diff --git a/workflow/snakemake_rules/common.smk b/workflow/snakemake_rules/common.smk
index d4a09568..105d16b2 100644
--- a/workflow/snakemake_rules/common.smk
+++ b/workflow/snakemake_rules/common.smk
@@ -53,6 +53,10 @@ def _get_path_for_input(stage, origin_wildcard):
if path_or_url=="" and stage in ["metadata", "sequences"]:
raise Exception(f"ERROR: config->input->{origin_wildcard}->{stage} is not defined.")
+ ## Currently only metadata can be compressed!
+ if stage!="metadata" and (path_or_url.endswith(".xz") or path_or_url.endswith(".gz")):
+ raise Exception(f"Compressed sequence input is not yet supported. Please check {path_or_url}")
+
if stage=="metadata":
return f"data/downloaded_{origin_wildcard}.tsv" if remote else path_or_url
if stage=="sequences":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And we should update my_profiles/example/builds.yaml
to use data/example_sequences.fasta
. The docs already instruct you to decompress the sequences file before running.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It turns out that the example profile has been broken on master
for a while now (since the proximity script started ignoring the reference strains), so we'll need to fix that in a separate PR.
In the meantime, I switched the starting sequences for the example profile back to the uncompressed file. This compressed vs uncompressed issue should be mostly handled by #605. I was hoping to merge that PR first, so this PR won't need to add any checks for compression state.
rule align: | ||
message: | ||
""" | ||
Aligning sequences to {input.reference} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good choice to move to nextalign
for everything. If nextalign
is missing from the environment we get the error
/usr/local/bin/bash: line 1: nextalign: command not found
which is probably good enough. I considered adding something simple to the Snakefile
like the following, but it may be more trouble than it's worth.
# Ensure `nextalign` is available, as we no longer support `mafft`
import shutil
if shutil.which("nextalign") is None:
logger.error("ERROR: `nextalign` does not seem to be installed, which is needed for this pipeline.")
sys.exit(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can't handle this much better than recommending users add the --use-conda
to their Snakemake commands or use the CLI and Docker image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one @huddlej! Apart from some minor fixes this looks good to go 👍
6b32071
to
1e68f6d
Compare
Makes nextalign the default and only tool for alignment, removing mafft rules and adding configuration details (genes) needed by nextalign to the default parameters.
Removes the haplotype status rule and script that is no longer required.
Removes experimental and unused rule to calculate mutation frequencies from multiple sequence alignments.
Replaces custom script with augur distance command that does the same thing.
Removes deprecated sequence and metadata inputs from the configuration file and removes Snakemake logic required to support these files. Also, removes references to this deprecated input format from the example profiles and the "multiple inputs" tutorial. Since we no longer support this old input format, we also no longer need to support empty origin wildcards. We drop support for empty origin wildcard and remove all references to trimming of origin wildcards that start with an underscore and update all rules to reference the origin wildcard with the underscore in the filename. We also now print helpful errors when inputs aren't defined properly through checks for configurations with old-style input definitions or without any inputs defined. These error messages provide recommendations about how to update the workflow configuration to fix the issues. Fixes #616
1e68f6d
to
074816e
Compare
one issue with dropping backwards compatibility is that users using previous versions can no longer pull in the updated exclude files etc... |
That's a good point. We've been treating these workflows and the data they use as parts of a single package, but the workflow is software and the data are data. If we want to avoid this issue generally in the future, we should probably think about how to decouple the software from the data it consumes. For example, we could continue to version the excludes and other curated files in this repo, but we could host the latest files for others to use on data.nextstrain.org. This approach would treat those data more like the files people already download from GISAID, but we could allow users to provide paths to data.nextstrain.org in their config files so they don't have to manually download anything. This solution could lead to files:
include: "https://data.nextstrain.org/ncov_include.txt"
exclude: "https://data.nextstrain.org/ncov_exclude.txt"
reference: "defaults/reference_seq.gb"
alignment_reference: "defaults/reference_seq.fasta"
annotation: "defaults/annotation.gff"
outgroup: "defaults/outgroup.fasta" # is this arg still used? file doesn't exist in gh repo
ordering: "defaults/color_ordering.tsv"
color_schemes: "defaults/color_schemes.tsv"
auspice_config: "defaults/auspice_config.json"
lat_longs: "https://data.nextstrain.org/ncov_lat_longs.tsv"
description: "defaults/description.md"
clades: "defaults/clades.tsv"
emerging_lineages: "defaults/emerging_lineages.tsv" |
Without a Conda environment, the distances rule tries to use whatever version of Augur is installed globally for the user which either leads to an error (when Augur is not installed) or a potential conflict in node data JSON versions during augur export v2 (when an older version of Augur than 12.0.0 is installed globally).
I successfully ran through the Nextstrain profile on both AWS (through a trial build) and our SLURM cluster, so this is ready to merge. If the nextalign change ends up being too problematic for folks, it will be easy to revert that specific commit and try something else like keeping mafft but making nextalign the default. |
This is very 'neat', thanks for sharing @huddlej! Looking at the changelog, I had a few questions (apologies if these are already documented elsewhere, happy to be redirected :)
What do you mean by "old...inputs"? I.e., what is the requirement and how is it different?
Is this now the only alignment algo available?
What does this last bullet mean? cc @ttung |
Thank you for these clarifying questions, @sidneymbell; I was a little too terse in the change log! I've proposed some revisions to the change log that should answer these questions. I want to include a link to the new configuration parameters docs from the revised change log, but these new docs haven't been merged yet so I can't quite yet. |
Thank you!! Super helpful.
…On Wed, May 12, 2021 at 12:03 PM John Huddleston ***@***.***> wrote:
Thank you for these clarifying questions, @sidneymbell
<https://github.com/sidneymbell>; I was a little too terse in the change
log! I've proposed some revisions to the change log that should answer
these questions <#634>. I want to
include a link to the new configuration parameters docs
<https://github.com/nextstrain/ncov/blob/add-config-docs/docs/configuration.md>
from the revised change log, but these new docs haven't been merged yet so
I can't quite yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#615 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADAIYX3VXPERF5IT4FMAUETTNLGITANCNFSM43LJNAFQ>
.
|
Description of proposed changes
This is a proof-of-concept PR showing components of the ncov workflow that I think could be removed or simplified.
Most of these changes rely on the idea of making a major backward incompatible release that removes much of our backward-compatibility that has survived at the expense of a more complicated workflow.
Each commit represents a single simplification step that could be cherry-picked or rebased onto master, if we choose.
Testing
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
v5
docs/change_log.md
in this pull request to document these changes and the new version number.