Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop backward compatibility #615

Merged
merged 7 commits into from
May 7, 2021
Merged

Drop backward compatibility #615

merged 7 commits into from
May 7, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Apr 21, 2021

Description of proposed changes

This is a proof-of-concept PR showing components of the ncov workflow that I think could be removed or simplified.

Most of these changes rely on the idea of making a major backward incompatible release that removes much of our backward-compatibility that has survived at the expense of a more complicated workflow.

Each commit represents a single simplification step that could be cherry-picked or rebased onto master, if we choose.

Testing

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release: v5
  • Update docs/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

@huddlej huddlej marked this pull request as ready for review April 27, 2021 04:41
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @huddlej! Enforcing the new style of inputs allows the snakefile to be much nicer. Apart from the comment on trimmed_origin, everything looks good as is.

Given that this will be a "breaking change", I would favour adding a few more changes to this major version bump. (Equally some could go in the next one).

  • Remove usage of "exposure" - removes the need for a lot of "getter" functions, removes the rule incorporate_travel_history entirely, and generally makes auspice easier to understand. Edit: this might be quite a lot of work...
  • Add nice error messages if one supplies an old-style sequence/metadata input.

workflow/snakemake_rules/common.smk Outdated Show resolved Hide resolved
@huddlej huddlej mentioned this pull request Apr 30, 2021
8 tasks
@huddlej
Copy link
Contributor Author

huddlej commented Apr 30, 2021

Thank you for checking this out, @jameshadfield! I'll work on cutting out the trim_origin bits.

Add nice error messages if one supplies an old-style sequence/metadata input.

This seems really important, since we're dropping support for that interface in this PR. We also need to provide nice errors when no inputs are defined.

Remove usage of "exposure"

If this will be a lot of work, we could handle it in a separate PR. Some of the proposals in the Google Doc for making things simpler won't be simple to implement themselves. :)

@huddlej huddlej force-pushed the drop-backward-compatibility branch from 80bb8d2 to 6b32071 Compare May 3, 2021 17:01
@huddlej
Copy link
Contributor Author

huddlej commented May 3, 2021

Commit 6b32071 adds helpful error messages for users with the old input config or no input configs and also removes all references to the trimmed origin, underscores in origins, etc. from the Snakemake files and multiple inputs tutorial.

I think I caught every reference to the old config style and origin wildcards with underscores, but would you mind giving this another look, @jameshadfield, just to verify?

inputs:
- name: example-data
metadata: data/example_metadata.tsv
sequences: data/example_sequences.fasta.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently fails, as the compressed fasta is handed to nextalign which results in an empty results/aligned_example-data.fasta (no error raised). The pipeline eventually fails at a downstream filter step (which is slightly confusing).

We could add a check to _get_path_for_input which throws a nice error if the input looks like it's compressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

diff --git a/workflow/snakemake_rules/common.smk b/workflow/snakemake_rules/common.smk
index d4a09568..105d16b2 100644
--- a/workflow/snakemake_rules/common.smk
+++ b/workflow/snakemake_rules/common.smk
@@ -53,6 +53,10 @@ def _get_path_for_input(stage, origin_wildcard):
     if path_or_url=="" and stage in ["metadata", "sequences"]:
         raise Exception(f"ERROR: config->input->{origin_wildcard}->{stage} is not defined.")
 
+    ## Currently only metadata can be compressed!
+    if stage!="metadata" and (path_or_url.endswith(".xz") or path_or_url.endswith(".gz")):
+        raise Exception(f"Compressed sequence input is not yet supported. Please check {path_or_url}")
+
     if stage=="metadata":
         return f"data/downloaded_{origin_wildcard}.tsv" if remote else path_or_url
     if stage=="sequences":

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And we should update my_profiles/example/builds.yaml to use data/example_sequences.fasta. The docs already instruct you to decompress the sequences file before running.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It turns out that the example profile has been broken on master for a while now (since the proximity script started ignoring the reference strains), so we'll need to fix that in a separate PR.

In the meantime, I switched the starting sequences for the example profile back to the uncompressed file. This compressed vs uncompressed issue should be mostly handled by #605. I was hoping to merge that PR first, so this PR won't need to add any checks for compression state.

Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
rule align:
message:
"""
Aligning sequences to {input.reference}
Copy link
Member

@jameshadfield jameshadfield May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good choice to move to nextalign for everything. If nextalign is missing from the environment we get the error

/usr/local/bin/bash: line 1: nextalign: command not found

which is probably good enough. I considered adding something simple to the Snakefile like the following, but it may be more trouble than it's worth.

# Ensure `nextalign` is available, as we no longer support `mafft`
import shutil
if shutil.which("nextalign") is None:
    logger.error("ERROR: `nextalign` does not seem to be installed, which is needed for this pipeline.")
    sys.exit(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can't handle this much better than recommending users add the --use-conda to their Snakemake commands or use the CLI and Docker image.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @huddlej! Apart from some minor fixes this looks good to go 👍

@huddlej huddlej force-pushed the drop-backward-compatibility branch from 6b32071 to 1e68f6d Compare May 4, 2021 23:21
huddlej added 5 commits May 5, 2021 11:14
Makes nextalign the default and only tool for alignment, removing mafft
rules and adding configuration details (genes) needed by nextalign to
the default parameters.
Removes the haplotype status rule and script that is no longer required.
Removes experimental and unused rule to calculate mutation frequencies
from multiple sequence alignments.
Replaces custom script with augur distance command that does the same
thing.
Removes deprecated sequence and metadata inputs from the configuration
file and removes Snakemake logic required to support these files. Also,
removes references to this deprecated input format from the example
profiles and the "multiple inputs" tutorial.

Since we no longer support this old input format, we also no longer need
to support empty origin wildcards. We drop support for empty origin
wildcard and remove all references to trimming of origin wildcards
that start with an underscore and update all rules to reference the origin
wildcard with the underscore in the filename.

We also now print helpful errors when inputs aren't defined properly
through checks for configurations with old-style input definitions or
without any inputs defined. These error messages provide recommendations
about how to update the workflow configuration to fix the issues.

Fixes #616
@huddlej huddlej force-pushed the drop-backward-compatibility branch from 1e68f6d to 074816e Compare May 5, 2021 18:36
@rneher
Copy link
Member

rneher commented May 7, 2021

one issue with dropping backwards compatibility is that users using previous versions can no longer pull in the updated exclude files etc...

@huddlej
Copy link
Contributor Author

huddlej commented May 7, 2021

one issue with dropping backwards compatibility is that users using previous versions can no longer pull in the updated exclude files etc...

That's a good point. We've been treating these workflows and the data they use as parts of a single package, but the workflow is software and the data are data. If we want to avoid this issue generally in the future, we should probably think about how to decouple the software from the data it consumes.

For example, we could continue to version the excludes and other curated files in this repo, but we could host the latest files for others to use on data.nextstrain.org. This approach would treat those data more like the files people already download from GISAID, but we could allow users to provide paths to data.nextstrain.org in their config files so they don't have to manually download anything. This solution could lead to default/parameters.yaml section like this (I don't know all of the files people would want, so I updated a couple of the obvious ones):

files:
  include: "https://data.nextstrain.org/ncov_include.txt"
  exclude: "https://data.nextstrain.org/ncov_exclude.txt"
  reference: "defaults/reference_seq.gb"
  alignment_reference: "defaults/reference_seq.fasta"
  annotation: "defaults/annotation.gff"
  outgroup: "defaults/outgroup.fasta" # is this arg still used? file doesn't exist in gh repo
  ordering: "defaults/color_ordering.tsv"
  color_schemes: "defaults/color_schemes.tsv"
  auspice_config: "defaults/auspice_config.json"
  lat_longs: "https://data.nextstrain.org/ncov_lat_longs.tsv"
  description: "defaults/description.md"
  clades: "defaults/clades.tsv"
  emerging_lineages: "defaults/emerging_lineages.tsv"

huddlej added 2 commits May 7, 2021 09:56
Without a Conda environment, the distances rule tries to use whatever version of
Augur is installed globally for the user which either leads to an error (when
Augur is not installed) or a potential conflict in node data JSON versions
during augur export v2 (when an older version of Augur than 12.0.0 is installed
globally).
@huddlej
Copy link
Contributor Author

huddlej commented May 7, 2021

I successfully ran through the Nextstrain profile on both AWS (through a trial build) and our SLURM cluster, so this is ready to merge. If the nextalign change ends up being too problematic for folks, it will be easy to revert that specific commit and try something else like keeping mafft but making nextalign the default.

@huddlej huddlej merged commit c7a707f into master May 7, 2021
@huddlej huddlej deleted the drop-backward-compatibility branch May 7, 2021 23:25
@jameshadfield jameshadfield mentioned this pull request May 10, 2021
23 tasks
@sidneymbell
Copy link
Collaborator

This is very 'neat', thanks for sharing @huddlej!

Looking at the changelog, I had a few questions (apologies if these are already documented elsewhere, happy to be redirected :)

Major changes

  • Drop support for old sequence/metadata inputs

What do you mean by "old...inputs"? I.e., what is the requirement and how is it different?

  • Use nextalign for alignment instead of mafft

Is this now the only alignment algo available?

Minor changes

  • Drop unused haplotype status rule and script
  • Remove unused nucleotide mutation frequencies rule
  • Use augur distance for mutation counts

What does this last bullet mean?

cc @ttung

@huddlej
Copy link
Contributor Author

huddlej commented May 12, 2021

Thank you for these clarifying questions, @sidneymbell; I was a little too terse in the change log! I've proposed some revisions to the change log that should answer these questions. I want to include a link to the new configuration parameters docs from the revised change log, but these new docs haven't been merged yet so I can't quite yet.

@sidneymbell
Copy link
Collaborator

sidneymbell commented May 14, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants