Drop backward compatibility #615

huddlej · 2021-04-21T23:11:20Z

Description of proposed changes

This is a proof-of-concept PR showing components of the ncov workflow that I think could be removed or simplified.

Most of these changes rely on the idea of making a major backward incompatible release that removes much of our backward-compatibility that has survived at the expense of a more complicated workflow.

Each commit represents a single simplification step that could be cherry-picked or rebased onto master, if we choose.

Testing

Tested by CI
Tested on SLURM cluster with full Nextstrain profile
Tested with a trial build through GitHub Actions

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release: v5
Update docs/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

jameshadfield

Thanks for this @huddlej! Enforcing the new style of inputs allows the snakefile to be much nicer. Apart from the comment on trimmed_origin, everything looks good as is.

Given that this will be a "breaking change", I would favour adding a few more changes to this major version bump. (Equally some could go in the next one).

Remove usage of "exposure" - removes the need for a lot of "getter" functions, removes the rule incorporate_travel_history entirely, and generally makes auspice easier to understand. Edit: this might be quite a lot of work...
Add nice error messages if one supplies an old-style sequence/metadata input.

workflow/snakemake_rules/common.smk

huddlej · 2021-04-30T22:51:08Z

Thank you for checking this out, @jameshadfield! I'll work on cutting out the trim_origin bits.

Add nice error messages if one supplies an old-style sequence/metadata input.

This seems really important, since we're dropping support for that interface in this PR. We also need to provide nice errors when no inputs are defined.

Remove usage of "exposure"

If this will be a lot of work, we could handle it in a separate PR. Some of the proposals in the Google Doc for making things simpler won't be simple to implement themselves. :)

huddlej · 2021-05-03T17:05:22Z

Commit 6b32071 adds helpful error messages for users with the old input config or no input configs and also removes all references to the trimmed origin, underscores in origins, etc. from the Snakemake files and multiple inputs tutorial.

I think I caught every reference to the old config style and origin wildcards with underscores, but would you mind giving this another look, @jameshadfield, just to verify?

jameshadfield · 2021-05-04T03:51:34Z

my_profiles/example/builds.yaml

+inputs:
+  - name: example-data
+    metadata: data/example_metadata.tsv
+    sequences: data/example_sequences.fasta.gz


This currently fails, as the compressed fasta is handed to nextalign which results in an empty results/aligned_example-data.fasta (no error raised). The pipeline eventually fails at a downstream filter step (which is slightly confusing).

We could add a check to _get_path_for_input which throws a nice error if the input looks like it's compressed.

e.g.

diff --git a/workflow/snakemake_rules/common.smk b/workflow/snakemake_rules/common.smk index d4a09568..105d16b2 100644 --- a/workflow/snakemake_rules/common.smk +++ b/workflow/snakemake_rules/common.smk @@ -53,6 +53,10 @@ def _get_path_for_input(stage, origin_wildcard): if path_or_url=="" and stage in ["metadata", "sequences"]: raise Exception(f"ERROR: config->input->{origin_wildcard}->{stage} is not defined.") + ## Currently only metadata can be compressed! + if stage!="metadata" and (path_or_url.endswith(".xz") or path_or_url.endswith(".gz")): + raise Exception(f"Compressed sequence input is not yet supported. Please check {path_or_url}") + if stage=="metadata": return f"data/downloaded_{origin_wildcard}.tsv" if remote else path_or_url if stage=="sequences":

(And we should update my_profiles/example/builds.yaml to use data/example_sequences.fasta. The docs already instruct you to decompress the sequences file before running.)

Good catch! It turns out that the example profile has been broken on master for a while now (since the proximity script started ignoring the reference strains), so we'll need to fix that in a separate PR.

In the meantime, I switched the starting sequences for the example profile back to the uncompressed file. This compressed vs uncompressed issue should be mostly handled by #605. I was hoping to merge that PR first, so this PR won't need to add any checks for compression state.

Snakefile

jameshadfield · 2021-05-04T04:38:15Z

workflow/snakemake_rules/main_workflow.smk

+rule align:
+    message:
+        """
+        Aligning sequences to {input.reference}


Good choice to move to nextalign for everything. If nextalign is missing from the environment we get the error

/usr/local/bin/bash: line 1: nextalign: command not found

which is probably good enough. I considered adding something simple to the Snakefile like the following, but it may be more trouble than it's worth.

# Ensure `nextalign` is available, as we no longer support `mafft` import shutil if shutil.which("nextalign") is None: logger.error("ERROR: `nextalign` does not seem to be installed, which is needed for this pipeline.") sys.exit(1)

Yeah, I think we can't handle this much better than recommending users add the --use-conda to their Snakemake commands or use the CLI and Docker image.

jameshadfield

Nice one @huddlej! Apart from some minor fixes this looks good to go 👍

Makes nextalign the default and only tool for alignment, removing mafft rules and adding configuration details (genes) needed by nextalign to the default parameters.

Removes the haplotype status rule and script that is no longer required.

Removes experimental and unused rule to calculate mutation frequencies from multiple sequence alignments.

Replaces custom script with augur distance command that does the same thing.

Removes deprecated sequence and metadata inputs from the configuration file and removes Snakemake logic required to support these files. Also, removes references to this deprecated input format from the example profiles and the "multiple inputs" tutorial. Since we no longer support this old input format, we also no longer need to support empty origin wildcards. We drop support for empty origin wildcard and remove all references to trimming of origin wildcards that start with an underscore and update all rules to reference the origin wildcard with the underscore in the filename. We also now print helpful errors when inputs aren't defined properly through checks for configurations with old-style input definitions or without any inputs defined. These error messages provide recommendations about how to update the workflow configuration to fix the issues. Fixes #616

rneher · 2021-05-07T07:36:48Z

one issue with dropping backwards compatibility is that users using previous versions can no longer pull in the updated exclude files etc...

huddlej · 2021-05-07T16:42:03Z

one issue with dropping backwards compatibility is that users using previous versions can no longer pull in the updated exclude files etc...

That's a good point. We've been treating these workflows and the data they use as parts of a single package, but the workflow is software and the data are data. If we want to avoid this issue generally in the future, we should probably think about how to decouple the software from the data it consumes.

For example, we could continue to version the excludes and other curated files in this repo, but we could host the latest files for others to use on data.nextstrain.org. This approach would treat those data more like the files people already download from GISAID, but we could allow users to provide paths to data.nextstrain.org in their config files so they don't have to manually download anything. This solution could lead to default/parameters.yaml section like this (I don't know all of the files people would want, so I updated a couple of the obvious ones):

files:
  include: "https://data.nextstrain.org/ncov_include.txt"
  exclude: "https://data.nextstrain.org/ncov_exclude.txt"
  reference: "defaults/reference_seq.gb"
  alignment_reference: "defaults/reference_seq.fasta"
  annotation: "defaults/annotation.gff"
  outgroup: "defaults/outgroup.fasta" # is this arg still used? file doesn't exist in gh repo
  ordering: "defaults/color_ordering.tsv"
  color_schemes: "defaults/color_schemes.tsv"
  auspice_config: "defaults/auspice_config.json"
  lat_longs: "https://data.nextstrain.org/ncov_lat_longs.tsv"
  description: "defaults/description.md"
  clades: "defaults/clades.tsv"
  emerging_lineages: "defaults/emerging_lineages.tsv"

Without a Conda environment, the distances rule tries to use whatever version of Augur is installed globally for the user which either leads to an error (when Augur is not installed) or a potential conflict in node data JSON versions during augur export v2 (when an older version of Augur than 12.0.0 is installed globally).

huddlej · 2021-05-07T23:25:14Z

I successfully ran through the Nextstrain profile on both AWS (through a trial build) and our SLURM cluster, so this is ready to merge. If the nextalign change ends up being too problematic for folks, it will be easy to revert that specific commit and try something else like keeping mafft but making nextalign the default.

sidneymbell · 2021-05-10T21:14:15Z

This is very 'neat', thanks for sharing @huddlej!

Looking at the changelog, I had a few questions (apologies if these are already documented elsewhere, happy to be redirected :)

Major changes

Drop support for old sequence/metadata inputs

What do you mean by "old...inputs"? I.e., what is the requirement and how is it different?

Use nextalign for alignment instead of mafft

Is this now the only alignment algo available?

Minor changes

Drop unused haplotype status rule and script

Remove unused nucleotide mutation frequencies rule

Use augur distance for mutation counts

What does this last bullet mean?

cc @ttung

huddlej · 2021-05-12T19:03:15Z

Thank you for these clarifying questions, @sidneymbell; I was a little too terse in the change log! I've proposed some revisions to the change log that should answer these questions. I want to include a link to the new configuration parameters docs from the revised change log, but these new docs haven't been merged yet so I can't quite yet.

sidneymbell · 2021-05-14T00:19:16Z

Thank you!! Super helpful.

…

On Wed, May 12, 2021 at 12:03 PM John Huddleston ***@***.***> wrote: Thank you for these clarifying questions, @sidneymbell <https://github.com/sidneymbell>; I was a little too terse in the change log! I've proposed some revisions to the change log that should answer these questions <#634>. I want to include a link to the new configuration parameters docs <https://github.com/nextstrain/ncov/blob/add-config-docs/docs/configuration.md> from the revised change log, but these new docs haven't been merged yet so I can't quite yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#615 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADAIYX3VXPERF5IT4FMAUETTNLGITANCNFSM43LJNAFQ> .

huddlej marked this pull request as ready for review April 27, 2021 04:41

jameshadfield reviewed Apr 30, 2021

View reviewed changes

workflow/snakemake_rules/common.smk Outdated Show resolved Hide resolved

huddlej mentioned this pull request Apr 30, 2021

Support GISAID downloads #605

Merged

8 tasks

huddlej force-pushed the drop-backward-compatibility branch from 80bb8d2 to 6b32071 Compare May 3, 2021 17:01

jameshadfield reviewed May 4, 2021

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield reviewed May 4, 2021

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield reviewed May 4, 2021

View reviewed changes

Snakefile Show resolved Hide resolved

jameshadfield reviewed May 4, 2021

View reviewed changes

jameshadfield requested changes May 4, 2021

View reviewed changes

huddlej force-pushed the drop-backward-compatibility branch from 6b32071 to 1e68f6d Compare May 4, 2021 23:21

huddlej added 5 commits May 5, 2021 11:14

Use nextalign for alignment instead of mafft

ff49561

Makes nextalign the default and only tool for alignment, removing mafft rules and adding configuration details (genes) needed by nextalign to the default parameters.

Drop unused haplotype status rule and script

383e1f6

Removes the haplotype status rule and script that is no longer required.

Remove unused nucleotide mutation frequencies rule

74c27f9

Removes experimental and unused rule to calculate mutation frequencies from multiple sequence alignments.

Use augur distance for mutation counts

aba6b1b

Replaces custom script with augur distance command that does the same thing.

huddlej force-pushed the drop-backward-compatibility branch from 1e68f6d to 074816e Compare May 5, 2021 18:36

huddlej added 2 commits May 7, 2021 09:56

Update change log to reflect new release

98d51a6

huddlej merged commit c7a707f into master May 7, 2021

huddlej deleted the drop-backward-compatibility branch May 7, 2021 23:25

jameshadfield mentioned this pull request May 10, 2021

Profile for Open Data builds #628

Merged

23 tasks

This was referenced May 10, 2021

Include distances and explicit aa muts by default #630

Merged

Clarify description of updates in v5 #634

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop backward compatibility #615

Drop backward compatibility #615

huddlej commented Apr 21, 2021 •

edited

Loading

jameshadfield left a comment •

edited

Loading

huddlej commented Apr 30, 2021

huddlej commented May 3, 2021

jameshadfield May 4, 2021

jameshadfield May 4, 2021

jameshadfield May 4, 2021

huddlej May 4, 2021

jameshadfield May 4, 2021 •

edited

Loading

huddlej May 4, 2021

jameshadfield left a comment

rneher commented May 7, 2021

huddlej commented May 7, 2021

huddlej commented May 7, 2021

sidneymbell commented May 10, 2021

Major changes

Minor changes

huddlej commented May 12, 2021

sidneymbell commented May 14, 2021 via email

Drop backward compatibility #615

Drop backward compatibility #615

Conversation

huddlej commented Apr 21, 2021 • edited Loading

Description of proposed changes

Testing

Release checklist

jameshadfield left a comment • edited Loading

Choose a reason for hiding this comment

huddlej commented Apr 30, 2021

huddlej commented May 3, 2021

jameshadfield May 4, 2021

Choose a reason for hiding this comment

jameshadfield May 4, 2021

Choose a reason for hiding this comment

jameshadfield May 4, 2021

Choose a reason for hiding this comment

huddlej May 4, 2021

Choose a reason for hiding this comment

jameshadfield May 4, 2021 • edited Loading

Choose a reason for hiding this comment

huddlej May 4, 2021

Choose a reason for hiding this comment

jameshadfield left a comment

Choose a reason for hiding this comment

rneher commented May 7, 2021

huddlej commented May 7, 2021

huddlej commented May 7, 2021

sidneymbell commented May 10, 2021

Major changes

Minor changes

huddlej commented May 12, 2021

sidneymbell commented May 14, 2021 via email

huddlej commented Apr 21, 2021 •

edited

Loading

jameshadfield left a comment •

edited

Loading

jameshadfield May 4, 2021 •

edited

Loading