Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start from actually-aligned sequences #1057

Merged
merged 1 commit into from
Apr 10, 2023
Merged

Start from actually-aligned sequences #1057

merged 1 commit into from
Apr 10, 2023

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Apr 6, 2023

Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input.

This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing.

In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not quite available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹

Resolves #1054.

¹ #814
#823

Testing

I've been starting from aligned.fasta.zst when testing the 21L-rooted builds and all seems fine.

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Use aligned sequences as the aligned sequences input, rather than pass
off unaligned sequences as the aligned sequences input.

This should be inconsequential to workflow behaviour or results, but it
makes the config a bit more straightforward and less confusing.

In a quick dig thru history, it seems like ncov-ingest's
aligned.fasta.xz was not _quite_ available when we first switched our
profiles to use an "aligned" input instead of a "sequences" input.  The
original use of "aligned" with unaligned sequences was driven by run
time concerns and related to the move of the filtering step after the
subsampling step and move of the "preprocess" steps from this workflow
(ncov) to ncov-ingest.¹

Resolves <#1054>.

¹ <#814>
  <#823>
@tsibley tsibley requested a review from a team April 6, 2023 22:34
@rneher
Copy link
Member

rneher commented Apr 7, 2023

If I remember correctly, the history of this is a bit convoluted. We used to require all sequences to be aligned since we generated subsamples by comparing all sequences to a focal sample. The workflow as it stands could run using aligned or unaligned sequences, I think. But we have an alignment step in the workflow

https://github.com/nextstrain/ncov/blob/trs/actually-aligned/workflow/snakemake_rules/main_workflow.smk#L60-L104

One might consider it a bit confusing that the aligned sequences are aligned again (and translated). There is also one caveat: The aligned sequences were aligned pairwise to the reference and insertions stripped. So some information got lost and translations in particular might be messed up if there are compensated frameshifts in some of the ORFs (I don't know of any examples for SC2). Using the original sequences avoids this and provides the insertions.csv file that could be used in downstream analysis.

@tsibley
Copy link
Member Author

tsibley commented Apr 7, 2023

I agree it's a bit convoluted. I wasn't involved in the history of it much, but I presume it got to be this way for decent reasons (e.g. to support different use cases). There's probably still room for improvement overall in the workflow graph, but that wasn't the intent of this PR.

The goal here was to provide what we claim to provide in the config. I don't particularly care if we provide unaligned sequences as the sequences input key vs. aligned sequences as the aligned input key, but more that what we provide should match what we say it is. :-)

Providing an aligned input key does skip that align rule, so they're not aligned again there. However, no matter what's provided as initial inputs, all sequences are always aligned a second time by nextclade in the build_align rule, after subsampling and filtering for each build:

rule build_align:
message:
"""
Running nextclade QC and aligning sequences
- gaps relative to reference are considered real
"""
input:
sequences = rules.combine_samples.output.sequences,
nextclade_dataset = "data/sars-cov-2-nextclade-defaults.zip",
output:
alignment = "results/{build_name}/aligned.fasta",
insertions = "results/{build_name}/insertions.tsv",
nextclade_qc = 'results/{build_name}/nextclade_qc.tsv',
translations = expand("results/{{build_name}}/translations/aligned.gene.{gene}.fasta", gene=config.get('genes', ['S']))
params:
outdir = "results/{build_name}/translations",
strain_prefixes=config["strip_strain_prefixes"],
sanitize_log="logs/sanitize_sequences_before_nextclade_{build_name}.txt",
output_translations = lambda w: f"results/{w.build_name}/translations/aligned.gene.{{gene}}.fasta"
log:
"logs/align_{build_name}.txt"
benchmark:
"benchmarks/align_{build_name}.txt"
conda: config["conda_environment"]
threads: 8
resources:
mem_mb=3000
shell:
"""
python3 scripts/sanitize_sequences.py \
--sequences {input.sequences} \
--strip-prefixes {params.strain_prefixes:q} \
--output /dev/stdout 2> {params.sanitize_log} \
| nextclade run \
--jobs {threads} \
--input-dataset {input.nextclade_dataset} \
--output-tsv {output.nextclade_qc} \
--output-fasta {output.alignment} \
--output-translations {params.output_translations} \
--output-insertions {output.insertions} 2>&1 | tee {log}
"""

@rneher
Copy link
Member

rneher commented Apr 8, 2023

yes, it skips the align rule bit still does the build_align rule which takes input from combine_samples:

rule combine_samples:

this is very confusing. In the current workflow, if one pretends the unaligned input sequences are aligned, one aligns just once. otherwise, things are aligned twice.

@jameshadfield
Copy link
Member

jameshadfield commented Apr 9, 2023

From memory: the build_align was introduced to improve alignments once QC & filtering had been done, as some erroneous sequences may have detrimentally affected the initial alignment (this may have been way back when we were using mafft). The reason we kept the initial align step around was to facilitate proximity filtering, which required an alignment; we no longer use this functionality and it's not computationally feasible for the entire dataset anymore.

Assuming proximity filtering is the only reason to align everything, perhaps modifying the workflow to skip align unless we have proximity filtering is the way to go. Or we could drop proximity support altogether.

Edit: I spent ~30 minutes trying to update the workflow to conditionally skip the initial align step. It's not as easy as I'd hoped. It would be cleaner to remove the proximity calculations entirely which would remove ~4 rules, a bunch of helper functions and multiple scripts; but that's only do-able if no-ones using this functionality anymore.

@tsibley
Copy link
Member Author

tsibley commented Apr 10, 2023

@rneher Yes, that's what I said.

@jameshadfield Thanks for providing that background. Seems to line up with my presumption above that "it got to be this way for decent reasons."

The tiny change in this PR is meant to be only a tiny clarification with zero impact on functionality. I'm going to merge it. The discussion of what steps the workflow should or shouldn't be doing (and how to do it) can continue elsewhere.

@tsibley tsibley merged commit acc1471 into master Apr 10, 2023
@tsibley tsibley deleted the trs/actually-aligned branch April 10, 2023 17:49
@emmahodcroft
Copy link
Member

Just as a note, I am using proximity calculations quite a lot, and I am not sure we can rule out that others don't use it. However, I'm also running a really old version of the ncov workflow generally, because I struggle to keep up with the updates. If we remove proximity rules all of my own current workflows will be forever diverged, though. (I do recognise that I'm 100% not the only user to be considered & not meaning to imply that - just adding a perspective.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Why use unaligned sequences as "aligned" input instead of the actually aligned sequences?
4 participants