Start from actually-aligned sequences #1057

tsibley · 2023-04-06T22:34:24Z

Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input.

This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing.

In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not quite available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹

Resolves #1054.

¹ #814
#823

Testing

I've been starting from aligned.fasta.zst when testing the 21L-rooted builds and all seems fine.

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input. This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing. In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not _quite_ available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹ Resolves <#1054>. ¹ <#814> <#823>

rneher · 2023-04-07T12:52:24Z

If I remember correctly, the history of this is a bit convoluted. We used to require all sequences to be aligned since we generated subsamples by comparing all sequences to a focal sample. The workflow as it stands could run using aligned or unaligned sequences, I think. But we have an alignment step in the workflow

https://github.com/nextstrain/ncov/blob/trs/actually-aligned/workflow/snakemake_rules/main_workflow.smk#L60-L104

One might consider it a bit confusing that the aligned sequences are aligned again (and translated). There is also one caveat: The aligned sequences were aligned pairwise to the reference and insertions stripped. So some information got lost and translations in particular might be messed up if there are compensated frameshifts in some of the ORFs (I don't know of any examples for SC2). Using the original sequences avoids this and provides the insertions.csv file that could be used in downstream analysis.

tsibley · 2023-04-07T22:23:14Z

I agree it's a bit convoluted. I wasn't involved in the history of it much, but I presume it got to be this way for decent reasons (e.g. to support different use cases). There's probably still room for improvement overall in the workflow graph, but that wasn't the intent of this PR.

The goal here was to provide what we claim to provide in the config. I don't particularly care if we provide unaligned sequences as the sequences input key vs. aligned sequences as the aligned input key, but more that what we provide should match what we say it is. :-)

Providing an aligned input key does skip that align rule, so they're not aligned again there. However, no matter what's provided as initial inputs, all sequences are always aligned a second time by nextclade in the build_align rule, after subsampling and filtering for each build:

ncov/workflow/snakemake_rules/main_workflow.smk

Lines 469 to 509 in 0a093c0

    
           rule build_align: 
        
               message: 
        
                   """ 
        
                   Running nextclade QC and aligning sequences 
        
                       - gaps relative to reference are considered real 
        
                   """ 
        
               input: 
        
                   sequences = rules.combine_samples.output.sequences, 
        
                   nextclade_dataset = "data/sars-cov-2-nextclade-defaults.zip", 
        
               output: 
        
                   alignment = "results/{build_name}/aligned.fasta", 
        
                   insertions = "results/{build_name}/insertions.tsv", 
        
                   nextclade_qc = 'results/{build_name}/nextclade_qc.tsv', 
        
                   translations = expand("results/{{build_name}}/translations/aligned.gene.{gene}.fasta", gene=config.get('genes', ['S'])) 
        
               params: 
        
                   outdir = "results/{build_name}/translations", 
        
                   strain_prefixes=config["strip_strain_prefixes"], 
        
                   sanitize_log="logs/sanitize_sequences_before_nextclade_{build_name}.txt", 
        
                   output_translations = lambda w: f"results/{w.build_name}/translations/aligned.gene.{{gene}}.fasta" 
        
               log: 
        
                   "logs/align_{build_name}.txt" 
        
               benchmark: 
        
                   "benchmarks/align_{build_name}.txt" 
        
               conda: config["conda_environment"] 
        
               threads: 8 
        
               resources: 
        
                   mem_mb=3000 
        
               shell: 
        
                   """ 
        
                   python3 scripts/sanitize_sequences.py \ 
        
                       --sequences {input.sequences} \ 
        
                       --strip-prefixes {params.strain_prefixes:q} \ 
        
                       --output /dev/stdout 2> {params.sanitize_log} \ 
        
                       | nextclade run \ 
        
                       --jobs {threads} \ 
        
                       --input-dataset {input.nextclade_dataset} \ 
        
                       --output-tsv {output.nextclade_qc} \ 
        
                       --output-fasta {output.alignment} \ 
        
                       --output-translations {params.output_translations} \ 
        
                       --output-insertions {output.insertions} 2>&1 | tee {log} 
        
                   """

rneher · 2023-04-08T10:49:35Z

yes, it skips the align rule bit still does the build_align rule which takes input from combine_samples:

ncov/workflow/snakemake_rules/main_workflow.smk

Line 425 in 0a093c0

rule combine_samples:

this is very confusing. In the current workflow, if one pretends the unaligned input sequences are aligned, one aligns just once. otherwise, things are aligned twice.

jameshadfield · 2023-04-09T23:25:42Z

From memory: the build_align was introduced to improve alignments once QC & filtering had been done, as some erroneous sequences may have detrimentally affected the initial alignment (this may have been way back when we were using mafft). The reason we kept the initial align step around was to facilitate proximity filtering, which required an alignment; we no longer use this functionality and it's not computationally feasible for the entire dataset anymore.

Assuming proximity filtering is the only reason to align everything, perhaps modifying the workflow to skip align unless we have proximity filtering is the way to go. Or we could drop proximity support altogether.

Edit: I spent ~30 minutes trying to update the workflow to conditionally skip the initial align step. It's not as easy as I'd hoped. It would be cleaner to remove the proximity calculations entirely which would remove ~4 rules, a bunch of helper functions and multiple scripts; but that's only do-able if no-ones using this functionality anymore.

tsibley · 2023-04-10T17:48:54Z

@rneher Yes, that's what I said.

@jameshadfield Thanks for providing that background. Seems to line up with my presumption above that "it got to be this way for decent reasons."

⁂

The tiny change in this PR is meant to be only a tiny clarification with zero impact on functionality. I'm going to merge it. The discussion of what steps the workflow should or shouldn't be doing (and how to do it) can continue elsewhere.

emmahodcroft · 2023-04-11T08:58:46Z

Just as a note, I am using proximity calculations quite a lot, and I am not sure we can rule out that others don't use it. However, I'm also running a really old version of the ncov workflow generally, because I struggle to keep up with the updates. If we remove proximity rules all of my own current workflows will be forever diverged, though. (I do recognise that I'm 100% not the only user to be considered & not meaning to imply that - just adding a perspective.)

tsibley requested a review from a team April 6, 2023 22:34

tsibley merged commit acc1471 into master Apr 10, 2023

tsibley deleted the trs/actually-aligned branch April 10, 2023 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start from actually-aligned sequences #1057

Start from actually-aligned sequences #1057

tsibley commented Apr 6, 2023

rneher commented Apr 7, 2023

tsibley commented Apr 7, 2023

rneher commented Apr 8, 2023

jameshadfield commented Apr 9, 2023 •

edited

Loading

tsibley commented Apr 10, 2023

emmahodcroft commented Apr 11, 2023

Start from actually-aligned sequences #1057

Start from actually-aligned sequences #1057

Conversation

tsibley commented Apr 6, 2023

Testing

Release checklist

rneher commented Apr 7, 2023

tsibley commented Apr 7, 2023

rneher commented Apr 8, 2023

jameshadfield commented Apr 9, 2023 • edited Loading

tsibley commented Apr 10, 2023

emmahodcroft commented Apr 11, 2023

jameshadfield commented Apr 9, 2023 •

edited

Loading