-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start from actually-aligned sequences #1057
Conversation
Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input. This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing. In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not _quite_ available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹ Resolves <#1054>. ¹ <#814> <#823>
If I remember correctly, the history of this is a bit convoluted. We used to require all sequences to be aligned since we generated subsamples by comparing all sequences to a focal sample. The workflow as it stands could run using aligned or unaligned sequences, I think. But we have an alignment step in the workflow One might consider it a bit confusing that the aligned sequences are aligned again (and translated). There is also one caveat: The aligned sequences were aligned pairwise to the reference and insertions stripped. So some information got lost and translations in particular might be messed up if there are compensated frameshifts in some of the ORFs (I don't know of any examples for SC2). Using the original sequences avoids this and provides the |
I agree it's a bit convoluted. I wasn't involved in the history of it much, but I presume it got to be this way for decent reasons (e.g. to support different use cases). There's probably still room for improvement overall in the workflow graph, but that wasn't the intent of this PR. The goal here was to provide what we claim to provide in the config. I don't particularly care if we provide unaligned sequences as the Providing an ncov/workflow/snakemake_rules/main_workflow.smk Lines 469 to 509 in 0a093c0
|
yes, it skips the
this is very confusing. In the current workflow, if one pretends the unaligned input sequences are aligned, one aligns just once. otherwise, things are aligned twice. |
From memory: the Assuming proximity filtering is the only reason to align everything, perhaps modifying the workflow to skip Edit: I spent ~30 minutes trying to update the workflow to conditionally skip the initial align step. It's not as easy as I'd hoped. It would be cleaner to remove the proximity calculations entirely which would remove ~4 rules, a bunch of helper functions and multiple scripts; but that's only do-able if no-ones using this functionality anymore. |
@rneher Yes, that's what I said. @jameshadfield Thanks for providing that background. Seems to line up with my presumption above that "it got to be this way for decent reasons." ⁂ The tiny change in this PR is meant to be only a tiny clarification with zero impact on functionality. I'm going to merge it. The discussion of what steps the workflow should or shouldn't be doing (and how to do it) can continue elsewhere. |
Just as a note, I am using proximity calculations quite a lot, and I am not sure we can rule out that others don't use it. However, I'm also running a really old version of the |
Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input.
This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing.
In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not quite available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹
Resolves #1054.
¹ #814
#823
Testing
I've been starting from
aligned.fasta.zst
when testing the 21L-rooted builds and all seems fine.Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.md
in this pull request to document these changes and the new version number.If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.