Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move filter after subsampling #814

Merged
merged 10 commits into from
Jan 5, 2022
Merged

Move filter after subsampling #814

merged 10 commits into from
Jan 5, 2022

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Dec 10, 2021

Description of proposed changes

Reorganizes the workflow to start subsampling as soon as possible without inspecting sequences. Specifically, this PR:

  • moves the "filter" rule after the mask rule in the post-subsampling section of the workflow
  • removes the "filtered" input and upload in favor of just the "aligned" files
  • moves the diagnostic rule after subsampling such that diagnostics only apply to the subsampled metadata (since diagnostics originally ran on each input "origin" but filtering occurs after merging of these inputs)
  • subsamples without sequences or the sequence index file to get only strain lists (speeding up subsampling substantially)
  • optionally extracts subsampled focal sequences with a new rule when those sequences are needed to calculate proximity values for priority-based subsampling
  • extracts all subsampled sequences without a sequence index (this index is not required when using the --exclude-all flag for augur filter)
  • removes references to “filtered” from the workflow documentation and schema
  • maintains support for two input-specific filter parameters including min_length and skip_diagnostics by building a "min length" filter query for each input (applied as a --query) and adding a --skip-inputs argument to the diagnostics script to allow skipping inputs based on named metadata columns for each input
  • drops preprocess GitHub workflows
  • modifies the Nextstrain profiles to use aligned inputs for sequences even though they are not aligned (to skip right to subsampling)

Breaking changes include:

  • drops support for input-specific filter parameters other than min_length and skip_diagnostics. The workflow prints an error to users when they try to run the filter step with any other input-specific parameters.

The upshot of these changes is that the most common use case (subsampling without priorities) can run much faster from prealigned sequences by avoiding construction of a sequence index, filtering the full sequence database, and making multiple passes through the full sequence database during subsampling. When subsampling with priorities, the workflow will build a sequence index, but it will only extract subsampled sequences for the focal data. This should still be much faster than the current implementation.

Additionally, users can skip alignment completely, if they don't use priority-based subsampling, by passing their sequences to the workflow as aligned inputs.

Here is an example workflow DAG from this PR with no priority subsampling:

PR's DAG without priority subsampling

Here is the corresponding DAG with priority subsampling:

PR's DAG with priority subsampling

Related issue(s)

Fixes #810
Fixes #807

Testing

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release: v10
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

@huddlej
Copy link
Contributor Author

huddlej commented Dec 15, 2021

I tested this workflow while running some builds for African countries by starting with the full unaligned Nextstrain GISAID sequences on S3 with a three-part subsampling scheme that includes focal sequences for a country, regional context from Africa, and global context from outside of Africa. By setting the inputs entry point to aligned, I could skip the alignment step and start with sanitizing metadata and subsampling. The figure below shows the run times for a single build for Ghana using Augur 13.1.0 (which includes some of @victorlin's recent group-by improvements but not the full priority logic rewrite).

image

The sanitize metadata script is obviously quite slow here, but this is a one time cost that subsequent builds for other countries don't incur. The focal and regional subsampling take ~10 minutes and the global context takes ~25 minutes. These commands only output strain lists and sequences get extracted by a later augur filter command that runs in ~15 minutes.

Most importantly, the subsampling logic doesn't use priorities, so we can completely skip alignment of the full GISAID database. This workflow ran in ~3 hours instead of the ~15 hours it would have taken if we included alignment.

@victorlin
Copy link
Member

@huddlej great to see this run time visual for an end-to-end build! Curious, what did you use to generate this?

@huddlej
Copy link
Contributor Author

huddlej commented Dec 15, 2021

I used @tsibley's Snakemake run stats viewer.

@huddlej
Copy link
Contributor Author

huddlej commented Dec 15, 2021

@jameshadfield mentioned the idea that we could build on these changes such that we don't need to fake that the inputs are aligned, but instead allow users to define a sequences input and allow the workflow to figure out whether alignment is needed or not based on whether subsampling requires priorities or not.

@huddlej
Copy link
Contributor Author

huddlej commented Dec 15, 2021

Here are stats for a subsequent Egypt build after sanitizing metadata has completed (Egypt has a similar number of genomes as Ghana):

image

This build took just under 2 hours.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @huddlej, thanks for tackling this. I've not run this, but have compared the DAGs and reviewed the code. (I realise this is still a draft PR so you've not covered everything).

I wrote responses to your in-line comments but in this PR they're showing up as individual comments for some reason. I think there are a few additions that are needed or at the least will help reduce papercuts:

Can no longer start from filtered inputs

The official nextstrain profile, some tests and the "example_global_context" build start from filtered inputs. Additionally the docs discuss filtered inputs in multiple places.

P.S. I'm going to make a PR on top of this one to remove the preprocess profile, so you can ignore those if you wish. Done, see #823.

Config warnings

The schema should remove filtered as a valid input which will prevent those configs from running. Additionally, if a user defined a config setting filter → {origin} then we should exit and notify them that this is no longer possible.

@huddlej
Copy link
Contributor Author

huddlej commented Dec 16, 2021

Thank you for looking through this, @jameshadfield! It was in the back of my mind to go back through all the configs, example, builds, and docs to remove references to "filtered", but I should have written it out.

We just found out today that Daniel Bridges (from the Zambia PATH group) would like to use the input-specific filtering option of the ncov workflow that we talked about dropping yesterday. The ability to apply different filters per input was an important feature. There isn't a way in the current PR to support per-origin filtering by number of valid nucleotides (min_length) or a way to skip the diagnostic for specific inputs. The presence of the origin columns in the metadata would help with the second case. If we annotated number of valid nucleotides in the metadata instead of the sequence index, users could use the --query interface to filter only for specific inputs. This feels pretty complicated for users though. I'll think about this more...

@huddlej huddlej added the source: office hours Issue mentioned during office hours label Dec 16, 2021
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
Copy link
Contributor Author

@huddlej huddlej Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could support input-specific filters as long as the metadata include columns like length or ACTG or valid_bases and then users could define a query that filters based on metadata input column like:

--query "(gisaid == 'yes' & length > 5000)"

If we update the alignment step above to use Nextclade instead of Nextalign, we could get the annotation of the number of gaps, etc. in a TSV to merge with metadata to allow the query above. Alternately, it would be simpler to start with augur index output and another rule+script that merges the metadata with the index.

@jameshadfield points out that we could generate this query based on the existing filter config if users provide an input-specific filter and add that query to the existing augur filter. To do this we would loop through all "origins" defined in inputs and construct a query string with a component per origin that pulls from the config. For a config like the following:

inputs:
  - name: gisaid
    # ...snip...
  - name: aus
    # ...snip...

filter:
  min_length: 10000
  aus:
    min_length: 5000

the resulting query would be something like:

--query "(gisaid == 'yes' & length > 10000) | (aus == 'yes' & length > 5000)"

Because we generate this query from Snakemake, we can call the length field whatever we like and change its source in the future without breaking the workflow. We should consider calling this field a private name like "_length" or something that won't collide with existing metadata columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more now, this solution only addresses the specific use case of filtering by length. We've technically allowed users to define any valid filter key for any origin and we've documented four specific options in the configuration guide and two of these in the multiple inputs example.

Instead of figuring out how to support each of these individually or collectively, we could revisit the use case. Here's an attempt at the user story we're trying to support:

James has 20 new likely Omicron genomes from his city that just came off the sequencer and he wants to know how they place in the phylogenetic context of other sequences from the city and the country. He configures a Nextstrain build with inputs including the full GISAID database and his local sequences and metadata. He wants a high-quality set of GISAID data for his phylogenetic context, but he wants to keep all local sequences in his analysis regardless of their quality. He sets stringent quality filters for GISAID data, applying a minimum genome length of 27,000 valid nucleotides and filtering out any genomes that do not pass all of Nextclade's quality control annotations.

From this user's perspective, we've got the current implementation a bit backwards; James wants to keep all of his inputs by default (with reasonable subsampling) and only wants to apply filters to specific datasets. We currently require the opposite approach where we apply more stringent filters by default and require users to manually relax those filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think having a more stringent default is likely good - for users less clear on what's going on, I think reducing the odds they take away big messages from poor quality genomes is fair. However, it should also be as straightforward as possible to relax these if you actively are choosing to include worse-quality genomes (which is a very fair thing to do - just really important you know you've done it!).

I think if most users are interested in QC metric filtering, we could try to get this from Nextclade early on (as you suggest) & support quite a few different filtering options - but not "anything". I guess we also ideally would want for user-datasets to go through this, but not to try and generate this for GISAID - preferring to wait until after subsampling (though those within Nextstrain can use the files we've generated already with this info).

I like the idea of putting this query together from the config as suggested above.

@huddlej
Copy link
Contributor Author

huddlej commented Dec 20, 2021

@jameshadfield Based on our conversations (above and offline), I've made the following changes to this PR:

  • Pass unmasked sequences to pangolin and augur ancestral
  • Remove references to “filtered” from the workflow documentation and schema
  • Build a "min length" filter query for each input
  • Support skipping diagnostics for each input
  • Add deprecation error for other input-specific filters

The most recent commits add two rules to the workflow including creation of a smaller sequence index after subsampling and merging of that index with the metadata to get "private" metadata fields with sequence content information. We could also decide to open this up a bit and make this annotated metadata more standard throughout the workflow, so it could be queried against more generically.

For a builds config like the following:

# Define inputs.
inputs:
  - name: gisaid
    # ...snip...
  - name: local
    # ...snip...

filter:
  # Apply length filters and diagnostics to all inputs by default.
  min_length: 27000

  # Keep most local inputs. 
  local:
    min_length: 5000
    skip_diagnostics: true

The filter step dynamically builds a query like this:

--query "(gisaid == 'yes' & _length >= 27000) | (local == 'yes' & _length >= 5000)"

The diagnostics script also runs with an additional argument of --skip-inputs local which causes matching records to be dropped from the metadata prior to building a list of strains to exclude.

For all other input-specific filters, the user gets an error message saying that these are not supported.

We'll need to test this workflow a bit more, after such major surgery. I will run some full Nextstrain builds on rhino and also some builds that use priorities, to get a better sense of where other bugs might exist. If you have other tests you'd like to run (e.g., with your own local data, etc.) that would be helpful, too.

@huddlej huddlej marked this pull request as ready for review December 20, 2021 23:47
@huddlej
Copy link
Contributor Author

huddlej commented Dec 21, 2021

I ran two full Nextstrain builds in parallel (Africa and Europe) on the cluster with this PR. The builds ran in ~3 hours. One of the longest steps was the sanitize metadata step, as shown below.

image

In a separate PR, we should consider adding a boolean flag for each input to allow skipping the sanitize steps. Like:

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"
    sanitize_metadata: false
    sanitize_sequences: false

Everyone else who uses GISAID-sourced data will still need these early steps, but the Nextstrain builds would run faster without them.

Next, I will run some similarly-sized builds with priorities enabled for subsampling.

huddlej and others added 9 commits December 22, 2021 13:33
Starts to reorganize the workflow such that we only need the sequence
index when subsampling with priorities and we only filter after the
subsampling, alignment, and mask steps.

Related to #810
These should be small enough, it isn't worth the extra effort/complexity
to compress.
Reverts inputs for pangolin annotations and augur ancestral to match the
current workflow, so these tools gets unmasked sequences. This change
allows pangolin to do what it needs to internally without preemptively
throwing out information and keeps mutations from masked sites as
"decorations" on the tree even though these sites are otherwise ignored
during the analysis.
Maintains support for two input-specific filter settings: `min_length`
and `skip_diagnostics`. To support the minimum length filter, this
commit adds a sequence index rule to the workflow before filtering, a
rule and script that annotates the metadata with sequence index data,
and custom query parameter logic in Snakemake to build a filter query
for each input based on the minimum length given.

Additionally, this commit allows users to skip diagnostics for specific
inputs. When there is only one input, the top-level `skip_diagnostics`
parameter takes effect. When there are multiple inputs, the workflow
will look for input-specific requests to skip diagnostics. The
diagnostics script now optionally takes a list of input names whose
records should be skipped and filters these prior to other diagnostic
logic.
Alerts the user when input-specific filters are no longer supported, so
their workflows don't silently produce unexpected results.
Handle cases where `skip_diagnostics` is not defined for a build and
where only one input exists for we can't filter metadata by the input
name.
@huddlej huddlej self-assigned this Dec 22, 2021
@huddlej
Copy link
Contributor Author

huddlej commented Dec 23, 2021

@jameshadfield The open builds we queued up yesterday ran without errors in just over 2 hours! 🎉

Edit: For comparison, the open build I just ran for the "allow numbers in build names" PR took a little over 4 hours.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - I've looked through the code again and the successful open builds indicate that this works in practice.

The current actions look like this:

  • Ncov-ingest (GISAID/GenBank) -> trigger ncov preprocessing pipelines -> trigger ncov rebuild pipelines

And after this PR they look like:

  • Ncov-ingest (GISAID/GenBank) -> trigger ncov rebuild pipelines

Note that the companion PR ncov-ingest #255 is needed here which stops ncov-ingest trying to trigger the (non-existent) preprocessing pipeline.


@huddlej and I plan to merge this tomorrow (Wednesday Seattle time). The schedule for ncov-ingest runs is:

I propose we use the following steps to get this out:

  1. Cancel the schedule ncov-ingest action, assuming they have started running (see above for start times)
  2. Merge this PR
  3. Merge the companion PR ncov-ingest #255
  4. Manually trigger the ncov-ingest actions “GISAID fetch and injest” and “GenBank fetch and ingest” (if before 10h07 Seattle, then we could let the cron-scheduled GISAID action run instead).
  5. Monitor the ncov-ingest actions (no changes here except the repository dispatch trigger). These take around 3-5 hours.
  6. Check that the ncov rebuild pipelines are correctly triggered from step 5.
  7. Check the ncov pipeline works as expected (c. 2 hours runtime) and updates the correct files on S3:
    1. (GenBank) Builds updated at s3://nextstrain-data/files/ncov/open/{global,africa,...}/*
    2. (GenBank) Datasets updated at s3://nextstrain-data/ncov_open_{global,africa,...}.json + sidecar files
    3. (GISAID) Builds updated at s3://nextstrain-private/{global,africa,...}/*
    4. (GISAID) Datasets updated at s3://nextstrain-data/ncov_gisaid_{global,africa,...}.json + sidecar files
  8. Remove the no-longer-used filtered sequences file from S3
    1. (GenBank) s3://nextstrain-data/files/ncov/open/filtered.fasta.xz
    2. (GISAID) s3://nextstrain-private/filtered.fasta.xz

docs/src/reference/configuration.md Show resolved Hide resolved
@@ -71,7 +70,6 @@ This means that the full GenBank metadata and sequences are typically updated a
| Full GenBank data | metadata | https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz |
| | sequences | https://data.nextstrain.org/files/ncov/open/sequences.fasta.xz |
| | aligned | https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz |
| | filtered | https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to us to remove this file after the first successful run (and the corresponding GISAID file as well)

@huddlej huddlej merged commit cf79e41 into master Jan 5, 2022
@huddlej huddlej deleted the move-filter branch January 5, 2022 20:46
@danrlu
Copy link
Contributor

danrlu commented Jan 14, 2022

Thank you for the changes! They all make a lot of sense and we're seeing a much faster run on our end as well. Our set up is similar in that we download and pre-align all data from GISAID at night, and use the aligned fasta (which used to be the filtered fasta before changes in this PR) plus user provided data as the starting point for all our tree builds. All our tree build has priority based subsampling (builds.yaml and no changes to the main workflow), and the longest step now is the index after combining all inputs (with 6+ millions GISAID sequences. great visualization tool btw).
stats
I'm looking into swapping augur index with seqtk, but want to check, why are the runs above don't seem to have this long index step (was I not squinting enough XD)? Are we doing anything unnecessary? Thanks!

@huddlej
Copy link
Contributor Author

huddlej commented Jan 14, 2022

Hi @danrlu! The index step only happens when you need priority-based subsampling. You don't see the index step in any of the examples above because I avoided subsampling with priorities. The priority calculations use the number of Ns per sequence from the sequence index to penalize/deprioritize low quality sequences. As long as your --sequence-index input to the priorities script is a TSV with a valid N column, everything should work the same as if you used augur index.

Now that you mention it, I'm realizing that we don't need the full sequence index for anything else in the workflow. We currently require it as an input for the "extract subsampled sequences" rule (to get focal sequences to compare to the whole database), but it doesn't get used.

seqtk would be a much better choice for the current workflow's need. Would you want to submit a PR to replace augur index with seqtk in the main workflow? Or do you want to try it out on your end first and we can copy your implementation here?

@danrlu
Copy link
Contributor

danrlu commented Jan 14, 2022

Great tips John, thank you!! Let me swap in seqtk on our end and benchmark first, since I'm not sure with decompressing .xz how much time it saves in total. Happy to send in a PR after that (may break the dependencies for some ppl...). I'll report back by the end of next week. Thanks again!

@danrlu
Copy link
Contributor

danrlu commented Jan 17, 2022

The testing run 2 posts above was with ncov feb2c02. To test seqtk, I replaced the code section of rule index_sequences: with:

shell:
    """
    unxz --stdout {input.sequences} \
        | seqtk comp - \
        | sed '1i strain\tlength\tA\tC\tG\tT\tmix2\tmix3\tN\tCpG\ttransversion\ttransition\tCpG_ts' \
        | xz -c -2 > {output.sequence_index} 2>&1 | tee {log}
    """

(Column headers are based on lh3/seqtk#47)

The good news is with 6.4 million sequences, the index step went from 1h50min down to 27min,
image

The bad news is seqtk does not distinguish gaps - from N (we would want to penalize low quality sequences, but not sequences with real deletions...)
In combined_sequence_index.tsv generated by augur index:
image

With seqtk, the column that I thought would be N is actually a sum of N and - (and in fact anything outside of IUPAC code, see further below) 😭
image

One work around is to count only capital letters with seqtk comp -u, which will make seqtk ignore -
image
and that is still pretty fast (33min).
image

Comparing augur index and seqtk comp -u, the individual counts for A T C G N match completely for all our 6.4 million sequences, which also confirmed that all fasta in our pipelines only use capital letters (I'm not sure whether GISAID formats them or ncov-ingest or ncov). This may not be true for everyone and all use cases. So I'm not confident this is a solution for everyone, but it will work nicely for us.


Side notes: I also tried faCount from UCSC and it also lumps other characters into the N column:
With test_N.fasta being:

>seq1
NNN---
>seq2
nnn---
>seq3
???ZZZ

The counts for N:
`

The only other tool I know that may also do counting is https://bioinf.shenwei.me/seqkit/usage/ but I'm not too familiar with it.

@huddlej
Copy link
Contributor Author

huddlej commented Jan 18, 2022

@danrlu Wow, thank you for the detailed analysis! Isn't this always the way with bioinformatics (or maybe just computers) that a conceptually simple task gets complicated quickly?

I had a quick look at seqkit stats (which seems to be the closest analog to seqtk comp). This command only outputs the full stats of all sequences in a given FASTA rather than the stats per sequence in the file.

Since we already sanitize all sequences passed to the workflow, we could force all sequences to uppercase during that step and use the seqtk comp -u workaround. I'm worried about relying on a side effect of seqtk's design, though. This behavior may not change in the short term, so it could be the stop-gap solution.

Some other long-term solutions could include:

  • implement C-based counting in augur index (proposed by @tsibley when we first added augur index)
  • cache index information across runs, to avoid recalculating the same stats for sequences that haven't changed (similar to how we've started caching our sequence alignments)
  • use Nextclade for the full alignment in the ncov workflow and store its metadata annotations of sequence composition (something we've also discussed internally)

The Nextclade approach might actually be ideal in that its already a parallelized C++ implementation and we already have an approach in ncov-ingest to cache alignments (and the corresponding metadata like number of missing bases). We could easily modify the priority script to look for Ns in other columns like the totalMissing column provided by Nextclade's --output-tsv output. We could immediately benefit from Nextclade's speed by using it for the full alignment step and passing its metadata output to the priority script. Later, we could incorporate the caching of alignment (and corresponding output metadata) from ncov-ingest into the main ncov workflow to get an even greater speed-up.

I'll continue this conversation with the rest of the team, but I'm curious about your thoughts on these proposals, too, and how they align with what you team has been thinking internally.

@danrlu
Copy link
Contributor

danrlu commented Jan 19, 2022

Hahaha so true! It's never as straightforward as planned. With seqtk comp -u and uppercase, if ppl have a lot of data, adding a step to uppercase the sequences may add more run time and potentially make the change from augur index to seqtk less time-saving (need benchmarking), or if they have a small dataset, there is no need for the change...... so I'm not sure whether it's a worthy change for ncov. I'm totally in for a longer term, more general and more robust solution.

We do tree build in 2 steps:
Step 1: we have a nightly job to download public repo data, format/clean it with ncov-ingest and process as far as possible as an individual input with ncov. We currently pin to an ncov hash right before this PR, so we run alignment/filter fasta + sanitize metadata. (If I understand it correctly, if we ever move to GenBank data, we can skip this step entirely and just use the open pre-processed data Nextstrain provide. Thanks for generating that dataset.)

Step 2: we run Nextstrain tree builds combining user input data and the pre-processed public data from the first job. This can be done many times for many users. So anything we can stash into the first job will save us time across all tree runs.

Per the potential solutions for index, I think for small tree jobs, having less package/repo dependencies is nicer and cleaner, but for large dataset, the speed matters more.

Before this PR merge, step 1 takes 11hr, step 2 takes 7-8hr. So we pretty much wait a day until new data gets on to the tree.
After this PR merge, step 1 takes 8hr, step 2 takes 6-7hr, and with seqtk comp -u step 2 takes 4.5-5.5hr. So now we can see trees with new data in the morning. For our purpose, this is already pretty good place to be. Thanks for the changes in this PR!

I was following this PR nextstrain/ncov-ingest#237 and thinking to do something similar, but we currently don't have any Nextclade components in our workflow and I have not looked into how Nextclade works. Per what you were describing, it sounds like we could just use Nextclade for our first job instead of ncov, plus caching the alignment?? and if the Nextclade route of index gets implemented, that will take care of the index step and caching as well??? 🤩 🤩 🤩 I'll put in seqtk for us for now and learn more about this approach. Thank you very much!!

tsibley added a commit that referenced this pull request Apr 6, 2023
Use aligned sequences as the aligned sequences input, rather than pass
off unaligned sequences as the aligned sequences input.

This should be inconsequential to workflow behaviour or results, but it
makes the config a bit more straightforward and less confusing.

In a quick dig thru history, it seems like ncov-ingest's
aligned.fasta.xz was not _quite_ available when we first switched our
profiles to use an "aligned" input instead of a "sequences" input.  The
original use of "aligned" with unaligned sequences was driven by run
time concerns and related to the move of the filtering step after the
subsampling step and move of the "preprocess" steps from this workflow
(ncov) to ncov-ingest.¹

¹ <#814>
  <#823>
tsibley added a commit that referenced this pull request Apr 6, 2023
Use aligned sequences as the aligned sequences input, rather than pass
off unaligned sequences as the aligned sequences input.

This should be inconsequential to workflow behaviour or results, but it
makes the config a bit more straightforward and less confusing.

In a quick dig thru history, it seems like ncov-ingest's
aligned.fasta.xz was not _quite_ available when we first switched our
profiles to use an "aligned" input instead of a "sequences" input.  The
original use of "aligned" with unaligned sequences was driven by run
time concerns and related to the move of the filtering step after the
subsampling step and move of the "preprocess" steps from this workflow
(ncov) to ncov-ingest.¹

Resolves <#1054>.

¹ <#814>
  <#823>
joverlee521 added a commit that referenced this pull request Oct 23, 2024
This commit removes the rules `mutation_summary` and
`build_mutation_summary`, both of which called the removed script
`scripts/mutation_summary.py`.

I was looking into updating the script to work with Nextclade v3 outputs
but it seems like the rules using the script are not actively used in
the Snakemake workflow!

In Feb 2021, the script and rule `mutation_summary` were added and the
only use of the output `results/mutation_summary_{origin}.tsv.xz`
was to be uploaded it as an intermediate file of the builds.¹
In Jan 2022, the upload of the intermediate file was removed² and I
cannot find any other uses of the file so the rule `mutation_summary`
seems unused.

In March 2021, the rule `build_mutation_summary` was added to support
the rule `add_mutation_counts`.³ However, `add_mutation_counts` was
later updated to use `augur distance`⁴ which does not require the
mutation summary output. I cannot find any other uses of the output file
`results/{build_name}/mutation_summary.tsv`, so the rule
`build_mutation_summary` seems unused.

¹ <#562>
² <#814>
³ <3b23e8f>
⁴ <d24d531>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: office hours Issue mentioned during office hours
Projects
No open projects
5 participants