Move filter after subsampling #814

huddlej · 2021-12-10T17:43:22Z

Description of proposed changes

Reorganizes the workflow to start subsampling as soon as possible without inspecting sequences. Specifically, this PR:

moves the "filter" rule after the mask rule in the post-subsampling section of the workflow
removes the "filtered" input and upload in favor of just the "aligned" files
moves the diagnostic rule after subsampling such that diagnostics only apply to the subsampled metadata (since diagnostics originally ran on each input "origin" but filtering occurs after merging of these inputs)
subsamples without sequences or the sequence index file to get only strain lists (speeding up subsampling substantially)
optionally extracts subsampled focal sequences with a new rule when those sequences are needed to calculate proximity values for priority-based subsampling
extracts all subsampled sequences without a sequence index (this index is not required when using the --exclude-all flag for augur filter)
removes references to “filtered” from the workflow documentation and schema
maintains support for two input-specific filter parameters including min_length and skip_diagnostics by building a "min length" filter query for each input (applied as a --query) and adding a --skip-inputs argument to the diagnostics script to allow skipping inputs based on named metadata columns for each input
drops preprocess GitHub workflows
modifies the Nextstrain profiles to use aligned inputs for sequences even though they are not aligned (to skip right to subsampling)

Breaking changes include:

drops support for input-specific filter parameters other than min_length and skip_diagnostics. The workflow prints an error to users when they try to run the filter step with any other input-specific parameters.

The upshot of these changes is that the most common use case (subsampling without priorities) can run much faster from prealigned sequences by avoiding construction of a sequence index, filtering the full sequence database, and making multiple passes through the full sequence database during subsampling. When subsampling with priorities, the workflow will build a sequence index, but it will only extract subsampled sequences for the focal data. This should still be much faster than the current implementation.

Additionally, users can skip alignment completely, if they don't use priority-based subsampling, by passing their sequences to the workflow as aligned inputs.

Here is an example workflow DAG from this PR with no priority subsampling:

Here is the corresponding DAG with priority subsampling:

Related issue(s)

Fixes #810
Fixes #807

Testing

Tested with CI
Tested with full Nextstrain builds for multiple countries
Tested with full Nextstrain builds with priorities
Tested with a full open build workflow through GitHub Actions (tests upload/deploy steps)

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release: v10
Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

huddlej · 2021-12-15T18:54:32Z

I tested this workflow while running some builds for African countries by starting with the full unaligned Nextstrain GISAID sequences on S3 with a three-part subsampling scheme that includes focal sequences for a country, regional context from Africa, and global context from outside of Africa. By setting the inputs entry point to aligned, I could skip the alignment step and start with sanitizing metadata and subsampling. The figure below shows the run times for a single build for Ghana using Augur 13.1.0 (which includes some of @victorlin's recent group-by improvements but not the full priority logic rewrite).

The sanitize metadata script is obviously quite slow here, but this is a one time cost that subsequent builds for other countries don't incur. The focal and regional subsampling take ~10 minutes and the global context takes ~25 minutes. These commands only output strain lists and sequences get extracted by a later augur filter command that runs in ~15 minutes.

Most importantly, the subsampling logic doesn't use priorities, so we can completely skip alignment of the full GISAID database. This workflow ran in ~3 hours instead of the ~15 hours it would have taken if we included alignment.

victorlin · 2021-12-15T19:05:51Z

@huddlej great to see this run time visual for an end-to-end build! Curious, what did you use to generate this?

huddlej · 2021-12-15T20:39:52Z

I used @tsibley's Snakemake run stats viewer.

huddlej · 2021-12-15T20:43:07Z

@jameshadfield mentioned the idea that we could build on these changes such that we don't need to fake that the inputs are aligned, but instead allow users to define a sequences input and allow the workflow to figure out whether alignment is needed or not based on whether subsampling requires priorities or not.

huddlej · 2021-12-15T21:15:18Z

Here are stats for a subsequent Egypt build after sanitizing metadata has completed (Egypt has a similar number of genomes as Ghana):

This build took just under 2 hours.

jameshadfield

This is great @huddlej, thanks for tackling this. I've not run this, but have compared the DAGs and reviewed the code. (I realise this is still a draft PR so you've not covered everything).

I wrote responses to your in-line comments but in this PR they're showing up as individual comments for some reason. I think there are a few additions that are needed or at the least will help reduce papercuts:

Can no longer start from filtered inputs

The official nextstrain profile, some tests and the "example_global_context" build start from filtered inputs. Additionally the docs discuss filtered inputs in multiple places.

P.S. ~~I'm going to make a PR on top of this one to remove the preprocess profile, so you can ignore those if you wish.~~ Done, see #823.

Config warnings

The schema should remove filtered as a valid input which will prevent those configs from running. Additionally, if a user defined a config setting filter → {origin} then we should exit and notify them that this is no longer possible.

huddlej · 2021-12-16T19:33:48Z

Thank you for looking through this, @jameshadfield! It was in the back of my mind to go back through all the configs, example, builds, and docs to remove references to "filtered", but I should have written it out.

We just found out today that Daniel Bridges (from the Zambia PATH group) would like to use the input-specific filtering option of the ncov workflow that we talked about dropping yesterday. The ability to apply different filters per input was an important feature. There isn't a way in the current PR to support per-origin filtering by number of valid nucleotides (min_length) or a way to skip the diagnostic for specific inputs. The presence of the origin columns in the metadata would help with the second case. If we annotated number of valid nucleotides in the metadata instead of the sequence index, users could use the --query interface to filter only for specific inputs. This feels pretty complicated for users though. I'll think about this more...

huddlej · 2021-12-17T00:16:54Z

workflow/snakemake_rules/main_workflow.smk

+        """
+        augur filter \
+            --sequences {input.sequences} \
+            --metadata {input.metadata} \


We could support input-specific filters as long as the metadata include columns like length or ACTG or valid_bases and then users could define a query that filters based on metadata input column like:

--query "(gisaid == 'yes' & length > 5000)"

If we update the alignment step above to use Nextclade instead of Nextalign, we could get the annotation of the number of gaps, etc. in a TSV to merge with metadata to allow the query above. Alternately, it would be simpler to start with augur index output and another rule+script that merges the metadata with the index.

@jameshadfield points out that we could generate this query based on the existing filter config if users provide an input-specific filter and add that query to the existing augur filter. To do this we would loop through all "origins" defined in inputs and construct a query string with a component per origin that pulls from the config. For a config like the following:

inputs: - name: gisaid # ...snip... - name: aus # ...snip... filter: min_length: 10000 aus: min_length: 5000

the resulting query would be something like:

--query "(gisaid == 'yes' & length > 10000) | (aus == 'yes' & length > 5000)"

Because we generate this query from Snakemake, we can call the length field whatever we like and change its source in the future without breaking the workflow. We should consider calling this field a private name like "_length" or something that won't collide with existing metadata columns.

Thinking about this more now, this solution only addresses the specific use case of filtering by length. We've technically allowed users to define any valid filter key for any origin and we've documented four specific options in the configuration guide and two of these in the multiple inputs example.

Instead of figuring out how to support each of these individually or collectively, we could revisit the use case. Here's an attempt at the user story we're trying to support:

James has 20 new likely Omicron genomes from his city that just came off the sequencer and he wants to know how they place in the phylogenetic context of other sequences from the city and the country. He configures a Nextstrain build with inputs including the full GISAID database and his local sequences and metadata. He wants a high-quality set of GISAID data for his phylogenetic context, but he wants to keep all local sequences in his analysis regardless of their quality. He sets stringent quality filters for GISAID data, applying a minimum genome length of 27,000 valid nucleotides and filtering out any genomes that do not pass all of Nextclade's quality control annotations.

From this user's perspective, we've got the current implementation a bit backwards; James wants to keep all of his inputs by default (with reasonable subsampling) and only wants to apply filters to specific datasets. We currently require the opposite approach where we apply more stringent filters by default and require users to manually relax those filters.

I do think having a more stringent default is likely good - for users less clear on what's going on, I think reducing the odds they take away big messages from poor quality genomes is fair. However, it should also be as straightforward as possible to relax these if you actively are choosing to include worse-quality genomes (which is a very fair thing to do - just really important you know you've done it!).

I think if most users are interested in QC metric filtering, we could try to get this from Nextclade early on (as you suggest) & support quite a few different filtering options - but not "anything". I guess we also ideally would want for user-datasets to go through this, but not to try and generate this for GISAID - preferring to wait until after subsampling (though those within Nextstrain can use the files we've generated already with this info).

I like the idea of putting this query together from the config as suggested above.

huddlej · 2021-12-20T23:47:33Z

@jameshadfield Based on our conversations (above and offline), I've made the following changes to this PR:

Pass unmasked sequences to pangolin and augur ancestral
Remove references to “filtered” from the workflow documentation and schema
Build a "min length" filter query for each input
Support skipping diagnostics for each input
Add deprecation error for other input-specific filters

The most recent commits add two rules to the workflow including creation of a smaller sequence index after subsampling and merging of that index with the metadata to get "private" metadata fields with sequence content information. We could also decide to open this up a bit and make this annotated metadata more standard throughout the workflow, so it could be queried against more generically.

For a builds config like the following:

# Define inputs.
inputs:
  - name: gisaid
    # ...snip...
  - name: local
    # ...snip...

filter:
  # Apply length filters and diagnostics to all inputs by default.
  min_length: 27000

  # Keep most local inputs. 
  local:
    min_length: 5000
    skip_diagnostics: true

The filter step dynamically builds a query like this:

--query "(gisaid == 'yes' & _length >= 27000) | (local == 'yes' & _length >= 5000)"

The diagnostics script also runs with an additional argument of --skip-inputs local which causes matching records to be dropped from the metadata prior to building a list of strains to exclude.

For all other input-specific filters, the user gets an error message saying that these are not supported.

We'll need to test this workflow a bit more, after such major surgery. I will run some full Nextstrain builds on rhino and also some builds that use priorities, to get a better sense of where other bugs might exist. If you have other tests you'd like to run (e.g., with your own local data, etc.) that would be helpful, too.

huddlej · 2021-12-21T20:22:06Z

I ran two full Nextstrain builds in parallel (Africa and Europe) on the cluster with this PR. The builds ran in ~3 hours. One of the longest steps was the sanitize metadata step, as shown below.

In a separate PR, we should consider adding a boolean flag for each input to allow skipping the sanitize steps. Like:

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"
    sanitize_metadata: false
    sanitize_sequences: false

Everyone else who uses GISAID-sourced data will still need these early steps, but the Nextstrain builds would run faster without them.

Next, I will run some similarly-sized builds with priorities enabled for subsampling.

Starts to reorganize the workflow such that we only need the sequence index when subsampling with priorities and we only filter after the subsampling, alignment, and mask steps. Related to #810

These should be small enough, it isn't worth the extra effort/complexity to compress.

Reverts inputs for pangolin annotations and augur ancestral to match the current workflow, so these tools gets unmasked sequences. This change allows pangolin to do what it needs to internally without preemptively throwing out information and keeps mutations from masked sites as "decorations" on the tree even though these sites are otherwise ignored during the analysis.

Maintains support for two input-specific filter settings: `min_length` and `skip_diagnostics`. To support the minimum length filter, this commit adds a sequence index rule to the workflow before filtering, a rule and script that annotates the metadata with sequence index data, and custom query parameter logic in Snakemake to build a filter query for each input based on the minimum length given. Additionally, this commit allows users to skip diagnostics for specific inputs. When there is only one input, the top-level `skip_diagnostics` parameter takes effect. When there are multiple inputs, the workflow will look for input-specific requests to skip diagnostics. The diagnostics script now optionally takes a list of input names whose records should be skipped and filters these prior to other diagnostic logic.

Alerts the user when input-specific filters are no longer supported, so their workflows don't silently produce unexpected results.

Handle cases where `skip_diagnostics` is not defined for a build and where only one input exists for we can't filter metadata by the input name.

huddlej · 2021-12-23T19:29:28Z

@jameshadfield The open builds we queued up yesterday ran without errors in just over 2 hours! 🎉

Edit: For comparison, the open build I just ran for the "allow numbers in build names" PR took a little over 4 hours.

jameshadfield

Looks great - I've looked through the code again and the successful open builds indicate that this works in practice.

The current actions look like this:

Ncov-ingest (GISAID/GenBank) -> trigger ncov preprocessing pipelines -> trigger ncov rebuild pipelines

And after this PR they look like:

Ncov-ingest (GISAID/GenBank) -> trigger ncov rebuild pipelines

Note that the companion PR ncov-ingest #255 is needed here which stops ncov-ingest trying to trigger the (non-existent) preprocessing pipeline.

@huddlej and I plan to merge this tomorrow (Wednesday Seattle time). The schedule for ncov-ingest runs is:

14h07 UTC daily for GISAID which is 10h07 Seattle

12h07 UTC daily for GenBank which is 08h07 Seattle

I propose we use the following steps to get this out:

Cancel the schedule ncov-ingest action, assuming they have started running (see above for start times)
Merge this PR
Merge the companion PR ncov-ingest #255
Manually trigger the ncov-ingest actions “GISAID fetch and injest” and “GenBank fetch and ingest” (if before 10h07 Seattle, then we could let the cron-scheduled GISAID action run instead).
Monitor the ncov-ingest actions (no changes here except the repository dispatch trigger). These take around 3-5 hours.
Check that the ncov rebuild pipelines are correctly triggered from step 5.
Check the ncov pipeline works as expected (c. 2 hours runtime) and updates the correct files on S3:
1. (GenBank) Builds updated at s3://nextstrain-data/files/ncov/open/{global,africa,...}/*
2. (GenBank) Datasets updated at s3://nextstrain-data/ncov_open_{global,africa,...}.json + sidecar files
3. (GISAID) Builds updated at s3://nextstrain-private/{global,africa,...}/*
4. (GISAID) Datasets updated at s3://nextstrain-data/ncov_gisaid_{global,africa,...}.json + sidecar files
Remove the no-longer-used filtered sequences file from S3
1. (GenBank) s3://nextstrain-data/files/ncov/open/filtered.fasta.xz
2. (GISAID) s3://nextstrain-private/filtered.fasta.xz

docs/src/reference/configuration.md

jameshadfield · 2022-01-05T00:03:53Z

docs/src/reference/remote_inputs.md

@@ -71,7 +70,6 @@ This means that the full GenBank metadata and sequences are typically updated a
 | Full GenBank data    | metadata  | https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz            |
 |                      | sequences | https://data.nextstrain.org/files/ncov/open/sequences.fasta.xz         |
 |                      | aligned   | https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz           |
-|                      | filtered  | https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz          |


Reminder to us to remove this file after the first successful run (and the corresponding GISAID file as well)

danrlu · 2022-01-14T00:23:39Z

Thank you for the changes! They all make a lot of sense and we're seeing a much faster run on our end as well. Our set up is similar in that we download and pre-align all data from GISAID at night, and use the aligned fasta (which used to be the filtered fasta before changes in this PR) plus user provided data as the starting point for all our tree builds. All our tree build has priority based subsampling (builds.yaml and no changes to the main workflow), and the longest step now is the index after combining all inputs (with 6+ millions GISAID sequences. great visualization tool btw).

I'm looking into swapping augur index with seqtk, but want to check, why are the runs above don't seem to have this long index step (was I not squinting enough XD)? Are we doing anything unnecessary? Thanks!

huddlej · 2022-01-14T18:45:35Z

Hi @danrlu! The index step only happens when you need priority-based subsampling. You don't see the index step in any of the examples above because I avoided subsampling with priorities. The priority calculations use the number of Ns per sequence from the sequence index to penalize/deprioritize low quality sequences. As long as your --sequence-index input to the priorities script is a TSV with a valid N column, everything should work the same as if you used augur index.

Now that you mention it, I'm realizing that we don't need the full sequence index for anything else in the workflow. We currently require it as an input for the "extract subsampled sequences" rule (to get focal sequences to compare to the whole database), but it doesn't get used.

seqtk would be a much better choice for the current workflow's need. Would you want to submit a PR to replace augur index with seqtk in the main workflow? Or do you want to try it out on your end first and we can copy your implementation here?

danrlu · 2022-01-14T19:16:42Z

Great tips John, thank you!! Let me swap in seqtk on our end and benchmark first, since I'm not sure with decompressing .xz how much time it saves in total. Happy to send in a PR after that (may break the dependencies for some ppl...). I'll report back by the end of next week. Thanks again!

danrlu · 2022-01-17T23:55:14Z

The testing run 2 posts above was with ncov feb2c02. To test seqtk, I replaced the code section of rule index_sequences: with:

shell:
    """
    unxz --stdout {input.sequences} \
        | seqtk comp - \
        | sed '1i strain\tlength\tA\tC\tG\tT\tmix2\tmix3\tN\tCpG\ttransversion\ttransition\tCpG_ts' \
        | xz -c -2 > {output.sequence_index} 2>&1 | tee {log}
    """

(Column headers are based on lh3/seqtk#47)

The good news is with 6.4 million sequences, the index step went from 1h50min down to 27min,

The bad news is seqtk does not distinguish gaps - from N (we would want to penalize low quality sequences, but not sequences with real deletions...)
In combined_sequence_index.tsv generated by augur index:

With seqtk, the column that I thought would be N is actually a sum of N and - (and in fact anything outside of IUPAC code, see further below) 😭

One work around is to count only capital letters with seqtk comp -u, which will make seqtk ignore -

and that is still pretty fast (33min).

Comparing augur index and seqtk comp -u, the individual counts for A T C G N match completely for all our 6.4 million sequences, which also confirmed that all fasta in our pipelines only use capital letters (I'm not sure whether GISAID formats them or ncov-ingest or ncov). This may not be true for everyone and all use cases. So I'm not confident this is a solution for everyone, but it will work nicely for us.

Side notes: I also tried faCount from UCSC and it also lumps other characters into the N column:
With test_N.fasta being:

>seq1
NNN---
>seq2
nnn---
>seq3
???ZZZ

The counts for N:
`

The only other tool I know that may also do counting is https://bioinf.shenwei.me/seqkit/usage/ but I'm not too familiar with it.

huddlej · 2022-01-18T19:40:03Z

@danrlu Wow, thank you for the detailed analysis! Isn't this always the way with bioinformatics (or maybe just computers) that a conceptually simple task gets complicated quickly?

I had a quick look at seqkit stats (which seems to be the closest analog to seqtk comp). This command only outputs the full stats of all sequences in a given FASTA rather than the stats per sequence in the file.

Since we already sanitize all sequences passed to the workflow, we could force all sequences to uppercase during that step and use the seqtk comp -u workaround. I'm worried about relying on a side effect of seqtk's design, though. This behavior may not change in the short term, so it could be the stop-gap solution.

Some other long-term solutions could include:

implement C-based counting in augur index (proposed by @tsibley when we first added augur index)
cache index information across runs, to avoid recalculating the same stats for sequences that haven't changed (similar to how we've started caching our sequence alignments)
use Nextclade for the full alignment in the ncov workflow and store its metadata annotations of sequence composition (something we've also discussed internally)

The Nextclade approach might actually be ideal in that its already a parallelized C++ implementation and we already have an approach in ncov-ingest to cache alignments (and the corresponding metadata like number of missing bases). We could easily modify the priority script to look for Ns in other columns like the totalMissing column provided by Nextclade's --output-tsv output. We could immediately benefit from Nextclade's speed by using it for the full alignment step and passing its metadata output to the priority script. Later, we could incorporate the caching of alignment (and corresponding output metadata) from ncov-ingest into the main ncov workflow to get an even greater speed-up.

I'll continue this conversation with the rest of the team, but I'm curious about your thoughts on these proposals, too, and how they align with what you team has been thinking internally.

danrlu · 2022-01-19T16:34:39Z

Hahaha so true! It's never as straightforward as planned. With seqtk comp -u and uppercase, if ppl have a lot of data, adding a step to uppercase the sequences may add more run time and potentially make the change from augur index to seqtk less time-saving (need benchmarking), or if they have a small dataset, there is no need for the change...... so I'm not sure whether it's a worthy change for ncov. I'm totally in for a longer term, more general and more robust solution.

We do tree build in 2 steps:
Step 1: we have a nightly job to download public repo data, format/clean it with ncov-ingest and process as far as possible as an individual input with ncov. We currently pin to an ncov hash right before this PR, so we run alignment/filter fasta + sanitize metadata. (If I understand it correctly, if we ever move to GenBank data, we can skip this step entirely and just use the open pre-processed data Nextstrain provide. Thanks for generating that dataset.)

Step 2: we run Nextstrain tree builds combining user input data and the pre-processed public data from the first job. This can be done many times for many users. So anything we can stash into the first job will save us time across all tree runs.

Per the potential solutions for index, I think for small tree jobs, having less package/repo dependencies is nicer and cleaner, but for large dataset, the speed matters more.

Before this PR merge, step 1 takes 11hr, step 2 takes 7-8hr. So we pretty much wait a day until new data gets on to the tree.
After this PR merge, step 1 takes 8hr, step 2 takes 6-7hr, and with seqtk comp -u step 2 takes 4.5-5.5hr. So now we can see trees with new data in the morning. For our purpose, this is already pretty good place to be. Thanks for the changes in this PR!

I was following this PR nextstrain/ncov-ingest#237 and thinking to do something similar, but we currently don't have any Nextclade components in our workflow and I have not looked into how Nextclade works. Per what you were describing, it sounds like we could just use Nextclade for our first job instead of ncov, plus caching the alignment?? and if the Nextclade route of index gets implemented, that will take care of the index step and caching as well??? 🤩 🤩 🤩 I'll put in seqtk for us for now and learn more about this approach. Thank you very much!!

Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input. This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing. In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not _quite_ available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹ ¹ <#814> <#823>

Use aligned sequences as the aligned sequences input, rather than pass off unaligned sequences as the aligned sequences input. This should be inconsequential to workflow behaviour or results, but it makes the config a bit more straightforward and less confusing. In a quick dig thru history, it seems like ncov-ingest's aligned.fasta.xz was not _quite_ available when we first switched our profiles to use an "aligned" input instead of a "sequences" input. The original use of "aligned" with unaligned sequences was driven by run time concerns and related to the move of the filtering step after the subsampling step and move of the "preprocess" steps from this workflow (ncov) to ncov-ingest.¹ Resolves <#1054>. ¹ <#814> <#823>

This commit removes the rules `mutation_summary` and `build_mutation_summary`, both of which called the removed script `scripts/mutation_summary.py`. I was looking into updating the script to work with Nextclade v3 outputs but it seems like the rules using the script are not actively used in the Snakemake workflow! In Feb 2021, the script and rule `mutation_summary` were added and the only use of the output `results/mutation_summary_{origin}.tsv.xz` was to be uploaded it as an intermediate file of the builds.¹ In Jan 2022, the upload of the intermediate file was removed² and I cannot find any other uses of the file so the rule `mutation_summary` seems unused. In March 2021, the rule `build_mutation_summary` was added to support the rule `add_mutation_counts`.³ However, `add_mutation_counts` was later updated to use `augur distance`⁴ which does not require the mutation summary output. I cannot find any other uses of the output file `results/{build_name}/mutation_summary.tsv`, so the rule `build_mutation_summary` seems unused. ¹ <#562> ² <#814> ³ <3b23e8f> ⁴ <d24d531>

huddlej mentioned this pull request Dec 10, 2021

Move index/filter after subsampling #810

Closed

jameshadfield requested changes Dec 16, 2021

View reviewed changes

jameshadfield mentioned this pull request Dec 16, 2021

WIP: remove preprocessing pipelines #823

Closed

15 tasks

huddlej added the source: office hours Issue mentioned during office hours label Dec 16, 2021

huddlej commented Dec 17, 2021

View reviewed changes

huddlej marked this pull request as ready for review December 20, 2021 23:47

huddlej and others added 9 commits December 22, 2021 13:33

Move filter step after subsampling

26ade92

Starts to reorganize the workflow such that we only need the sequence index when subsampling with priorities and we only filter after the subsampling, alignment, and mask steps. Related to #810

Don't bother compressing filtered FASTAs

43d2101

These should be small enough, it isn't worth the extra effort/complexity to compress.

Combine files from the aligned inputs

6551a69

Remove references in docs to "filtered" outputs

2625374

Add error message for unsupported filters

ffebc76

Alerts the user when input-specific filters are no longer supported, so their workflows don't silently produce unexpected results.

Fix bugs for single input builds

ece7338

Handle cases where `skip_diagnostics` is not defined for a build and where only one input exists for we can't filter metadata by the input name.

Remove preprocessing pipelines

9abad2a

huddlej force-pushed the move-filter branch from e8e0f40 to 9abad2a Compare December 22, 2021 21:35

huddlej self-assigned this Dec 22, 2021

Update change log to reflect breaking changes

d04c639

jameshadfield approved these changes Jan 5, 2022

View reviewed changes

huddlej merged commit cf79e41 into master Jan 5, 2022

huddlej deleted the move-filter branch January 5, 2022 20:46

jameshadfield mentioned this pull request Jan 5, 2022

Update repository dispatch triggers nextstrain/ncov-ingest#255

Merged

huddlej mentioned this pull request Jan 6, 2022

Reduce filter resources requests #833

Merged

1 task

jgadling mentioned this pull request Jan 12, 2022

Revert to older ncov versions chanzuckerberg/czgenepi#910

Merged

6 tasks

danrlu mentioned this pull request Jan 21, 2022

Danrlu/update ncov chanzuckerberg/czgenepi#921

Merged

6 tasks

danrlu mentioned this pull request Feb 28, 2022

Danrlu/move to fork chanzuckerberg/ncov#1

Merged

emmahodcroft mentioned this pull request Mar 31, 2023

Why use unaligned sequences as "aligned" input instead of the actually aligned sequences? #1054

Closed

tsibley mentioned this pull request Apr 6, 2023

Start from actually-aligned sequences #1057

Merged

4 tasks

huddlej mentioned this pull request Jul 25, 2024

Demo prefilter rule for Nextstrain GISAID build #1128

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move filter after subsampling #814

Move filter after subsampling #814

huddlej commented Dec 10, 2021 •

edited

Loading

huddlej commented Dec 15, 2021 •

edited

Loading

victorlin commented Dec 15, 2021

huddlej commented Dec 15, 2021

huddlej commented Dec 15, 2021

huddlej commented Dec 15, 2021

jameshadfield left a comment •

edited

Loading

huddlej commented Dec 16, 2021

huddlej Dec 17, 2021 •

edited

Loading

huddlej Dec 17, 2021

emmahodcroft Dec 17, 2021

huddlej commented Dec 20, 2021

huddlej commented Dec 21, 2021

huddlej commented Dec 23, 2021 •

edited

Loading

jameshadfield left a comment

jameshadfield Jan 5, 2022

danrlu commented Jan 14, 2022 •

edited

Loading

huddlej commented Jan 14, 2022

danrlu commented Jan 14, 2022

danrlu commented Jan 17, 2022 •

edited

Loading

huddlej commented Jan 18, 2022

danrlu commented Jan 19, 2022

Move filter after subsampling #814

Move filter after subsampling #814

Conversation

huddlej commented Dec 10, 2021 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Release checklist

huddlej commented Dec 15, 2021 • edited Loading

victorlin commented Dec 15, 2021

huddlej commented Dec 15, 2021

huddlej commented Dec 15, 2021

huddlej commented Dec 15, 2021

jameshadfield left a comment • edited Loading

Choose a reason for hiding this comment

huddlej commented Dec 16, 2021

huddlej Dec 17, 2021 • edited Loading

Choose a reason for hiding this comment

huddlej Dec 17, 2021

Choose a reason for hiding this comment

emmahodcroft Dec 17, 2021

Choose a reason for hiding this comment

huddlej commented Dec 20, 2021

huddlej commented Dec 21, 2021

huddlej commented Dec 23, 2021 • edited Loading

jameshadfield left a comment

Choose a reason for hiding this comment

jameshadfield Jan 5, 2022

Choose a reason for hiding this comment

danrlu commented Jan 14, 2022 • edited Loading

huddlej commented Jan 14, 2022

danrlu commented Jan 14, 2022

danrlu commented Jan 17, 2022 • edited Loading

huddlej commented Jan 18, 2022

danrlu commented Jan 19, 2022

huddlej commented Dec 10, 2021 •

edited

Loading

huddlej commented Dec 15, 2021 •

edited

Loading

jameshadfield left a comment •

edited

Loading

huddlej Dec 17, 2021 •

edited

Loading

huddlej commented Dec 23, 2021 •

edited

Loading

danrlu commented Jan 14, 2022 •

edited

Loading

danrlu commented Jan 17, 2022 •

edited

Loading