Parameterize Nextclade dataset in config #1046

huddlej · 2023-02-09T20:44:33Z

Description of proposed changes

Prototypes an idea of allowing users to choose their own Nextclade dataset for the final alignment with Nextclade. Adds minimal logic and config to allow users to pull in immune escape and ACE2 binding scores to their builds by switching their default dataset.

As with the Nextclade web UI, users must understand that their trees and annotations on those trees will not make sense biologically if they use sequences that don't descend from BA.2 (Nextstrain clade 21L). This PR doesn't implement any checks or force any filters to ensure that input data are reasonable, unlike the RBD levels annotations in this workflow or the ncov-ingest logic to pull in the same phenotypic scores.

Testing

Tested with this minimal build config based on the CI build:

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/100k/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/100k/sequences.fasta.xz"

nextclade_dataset: sars-cov-2-21L

builds:
  europe:
    subsampling_scheme: nextstrain_ci_sampling
    region: Europe

filter:
  skip_diagnostics: true

subsampling:
  # Custom subsampling logic for CI tests.
  nextstrain_ci_sampling:
    # Focal samples for region
    region:
      group_by: "division year month"
      max_sequences: 50
      exclude: "--exclude-where 'region!={region}'"
      min_date: --min-date 2022-01-01
    # Contextual samples for region from the rest of the world
    global:
      group_by: "year month"
      max_sequences: 50
      exclude: "--exclude-where 'region={region}'"
      min_date: --min-date 2022-01-01

Run the build like so:

snakemake --forceall -p -j 4 --use-conda --conda-frontend mamba --configfile immune_escape.yaml

An example tree with immune escape scores from this build looked like this:

And ACE2 binding:

This tree is a good example where the input sequences should be limited to 21L descendants and haven't been fully.

Release checklist

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Moves hardcoded Nextclade dataset name into the top-level of the build configuration, allowing users to choose which data to use for their final alignments, QC scores, etc. As a side effect of this flexibility, users can select a dataset like "sars-cov-2-21L" which provides additional metadata annotations in the QC output that users would like to include (immune escape and ACE2 binding scores).

Adds colorings to the default Auspice config JSON for immune escape and ACE2 binding scores and updates the list of columns to merge into the metadata from Nextclade's QC output. This change allows users to get these scores in their builds by changing the `nextclade_dataset` option in their build config to "sars-cov-2-21L", but it has no effect on the behavior of the default dataset, "sars-cov-2".

huddlej · 2023-03-16T23:30:20Z

Since this is a minor backwards-compatible change that got general verbal approval at our recent Nextstrain meeting, I'm going to merge this without review.

corneliusroemer · 2023-03-17T15:18:58Z

@huddlej thanks a lot for providing a good debug template config and snakemake command - used this to great advantage in #1050! Ran in 5min including mamba installation!

huddlej added 2 commits February 9, 2023 12:35

huddlej marked this pull request as ready for review March 16, 2023 21:17

huddlej added 2 commits March 16, 2023 15:52

Correct year for most recent feature

9fe7e65

Note new feature in changelog and reference docs

f80fc0c

huddlej merged commit cfa73be into master Mar 16, 2023

huddlej deleted the config-nextclade-dataset branch March 16, 2023 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameterize Nextclade dataset in config #1046

Parameterize Nextclade dataset in config #1046

huddlej commented Feb 9, 2023 •

edited

Loading

huddlej commented Mar 16, 2023

corneliusroemer commented Mar 17, 2023

Parameterize Nextclade dataset in config #1046

Parameterize Nextclade dataset in config #1046

Conversation

huddlej commented Feb 9, 2023 • edited Loading

Description of proposed changes

Testing

Release checklist

huddlej commented Mar 16, 2023

corneliusroemer commented Mar 17, 2023

huddlej commented Feb 9, 2023 •

edited

Loading