Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize Nextclade dataset in config #1046

Merged
merged 4 commits into from
Mar 16, 2023
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Feb 9, 2023

Description of proposed changes

Prototypes an idea of allowing users to choose their own Nextclade dataset for the final alignment with Nextclade. Adds minimal logic and config to allow users to pull in immune escape and ACE2 binding scores to their builds by switching their default dataset.

As with the Nextclade web UI, users must understand that their trees and annotations on those trees will not make sense biologically if they use sequences that don't descend from BA.2 (Nextstrain clade 21L). This PR doesn't implement any checks or force any filters to ensure that input data are reasonable, unlike the RBD levels annotations in this workflow or the ncov-ingest logic to pull in the same phenotypic scores.

Testing

Tested with this minimal build config based on the CI build:

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/100k/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/100k/sequences.fasta.xz"

nextclade_dataset: sars-cov-2-21L

builds:
  europe:
    subsampling_scheme: nextstrain_ci_sampling
    region: Europe

filter:
  skip_diagnostics: true

subsampling:
  # Custom subsampling logic for CI tests.
  nextstrain_ci_sampling:
    # Focal samples for region
    region:
      group_by: "division year month"
      max_sequences: 50
      exclude: "--exclude-where 'region!={region}'"
      min_date: --min-date 2022-01-01
    # Contextual samples for region from the rest of the world
    global:
      group_by: "year month"
      max_sequences: 50
      exclude: "--exclude-where 'region={region}'"
      min_date: --min-date 2022-01-01

Run the build like so:

snakemake --forceall -p -j 4 --use-conda --conda-frontend mamba --configfile immune_escape.yaml

An example tree with immune escape scores from this build looked like this:

image

And ACE2 binding:

image

This tree is a good example where the input sequences should be limited to 21L descendants and haven't been fully.

Release checklist

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Moves hardcoded Nextclade dataset name into the top-level of the build
configuration, allowing users to choose which data to use for their
final alignments, QC scores, etc. As a side effect of this flexibility,
users can select a dataset like "sars-cov-2-21L" which provides
additional metadata annotations in the QC output that users would like
to include (immune escape and ACE2 binding scores).
Adds colorings to the default Auspice config JSON for immune escape and
ACE2 binding scores and updates the list of columns to merge into the
metadata from Nextclade's QC output. This change allows users to get
these scores in their builds by changing the `nextclade_dataset` option
in their build config to "sars-cov-2-21L", but it has no effect on the
behavior of the default dataset, "sars-cov-2".
@huddlej huddlej marked this pull request as ready for review March 16, 2023 21:17
@huddlej
Copy link
Contributor Author

huddlej commented Mar 16, 2023

Since this is a minor backwards-compatible change that got general verbal approval at our recent Nextstrain meeting, I'm going to merge this without review.

@huddlej huddlej merged commit cfa73be into master Mar 16, 2023
@huddlej huddlej deleted the config-nextclade-dataset branch March 16, 2023 23:30
@corneliusroemer
Copy link
Member

@huddlej thanks a lot for providing a good debug template config and snakemake command - used this to great advantage in #1050! Ran in 5min including mamba installation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants