Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include 21L rooted Nextstrain build for GISAID data #1029

Merged
merged 7 commits into from
Apr 12, 2023
Merged

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Nov 2, 2022

Description of proposed changes

This PR includes a new set of Nextstrain builds for GISAID data that use 21L as reference rather than Wuhan. All viruses that outgroup relative to 21L in the tree are dropped via filter rule:

filter:
  min_date: "2022-01-01"
  exclude_where: "division='USA' \
    Nextstrain_clade='19A' \
    Nextstrain_clade='19B' \
    Nextstrain_clade='20A' \
    Nextstrain_clade='20B' \
    ...

I decided to split out a separate nextstrain_profiles/nextstrain-gisaid-21L rather than padding out nextstrain_profiles/nextstrain-gisaid with additional logic because the latter would have resulted in extensive code-duplication within the builds.yaml in specifying different behavior across specific build endpoints.

I've run this locally with swapping

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/global/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/global/sequences.fasta.xz"

to

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"

and things work well and run quickly. I'm still testing using the full data on AWS however and will update this PR with links to trial runs.

That said, restricting to 21L makes a number of visuals much clearer. For example, here is the smaller local tree looking at S1 mutations:

Screen Shot 2022-11-02 at 4 11 27 PM

I'd plan to split things out so that available endpoints include:

  • nextstrain.org/ncov/gisaid/ancestral/global/6m
  • nextstrain.org/ncov/gisaid/21L/global/6m
  • nextstrain.org/ncov/gisaid/21L/asia/2m
    etc...

However, all this said, I don't think this is the right tact to solve the problem.

Issues include:

  1. In the workflow, the filter rule happens after the subsampling rules. This makes it so that if we ask for say 2560 in a sampling bucket, we'll lose >50% due to filtering out non-21L-descending clades.
    This could be solved by padding count targets to compensate, but this is hacky and the numbers will change as time goes on. Or the filter rule could be placed again before subsample, but we moved it afterwards for good reasons.

  2. I'd anticipate wanting a number of analyses (like https://github.com/blab/ncov-escape and other projects) that use 21L as reference. Having to deal with the entire dataset and having to filter out outgroup viruses will be repeatedly annoying.


Instead, I'd propose that we move this filter logic to https://github.com/nextstrain/ncov-ingest and to make the filtered metadata and sequences available for both GISAID data and open data at URLs like:

  • s3://nextstrain-ncov-private/21L/metadata.tsv.gz
  • s3://nextstrain-ncov-private/21L/sequences.fasta.xz
  • s3://nextstrain-data/files/ncov/open/21L/metadata.tsv.gz
  • s3://nextstrain-data/files/ncov/open/21L/sequences.fasta.xz

This will streamline and speed up running of 21L builds (augur filter runs repeatedly on a smaller metadata file) and help enable many future analyses.

Testing

As above, I've locally tested with

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"

and trial build is running that will output to https://nextstrain.org/staging/ncov/gisaid/21L/trial/21L-reference/global/6m. I'll update when this completes.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that you use a 21L reference with deletions.

Will we keep using Wuhan coordinates? This should be the case as long as we translate using Wuhan - otherwise Spike and other positions will be messed up.

@trvrb
Copy link
Member Author

trvrb commented Nov 3, 2022

I think best to keep everything in Wuhan/1 coordinates. Latest BQ.1 preprints still use Wuhan/1 coordinates. Will be way too confusing otherwise. So perhaps this PR is better thought of as 21L "root" rather than 21L "reference".

@corneliusroemer
Copy link
Member

I totally agree, especially given that deletions commonly occur, like 69/70del and 144del etc it'd be a mess to change coordinate systems and not worth it.

@tsibley
Copy link
Member

tsibley commented Nov 22, 2022

Commenting mostly on the organizational/conceptual bits (and not the code changes here)…

I'd plan to split things out so that available endpoints include:

  • nextstrain.org/ncov/gisaid/ancestral/global/6m
  • nextstrain.org/ncov/gisaid/21L/global/6m
  • nextstrain.org/ncov/gisaid/21L/asia/2m
    etc...

Why not keep existing builds (with the ancestral Wuhan root) at nextstrain.org/ncov/gisaid/global/6m?

i.e. don't change what's already there, only add the nextstrain.org/ncov/gisaid/21L/… datasets.

Issues include:

  1. In the workflow, the filter rule happens after the subsampling rules. This makes it so that if we ask for say 2560 in a sampling bucket, we'll lose >50% due to filtering out non-21L-descending clades.
    This could be solved by padding count targets to compensate, but this is hacky and the numbers will change as time goes on. Or the filter rule could be placed again before subsample, but we moved it afterwards for good reasons.

Alternatively, a pre-subsampling filter rule could be added to the DAG for this profile instead of moving the existing filter rule around (which I agree is probably a non-starter). This would look like the build config defining an "aligned" input that's a local file (to be generated by the new prefilter rule), e.g.:

inputs:
  - name: gisaid
    metadata: "results/gisaid_21L_metadata.fasta.xz"
    aligned: "results/gisaid_21L_aligned.fasta.xz"

a new custom_rules file, e.g.:

custom_rules:
  - nextstrain_profiles/nextstrain-gisaid-21L/prefilter.smk

and in that new file writing the rule to do the filtering down to 21L, e.g. something that looks roughly like:

rule 21L_prefilter:
    input:
        metadata = path_or_file("s3://nextstrain-ncov-private/metadata.tsv.xz"),
        aligned = path_or_file("s3://nextstrain-ncov-private/aligned.fasta.xz")
    output:
        metadata = "results/gisaid_21L_metadata.tsv.xz",
        aligned = "results/gisaid_21L_aligned.fasta.xz"
    shell:
        …

(You also might be able to accomplish the same thing without any custom rules by chaining inputs together, though I'm not sure it's possible with existing config.)

I don't think this as a one-off approach is necessarily better than one-off provisioning of a separate set of files upstream in ncov-ingest, but I think if the prefiltering concept was generalized for this workflow (even if prefiltering was still limited only to clades or something) that might be better than producing/maintaining lots of separate static subsets.

  1. I'd anticipate wanting a number of analyses (like https://github.com/blab/ncov-escape and other projects) that use 21L as reference. Having to deal with the entire dataset and having to filter out outgroup viruses will be repeatedly annoying.

Is the biggest annoyance here the slowness of augur filter? Or that it's verbose/difficult to use it to filter to just 21L? (Both?)

Instead, I'd propose that we move this filter logic to https://github.com/nextstrain/ncov-ingest and to make the filtered metadata and sequences available for both GISAID data and open data at URLs like:

  • s3://nextstrain-ncov-private/21L/metadata.tsv.gz
  • s3://nextstrain-ncov-private/21L/sequences.fasta.xz
  • s3://nextstrain-data/files/ncov/open/21L/metadata.tsv.gz
  • s3://nextstrain-data/files/ncov/open/21L/sequences.fasta.xz

I think it's reasonable to provision separate subsets for specific cases like this, though I do wonder if the alternative I describe above might be better in the long run.

Re: these specific proposed paths: they do perpetuate the comingling of workflow-associated files and dataset-associated files that ncov-ingest and ncov already do, and in that sense it's fine and maybe even preferred, but it might be nice to consider putting new workflow-associated files (which these are) in the new locations we decided upon, e.g.:

s3://nextstrain-data-private/files/workflows/ncov/21L/metadata.tsv.gz
s3://nextstrain-data-private/files/workflows/ncov/21L/sequences.fasta.xz
s3://nextstrain-data/files/workflows/ncov/open/21L/metadata.tsv.gz
s3://nextstrain-data/files/workflows/ncov/open/21L/sequences.fasta.xz

Since we're not planning to remove the existing comingled files for ncov/ncov-ingest, avoiding adding more might help avoid further confusion of what's what.

@tsibley tsibley changed the title Include 21L reference Nextstrain build Include 21L rooted Nextstrain build Nov 22, 2022
Comment on lines 761 to 766
Adjusting metadata for build '{wildcards.build_name}'
"""
input:
metadata="results/{build_name}/metadata_with_index.tsv",
metadata="results/{build_name}/metadata_filtered.tsv.xz",
output:
metadata = "results/{build_name}/metadata_adjusted.tsv.xz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the metadata that's input to several downstream rules. Rules which take metadata_adjusted.tsv.xz as input currently get the subsampled but unfiltered metadata and with this change would get the subsampled and filtered metadata.

Is this change a bug fix? It seems like it. But it does have potential visible changes like rule colors now will produce a scale based on the filtered rather than unfiltered metadata values.

I'd appreciate someone else's 👀 on this bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @nextstrain/core

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out separately in d8c17ec.

@tsibley
Copy link
Member

tsibley commented Mar 30, 2023

Rebased this onto the latest master to bring in changes. This also highlights a few things that need updating since the copy, which I'll mark separately with comments here so I can track them.

@tsibley tsibley force-pushed the 21L-reference branch 2 times, most recently from 72992da to bb384a7 Compare April 3, 2023 20:54
This is intended as a bug fix to more closely align metadata with
sequences in rules downstream of subsample + filter.  Rules which take
metadata_adjusted.tsv.xz as input used to get the subsampled _but not_
filtered metadata.  With this change, they now get the subsampled _and_
filtered metadata.

Many rules will not be affected as they join sequences → metadata and
ignore extra metadata rows.  However, there are some potential changes
to results as some rules do consult all of the metadata without joining
it to sequences.  For example, the "colors" rule will now produce a
scale based on the filtered rather than unfiltered metadata values,
which could be noticeably different sets.

Co-authored-by: Trevor Bedford <trevor@bedford.io>
All circulating viruses now descend from clade 21L. By rooting the tree
on 21L reference virus we can better emphasize recent evolution. This
includes tree layout, but also things like S1 mutations, where these
were basically all just red rather than emphasizing differences among
circulating viruses, or clades, where all the previous clades were
making 21L through 22E all similar colors of red.

This copies the existing Nextstrain GISAID build and adds some simple
logic to filter out clades previous to 21L. This results in some code
duplication between the ancestral GISAID build and the 21L GISAID build,
but overall much simpler code than if a single build.yml covered both
ancestral and 21L root cases.

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
tsibley added a commit that referenced this pull request Apr 5, 2023
Filter rules in the config are applied _after_ subsampling, which poses
issues with reliably getting the desired number of sequences.  As @trvrb
wrote¹:

> In the workflow, the filter rule happens after the subsampling rules.
> This makes it so that if we ask for say 2560 in a sampling bucket, we'll
> lose >50% due to filtering out non-21L-descending clades.
>
> This could be solved by padding count targets to compensate, but this is
> hacky and the numbers will change as time goes on. Or the filter rule
> could be placed again before subsample, but we moved it afterwards for
> good reasons.

A few custom rules for the builds allows us to prefilter the full
dataset before subsampling.  Currently these rules are specific to our
GISAID data source, but they could be easily expanded to our Open data
sources too.

In the future we might also provide clade-partitioned subsets from
ncov-ingest², which we could use here instead with some adaptation of
the build config.

¹ <#1029 (comment)>
² e.g. <nextstrain/ncov-ingest#398>
@tsibley tsibley marked this pull request as ready for review April 5, 2023 20:35
@tsibley tsibley requested a review from a team April 5, 2023 20:35
@tsibley
Copy link
Member

tsibley commented Apr 5, 2023

Per discussion last week, I reworked this branch to prefilter the data before subsampling and to avoid memory-inefficient steps. See the commit messages for details. This is ready for review again by @nextstrain/core (and maybe particularly @trvrb).

There's also a trial build running now.

@tsibley
Copy link
Member

tsibley commented Apr 5, 2023

Looking at the global/6m build locally, it seems ok.

image

Oddly, I noticed there's a sequence that made it thru classified as 19A:

image

This sequence is in our metadata as 22B, so passes the pre-filters, and is from 2022-12, so passes the date filter, but must be getting reclassified to 19A (erroneously?) by the workflow. I haven't dug into it more, but perhaps we want to both pre-filter and (post-)filter on the clades to catch these reclassifications?

@emmahodcroft
Copy link
Member

emmahodcroft commented Apr 6, 2023

Yes - in fact, I bet it's not even the workflow proper where the reclassification is happening - it's just at the end when the clades are being 'put' on the tree (so that the labels show up and everything gets a nice colour). This overrides whatever classifications we've gotten from Nextclade in the display. But these are essentially only assigned at one node - then everything downstream from that is classified as "being that node" (unless another node downstream also has a label, which then overrides the first).

If I'm remembering correctly, node labels attach to the first node available that matches their specified mutations. 19A is a bit 'weak' in the respect that it's only defined by 2 mutations -- of course, we didn't have many muts at the start!! So this may well literally be all that defines this clade.
However, in a normal build, 19A split is near the root, so this node gets assigned early and won't end up lower down the tree (even though these mutations may happen elsewhere). However, in this build, there is no 'real' 19A node, so it's just wandering down the tree, and if it happens across a node with the two mutations it needs, it attaches there...

Naively I would have said that node assignments happen in order specified in the TSV, with nodes coming later in the list overriding any assigned earlier (after all, after the first clade is assigned 1/2 the tree may be 'part of that node' until you do your next assignment within that) - so I'm a little surprised this happened, but I may be completely misremembering how this works.

If useful for others, here's the tree: https://nextstrain.org/staging/ncov/gisaid/21L/trial/21L-reference/global/6m?label=clade:22E%20%28Omicron%29&showBranchLabels=all

@emmahodcroft
Copy link
Member

emmahodcroft commented Apr 6, 2023

As for solutions:

I don't think a pre-filter will help, as this node really is 22B, it's just that it also has these two 19A mutations.

The most straightforward (on the surface) solution would be to use a separate clades_nextstrain.tsv for these 21L builds - but already have a lot of these and we have to update each separately every time we add a new clade - so that's not ideal if we can find another way...

@tsibley
Copy link
Member

tsibley commented Apr 6, 2023

However, in this build, there is no 'real' 19A node, so it's just wandering down the tree, and if it happens across a node with the two mutations it needs, it attaches there...

Ahh, this makes sense. Thanks for the explanation!

The most straightforward (on the surface) solution would be to use a separate clades_nextstrain.tsv for these 21L builds

Hmm. I'd think we could programmatically filter the clades though to remove the ones prior to 21L… I'll try it.

@tsibley
Copy link
Member

tsibley commented Apr 6, 2023

Oh, hrm, but I can't just remove the excluded clades because they're defined hierarchically. So the filtering/transformation is a bit more complex.

@tsibley
Copy link
Member

tsibley commented Apr 6, 2023

I've pushed 59a672c to try to address the clade labeling issue. Testing a global/6m build first, then will trigger a full set of trial builds.

@emmahodcroft
Copy link
Member

Very nice little solution @tsibley ! I like the idea and the implementation - makes sense to me, and avoid us having another clades file to update manually!

@tsibley
Copy link
Member

tsibley commented Apr 6, 2023

Well, that didn't work as intended:

Screenshot from 2023-04-06 13-59-30

No clades got labeled…

This avoids incorrect labeling of the trees by `augur clades` if a
sequence happens to have mutations that define one of the excluded
clades.  This happens because there aren't earlier sequences for those
clades to be labeled due to their exclusion.  For example, we've seen
some spurious labelings of 19A because we exclude such sequences from
these builds.  Thanks to @emmahodcroft for debugging and explaining
this!¹

Since we define our clades hierarchically with inheritance, a small new
program is needed to fully expand those definitions so we can filter
them without worrying about filtering out their "parent" clades.

¹ <#1029 (comment)>
@tsibley
Copy link
Member

tsibley commented Apr 6, 2023

Off-by-one error on my part because I saw but then didn't recall this bit about augur.clades.read_in_clade_definitions():

Sites are 1 indexed in the file, and are converted to 0 indexed in the output

@tsibley
Copy link
Member

tsibley commented Apr 7, 2023

Ok, that's better:

Screenshot from 2023-04-06 17-01-36

@tsibley
Copy link
Member

tsibley commented Apr 7, 2023

I've started another full set of trial builds (build id 93366c93-b2df-4db1-9f84-f4778bb839a5), which will appear at URLs like https://nextstrain.org/staging/ncov/gisaid/21L/trial/21L-reference/global/6m if things go well.

@emmahodcroft
Copy link
Member

Looks like the test builds completed, and checking a few of them, they look really good now! Thanks for tackling this Tom, I think it'll be really useful!

@rneher
Copy link
Member

rneher commented Apr 7, 2023

LGTM, but just a heads up that the exclude_clades depends on the values of Nextstrain_clade in the metadata. We are currently moving the names displayed on the tree from 23A (Omicron) to smth like 23A (XBB.1.5) #1050 and the primary output of Nextclade is just 23A. We could add the latter the metadata and use that.

But for the time being this should work, just highlighting possible future divergence and breakage.

@tsibley
Copy link
Member

tsibley commented Apr 7, 2023

@rneher Yep, thanks for noting that. I think the sort of thing done by this PR is a great example of why we want to do nextstrain/ncov-ingest#399. :-) I'd much rather write and maintain code to filter on 23A vs. 23A (Omicron) or 23A (XBB.1.5). I don't think that's an option for this PR until that other work is complete, though.

tsibley added 2 commits April 12, 2023 11:20
This should have been removed in "Remove preprocessing pipelines"
(9abad2a).
This is the event triggered by ncov-ingest when there's new GISAID data,
which also triggers the standard ancestral-rooted builds.

The specific "rebuild-gisaid-21L" event for this workflow seems
unnecessary since it would only be triggered manually, and in that case
I expect folks to use the GitHub Actions web UI instead.
@tsibley
Copy link
Member

tsibley commented Apr 12, 2023

Pushed a couple additional small commits, one of which ensures these 21L-rooted builds are triggered by new data, and now will merge.

@tsibley tsibley merged commit c6fcdf6 into master Apr 12, 2023
@tsibley tsibley deleted the 21L-reference branch April 12, 2023 18:25
@tsibley
Copy link
Member

tsibley commented Apr 12, 2023

One thing I think that's worth addressing, but that I chose not to address now (because I have to timebox this to some reasonable scope), is the vast duplication between nextstrain-gisaid and nextstrain-gisaid-21L profiles. It will be less maintainable over time, esp. as a cross-product with other sources (e.g. open) and other filterings (e.g. 100k).

@tsibley
Copy link
Member

tsibley commented Apr 13, 2023

Five months ago, I wrote:

Why not keep existing builds (with the ancestral Wuhan root) at nextstrain.org/ncov/gisaid/global/6m?

i.e. don't change what's already there, only add the nextstrain.org/ncov/gisaid/21L/… datasets.

The above is what I've done in getting this PR over the line in the past two weeks.

One reason, given on the biweekly call right now by @trvrb, to switch existing builds to use an ancestral label, is that it would make it possible to update nextstrain.org's dataset manifest (ugh that this still exists) such that the dataset selector dropdowns work when viewing 21L builds (and 21L builds are navigable from the ancestral builds). Currently the dataset selector dropdowns do not work well for 21L builds.

I'd suggest that instead of continuing to shift existing URLs all the time and maintain all those redirects, we could instead prioritize a better dataset selector (such as the one sketched out by @jameshadfield) and subsequent removal of the manifest.

@trvrb
Copy link
Member Author

trvrb commented Apr 13, 2023

Good point about dataset selector. We currently also have this issue for https://nextstrain.org/nextclade/sars-cov-2/21L.

@trvrb
Copy link
Member Author

trvrb commented Apr 13, 2023

Also, I think we can get okay, but not ideal, behavior in the dataset selector from the current dataset names of:

  • /ncov/gisaid/global/2m
  • /ncov/gisaid/21L/global/2m
    etc...

by modifying manifest_guest.json to 21L under the "region" heading. We won't get the blank dropdown in this case.

The issue with https://nextstrain.org/nextclade/sars-cov-2/21L is having _21L.json as a trailing option. There's no way to address this with current system.

@emmahodcroft
Copy link
Member

BTW I definitely didn't intend to sound 'problem-finding' with the drop-downs - apologies if it came across that way! I think it's fantastic the 21L build is up, and thanks for pushing it over the line, @tsibley! I only brought it up because the drop-down is mainly how I remember what builds we have and how to get to them (so I find it really useful).

But, I totally get the issue at play here, and I agree that constantly modifying URLs is not an ideal solution. Coming up with a "better way" is almost certainly going to pay off in the long-run so we can have whatever URLs we want, without needing 'sisters' etc.

@tsibley
Copy link
Member

tsibley commented Apr 14, 2023

@emmahodcroft No worries! I have no problem (ha) with you pointing out the dropdown problems—they're real obstacles to usage!—I just wanted to make it clear here that a) the problem exists (since it hadn't been written down) but that b) I'm personally not going to solve it at this time. :-)

@victorlin victorlin changed the title Include 21L rooted Nextstrain build Include 21L rooted Nextstrain build for GISAID data Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants