Include 21L rooted Nextstrain build for GISAID data #1029

trvrb · 2022-11-02T23:15:04Z

Description of proposed changes

This PR includes a new set of Nextstrain builds for GISAID data that use 21L as reference rather than Wuhan. All viruses that outgroup relative to 21L in the tree are dropped via filter rule:

filter:
  min_date: "2022-01-01"
  exclude_where: "division='USA' \
    Nextstrain_clade='19A' \
    Nextstrain_clade='19B' \
    Nextstrain_clade='20A' \
    Nextstrain_clade='20B' \
    ...

I decided to split out a separate nextstrain_profiles/nextstrain-gisaid-21L rather than padding out nextstrain_profiles/nextstrain-gisaid with additional logic because the latter would have resulted in extensive code-duplication within the builds.yaml in specifying different behavior across specific build endpoints.

I've run this locally with swapping

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/global/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/global/sequences.fasta.xz"

to

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"

and things work well and run quickly. I'm still testing using the full data on AWS however and will update this PR with links to trial runs.

That said, restricting to 21L makes a number of visuals much clearer. For example, here is the smaller local tree looking at S1 mutations:

I'd plan to split things out so that available endpoints include:

nextstrain.org/ncov/gisaid/ancestral/global/6m
nextstrain.org/ncov/gisaid/21L/global/6m
nextstrain.org/ncov/gisaid/21L/asia/2m
etc...

However, all this said, I don't think this is the right tact to solve the problem.

Issues include:

In the workflow, the filter rule happens after the subsampling rules. This makes it so that if we ask for say 2560 in a sampling bucket, we'll lose >50% due to filtering out non-21L-descending clades.
This could be solved by padding count targets to compensate, but this is hacky and the numbers will change as time goes on. Or the filter rule could be placed again before subsample, but we moved it afterwards for good reasons.
I'd anticipate wanting a number of analyses (like https://github.com/blab/ncov-escape and other projects) that use 21L as reference. Having to deal with the entire dataset and having to filter out outgroup viruses will be repeatedly annoying.

Instead, I'd propose that we move this filter logic to https://github.com/nextstrain/ncov-ingest and to make the filtered metadata and sequences available for both GISAID data and open data at URLs like:

s3://nextstrain-ncov-private/21L/metadata.tsv.gz
s3://nextstrain-ncov-private/21L/sequences.fasta.xz
s3://nextstrain-data/files/ncov/open/21L/metadata.tsv.gz
s3://nextstrain-data/files/ncov/open/21L/sequences.fasta.xz

This will streamline and speed up running of 21L builds (augur filter runs repeatedly on a smaller metadata file) and help enable many future analyses.

Testing

As above, I've locally tested with

inputs:
  - name: gisaid
    metadata: "s3://nextstrain-ncov-private/metadata.tsv.xz"
    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"

and trial build is running that will output to https://nextstrain.org/staging/ncov/gisaid/21L/trial/21L-reference/global/6m. I'll update when this completes.

corneliusroemer

I've noticed that you use a 21L reference with deletions.

Will we keep using Wuhan coordinates? This should be the case as long as we translate using Wuhan - otherwise Spike and other positions will be messed up.

trvrb · 2022-11-03T00:01:29Z

I think best to keep everything in Wuhan/1 coordinates. Latest BQ.1 preprints still use Wuhan/1 coordinates. Will be way too confusing otherwise. So perhaps this PR is better thought of as 21L "root" rather than 21L "reference".

corneliusroemer · 2022-11-03T00:18:55Z

I totally agree, especially given that deletions commonly occur, like 69/70del and 144del etc it'd be a mess to change coordinate systems and not worth it.

tsibley · 2022-11-22T23:26:05Z

Commenting mostly on the organizational/conceptual bits (and not the code changes here)…

I'd plan to split things out so that available endpoints include:

nextstrain.org/ncov/gisaid/ancestral/global/6m

nextstrain.org/ncov/gisaid/21L/global/6m

nextstrain.org/ncov/gisaid/21L/asia/2m
etc...

Why not keep existing builds (with the ancestral Wuhan root) at nextstrain.org/ncov/gisaid/global/6m?

i.e. don't change what's already there, only add the nextstrain.org/ncov/gisaid/21L/… datasets.

Issues include:

In the workflow, the filter rule happens after the subsampling rules. This makes it so that if we ask for say 2560 in a sampling bucket, we'll lose >50% due to filtering out non-21L-descending clades.
This could be solved by padding count targets to compensate, but this is hacky and the numbers will change as time goes on. Or the filter rule could be placed again before subsample, but we moved it afterwards for good reasons.

Alternatively, a pre-subsampling filter rule could be added to the DAG for this profile instead of moving the existing filter rule around (which I agree is probably a non-starter). This would look like the build config defining an "aligned" input that's a local file (to be generated by the new prefilter rule), e.g.:

inputs:
  - name: gisaid
    metadata: "results/gisaid_21L_metadata.fasta.xz"
    aligned: "results/gisaid_21L_aligned.fasta.xz"

a new custom_rules file, e.g.:

custom_rules:
  - nextstrain_profiles/nextstrain-gisaid-21L/prefilter.smk

and in that new file writing the rule to do the filtering down to 21L, e.g. something that looks roughly like:

rule 21L_prefilter:
    input:
        metadata = path_or_file("s3://nextstrain-ncov-private/metadata.tsv.xz"),
        aligned = path_or_file("s3://nextstrain-ncov-private/aligned.fasta.xz")
    output:
        metadata = "results/gisaid_21L_metadata.tsv.xz",
        aligned = "results/gisaid_21L_aligned.fasta.xz"
    shell:
        …

(You also might be able to accomplish the same thing without any custom rules by chaining inputs together, though I'm not sure it's possible with existing config.)

I don't think this as a one-off approach is necessarily better than one-off provisioning of a separate set of files upstream in ncov-ingest, but I think if the prefiltering concept was generalized for this workflow (even if prefiltering was still limited only to clades or something) that might be better than producing/maintaining lots of separate static subsets.

I'd anticipate wanting a number of analyses (like https://github.com/blab/ncov-escape and other projects) that use 21L as reference. Having to deal with the entire dataset and having to filter out outgroup viruses will be repeatedly annoying.

Is the biggest annoyance here the slowness of augur filter? Or that it's verbose/difficult to use it to filter to just 21L? (Both?)

Instead, I'd propose that we move this filter logic to https://github.com/nextstrain/ncov-ingest and to make the filtered metadata and sequences available for both GISAID data and open data at URLs like:

s3://nextstrain-ncov-private/21L/metadata.tsv.gz

s3://nextstrain-ncov-private/21L/sequences.fasta.xz

s3://nextstrain-data/files/ncov/open/21L/metadata.tsv.gz

s3://nextstrain-data/files/ncov/open/21L/sequences.fasta.xz

I think it's reasonable to provision separate subsets for specific cases like this, though I do wonder if the alternative I describe above might be better in the long run.

Re: these specific proposed paths: they do perpetuate the comingling of workflow-associated files and dataset-associated files that ncov-ingest and ncov already do, and in that sense it's fine and maybe even preferred, but it might be nice to consider putting new workflow-associated files (which these are) in the new locations we decided upon, e.g.:

s3://nextstrain-data-private/files/workflows/ncov/21L/metadata.tsv.gz
s3://nextstrain-data-private/files/workflows/ncov/21L/sequences.fasta.xz
s3://nextstrain-data/files/workflows/ncov/open/21L/metadata.tsv.gz
s3://nextstrain-data/files/workflows/ncov/open/21L/sequences.fasta.xz

Since we're not planning to remove the existing comingled files for ncov/ncov-ingest, avoiding adding more might help avoid further confusion of what's what.

workflow/snakemake_rules/export_for_nextstrain.smk

tsibley · 2023-03-30T22:45:21Z

workflow/snakemake_rules/main_workflow.smk

        Adjusting metadata for build '{wildcards.build_name}'
        """
    input:
-        metadata="results/{build_name}/metadata_with_index.tsv",
+        metadata="results/{build_name}/metadata_filtered.tsv.xz",
    output:
        metadata = "results/{build_name}/metadata_adjusted.tsv.xz"


This changes the metadata that's input to several downstream rules. Rules which take metadata_adjusted.tsv.xz as input currently get the subsampled but unfiltered metadata and with this change would get the subsampled and filtered metadata.

Is this change a bug fix? It seems like it. But it does have potential visible changes like rule colors now will produce a scale based on the filtered rather than unfiltered metadata values.

I'd appreciate someone else's 👀 on this bit.

/cc @nextstrain/core

Split out separately in d8c17ec.

tsibley · 2023-03-30T22:48:40Z

Rebased this onto the latest master to bring in changes. This also highlights a few things that need updating since the copy, which I'll mark separately with comments here so I can track them.

nextstrain_profiles/nextstrain-gisaid-21L/builds.yaml

nextstrain_profiles/nextstrain-gisaid-21L/nextstrain_description.md

This is intended as a bug fix to more closely align metadata with sequences in rules downstream of subsample + filter. Rules which take metadata_adjusted.tsv.xz as input used to get the subsampled _but not_ filtered metadata. With this change, they now get the subsampled _and_ filtered metadata. Many rules will not be affected as they join sequences → metadata and ignore extra metadata rows. However, there are some potential changes to results as some rules do consult all of the metadata without joining it to sequences. For example, the "colors" rule will now produce a scale based on the filtered rather than unfiltered metadata values, which could be noticeably different sets. Co-authored-by: Trevor Bedford <trevor@bedford.io>

All circulating viruses now descend from clade 21L. By rooting the tree on 21L reference virus we can better emphasize recent evolution. This includes tree layout, but also things like S1 mutations, where these were basically all just red rather than emphasizing differences among circulating viruses, or clades, where all the previous clades were making 21L through 22E all similar colors of red. This copies the existing Nextstrain GISAID build and adds some simple logic to filter out clades previous to 21L. This results in some code duplication between the ancestral GISAID build and the 21L GISAID build, but overall much simpler code than if a single build.yml covered both ancestral and 21L root cases. Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

@trvrb

Filter rules in the config are applied _after_ subsampling, which poses issues with reliably getting the desired number of sequences. As @trvrb wrote¹: > In the workflow, the filter rule happens after the subsampling rules. > This makes it so that if we ask for say 2560 in a sampling bucket, we'll > lose >50% due to filtering out non-21L-descending clades. > > This could be solved by padding count targets to compensate, but this is > hacky and the numbers will change as time goes on. Or the filter rule > could be placed again before subsample, but we moved it afterwards for > good reasons. A few custom rules for the builds allows us to prefilter the full dataset before subsampling. Currently these rules are specific to our GISAID data source, but they could be easily expanded to our Open data sources too. In the future we might also provide clade-partitioned subsets from ncov-ingest², which we could use here instead with some adaptation of the build config. ¹ <#1029 (comment)> ² e.g. <nextstrain/ncov-ingest#398>

tsibley · 2023-04-05T20:45:09Z

Per discussion last week, I reworked this branch to prefilter the data before subsampling and to avoid memory-inefficient steps. See the commit messages for details. This is ready for review again by @nextstrain/core (and maybe particularly @trvrb).

There's also a trial build running now.

tsibley · 2023-04-05T22:55:47Z

Looking at the global/6m build locally, it seems ok.

Oddly, I noticed there's a sequence that made it thru classified as 19A:

This sequence is in our metadata as 22B, so passes the pre-filters, and is from 2022-12, so passes the date filter, but must be getting reclassified to 19A (erroneously?) by the workflow. I haven't dug into it more, but perhaps we want to both pre-filter and (post-)filter on the clades to catch these reclassifications?

emmahodcroft · 2023-04-06T13:12:04Z

Yes - in fact, I bet it's not even the workflow proper where the reclassification is happening - it's just at the end when the clades are being 'put' on the tree (so that the labels show up and everything gets a nice colour). This overrides whatever classifications we've gotten from Nextclade in the display. But these are essentially only assigned at one node - then everything downstream from that is classified as "being that node" (unless another node downstream also has a label, which then overrides the first).

If I'm remembering correctly, node labels attach to the first node available that matches their specified mutations. 19A is a bit 'weak' in the respect that it's only defined by 2 mutations -- of course, we didn't have many muts at the start!! So this may well literally be all that defines this clade.
However, in a normal build, 19A split is near the root, so this node gets assigned early and won't end up lower down the tree (even though these mutations may happen elsewhere). However, in this build, there is no 'real' 19A node, so it's just wandering down the tree, and if it happens across a node with the two mutations it needs, it attaches there...

Naively I would have said that node assignments happen in order specified in the TSV, with nodes coming later in the list overriding any assigned earlier (after all, after the first clade is assigned 1/2 the tree may be 'part of that node' until you do your next assignment within that) - so I'm a little surprised this happened, but I may be completely misremembering how this works.

If useful for others, here's the tree: https://nextstrain.org/staging/ncov/gisaid/21L/trial/21L-reference/global/6m?label=clade:22E%20%28Omicron%29&showBranchLabels=all

emmahodcroft · 2023-04-06T13:15:24Z

As for solutions:

I don't think a pre-filter will help, as this node really is 22B, it's just that it also has these two 19A mutations.

The most straightforward (on the surface) solution would be to use a separate clades_nextstrain.tsv for these 21L builds - but already have a lot of these and we have to update each separately every time we add a new clade - so that's not ideal if we can find another way...

tsibley · 2023-04-06T16:43:33Z

However, in this build, there is no 'real' 19A node, so it's just wandering down the tree, and if it happens across a node with the two mutations it needs, it attaches there...

Ahh, this makes sense. Thanks for the explanation!

The most straightforward (on the surface) solution would be to use a separate clades_nextstrain.tsv for these 21L builds

Hmm. I'd think we could programmatically filter the clades though to remove the ones prior to 21L… I'll try it.

tsibley · 2023-04-06T17:05:16Z

Oh, hrm, but I can't just remove the excluded clades because they're defined hierarchically. So the filtering/transformation is a bit more complex.

tsibley · 2023-04-06T18:13:00Z

I've pushed 59a672c to try to address the clade labeling issue. Testing a global/6m build first, then will trigger a full set of trial builds.

emmahodcroft · 2023-04-06T18:28:14Z

Very nice little solution @tsibley ! I like the idea and the implementation - makes sense to me, and avoid us having another clades file to update manually!

tsibley · 2023-04-06T21:02:13Z

Well, that didn't work as intended:

No clades got labeled…

@emmahodcroft

This avoids incorrect labeling of the trees by `augur clades` if a sequence happens to have mutations that define one of the excluded clades. This happens because there aren't earlier sequences for those clades to be labeled due to their exclusion. For example, we've seen some spurious labelings of 19A because we exclude such sequences from these builds. Thanks to @emmahodcroft for debugging and explaining this!¹ Since we define our clades hierarchically with inheritance, a small new program is needed to fully expand those definitions so we can filter them without worrying about filtering out their "parent" clades. ¹ <#1029 (comment)>

tsibley · 2023-04-06T21:35:31Z

Off-by-one error on my part because I saw but then didn't recall this bit about augur.clades.read_in_clade_definitions():

Sites are 1 indexed in the file, and are converted to 0 indexed in the output

tsibley · 2023-04-07T00:02:19Z

Ok, that's better:

tsibley · 2023-04-07T00:03:45Z

I've started another full set of trial builds (build id 93366c93-b2df-4db1-9f84-f4778bb839a5), which will appear at URLs like https://nextstrain.org/staging/ncov/gisaid/21L/trial/21L-reference/global/6m if things go well.

emmahodcroft · 2023-04-07T11:32:49Z

Looks like the test builds completed, and checking a few of them, they look really good now! Thanks for tackling this Tom, I think it'll be really useful!

rneher · 2023-04-07T13:04:23Z

LGTM, but just a heads up that the exclude_clades depends on the values of Nextstrain_clade in the metadata. We are currently moving the names displayed on the tree from 23A (Omicron) to smth like 23A (XBB.1.5) #1050 and the primary output of Nextclade is just 23A. We could add the latter the metadata and use that.

But for the time being this should work, just highlighting possible future divergence and breakage.

tsibley · 2023-04-07T21:54:50Z

@rneher Yep, thanks for noting that. I think the sort of thing done by this PR is a great example of why we want to do nextstrain/ncov-ingest#399. :-) I'd much rather write and maintain code to filter on 23A vs. 23A (Omicron) or 23A (XBB.1.5). I don't think that's an option for this PR until that other work is complete, though.

This should have been removed in "Remove preprocessing pipelines" (9abad2a).

This is the event triggered by ncov-ingest when there's new GISAID data, which also triggers the standard ancestral-rooted builds. The specific "rebuild-gisaid-21L" event for this workflow seems unnecessary since it would only be triggered manually, and in that case I expect folks to use the GitHub Actions web UI instead.

tsibley · 2023-04-12T18:25:13Z

Pushed a couple additional small commits, one of which ensures these 21L-rooted builds are triggered by new data, and now will merge.

tsibley · 2023-04-12T18:43:35Z

One thing I think that's worth addressing, but that I chose not to address now (because I have to timebox this to some reasonable scope), is the vast duplication between nextstrain-gisaid and nextstrain-gisaid-21L profiles. It will be less maintainable over time, esp. as a cross-product with other sources (e.g. open) and other filterings (e.g. 100k).

tsibley · 2023-04-13T19:11:54Z

Five months ago, I wrote:

Why not keep existing builds (with the ancestral Wuhan root) at nextstrain.org/ncov/gisaid/global/6m?

i.e. don't change what's already there, only add the nextstrain.org/ncov/gisaid/21L/… datasets.

The above is what I've done in getting this PR over the line in the past two weeks.

One reason, given on the biweekly call right now by @trvrb, to switch existing builds to use an ancestral label, is that it would make it possible to update nextstrain.org's dataset manifest (ugh that this still exists) such that the dataset selector dropdowns work when viewing 21L builds (and 21L builds are navigable from the ancestral builds). Currently the dataset selector dropdowns do not work well for 21L builds.

I'd suggest that instead of continuing to shift existing URLs all the time and maintain all those redirects, we could instead prioritize a better dataset selector (such as the one sketched out by @jameshadfield) and subsequent removal of the manifest.

trvrb · 2023-04-13T19:57:40Z

Good point about dataset selector. We currently also have this issue for https://nextstrain.org/nextclade/sars-cov-2/21L.

trvrb · 2023-04-13T20:38:48Z

Also, I think we can get okay, but not ideal, behavior in the dataset selector from the current dataset names of:

/ncov/gisaid/global/2m
/ncov/gisaid/21L/global/2m
etc...

by modifying manifest_guest.json to 21L under the "region" heading. We won't get the blank dropdown in this case.

The issue with https://nextstrain.org/nextclade/sars-cov-2/21L is having _21L.json as a trailing option. There's no way to address this with current system.

emmahodcroft · 2023-04-14T07:36:24Z

BTW I definitely didn't intend to sound 'problem-finding' with the drop-downs - apologies if it came across that way! I think it's fantastic the 21L build is up, and thanks for pushing it over the line, @tsibley! I only brought it up because the drop-down is mainly how I remember what builds we have and how to get to them (so I find it really useful).

But, I totally get the issue at play here, and I agree that constantly modifying URLs is not an ideal solution. Coming up with a "better way" is almost certainly going to pay off in the long-run so we can have whatever URLs we want, without needing 'sisters' etc.

tsibley · 2023-04-14T21:02:55Z

@emmahodcroft No worries! I have no problem (ha) with you pointing out the dropdown problems—they're real obstacles to usage!—I just wanted to make it clear here that a) the problem exists (since it hadn't been written down) but that b) I'm personally not going to solve it at this time. :-)

corneliusroemer reviewed Nov 2, 2022

View reviewed changes

tsibley changed the title ~~Include 21L reference Nextstrain build~~ Include 21L rooted Nextstrain build Nov 22, 2022

tsibley reviewed Nov 22, 2022

View reviewed changes

workflow/snakemake_rules/export_for_nextstrain.smk Outdated Show resolved Hide resolved

corneliusroemer assigned tsibley Mar 30, 2023