Add display name clades_display_name.tsv for human friendly names #1050

corneliusroemer · 2023-03-16T18:24:31Z

Description of proposed changes

Adds an additional clades.tsv that is for human consumption
It combines Nextstrain clade, WHO variant and Pango lineage names
Nextstrain clade is always displayed
WHO variants are added in parens when meaningful (pre-Omicron)
Pango lineage names are used for Omicron era

For now, we keep the old clades.tsv around and add a new one - but if desired we could just replace the clades.tsv with the one added here. I don't have enough experience with this repo to know which pattern is better.

It will look something like this:

So 22A (Omicron) becomes 22A (BA.4)

Further patterns are:

19A
20I (Alpha)
21M (Omicron)
22C (BA.2.12.1)
22E (BQ.1)
23A (XBB.1.5)

Needs fixing

Because the new display names are not in the metadata (yet), color-ordering grays them out.

Possible fixes:

Change assign-colors.py script to allow metadata to be ignored for certain colorings
Edit assign-colors to take in node-data-JSONs and use these preferentially over metadata
Rewrite the clade_membership column values for metadata that is fed into assign-colors.py to work with existing assign-colors.py
Delete the clade_membership column from metadata, maybe that's enough to get assign-colors.py to ignore that column without requiring editing of the script.

None of these fixes are perfect. Rewriting the metadata column clade_membership before passing to assign-colors.py may be the easiest.

In general, it could be a good idea to turn assign-colors.py into an augur subcommand `augur

Testing

Tested locally with Europe build visible above and it worked.
Run big test builds: https://github.com/nextstrain/ncov/actions/runs/4449254982, result will be here: https://next.nextstrain.org/staging/ncov/open/trial/display-names/africa/all-time (and at similar URLs)

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Adds an additional clades.tsv that is for human consumption It combines Nextstrain clade, WHO variant and Pango lineage names Nextstrain clade is always displayed WHO variants are added in parens when meaningful (pre-Omicron) Pango lineage names are used for Omicron era

emmahodcroft · 2023-03-20T10:31:34Z

I think the names themselves look great.

For the colouring problem, what was our plan as far as actually having these new names in our metadata? That seems like perhaps the easiest option, to coordinate these two things?

corneliusroemer · 2023-03-20T18:32:10Z

For the colouring problem, what was our plan as far as actually having these new names in our metadata? That seems like perhaps the easiest option, to coordinate these two things?

That's a great question. I think the idea was originally not to put them in metadata at all, and just let users piece them together. We already have all the pieces in the metadata (Clade, WHO, Pango).

But now having actually tried to implement, there is advantage in having them in the metadata.

I'm starting to think that the way to go is to not treat them as extra clades, but as a mapping from Nextstrain clade to display name. So rather than having another clade.tsv (with duplicated genotypes), I could just make a simple mapping tsv.

The mapping would be:
a) used to add a new clade_display_name column in metadata.tsv during ingest (this way we don't burden Nextclade with an extra column)
b) used to find/replace the output of clades.tsv in ncov

Do you prefer to have the column in metadata @emmahodcroft? This is doable and not difficult, I'm just wondering whether this is worth it.

I'd still need to edit assign-colors.py to accept node-data jsons in addition to metadata, but that's something that should be done anyways.

emmahodcroft · 2023-03-20T19:29:32Z

To be fair I don't remember exactly what the plan was as far as replacing the names, it's been such a long time since we started.

I definitely wouldn't say to add an extra column in Nextclade - we don't need it, I just didn't know if we wanted to add it in our own metadata from what we can construct in Nextclade.

To me it feels like just adding an extra column would be the most transparent (easy for others to follow, figure out what we do, & see that 21L (BA.2) is the same as "21L" and "BA.2" and "Omicron" in all respective columns). But I'd be interested to hear others weigh in. I think the more than can happen in more transparent way, the better, rather than weird behind-the-scenes replacements.

corneliusroemer · 2023-03-20T20:07:42Z

I think there are a few questions:

What should the display names be? There seems to be agreement on the proposal
Do we want the display names as an extra column in metadata?
Do we encode the names in a clades.tsv like currently in the PR or do we make a new mapping tsv that goes from Nextstrain clade to display name, one row per clade.

If we want the display names in metadata, then the encoding potentially gets coupled to Nextclade:

if we encode display name in clades.tsv, then the display name should be added as a new custom clade column to Nextclade (like clade_legacy, who_variant etc.)
if we use a mapping tsv, it should not be done via Nextclade, but just in ingest

If we don't want the display names in metadata, either implementation can be used. The actual annotation in ncov works as follows:

if we add a new clades.tsv (as the PR suggests at the moment) we can just use this clades.tsv, easy
if we use a mapping, we either preprocess the clades.tsv before feeding into augur clades, or we postprocess the node data json coming out of augur clades using the mapping

corneliusroemer · 2023-03-30T18:34:25Z

We just discussed on the call that we can just change clades.tsv names (this would cause changes in nextclade's clade_legacy column, but that's ok, we've notified that there may be changes.

emmahodcroft · 2023-03-31T09:30:55Z

From mores discussion in slack, we've realised that though we moved through the steps of 'phasing out' the old naming in Nextclade files, we haven't done this in metadata (ex: providing columns with 23A, Omicron etc before removing/changing anything, so people can adjust code).
This was part of our original plan, we probably just dropped the ball along the way (and/or got confused that we'd done this with Nextclade but not with ncov).

We should do this and provide an overlapping period before breaking anything in metadata - however, at the same time we could add a column called 'display_name' or similar, which may let us proceed with adjusting the branches.

rneher · 2023-04-07T12:39:48Z

I think this looks good. From an implementation perspective, we might want to consider defining clades via their nextstrain clade name 23A and provide a separate mapping from 23A -> 23A (XBB.1.5). This would align the clade definition with what is called in Nextclade and since Nextstrain clades are unique this should be a fairly easy search-replace operation. But this isn't blocking from my perspective.

emmahodcroft · 2023-04-07T13:46:52Z

I've created a PR in ncov-ingest to add the extra columns to metadata so that we can proceed with this.
nextstrain/ncov-ingest#399

I'd say that my PR should be step 1, then we come back to this issue.

Link to slack discussion for easy finding.

tsibley · 2023-04-07T21:48:00Z

@rneher wrote:

From an implementation perspective, we might want to consider defining clades via their nextstrain clade name 23A and provide a separate mapping from 23A -> 23A (XBB.1.5).

Agreed. This is what I expected to find when I started fixing up #1029, but obviously it's not what's currently done. In any data, it's nice to separate the concerns of ids (23A) vs. labels (23A (XBB.1.5)), esp. for programmatic use.

emmahodcroft · 2023-04-13T08:04:03Z

Just to be extra explicit, I think as things stand at the moment we can't solve the clade-labelling issue by changing clades.tsv because it will change Nextclade output (which will change metadata.tsv contents).

Instead I'd suggest we add the proposed extra columns to metadata.tsv (nextstrain/ncov-ingest#399 - which will give us access to things like 23A, Omicron, etc) and then internally in ncov find a way of constructing the column we need for the clade labelling (which will be a mix of concat-ing different stuff, I guess) & proceed from there? (Other ideas welcome!)

tsibley · 2023-04-14T20:57:41Z

I think as things stand at the moment we can't solve the clade-labelling issue by changing clades.tsv because it will change Nextclade output (which will change metadata.tsv contents).

But, from a brief look at this earlier in the week and @corneliusroemer's comments in our meeting yesterday, doesn't Nextclade only pull/use clades.tsv from ncov at release time? And it's part of the Nextclade datasets, not the code itself, right? So we can change clades.tsv in ncov without immediately affecting Nextclade, and choose what to do for Nextclade separately?

emmahodcroft · 2023-04-16T11:04:49Z

Hmm, that may be true - but we'd have to be very careful to make any changes that would impact the ncov metadata before the next release, I guess. (If I'm understanding it correcntly.) I'm a very risk-averse person myself and I'd probably prefer that we have a firm plan so that we can ensure we don't accidentally break something by forgetting & releasing.

rneher · 2023-05-17T16:38:04Z

this is obsolete now, right? @corneliusroemer

rneher · 2023-05-25T20:43:08Z

This superseeded by #1065

corneliusroemer mentioned this pull request Mar 17, 2023

Parameterize Nextclade dataset in config #1046

Merged

1 task

Use clades_display_name.tsv as default & in color_ordering.tsv

07ac3c4

corneliusroemer mentioned this pull request Mar 17, 2023

ENH: Add augur colors subcommand based on assign-colors.py scripts nextstrain/augur#1185

Open

corneliusroemer requested a review from a team March 17, 2023 18:36

rneher approved these changes Apr 7, 2023

View reviewed changes

rneher mentioned this pull request Apr 7, 2023

Include 21L rooted Nextstrain build for GISAID data #1029

Merged

rneher self-requested a review April 7, 2023 13:42

emmahodcroft mentioned this pull request Apr 7, 2023

Add more clade columns to metadata nextstrain/ncov-ingest#399

Merged

rneher closed this May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add display name clades_display_name.tsv for human friendly names #1050

Add display name clades_display_name.tsv for human friendly names #1050

corneliusroemer commented Mar 16, 2023 •

edited

Loading

emmahodcroft commented Mar 20, 2023

corneliusroemer commented Mar 20, 2023

emmahodcroft commented Mar 20, 2023

corneliusroemer commented Mar 20, 2023

corneliusroemer commented Mar 30, 2023

emmahodcroft commented Mar 31, 2023 •

edited

Loading

rneher commented Apr 7, 2023

emmahodcroft commented Apr 7, 2023

tsibley commented Apr 7, 2023 •

edited

Loading

emmahodcroft commented Apr 13, 2023 •

edited

Loading

tsibley commented Apr 14, 2023

emmahodcroft commented Apr 16, 2023 •

edited

Loading

rneher commented May 17, 2023

rneher commented May 25, 2023

Add display name clades_display_name.tsv for human friendly names #1050

Add display name clades_display_name.tsv for human friendly names #1050

Conversation

corneliusroemer commented Mar 16, 2023 • edited Loading

Description of proposed changes

Needs fixing

Testing

Release checklist

emmahodcroft commented Mar 20, 2023

corneliusroemer commented Mar 20, 2023

emmahodcroft commented Mar 20, 2023

corneliusroemer commented Mar 20, 2023

corneliusroemer commented Mar 30, 2023

emmahodcroft commented Mar 31, 2023 • edited Loading

rneher commented Apr 7, 2023

emmahodcroft commented Apr 7, 2023

tsibley commented Apr 7, 2023 • edited Loading

emmahodcroft commented Apr 13, 2023 • edited Loading

tsibley commented Apr 14, 2023

emmahodcroft commented Apr 16, 2023 • edited Loading

rneher commented May 17, 2023

rneher commented May 25, 2023

corneliusroemer commented Mar 16, 2023 •

edited

Loading

emmahodcroft commented Mar 31, 2023 •

edited

Loading

tsibley commented Apr 7, 2023 •

edited

Loading

emmahodcroft commented Apr 13, 2023 •

edited

Loading

emmahodcroft commented Apr 16, 2023 •

edited

Loading