Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add display name clades_display_name.tsv for human friendly names #1050

Closed
wants to merge 2 commits into from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Mar 16, 2023

Description of proposed changes

Adds an additional clades.tsv that is for human consumption
It combines Nextstrain clade, WHO variant and Pango lineage names
Nextstrain clade is always displayed
WHO variants are added in parens when meaningful (pre-Omicron)
Pango lineage names are used for Omicron era

For now, we keep the old clades.tsv around and add a new one - but if desired we could just replace the clades.tsv with the one added here. I don't have enough experience with this repo to know which pattern is better.

It will look something like this:
image

So 22A (Omicron) becomes 22A (BA.4)

Further patterns are:

  • 19A
  • 20I (Alpha)
  • 21M (Omicron)
  • 22C (BA.2.12.1)
  • 22E (BQ.1)
  • 23A (XBB.1.5)

Needs fixing

Because the new display names are not in the metadata (yet), color-ordering grays them out.

image

Possible fixes:

  • Change assign-colors.py script to allow metadata to be ignored for certain colorings
  • Edit assign-colors to take in node-data-JSONs and use these preferentially over metadata
  • Rewrite the clade_membership column values for metadata that is fed into assign-colors.py to work with existing assign-colors.py
  • Delete the clade_membership column from metadata, maybe that's enough to get assign-colors.py to ignore that column without requiring editing of the script.

None of these fixes are perfect. Rewriting the metadata column clade_membership before passing to assign-colors.py may be the easiest.

In general, it could be a good idea to turn assign-colors.py into an augur subcommand `augur

Testing

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Adds an additional clades.tsv that is for human consumption
It combines Nextstrain clade, WHO variant and Pango lineage names
Nextstrain clade is always displayed
WHO variants are added in parens when meaningful (pre-Omicron)
Pango lineage names are used for Omicron era
@emmahodcroft
Copy link
Member

I think the names themselves look great.

For the colouring problem, what was our plan as far as actually having these new names in our metadata? That seems like perhaps the easiest option, to coordinate these two things?

@corneliusroemer
Copy link
Member Author

For the colouring problem, what was our plan as far as actually having these new names in our metadata? That seems like perhaps the easiest option, to coordinate these two things?

That's a great question. I think the idea was originally not to put them in metadata at all, and just let users piece them together. We already have all the pieces in the metadata (Clade, WHO, Pango).

But now having actually tried to implement, there is advantage in having them in the metadata.

I'm starting to think that the way to go is to not treat them as extra clades, but as a mapping from Nextstrain clade to display name. So rather than having another clade.tsv (with duplicated genotypes), I could just make a simple mapping tsv.

The mapping would be:
a) used to add a new clade_display_name column in metadata.tsv during ingest (this way we don't burden Nextclade with an extra column)
b) used to find/replace the output of clades.tsv in ncov

Do you prefer to have the column in metadata @emmahodcroft? This is doable and not difficult, I'm just wondering whether this is worth it.

I'd still need to edit assign-colors.py to accept node-data jsons in addition to metadata, but that's something that should be done anyways.

@emmahodcroft
Copy link
Member

To be fair I don't remember exactly what the plan was as far as replacing the names, it's been such a long time since we started.

I definitely wouldn't say to add an extra column in Nextclade - we don't need it, I just didn't know if we wanted to add it in our own metadata from what we can construct in Nextclade.

To me it feels like just adding an extra column would be the most transparent (easy for others to follow, figure out what we do, & see that 21L (BA.2) is the same as "21L" and "BA.2" and "Omicron" in all respective columns). But I'd be interested to hear others weigh in. I think the more than can happen in more transparent way, the better, rather than weird behind-the-scenes replacements.

@corneliusroemer
Copy link
Member Author

I think there are a few questions:

  1. What should the display names be? There seems to be agreement on the proposal
  2. Do we want the display names as an extra column in metadata?
  3. Do we encode the names in a clades.tsv like currently in the PR or do we make a new mapping tsv that goes from Nextstrain clade to display name, one row per clade.

If we want the display names in metadata, then the encoding potentially gets coupled to Nextclade:

  • if we encode display name in clades.tsv, then the display name should be added as a new custom clade column to Nextclade (like clade_legacy, who_variant etc.)
  • if we use a mapping tsv, it should not be done via Nextclade, but just in ingest

If we don't want the display names in metadata, either implementation can be used. The actual annotation in ncov works as follows:

  • if we add a new clades.tsv (as the PR suggests at the moment) we can just use this clades.tsv, easy
  • if we use a mapping, we either preprocess the clades.tsv before feeding into augur clades, or we postprocess the node data json coming out of augur clades using the mapping

@corneliusroemer
Copy link
Member Author

We just discussed on the call that we can just change clades.tsv names (this would cause changes in nextclade's clade_legacy column, but that's ok, we've notified that there may be changes.

@emmahodcroft
Copy link
Member

emmahodcroft commented Mar 31, 2023

From mores discussion in slack, we've realised that though we moved through the steps of 'phasing out' the old naming in Nextclade files, we haven't done this in metadata (ex: providing columns with 23A, Omicron etc before removing/changing anything, so people can adjust code).
This was part of our original plan, we probably just dropped the ball along the way (and/or got confused that we'd done this with Nextclade but not with ncov).

We should do this and provide an overlapping period before breaking anything in metadata - however, at the same time we could add a column called 'display_name' or similar, which may let us proceed with adjusting the branches.

@rneher
Copy link
Member

rneher commented Apr 7, 2023

I think this looks good. From an implementation perspective, we might want to consider defining clades via their nextstrain clade name 23A and provide a separate mapping from 23A -> 23A (XBB.1.5). This would align the clade definition with what is called in Nextclade and since Nextstrain clades are unique this should be a fairly easy search-replace operation. But this isn't blocking from my perspective.

@emmahodcroft
Copy link
Member

I've created a PR in ncov-ingest to add the extra columns to metadata so that we can proceed with this.
nextstrain/ncov-ingest#399

I'd say that my PR should be step 1, then we come back to this issue.

Link to slack discussion for easy finding.

@tsibley
Copy link
Member

tsibley commented Apr 7, 2023

@rneher wrote:

From an implementation perspective, we might want to consider defining clades via their nextstrain clade name 23A and provide a separate mapping from 23A -> 23A (XBB.1.5).

Agreed. This is what I expected to find when I started fixing up #1029, but obviously it's not what's currently done. In any data, it's nice to separate the concerns of ids (23A) vs. labels (23A (XBB.1.5)), esp. for programmatic use.

@emmahodcroft
Copy link
Member

emmahodcroft commented Apr 13, 2023

Just to be extra explicit, I think as things stand at the moment we can't solve the clade-labelling issue by changing clades.tsv because it will change Nextclade output (which will change metadata.tsv contents).

Instead I'd suggest we add the proposed extra columns to metadata.tsv (nextstrain/ncov-ingest#399 - which will give us access to things like 23A, Omicron, etc) and then internally in ncov find a way of constructing the column we need for the clade labelling (which will be a mix of concat-ing different stuff, I guess) & proceed from there? (Other ideas welcome!)

@tsibley
Copy link
Member

tsibley commented Apr 14, 2023

I think as things stand at the moment we can't solve the clade-labelling issue by changing clades.tsv because it will change Nextclade output (which will change metadata.tsv contents).

But, from a brief look at this earlier in the week and @corneliusroemer's comments in our meeting yesterday, doesn't Nextclade only pull/use clades.tsv from ncov at release time? And it's part of the Nextclade datasets, not the code itself, right? So we can change clades.tsv in ncov without immediately affecting Nextclade, and choose what to do for Nextclade separately?

@emmahodcroft
Copy link
Member

emmahodcroft commented Apr 16, 2023

Hmm, that may be true - but we'd have to be very careful to make any changes that would impact the ncov metadata before the next release, I guess. (If I'm understanding it correcntly.) I'm a very risk-averse person myself and I'd probably prefer that we have a firm plan so that we can ensure we don't accidentally break something by forgetting & releasing.

@rneher
Copy link
Member

rneher commented May 17, 2023

this is obsolete now, right? @corneliusroemer

@rneher
Copy link
Member

rneher commented May 25, 2023

This superseeded by #1065

@rneher rneher closed this May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants