-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add display name clades_display_name.tsv for human friendly names #1050
Conversation
Adds an additional clades.tsv that is for human consumption It combines Nextstrain clade, WHO variant and Pango lineage names Nextstrain clade is always displayed WHO variants are added in parens when meaningful (pre-Omicron) Pango lineage names are used for Omicron era
I think the names themselves look great. For the colouring problem, what was our plan as far as actually having these new names in our metadata? That seems like perhaps the easiest option, to coordinate these two things? |
That's a great question. I think the idea was originally not to put them in metadata at all, and just let users piece them together. We already have all the pieces in the metadata (Clade, WHO, Pango). But now having actually tried to implement, there is advantage in having them in the metadata. I'm starting to think that the way to go is to not treat them as extra clades, but as a mapping from Nextstrain clade to display name. So rather than having another clade.tsv (with duplicated genotypes), I could just make a simple mapping tsv. The mapping would be: Do you prefer to have the column in metadata @emmahodcroft? This is doable and not difficult, I'm just wondering whether this is worth it. I'd still need to edit |
To be fair I don't remember exactly what the plan was as far as replacing the names, it's been such a long time since we started. I definitely wouldn't say to add an extra column in Nextclade - we don't need it, I just didn't know if we wanted to add it in our own metadata from what we can construct in Nextclade. To me it feels like just adding an extra column would be the most transparent (easy for others to follow, figure out what we do, & see that 21L (BA.2) is the same as "21L" and "BA.2" and "Omicron" in all respective columns). But I'd be interested to hear others weigh in. I think the more than can happen in more transparent way, the better, rather than weird behind-the-scenes replacements. |
I think there are a few questions:
If we want the display names in metadata, then the encoding potentially gets coupled to Nextclade:
If we don't want the display names in metadata, either implementation can be used. The actual annotation in ncov works as follows:
|
We just discussed on the call that we can just change clades.tsv names (this would cause changes in nextclade's |
From mores discussion in slack, we've realised that though we moved through the steps of 'phasing out' the old naming in Nextclade files, we haven't done this in We should do this and provide an overlapping period before breaking anything in |
I think this looks good. From an implementation perspective, we might want to consider defining clades via their nextstrain clade name |
I've created a PR in ncov-ingest to add the extra columns to I'd say that my PR should be step 1, then we come back to this issue. Link to slack discussion for easy finding. |
Agreed. This is what I expected to find when I started fixing up #1029, but obviously it's not what's currently done. In any data, it's nice to separate the concerns of ids ( |
Just to be extra explicit, I think as things stand at the moment we can't solve the clade-labelling issue by changing Instead I'd suggest we add the proposed extra columns to |
But, from a brief look at this earlier in the week and @corneliusroemer's comments in our meeting yesterday, doesn't Nextclade only pull/use |
Hmm, that may be true - but we'd have to be very careful to make any changes that would impact the |
this is obsolete now, right? @corneliusroemer |
This superseeded by #1065 |
Description of proposed changes
Adds an additional clades.tsv that is for human consumption
It combines Nextstrain clade, WHO variant and Pango lineage names
Nextstrain clade is always displayed
WHO variants are added in parens when meaningful (pre-Omicron)
Pango lineage names are used for Omicron era
For now, we keep the old
clades.tsv
around and add a new one - but if desired we could just replace theclades.tsv
with the one added here. I don't have enough experience with this repo to know which pattern is better.It will look something like this:
So
22A (Omicron)
becomes22A (BA.4)
Further patterns are:
19A
20I (Alpha)
21M (Omicron)
22C (BA.2.12.1)
22E (BQ.1)
23A (XBB.1.5)
Needs fixing
Because the new display names are not in the metadata (yet), color-ordering grays them out.
Possible fixes:
assign-colors.py
script to allow metadata to be ignored for certain coloringsclade_membership
column values for metadata that is fed intoassign-colors.py
to work with existingassign-colors.py
clade_membership
column from metadata, maybe that's enough to getassign-colors.py
to ignore that column without requiring editing of the script.None of these fixes are perfect. Rewriting the metadata column
clade_membership
before passing toassign-colors.py
may be the easiest.In general, it could be a good idea to turn
assign-colors.py
into an augur subcommand `augurTesting
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.md
in this pull request to document these changes and the new version number.If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.