Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename "subtype" to "genotype" #41

Closed
j23414 opened this issue Apr 12, 2024 · 5 comments · Fixed by #51
Closed

Rename "subtype" to "genotype" #41

j23414 opened this issue Apr 12, 2024 · 5 comments · Fixed by #51
Labels
enhancement New feature or request

Comments

@j23414
Copy link
Contributor

j23414 commented Apr 12, 2024

Context

In response to comment:

I think we should generally be consistent with the nomenclature. I see for metadata you have nextclade_subtype with entries like DENV1/II. This is canonically "DENV genotype". I suggest aiming for two columns in the metadata. One for serotype with DENV1, DENV2, etc... and one for denv_genotype with DENV1/II, etc.... This is similar to how things work for SARS-CoV-2 with a clade column as well as a lineage column. Also mpox uses clade and lineage as well as separate columns.

Description

Currently we have

  • ncbi_serotype because we are relying on "NCBI" annotation as the source of serotype assignment. No change to the column name here
  • nextclade_subtype because we are using "nextclade" for genotype assignment. Rename this to "nextclade_genotype"

Of course feel free to comment on this GitHub Issue with other suggestions.
Optionally, we could reorder the metadata columns such that ncbi_serotype and nextclade_genotype are next to each other to make this distinction more obvious to people manually looking at the metadata file.

@j23414 j23414 added the enhancement New feature or request label Apr 12, 2024
@j23414
Copy link
Contributor Author

j23414 commented May 15, 2024

Following from #48 (comment), expanding the scope of this issue to establish standard metadata column names pertaining to Dengue serotype, genotype, and the various methods we employ to derive them for a specific strain (NCBI, Nextclade, augur clades (Nextstrain)).

Below is a proposed standardization along with suggested modifications:

Metadata Column Auspice Display Title Description of data
ncbi_serotype -> serotype_ncbi -> serotype_genbank NCBI serotype -> Serotype (NCBI) -> Serotype (Genbank metadata) Indicates that the assignment of denv1-4 is based on NCBI GenBank record annotation.
clade_membership during the "all_genome build" Serotype -> Serotype (Nextstrain) Indicates that the assignment of denv1-4 is based on augur clades call using full-genome-level-serotype-defining amino acid mutations.
nextclade_subtype -> genotype_nextclade Nextclade genotype -> Genotype (Nextclade) -> Dengue Genotype (Nextclade) Denotes genotype level assignment (e.g., DENV1/S) within serotype, based on Nextclade call.
clade_membership during the "denvX_genome builds" DENV genotype -> Genotype (Nextstrain) -> Dengue Genotype (Nextstrain) Denotes genotype level assignment (e.g., DENV1/S) within serotype, based on augur clades call using full-genome-level-genotype-defining amino acid mutations.

Feel free to suggest other naming conventions along with written justification. This also leaves room for the potential inclusion of Genotype (NCBI) Dengue Genotype (Genbank metadata) if a script is developed to parse genotypes from GenBank data.

@kimandrews
Copy link

Thanks for this very organized summary! Based on comments here, we may want to use Serotype (GenBank metadata) rather than Serotype (NCBI) and change Genotype to Dengue Genotype (I'm planning to make similar changes for the live measles tree eventually). I think distinguishing between Genotype (Nextclade) and Genotype (Nextstrain) could be confusing, but I'm not sure if there is a better solution if we need to include output from both of these analyses on the same tree. Also I prefer Clade over Genotype, but it's probably best to use Genotype if that is what is used in the Dengue literature.

@j23414
Copy link
Contributor Author

j23414 commented May 16, 2024

Thanks for linking some more recent discussion on naming! I'm open to changing NCBI to GenBank metadata (e.g. Serotype (NCBI) -> Serotype (GenBank metadata). Clarification question, are you also planning to update the metadata column names? For example changing genotype_ncbi to genotype_genbank during ingest here?

I ran a quick PubMed search and it looks like the dengue literature uses Genotype. I've linked the results below but good to check!

  • "dengue"+"genotype"
    • "genotype" is being used in the Dengue literature
  • "dengue"+"clade"
    • There are some mentions of "different clades within genotype", "2 clades within genotype II" but the results seem to indicate clade can be used for groupings smaller than "genotype"

@kimandrews
Copy link

It's probably a good idea to change from genotype_ncbi to genotype_genbank in the measles repo as you suggested. I don't have strong opinions about these changes.

@j23414
Copy link
Contributor Author

j23414 commented May 16, 2024

Thanks for clarifying as an optional path, I went ahead and updated the table above accordingly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants