Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse-genbank-location should warn about region/locality mix ups #1578

Open
joverlee521 opened this issue Aug 14, 2024 · 2 comments
Open

parse-genbank-location should warn about region/locality mix ups #1578

joverlee521 opened this issue Aug 14, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Aug 14, 2024

Currently, parse-genbank-location strictly follows GenBank's documented pattern for geo_loc_name:

# Expected pattern for the location field is
# "<country_value>[:<region>][, <locality>]"
#
# See GenBank docs for their "country" field:
# https://www.ncbi.nlm.nih.gov/genbank/collab/country/

However, the GenBank records don't always follow this pattern as shown in nextstrain/rabies#10.

We've previously done this in ncov-ingest specifically for USA locations by checking for US state codes but we can do a more generalized check with something like pycountry. If there is a region/locality mix-up, the command should emit a warning with instructions on how to fix this with apply-geolocation-rules.

@joverlee521 joverlee521 added enhancement New feature or request proposal Proposals that warrant further discussion labels Aug 14, 2024
@genehack
Copy link
Contributor

The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?

I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.

@joverlee521
Copy link
Contributor Author

The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?

I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.

That's fair! We'd still have to use something like pycountry to detect these mix-ups to warn the users about them.

@joverlee521 joverlee521 changed the title Should parse-genbank-location automatically fix region/locality mix ups? parse-genbank-location should warn about region/locality mix ups Sep 13, 2024
@joverlee521 joverlee521 removed the proposal Proposals that warrant further discussion label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants