Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

> 50% of queries are returning NA; and, separately, exact matches aren't returning where appropriate #36

Closed
cjcarlson opened this issue May 3, 2021 · 3 comments
Assignees

Comments

@cjcarlson
Copy link

cjcarlson commented May 3, 2021

Hi! This is gonna be a long one.

Here are three viruses that all have exact matches in the NCBI taxonomy:
Adeno-associated virus - 3
Adeno-associated virus 3B
Adenovirus predict_adv-20

They're an interesting case study for what's going horribly wrong here. In theory, they should all be retrieved as exact matches. Two are, in fact, the same "species". For example, the same NCBI API call through taxize:

> classification(get_uid("Adeno-associated virus - 3"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adeno-associated virus - 3'

√  Found:  Adeno-associated+virus+-+3
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`46350`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adeno-associated virus 3B"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adeno-associated virus 3B'

√  Found:  Adeno-associated+virus+3B
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`68742`
                                   name         rank      id
1                               Viruses superkingdom   10239
2                          Monodnaviria        clade 2731342
3                          Shotokuvirae      kingdom 2732092
4                         Cossaviricota       phylum 2732415
5                       Quintoviricetes        class 2732422
6                          Piccovirales        order 2732534
7                          Parvoviridae       family   10780
8                          Parvovirinae    subfamily   40119
9                     Dependoparvovirus        genus   10803
10 Adeno-associated dependoparvovirus A      species 1511891
11           Adeno-associated virus - 3      no rank   46350
12            Adeno-associated virus 3B      no rank   68742

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

> classification(get_uid("Adenovirus predict_adv-20"), db = "ncbi")
==  1 queries  ===============

Retrieving data for taxon 'Adenovirus predict_adv-20'

√  Found:  Adenovirus+predict_adv-20
==  Results  =================

* Total: 1 
* Found: 1 
* Not Found: 0
$`2710954`
                       name         rank      id
1                   Viruses superkingdom   10239
2              Varidnaviria        clade 2732004
3              Bamfordvirae      kingdom 2732005
4         Preplasmiviricota       phylum 2732008
5          Tectiliviricetes        class 2732529
6               Rowavirales        order 2732559
7              Adenoviridae       family   10508
8 unclassified Adenoviridae      no rank  189831
9 Adenovirus PREDICT_AdV-20      species 2710954

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

Everything I'm going to describe is being run through an R script called jncbi() which is included below for convenience:

jncbi <- function(spnames, type = 'host') {
  raw <- data.frame(Name = spnames)
  write_csv(raw, '~/Github/virion/Code_Dev/TaxonomyTempIn.csv', eol = "\n")
  
  if(type == 'host') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/host.jl")}
  if(type == 'virus') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/virus.jl")}
  if(type == 'pathogen') {system("julia C:/Users/cjcar/Documents/Github/virion/Code_Dev/pathogen.jl")}
  
  clean <- read_csv("~/Github/virion/Code_Dev/TaxonomyTempOut.csv")
  file.remove('~/Github/virion/Code_Dev/TaxonomyTempIn.csv')
  file.remove('~/Github/virion/Code_Dev/TaxonomyTempOut.csv')
  
  clean$Name <- stringr::str_to_sentence(clean$Name)
  clean$match <- stringr::str_to_sentence(clean$match)
  return(clean)
}

Doesn't really change anything about the attributes. Just outsources a file to clean and brings it back in.

Here are some contrasting results of virus.jl on different kinds of input.

  1. A BIG LIST

When I pass 8,632 viruses through jncbi, 4,968 come back NA (no match) and 273 come back fuzzy matches (3,419 exact matches). (A file to reproduce this is attached. I'm only including these stats because I think they're probably relevant to our understanding of how big this bug is.) The results are concerning:

Name matched match taxid
adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
adeno-associated virus 3B NA NA NA
adenovirus PREDICT_AdV-20 NA NA NA

  1. JUST THOSE THREE VALUES
> jncbi(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20"), type = 'virus')
Progress: 100%|█████████████████████████████████████████| Time: 0:00:01

-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  matched = col_logical(),
  match = col_character(),
  taxid = col_double()
)

# A tibble: 3 x 4
  Name                       matched match                      taxid
  <chr>                      <lgl>   <chr>                      <dbl>
1 Adeno-associated virus - 3 TRUE    Adeno-associated virus - 3 46350
2 Adeno-associated virus 3b  NA      NA                            NA
3 Adenovirus predict_adv-20  NA      NA                            NA

2B. THOSE THREE VALUES (LOWERCASE)

> jncbi(str_to_lower(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20")), type = 'virus')
Progress: 100%|█████████████████████████████████████████| Time: 0:00:02

-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  matched = col_logical(),
  match = col_character(),
  taxid = col_double()
)

# A tibble: 3 x 4
  Name                       matched match                        taxid
  <chr>                      <lgl>   <chr>                        <dbl>
1 Adeno-associated virus - 3 TRUE    Adeno-associated virus - 3   46350
2 Adeno-associated virus 3b  FALSE   Adeno-associated virus 3a  1406223
3 Adenovirus predict_adv-20  FALSE   Adenovirus predict_adv-20  2710954

==========================
I haven't included it but, if you str_to_lower the virus names before they're passed for the entire list, it also significantly reduces the number of no-match's, and also extends the runtime from 5 mins to about 30 mins, confirming this is, in fact, part (but not all) of the issue

So there are two separate problems that need to be debugged.

  1. Capitalization appears to be making everything wonky. I don't want to do R-end solves to this, given that there's capitalization changes in pathogen.jl - I think you can probably solve this by revisiting that script. (When you do, please do not turn it back into a generic script for both hosts and viruses.)

  2. These should have exact (match=TRUE) matches in the NCBI taxonomy. Both instead get called to fuzzy matching. The first fuzzy match is wrong, while the second fuzzy match is actually the correct exact match, and the strings returned are identical (no differences, as far as I can tell in spacing).

@cjcarlson cjcarlson added bug Something isn't working need-triage labels May 3, 2021
@tpoisot tpoisot added unreplicable and removed bug Something isn't working need-triage labels May 3, 2021
@tpoisot
Copy link
Member

tpoisot commented May 3, 2021

Seems to be not a bug:

julia> ncbi"Adeno-associated virus - 3"
Adeno-associated virus - 3 (ncbi:46350)

julia> ncbi"Adeno-associated virus 3B"
Adeno-associated virus 3B (ncbi:68742)

julia> ncbi"Adenovirus predict_adv-20"

The last one is probably because I haven't updated my local DB recently; will report when I have it rebuilt.

@tpoisot
Copy link
Member

tpoisot commented May 3, 2021

Last one requires fuzzy:

julia> taxon("Adenovirus predict_adv-20")

julia> taxon("Adenovirus predict_adv-20"; strict=false)
Adenovirus PREDICT_AdV-20 (ncbi:2710954)

@tpoisot
Copy link
Member

tpoisot commented May 3, 2021

I think (and I can't tell because I don't have R) that you pull the last version of the package every time; the match=true was replaced by strict=false to match the GBIF.jl argument of GBIF.taxon. I'm going to close this one as the package behaves as expected, let's move it to an issue on the repo where jncbi is defined.

@tpoisot tpoisot closed this as completed May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants