You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here are three viruses that all have exact matches in the NCBI taxonomy:
Adeno-associated virus - 3
Adeno-associated virus 3B
Adenovirus predict_adv-20
They're an interesting case study for what's going horribly wrong here. In theory, they should all be retrieved as exact matches. Two are, in fact, the same "species". For example, the same NCBI API call through taxize:
> classification(get_uid("Adeno-associated virus - 3"), db = "ncbi")
== 1 queries ===============
Retrieving data for taxon 'Adeno-associated virus - 3'
√ Found: Adeno-associated+virus+-+3
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`46350`
name rank id
1 Viruses superkingdom 10239
2 Monodnaviria clade 2731342
3 Shotokuvirae kingdom 2732092
4 Cossaviricota phylum 2732415
5 Quintoviricetes class 2732422
6 Piccovirales order 2732534
7 Parvoviridae family 10780
8 Parvovirinae subfamily 40119
9 Dependoparvovirus genus 10803
10 Adeno-associated dependoparvovirus A species 1511891
11 Adeno-associated virus - 3 no rank 46350
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adeno-associated virus 3B"), db = "ncbi")
== 1 queries ===============
Retrieving data for taxon 'Adeno-associated virus 3B'
√ Found: Adeno-associated+virus+3B
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`68742`
name rank id
1 Viruses superkingdom 10239
2 Monodnaviria clade 2731342
3 Shotokuvirae kingdom 2732092
4 Cossaviricota phylum 2732415
5 Quintoviricetes class 2732422
6 Piccovirales order 2732534
7 Parvoviridae family 10780
8 Parvovirinae subfamily 40119
9 Dependoparvovirus genus 10803
10 Adeno-associated dependoparvovirus A species 1511891
11 Adeno-associated virus - 3 no rank 46350
12 Adeno-associated virus 3B no rank 68742
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification(get_uid("Adenovirus predict_adv-20"), db = "ncbi")
== 1 queries ===============
Retrieving data for taxon 'Adenovirus predict_adv-20'
√ Found: Adenovirus+predict_adv-20
== Results =================
* Total: 1
* Found: 1
* Not Found: 0
$`2710954`
name rank id
1 Viruses superkingdom 10239
2 Varidnaviria clade 2732004
3 Bamfordvirae kingdom 2732005
4 Preplasmiviricota phylum 2732008
5 Tectiliviricetes class 2732529
6 Rowavirales order 2732559
7 Adenoviridae family 10508
8 unclassified Adenoviridae no rank 189831
9 Adenovirus PREDICT_AdV-20 species 2710954
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
Everything I'm going to describe is being run through an R script called jncbi() which is included below for convenience:
Doesn't really change anything about the attributes. Just outsources a file to clean and brings it back in.
Here are some contrasting results of virus.jl on different kinds of input.
A BIG LIST
When I pass 8,632 viruses through jncbi, 4,968 come back NA (no match) and 273 come back fuzzy matches (3,419 exact matches). (A file to reproduce this is attached. I'm only including these stats because I think they're probably relevant to our understanding of how big this bug is.) The results are concerning:
Name matched match taxid
adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
adeno-associated virus 3B NA NA NA
adenovirus PREDICT_AdV-20 NA NA NA
JUST THOSE THREE VALUES
> jncbi(c("Adeno-associated virus - 3","Adeno-associated virus 3B","Adenovirus PREDICT_AdV-20"), type = 'virus')
Progress: 100%|█████████████████████████████████████████| Time: 0:00:01
-- Column specification ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
cols(
Name = col_character(),
matched = col_logical(),
match = col_character(),
taxid = col_double()
)
# A tibble: 3 x 4
Name matched match taxid
<chr> <lgl> <chr> <dbl>
1 Adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
2 Adeno-associated virus 3b NA NA NA
3 Adenovirus predict_adv-20 NA NA NA
==========================
I haven't included it but, if you str_to_lower the virus names before they're passed for the entire list, it also significantly reduces the number of no-match's, and also extends the runtime from 5 mins to about 30 mins, confirming this is, in fact, part (but not all) of the issue
So there are two separate problems that need to be debugged.
Capitalization appears to be making everything wonky. I don't want to do R-end solves to this, given that there's capitalization changes in pathogen.jl - I think you can probably solve this by revisiting that script. (When you do, please do not turn it back into a generic script for both hosts and viruses.)
These should have exact (match=TRUE) matches in the NCBI taxonomy. Both instead get called to fuzzy matching. The first fuzzy match is wrong, while the second fuzzy match is actually the correct exact match, and the strings returned are identical (no differences, as far as I can tell in spacing).
The text was updated successfully, but these errors were encountered:
I think (and I can't tell because I don't have R) that you pull the last version of the package every time; the match=true was replaced by strict=false to match the GBIF.jl argument of GBIF.taxon. I'm going to close this one as the package behaves as expected, let's move it to an issue on the repo where jncbi is defined.
Hi! This is gonna be a long one.
Here are three viruses that all have exact matches in the NCBI taxonomy:
Adeno-associated virus - 3
Adeno-associated virus 3B
Adenovirus predict_adv-20
They're an interesting case study for what's going horribly wrong here. In theory, they should all be retrieved as exact matches. Two are, in fact, the same "species". For example, the same NCBI API call through
taxize
:Everything I'm going to describe is being run through an R script called jncbi() which is included below for convenience:
Doesn't really change anything about the attributes. Just outsources a file to clean and brings it back in.
Here are some contrasting results of virus.jl on different kinds of input.
When I pass 8,632 viruses through jncbi, 4,968 come back NA (no match) and 273 come back fuzzy matches (3,419 exact matches). (A file to reproduce this is attached. I'm only including these stats because I think they're probably relevant to our understanding of how big this bug is.) The results are concerning:
Name matched match taxid
adeno-associated virus - 3 TRUE Adeno-associated virus - 3 46350
adeno-associated virus 3B NA NA NA
adenovirus PREDICT_AdV-20 NA NA NA
2B. THOSE THREE VALUES (LOWERCASE)
==========================
I haven't included it but, if you str_to_lower the virus names before they're passed for the entire list, it also significantly reduces the number of no-match's, and also extends the runtime from 5 mins to about 30 mins, confirming this is, in fact, part (but not all) of the issue
So there are two separate problems that need to be debugged.
Capitalization appears to be making everything wonky. I don't want to do R-end solves to this, given that there's capitalization changes in pathogen.jl - I think you can probably solve this by revisiting that script. (When you do, please do not turn it back into a generic script for both hosts and viruses.)
These should have exact (match=TRUE) matches in the NCBI taxonomy. Both instead get called to fuzzy matching. The first fuzzy match is wrong, while the second fuzzy match is actually the correct exact match, and the strings returned are identical (no differences, as far as I can tell in spacing).
The text was updated successfully, but these errors were encountered: