-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch downloads for NCBI in classification
#798
Conversation
Hi @sckott, I tried to debug that error (C stack usage 51050657 is too close to the limit), but its inconsistent. During local testing, the error happens here: taxize/tests/testthat/test-children.R Line 19 in e3525a0
I debugged with Line 117 in e3525a0
However, running this code interactively works: > ch_ncbi <- children(8028, db = "ncbi")
> children(8028, db = "ncbi")
$`8028`
childtaxa_id childtaxa_name childtaxa_rank
1 2476907 unclassified Salmoninae no rank
2 504709 Parahucho genus
3 504571 salmonine intergeneric hybrids no rank
4 152108 Salvethymus genus
5 62066 Brachymystax genus
6 8041 Hucho genus
7 8033 Salvelinus genus
8 8028 Salmo genus
9 8016 Oncorhynchus genus
attr(,"class")
[1] "children"
attr(,"db")
[1] "ncbi" Also, it seems to sometimes happen when testing remotely and some times does not. Are there any changes you want made for this PR? The batch size is hard-coded at 10. Do you want this to be a changeable parameter or have another default value? It probably could be increased. |
thanks! Having a look at it. that error in my experience often comes down to circular code, so it ends up running in an endless loop. however, I can't replicate the problem. what version of |
Yea, seems like it only happens in some cases. I have: > packageVersion('crul')
[1] ‘0.9.0’ |
thanks. i think we can ignore that problem for now and focus on the code changes here ... |
nice, lots faster: x <- names_list("species", 200)
ids <- get_uid(x)
ids <- as.uid(na.omit(ids), check=FALSE)
system.time(out <- classification.uid(ids))
user system elapsed
0.216 0.011 4.354
# the old classification.uid
system.time(out2 <- classification_old_uid(ids))
user system elapsed
0.820 0.028 17.532 |
The queries are different with this change, can you delete and re-record the affected fixtures, something like
|
Sure, but I will have to learn a bit more about that first. Does the fixtures concept come from I adapted your benchmarking example to try to pick a default batch size and I think 50 looks to be at the point of diminishing returns: library(taxize)
library(purrr)
library(microbenchmark)
library(rbenchmark)
library(tibble)
library(ggplot2)
x <- names_list("species", 1000)
ids <- get_uid(x)
ids <- as.uid(na.omit(ids), check=FALSE)
res <- map_dfr(c(1, 2, 3, 4, 5, 7, 10, 20, 30, 40, 50, 70, 100, 150), function(b) {
res <- system.time(taxize:::classification.uid(ids, batch_size = b))
Sys.sleep(10)
tibble(batch_size = b, time = res[3])
})
ggplot(res) +
geom_point(aes(x = batch_size, y = time)) While testing, I was getting frequent errors from failed queries, so I added code to retry a query up to two times, by default and that seemed to make it more reliable based on my limited testing. |
the fixtures are from vcr thanks for doing the benchmarking - 50 sounds good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just 2 things:
- re-record those fixtures, and
- can you add an example of fetching more than 50 ncbi ids through
classification
so users can see an example of that
I keep running into to the 'C stack usage 51050657 is too close to the limit' error while testing the code, so I don't know if I did the fixtures right. I also need to make sure the function works with invalid IDs. Its getting a bit messy with all the iterative changes, so I might just refactor it to clean it up. In regards to the 'C stack usage 51050657 is too close to the limit' error, I think I figured it out: Its caused by the |
the vcr PR was merged |
Work in progress, you might not want to merge yet
Description
Internal change to
classification.uid
so that queries to NCBI are made in batches in order to speed up large queries.Related Issue
In response to #678.
Issues
It seems to be working, but I get this test error that seems unrelated:
I need to look into this more, but I thought I would start this PR to keep a record of things.