Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch downloads for NCBI in classification #798

Merged
merged 7 commits into from
Mar 6, 2020
Merged

Batch downloads for NCBI in classification #798

merged 7 commits into from
Mar 6, 2020

Conversation

zachary-foster
Copy link
Collaborator

Work in progress, you might not want to merge yet

Description

Internal change to classification.uid so that queries to NCBI are made in batches in order to speed up large queries.

Related Issue

In response to #678.

Issues

It seems to be working, but I get this test error that seems unrelated:

Testing taxize
✓ |  OK F W S | Context
✓ |   8       | apg* functions [1.4 s]
✓ |  14       | bold_search [2.1 s]
⠋ |   1       | childrenError: C stack usage  51050657 is too close to the limit
Execution halted

Exited with status 1.

I need to look into this more, but I thought I would start this PR to keep a record of things.

@zachary-foster
Copy link
Collaborator Author

Hi @sckott, I tried to debug that error (C stack usage 51050657 is too close to the limit), but its inconsistent. During local testing, the error happens here:

ch_ncbi <- children(8028, db = "ncbi")

I debugged with browser() while the tests were being run and the error happens here:

rr <- cli$get('entrez/eutils/esearch.fcgi', query = args)

However, running this code interactively works:

> ch_ncbi <- children(8028, db = "ncbi")
> children(8028, db = "ncbi")
$`8028`
 childtaxa_id                 childtaxa_name childtaxa_rank
1      2476907        unclassified Salmoninae        no rank
2       504709                      Parahucho          genus
3       504571 salmonine intergeneric hybrids        no rank
4       152108                    Salvethymus          genus
5        62066                   Brachymystax          genus
6         8041                          Hucho          genus
7         8033                     Salvelinus          genus
8         8028                          Salmo          genus
9         8016                   Oncorhynchus          genus

attr(,"class")
[1] "children"
attr(,"db")
[1] "ncbi"

Also, it seems to sometimes happen when testing remotely and some times does not.
I am not sure what the issue is, but it seems to be associated with crul::HttpClient and not related to the changes I made as far as I can tell.

Are there any changes you want made for this PR?

The batch size is hard-coded at 10. Do you want this to be a changeable parameter or have another default value? It probably could be increased.

@sckott
Copy link
Contributor

sckott commented Feb 12, 2020

thanks! Having a look at it.

that error in my experience often comes down to circular code, so it ends up running in an endless loop. however, I can't replicate the problem. what version of crul do you have?

@zachary-foster
Copy link
Collaborator Author

Yea, seems like it only happens in some cases. I have:

> packageVersion('crul')
[1] ‘0.9.0

@sckott
Copy link
Contributor

sckott commented Feb 12, 2020

thanks. i think we can ignore that problem for now and focus on the code changes here ...

@sckott
Copy link
Contributor

sckott commented Feb 13, 2020

nice, lots faster:

x <- names_list("species", 200)
ids <- get_uid(x)
ids <- as.uid(na.omit(ids), check=FALSE)
system.time(out <- classification.uid(ids))
   user  system elapsed
  0.216   0.011   4.354

# the old classification.uid
system.time(out2 <- classification_old_uid(ids))
   user  system elapsed
  0.820   0.028  17.532

@sckott
Copy link
Contributor

sckott commented Feb 13, 2020

The queries are different with this change, can you delete and re-record the affected fixtures, something like

rm tests/fixtures/classification_cbind_rbind.yml tests/fixtures/classification_rows_param.yml
Rscript -e 'devtools::test_file("tests/testthat/test-classification.R")'

@zachary-foster
Copy link
Collaborator Author

can you delete and re-record the affected fixtures

Sure, but I will have to learn a bit more about that first. Does the fixtures concept come from httptest package?

I adapted your benchmarking example to try to pick a default batch size and I think 50 looks to be at the point of diminishing returns:

library(taxize)
library(purrr)
library(microbenchmark)
library(rbenchmark)
library(tibble)
library(ggplot2)

x <- names_list("species", 1000)
ids <- get_uid(x)
ids <- as.uid(na.omit(ids), check=FALSE)

res <- map_dfr(c(1, 2, 3, 4, 5, 7, 10, 20, 30, 40, 50, 70, 100, 150), function(b) {
  res <- system.time(taxize:::classification.uid(ids, batch_size = b))
  Sys.sleep(10)
  tibble(batch_size = b, time = res[3])
})

ggplot(res) +
  geom_point(aes(x = batch_size, y = time))

image

While testing, I was getting frequent errors from failed queries, so I added code to retry a query up to two times, by default and that seemed to make it more reliable based on my limited testing.

@sckott
Copy link
Contributor

sckott commented Feb 19, 2020

the fixtures are from vcr

thanks for doing the benchmarking - 50 sounds good

@sckott sckott linked an issue Feb 19, 2020 that may be closed by this pull request
@sckott sckott added this to the v0.9.93 milestone Feb 19, 2020
Copy link
Contributor

@sckott sckott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just 2 things:

  • re-record those fixtures, and
  • can you add an example of fetching more than 50 ncbi ids through classification so users can see an example of that

@zachary-foster
Copy link
Collaborator Author

I keep running into to the 'C stack usage 51050657 is too close to the limit' error while testing the code, so I don't know if I did the fixtures right.

I also need to make sure the function works with invalid IDs. Its getting a bit messy with all the iterative changes, so I might just refactor it to clean it up.

In regards to the 'C stack usage 51050657 is too close to the limit' error, I think I figured it out:

Its caused by the vcr package. A bug causes an error message to be huge (~25Mb in a text file, on one line), which for some reason breaks stop, perhaps because stop is calling itself and reporting the error, which causes the error, etc. I will submit a PR to vcr with a fix I made in case it is useful.

@sckott
Copy link
Contributor

sckott commented Feb 28, 2020

the vcr PR was merged

@sckott sckott merged commit 9dc3551 into ropensci:master Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch downloads in classification
2 participants