Batch downloads for NCBI in `classification` #798

zachary-foster · 2020-02-11T10:24:26Z

Work in progress, you might not want to merge yet

Description

Internal change to classification.uid so that queries to NCBI are made in batches in order to speed up large queries.

Related Issue

In response to #678.

Issues

It seems to be working, but I get this test error that seems unrelated:

Testing taxize
✓ |  OK F W S | Context
✓ |   8       | apg* functions [1.4 s]
✓ |  14       | bold_search [2.1 s]
⠋ |   1       | childrenError: C stack usage  51050657 is too close to the limit
Execution halted

Exited with status 1.

I need to look into this more, but I thought I would start this PR to keep a record of things.

zachary-foster · 2020-02-11T21:24:40Z

Hi @sckott, I tried to debug that error (C stack usage 51050657 is too close to the limit), but its inconsistent. During local testing, the error happens here:

taxize/tests/testthat/test-children.R

Line 19 in e3525a0

ch_ncbi <- children(8028, db = "ncbi")

I debugged with browser() while the tests were being run and the error happens here:

taxize/R/ncbi_children.R

Line 117 in e3525a0

rr <- cli$get('entrez/eutils/esearch.fcgi', query = args)

However, running this code interactively works:

> ch_ncbi <- children(8028, db = "ncbi")
> children(8028, db = "ncbi")
$`8028`
 childtaxa_id                 childtaxa_name childtaxa_rank
1      2476907        unclassified Salmoninae        no rank
2       504709                      Parahucho          genus
3       504571 salmonine intergeneric hybrids        no rank
4       152108                    Salvethymus          genus
5        62066                   Brachymystax          genus
6         8041                          Hucho          genus
7         8033                     Salvelinus          genus
8         8028                          Salmo          genus
9         8016                   Oncorhynchus          genus

attr(,"class")
[1] "children"
attr(,"db")
[1] "ncbi"

Also, it seems to sometimes happen when testing remotely and some times does not.
I am not sure what the issue is, but it seems to be associated with crul::HttpClient and not related to the changes I made as far as I can tell.

Are there any changes you want made for this PR?

The batch size is hard-coded at 10. Do you want this to be a changeable parameter or have another default value? It probably could be increased.

sckott · 2020-02-12T16:51:13Z

thanks! Having a look at it.

that error in my experience often comes down to circular code, so it ends up running in an endless loop. however, I can't replicate the problem. what version of crul do you have?

zachary-foster · 2020-02-12T17:57:19Z

Yea, seems like it only happens in some cases. I have:

> packageVersion('crul')
[1] ‘0.9.0’

sckott · 2020-02-12T21:29:58Z

thanks. i think we can ignore that problem for now and focus on the code changes here ...

sckott · 2020-02-13T19:37:44Z

nice, lots faster:

x <- names_list("species", 200)
ids <- get_uid(x)
ids <- as.uid(na.omit(ids), check=FALSE)
system.time(out <- classification.uid(ids))
   user  system elapsed
  0.216   0.011   4.354

# the old classification.uid
system.time(out2 <- classification_old_uid(ids))
   user  system elapsed
  0.820   0.028  17.532

sckott · 2020-02-13T19:42:57Z

The queries are different with this change, can you delete and re-record the affected fixtures, something like

rm tests/fixtures/classification_cbind_rbind.yml tests/fixtures/classification_rows_param.yml
Rscript -e 'devtools::test_file("tests/testthat/test-classification.R")'

zachary-foster · 2020-02-19T04:36:34Z

can you delete and re-record the affected fixtures

Sure, but I will have to learn a bit more about that first. Does the fixtures concept come from httptest package?

I adapted your benchmarking example to try to pick a default batch size and I think 50 looks to be at the point of diminishing returns:

library(taxize)
library(purrr)
library(microbenchmark)
library(rbenchmark)
library(tibble)
library(ggplot2)

x <- names_list("species", 1000)
ids <- get_uid(x)
ids <- as.uid(na.omit(ids), check=FALSE)

res <- map_dfr(c(1, 2, 3, 4, 5, 7, 10, 20, 30, 40, 50, 70, 100, 150), function(b) {
  res <- system.time(taxize:::classification.uid(ids, batch_size = b))
  Sys.sleep(10)
  tibble(batch_size = b, time = res[3])
})

ggplot(res) +
  geom_point(aes(x = batch_size, y = time))

While testing, I was getting frequent errors from failed queries, so I added code to retry a query up to two times, by default and that seemed to make it more reliable based on my limited testing.

…ze to 50

sckott · 2020-02-19T17:17:42Z

the fixtures are from vcr

thanks for doing the benchmarking - 50 sounds good

sckott

looks good, just 2 things:

re-record those fixtures, and
can you add an example of fetching more than 50 ncbi ids through classification so users can see an example of that

zachary-foster · 2020-02-27T19:42:32Z

I keep running into to the 'C stack usage 51050657 is too close to the limit' error while testing the code, so I don't know if I did the fixtures right.

I also need to make sure the function works with invalid IDs. Its getting a bit messy with all the iterative changes, so I might just refactor it to clean it up.

In regards to the 'C stack usage 51050657 is too close to the limit' error, I think I figured it out:

Its caused by the vcr package. A bug causes an error message to be huge (~25Mb in a text file, on one line), which for some reason breaks stop, perhaps because stop is calling itself and reporting the error, which causes the error, etc. I will submit a PR to vcr with a fix I made in case it is useful.

sckott · 2020-02-28T18:38:31Z

the vcr PR was merged

zachary-foster added 5 commits July 29, 2019 17:32

typo fixes

17342c7

make classification.uid work in batches

c0cfc7a

Merge branch 'master' of github.com:ropensci/taxize

0c9b688

Merge branch 'master' of github.com:ropensci/taxize

f4b0d24

Merge branch 'master' of github.com:zachary-foster/taxize

e4f8f95

zachary-foster requested a review from sckott February 11, 2020 21:24

added retry code to classification.uid and increased default batch si…

d9c1863

…ze to 50

sckott linked an issue Feb 19, 2020 that may be closed by this pull request

Batch downloads in classification #678

Closed

sckott added this to the v0.9.93 milestone Feb 19, 2020

sckott suggested changes Feb 19, 2020

View reviewed changes

add documentation; make fxn for ncbi rate limit pause

ef31ada

zachary-foster mentioned this pull request Feb 27, 2020

Avoid huge error msg that causes infinite recursion ropensci/vcr#160

Merged

sckott approved these changes Mar 6, 2020

View reviewed changes

sckott added the get_classification label Mar 6, 2020

sckott merged commit 9dc3551 into ropensci:master Mar 6, 2020

sckott mentioned this pull request Mar 6, 2020

tax_rank with ncbi is all messsed up #804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch downloads for NCBI in `classification` #798

Batch downloads for NCBI in `classification` #798

zachary-foster commented Feb 11, 2020

zachary-foster commented Feb 11, 2020

sckott commented Feb 12, 2020

zachary-foster commented Feb 12, 2020

sckott commented Feb 12, 2020

sckott commented Feb 13, 2020

sckott commented Feb 13, 2020

zachary-foster commented Feb 19, 2020

sckott commented Feb 19, 2020 •

edited

Loading

sckott left a comment

zachary-foster commented Feb 27, 2020

sckott commented Feb 28, 2020

Batch downloads for NCBI in classification #798

Batch downloads for NCBI in classification #798

Conversation

zachary-foster commented Feb 11, 2020

Description

Related Issue

Issues

zachary-foster commented Feb 11, 2020

sckott commented Feb 12, 2020

zachary-foster commented Feb 12, 2020

sckott commented Feb 12, 2020

sckott commented Feb 13, 2020

sckott commented Feb 13, 2020

zachary-foster commented Feb 19, 2020

sckott commented Feb 19, 2020 • edited Loading

sckott left a comment

Choose a reason for hiding this comment

zachary-foster commented Feb 27, 2020

sckott commented Feb 28, 2020

Batch downloads for NCBI in `classification` #798

Batch downloads for NCBI in `classification` #798

sckott commented Feb 19, 2020 •

edited

Loading