Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP errors when parsing long taxon_id list #202

Open
janstrauss1 opened this issue Nov 25, 2019 · 12 comments
Open

HTTP errors when parsing long taxon_id list #202

janstrauss1 opened this issue Nov 25, 2019 · 12 comments

Comments

@janstrauss1
Copy link

Hi there,

I'm trying to create a taxmap from a long list of NCBI taxon IDs for subsequent filtering.

I have downloaded about 17k taxa containing a specific protein domain from InterPro and imported into R

my.tax_id <- read.table(file = "TaxID_IPR012674.txt")
> str(my.tax_id)
'data.frame':	17482 obs. of  1 variable:
 $ V1: int  104 158 162 166 17 172 192 195 196 197

I then try to set um my taxmap as follows:

my.taxmap <- lookup_tax_data(
  tax_data = my.tax_id, 
  type = "taxon_id", 
  column = 1, 
  datasets = list(),
  mappings = c(), 
  database = "ncbi", 
  include_tax_data = TRUE,
  use_database_ids = TRUE, 
  ask = TRUE
  )
Looking up classifications for 17482 unique taxon IDs from database "ncbi"...

Unfortunately, this throws the error
Error: Too Many Requests (HTTP 429)

I guess the API client is making too many concurrent requests to the database which causes the error.

Could you please help to fix it?

Many thanks in advance!

The output of sessionInfo() is

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] urltools_1.7.3 taxize_0.9.91  taxa_0.3.2    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        pillar_1.4.2      compiler_3.6.1    plyr_1.8.4        iterators_1.0.12  tools_3.6.1      
 [7] jsonlite_1.6      tibble_2.1.3      nlme_3.1-141      lattice_0.20-38   pkgconfig_2.0.3   rlang_0.4.1      
[13] foreach_1.4.7     cli_1.1.0         rstudioapi_0.10   crul_0.9.0        curl_4.2          parallel_3.6.1   
[19] dplyr_0.8.3       stringr_1.4.0     xml2_1.2.2        triebeard_0.3.0   grid_3.6.1        tidyselect_0.2.5 
[25] reshape_0.8.8     glue_1.3.1        httpcode_0.2.0    data.table_1.12.6 R6_2.4.1          reshape2_1.4.3   
[31] purrr_0.3.3       magrittr_1.5      codetools_0.2-16  assertthat_0.2.1  bold_0.9.0        ape_5.3          
[37] stringi_1.4.3     crayon_1.3.4      zoo_1.8-6  
@janstrauss1
Copy link
Author

janstrauss1 commented Nov 25, 2019

there seems to be a related issue for the taxize package
ropensci/taxize#785 (comment)

@sckott
Copy link
Contributor

sckott commented Nov 25, 2019

Are you definitely using NCBI? The data source in question in that taxize issue 785 is for Catalogue of Life, not NCBI. Anyway, NCBI may also throw 429 errors. Do you have an NCBI ENTREZ API key set with the env var ENTREZ_KEY ?

@janstrauss1
Copy link
Author

@sckott,
yes, I'm definitely using NCBI taxon IDs.
No, I did not set an ENTREZ_KEY but I think this might solve the problem. I have already obtained an NCBI API key but how to I set it correctly?

Many thanks in advance for your help!

@janstrauss1
Copy link
Author

@sckott,
I just set the key using Sys.setenv(ENTREZ_KEY = "my.api.key") as you outlined at #135 (comment).
It seems to partially solve my issue as the download stalled at 7% throwing the error:
Error: Bad Request (HTTP 400).
Any idea how to address this?

@janstrauss1
Copy link
Author

It appears that downloading the classifications for such a long list of taxon IDs from NCBI is very fragile. Setting my NCBI API key and re-running my script as outlined above, the download now stalled at 25% throwing the error: Bad Gateway (HTTP 502).

@janstrauss1 janstrauss1 changed the title Too Many Requests (HTTP 429) error when parsing long taxon_id list HTTP errors when parsing long taxon_id list Nov 26, 2019
@janstrauss1
Copy link
Author

It eventually worked to download the classifications of the full 17k list of NCBI taxon IDs.

@sckott
Copy link
Contributor

sckott commented Nov 26, 2019

NCBI's infrastructure is not very good, so I'm not surprised that you are running into errors with a lot of names.

Another option is taxizedb - idea is the same as taxize, but using SQL dumps on your local machine.

@ErwinFeringa
Copy link

I have been running into an issue for some time now trying to parse my data with lookup_tax_data.
I have around 4k of tax_id's and I want to visualize them together with their fraction total reads within a heat tree.

this is what I run:

Sys.setenv(ENTREZ_KEY = "my key")
data15 <- read.delim("path to my file")
taxed_15 <- lookup_tax_data(
data15,
"taxon_id",
column = 2,
datasets = list("fraction_total_reads"),
mappings = c("value),
database = "ncbi",
include_tax_data = TRUE,
use_database_ids = TRUE,
ask = TRUE
)

I either get the following errors:
Error: Bad Request (HTTP 400)
or:
Error in get_sort_var(tax_data, names(sort_var)) : No column named ""."

the last error does not show up if i leave out "datasets" and " mapping"

I hope there is a way to solve the problems i am facing.

@RJGrayEcology
Copy link

Is this still not solved? I have the same problem with a list of about 600 species.

@zachary-foster
Copy link
Collaborator

Are these errors random, or the same every time? If the latter, can you give me a command to test that causes this error?

@morellek
Copy link

I had the same error, and what did the trick to me, is to include the query in a 'try-error' function, and if the Error: Bad Request (HTTP 400) message appeared, than I used the Sys.sleep() and retried the query. In a loop, looks like:

for (i in 1: nrow(data)) {

classes_i <- try(tax_name(sci = data$taxon[i], get = c("genus","family","order","class"), db = "ncbi"))
if (class(classes_i)=="try-error") {
Sys.sleep(10)
classes_i <- try(tax_name(sci = data$taxon[i], get = c("genus","family","order","class"), db = "ncbi"))}
classes_both <- rbind(classes_both, classes_i)
}

@stephanJG
Copy link

Thanks morellek, the loop worked for me. Was getting frustrated that even after getting the ncbi api key and using Sys.sleep in my similar loop I still got the Error: Bad Request (HTTP 400) message. I still get some rows filled with the error messag, but that I can fix.

PS: classes_both = NULL before the loop is missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants