-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bold children and downstream fxns? #817
Comments
still need to add tests |
|
@devonorourke let me know your thoughts on these functions |
It's been a while since I've had to tackle this kind of problem - can you remind me how the proposed changes would impact my workflow? This was the original R script I ran to get everything that did all the manual scraping. |
You started off the bold repo issue ropensci/bold#60 (comment) with usage of |
still want any feedback on these new functions, but need to push a new version to cran now |
Hey @sckott - I'm finally giving the new bold downstream functionality a go and it appears to be working. I had to make a tiny update to your example in the readme section on pulling large data. The initial list was created like this:
It's almost identical to the example/tutorial for pulling from NCBI, where all I'm doing is replacing the
However, when you use
I'm interested in pulling the entire COI database from BOLD. Getting all arthropod records is certainly the biggest task, but I was curious if there was a simple way in your R package to split this task into two halves: one where I pull all arthropod records, and one where I pull all the rest. Or maybe something like (1) all animals that are arthropods; (2) all animals that aren't arthropods; (3) all COI that aren't animals. Do you think BOLD would ever set up a way where you can go to some kind of FTP site of theirs and just pull all public records in one shot? I get that's not their main purpose, but I'd love to be able to set up something where I can pull their existing records once per year to keep up some kind of versioned release of their data. On a related note, I was curious if you knew of any licensing issues in using/sharing these BOLD data. My hope is to take these BOLD data and create a database (or ten) for others (all free, of course). Am I violating any terms of agreement from BOLD? I can't find anything on their website about that kind of thing, but figured you may have encountered it before. Thanks for the help, hope all is well. |
@devonorourke Thanks for trying it!
by this package, do you mean taxize or bold? Not really either way. The way I usually recommend is for people to use
There is this page http://boldsystems.org/index.php/datarelease but looks like no new data since 2015
not that I know of |
Thanks again. Do you know anyone specific at BOLD that I might ask about
licensing? I wasn't sure how plugged in you are with that group.
Cheers
…On Tue, Jun 30, 2020, 6:50 PM Scott Chamberlain ***@***.***> wrote:
@devonorourke <https://github.com/devonorourke> Thanks for trying it!
curious if there was a simple way in your R package to split this task
into two halves
by this package, do you mean taxize or bold? Not really either way. The
way I usually recommend is for people to use downstream() or similar
functions to split the target taxon into smaller taxonomic groups to query
on because as you know BOLD does not handle very large queries well.
Do you think BOLD would ever set up a way where you can go to some kind of
FTP site of theirs and just pull all public records in one shot?
There is this page http://boldsystems.org/index.php/datarelease but looks
like no new data since 2015
curious if you knew of any licensing issues in using/sharing these BOLD
data
not that I know of
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#817 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACVKAXCWEB7TQMPOMBHHIV3RZJT4TANCNFSM4MMV74PQ>
.
|
last time I got a response was in 2017 from mmilton@boldsystems.org Megan Milton - there's some email address here http://boldsystems.org/index.php/Resources/ContactUs to try |
Thanks. In an ideal scenario, the user would be able to run something like this to get a sense of how many recoreds they are about to attempt to download:
What I think would be sweet is if there was a way to automatically flag instances where these total numbers of records are higher than a value we know is likely to trip an error. If the flag is thrown, maybe there would be a function that could then subset the elements of that list that are too big and then break them down to their next taxonomic rank into smaller chunks and proceed. Does that sound crazy? Is there a simple way to recycle the Thanks |
It may vary as well, e.g., if they're getting a lot of requests to their servers at once, the "too large" value may be larger at this point vs. when there are very few requests coming in. Unfortunately, we're stuck not knowing.
library(taxize)
library(bold)
z <- downstream("Arthropoda", db = "bold", downto = "class")
x <- bold_stats(taxon=z$Arthropoda$name)
w <- lapply(z$Arthropoda$name[1:3], bold_stats)
as.list(stats::setNames(vapply(w, "[[", 1, "total_records"), z$Arthropoda$name[1:3])) |
Yeah, it's definitely not a fixed value.
I got all of the chordate dataset to work in one shot (>800k records on
BOLD's taxonomy page), but then it failed for the Coleopterans (~ 600k
records)... maybe I should only be downloading at 3am EST.
…On Thu, Jul 2, 2020 at 1:46 PM Scott Chamberlain ***@***.***> wrote:
I'm still trying to work out exactly how large is too large for BOLD
It may vary as well, e.g., if they're getting a lot of requests to their
servers at once, the "too large" value may be larger at this point vs. when
there are very few requests coming in. Unfortunately, we're stuck not
knowing.
recycle the bold_stats function
library(taxize)
library(bold)z <- downstream("Arthropoda", db = "bold", downto = "class")x <- bold_stats(taxon=z$Arthropoda$name)w <- lapply(z$Arthropoda$name[1:3], bold_stats)stats::setNames(vapply(w, "[[", 1, "total_records"), z$Arthropoda$name[1:3])
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#817 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACVKAXA4WRJV72H4SP2BVXTRZTBV7ANCNFSM4MMV74PQ>
.
--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
|
Oof, not consistent, wish I was wrong and that it was consistent |
Here's hoping I don't have to break it back any further than Family-level
bins. I'm now being forced to split up Dipterans into 4 groups,
Coleopterans into 5 groups, Hymenopterans into 5 groups...
I wrote to BOLD yesterday asking if they could make their entire dataset
available. I'm trying to *reduce* their server use by providing a dataset
that users can further refine for their own dataset needs! ...
Have you ever looked at how different GenBank and BOLD are these days?
Maybe it's time I give on on BOLD and just build databases from NCBI
instead.
Thanks for your help over the years - I wouldn't have made it through my
Ph.D without the taxize and bold R packages!
…On Thu, Jul 2, 2020 at 1:57 PM Scott Chamberlain ***@***.***> wrote:
Oof, not consistent, wish I was wrong and that it was consistent
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#817 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACVKAXEFYS4AL4S65ANUMXDRZTDCBANCNFSM4MMV74PQ>
.
--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
|
you could do something like split_taxa <- function(taxa, downto, max_records = 5000) {
x <- downstream(taxa, db = "bold", downto = downto)
w <- lapply(x[[1]]$name, bold_stats)
recs <- vapply(w, "[[", 1, "total_records")
df <- data.frame(name = x[[1]]$name, total_records = recs)
x <- merge(x[[1]], df)
x2 <- data.frame(NULL)
if (any(x$total_records > max_records)) {
too_big <- x[x$total_records > max_records,]
out <- list()
for (i in seq_len(NROW(too_big))) {
rank_get <- ranks[which(ranks %in% too_big[i,"rank"]) + 1]
out[[i]] <- split_taxa(too_big[i,"name"], downto = rank_get)
}
x <- x[!x$name %in% too_big$name, ]
x2 <- dplyr::bind_rows(out)
}
return(dplyr::bind_rows(x, x2))
} It's pretty slow though, this takes a long time res <- split_taxa(taxa = "Arthropoda", downto = "class")
# ends up with many taxa with 0 records, filter them out
res <- dplyr::filter(res, total_records > 0)
head(res)
#> name id rank total_records y[FALSE, ]
#> 1 Acrothoracica 987854 class 1 NA
#> 2 Cephalocarida 73 class 25 NA
#> 3 Chilopoda 75 class 2652 NA
#> 4 Diplopoda 85 class 4157 NA
#> 5 Diplura 734358 class 183 NA
#> 6 Hexanauplia 765970 class 2 NA
max(res$total_records)
#> [1] 4999 A faster example to try: split_taxa("Branchiopoda", "order")
#> name id rank total_records
#> 1 Anostraca 314 order 1570
#> 2 Ctenopoda 951535 order 515
#> 3 Cyclestherida 951605 order 64
#> 4 Haplopoda 951534 order 220
#> 5 Laevicaudata 951532 order 58
#> 6 Notostraca 328 order 625
#> 7 Onychopoda 951533 order 731
#> 8 Spinicaudata 951608 order 1674
#> 9 Bosminidae 1569 family 310
#> 10 Chydoridae 1573 family 1481
#> 11 Daphniidae 1572 family 3663
#> 12 Eurycercidae 154591 family 184
#> 13 Ilyocryptidae 177866 family 23
#> 14 Macrothricidae 1566 family 109
#> 15 Moinidae 1570 family 450 |
No. I don't have a good sense of what different data they have. GenBank web services are at least a little easier to use so thats something. In general it seems NCBI does provide bulk access to data, but i'm not sure if they do for all data types.
of course! happy to help |
via ropensci/bold#60
The text was updated successfully, but these errors were encountered: