bold children and downstream fxns? #817

sckott · 2020-04-20T19:02:16Z

via ropensci/bold#60

…heir site

sckott · 2020-04-20T19:07:03Z

still need to add tests

sckott · 2020-04-20T19:07:06Z

remotes::install_github("ropensci/taxize") - see ?bold_children, ?children, ?bold_downstream, ?downstream

sckott · 2020-04-21T19:11:27Z

@devonorourke let me know your thoughts on these functions

devonorourke · 2020-04-22T12:16:09Z

It's been a while since I've had to tackle this kind of problem - can you remind me how the proposed changes would impact my workflow? This was the original R script I ran to get everything that did all the manual scraping.
Thanks for the help

sckott · 2020-04-23T16:43:31Z

You started off the bold repo issue ropensci/bold#60 (comment) with usage of taxize::downstream() from ncbi. With these changes, you can now use bold instead if you like in downstream().

sckott · 2020-04-23T22:51:52Z

still want any feedback on these new functions, but need to push a new version to cran now

devonorourke · 2020-06-30T21:25:58Z

Hey @sckott - I'm finally giving the new bold downstream functionality a go and it appears to be working. I had to make a tiny update to your example in the readme section on pulling large data.

The initial list was created like this:

x <- downstream("Arthropoda", db="bold", downto="class")

It's almost identical to the example/tutorial for pulling from NCBI, where all I'm doing is replacing the db= value from ncbi to bold. My confusion came from following two steps later in the tutorial where we loop over an element of the x object using the bold_seqspec function. In the tutorial, you do this with the list:

nms <- x$Arthropoda$childtaxa_name
out <- lapply(nms, bold_seq)

However, when you use bold as the database in the downstream function instead of ncbi, the
the vector of names is just called name instead of childtaxa_name. Nothing's wrong in the tutorial, of course, but it's funny to me that you created two different vector names depending on which database you pull from. I'm sure you had a reason! I ended up just doing this and things seem ok (obviously this is going to take a bit):

nms <- x$Arthropoda$name
out <- lapply(nms, bold_seq)

I'm interested in pulling the entire COI database from BOLD. Getting all arthropod records is certainly the biggest task, but I was curious if there was a simple way in your R package to split this task into two halves: one where I pull all arthropod records, and one where I pull all the rest. Or maybe something like (1) all animals that are arthropods; (2) all animals that aren't arthropods; (3) all COI that aren't animals.

Do you think BOLD would ever set up a way where you can go to some kind of FTP site of theirs and just pull all public records in one shot? I get that's not their main purpose, but I'd love to be able to set up something where I can pull their existing records once per year to keep up some kind of versioned release of their data.

On a related note, I was curious if you knew of any licensing issues in using/sharing these BOLD data. My hope is to take these BOLD data and create a database (or ten) for others (all free, of course). Am I violating any terms of agreement from BOLD? I can't find anything on their website about that kind of thing, but figured you may have encountered it before.

Thanks for the help, hope all is well.

sckott · 2020-06-30T22:50:36Z

@devonorourke Thanks for trying it!

curious if there was a simple way in your R package to split this task into two halves

by this package, do you mean taxize or bold? Not really either way. The way I usually recommend is for people to use downstream() or similar functions to split the target taxon into smaller taxonomic groups to query on because as you know BOLD does not handle very large queries well.

Do you think BOLD would ever set up a way where you can go to some kind of FTP site of theirs and just pull all public records in one shot?

There is this page http://boldsystems.org/index.php/datarelease but looks like no new data since 2015

curious if you knew of any licensing issues in using/sharing these BOLD data

not that I know of

devonorourke · 2020-06-30T22:53:38Z

Thanks again. Do you know anyone specific at BOLD that I might ask about licensing? I wasn't sure how plugged in you are with that group. Cheers

…

On Tue, Jun 30, 2020, 6:50 PM Scott Chamberlain ***@***.***> wrote: @devonorourke <https://github.com/devonorourke> Thanks for trying it! curious if there was a simple way in your R package to split this task into two halves by this package, do you mean taxize or bold? Not really either way. The way I usually recommend is for people to use downstream() or similar functions to split the target taxon into smaller taxonomic groups to query on because as you know BOLD does not handle very large queries well. Do you think BOLD would ever set up a way where you can go to some kind of FTP site of theirs and just pull all public records in one shot? There is this page http://boldsystems.org/index.php/datarelease but looks like no new data since 2015 curious if you knew of any licensing issues in using/sharing these BOLD data not that I know of — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#817 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACVKAXCWEB7TQMPOMBHHIV3RZJT4TANCNFSM4MMV74PQ> .

sckott · 2020-06-30T23:05:49Z

last time I got a response was in 2017 from mmilton@boldsystems.org Megan Milton - there's some email address here http://boldsystems.org/index.php/Resources/ContactUs to try

devonorourke · 2020-07-02T02:14:20Z

Thanks.
I ran into an issue in trying to get through the entire Arthropoda dataset in one fell swoop and it crashed. I then went back to my old ways of splitting that dataset into 5: Coleoptera, Diptera, Hymenoptera, Lepidoptera, and all other Insects. Then it crashed again! 😠 ... So now I'm splitting up the Dipterans into 4 groups (the 3 families with the most specimen, plus all others) and seeing if it crashes anew.
I'm still trying to work out exactly how large is too large for BOLD, because I just ran the same kind of script today to pull data for all Chordata in one shot, and it worked fine. So clearly you can pull a lot of records, but just how many exactly I'm not so sure about.

In an ideal scenario, the user would be able to run something like this to get a sense of how many recoreds they are about to attempt to download:

arthropod_list <- downstream("Arthropoda", db = "bold", downto = "class")
x <- bold_stats(taxon=arthropod_list$Arthropoda$name)
x$total_records

What I think would be sweet is if there was a way to automatically flag instances where these total numbers of records are higher than a value we know is likely to trip an error. If the flag is thrown, maybe there would be a function that could then subset the elements of that list that are too big and then break them down to their next taxonomic rank into smaller chunks and proceed. Does that sound crazy?

Is there a simple way to recycle the bold_stats function across each element of arthropod_list$Arthropoda$name so that you get a record totals that are per $name instead of aggregated among all the records of all (in this case Arthropod Class names)?
Even if the user is going to need to be splitting up their searchers manually, starting with a broad taxonomic group and getting the number of individual records per subgroup (say, next taxonomic rank down) would be helpful in figuring out how to chunk the subgroups.

Thanks

sckott · 2020-07-02T17:45:54Z

I'm still trying to work out exactly how large is too large for BOLD

It may vary as well, e.g., if they're getting a lot of requests to their servers at once, the "too large" value may be larger at this point vs. when there are very few requests coming in. Unfortunately, we're stuck not knowing.

recycle the bold_stats function

library(taxize)
library(bold)
z <- downstream("Arthropoda", db = "bold", downto = "class")
x <- bold_stats(taxon=z$Arthropoda$name)
w <- lapply(z$Arthropoda$name[1:3], bold_stats)
as.list(stats::setNames(vapply(w, "[[", 1, "total_records"), z$Arthropoda$name[1:3]))

devonorourke · 2020-07-02T17:52:03Z

Yeah, it's definitely not a fixed value. I got all of the chordate dataset to work in one shot (>800k records on BOLD's taxonomy page), but then it failed for the Coleopterans (~ 600k records)... maybe I should only be downloading at 3am EST.

…

On Thu, Jul 2, 2020 at 1:46 PM Scott Chamberlain ***@***.***> wrote: I'm still trying to work out exactly how large is too large for BOLD It may vary as well, e.g., if they're getting a lot of requests to their servers at once, the "too large" value may be larger at this point vs. when there are very few requests coming in. Unfortunately, we're stuck not knowing. recycle the bold_stats function library(taxize) library(bold)z <- downstream("Arthropoda", db = "bold", downto = "class")x <- bold_stats(taxon=z$Arthropoda$name)w <- lapply(z$Arthropoda$name[1:3], bold_stats)stats::setNames(vapply(w, "[[", 1, "total_records"), z$Arthropoda$name[1:3]) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#817 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACVKAXA4WRJV72H4SP2BVXTRZTBV7ANCNFSM4MMV74PQ> .

-- Devon O'Rourke Postdoctoral researcher, Northern Arizona University Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ twitter: @thesciencedork

sckott · 2020-07-02T17:57:39Z

Oof, not consistent, wish I was wrong and that it was consistent

devonorourke · 2020-07-02T18:06:58Z

Here's hoping I don't have to break it back any further than Family-level bins. I'm now being forced to split up Dipterans into 4 groups, Coleopterans into 5 groups, Hymenopterans into 5 groups... I wrote to BOLD yesterday asking if they could make their entire dataset available. I'm trying to *reduce* their server use by providing a dataset that users can further refine for their own dataset needs! ... Have you ever looked at how different GenBank and BOLD are these days? Maybe it's time I give on on BOLD and just build databases from NCBI instead. Thanks for your help over the years - I wouldn't have made it through my Ph.D without the taxize and bold R packages!

…

On Thu, Jul 2, 2020 at 1:57 PM Scott Chamberlain ***@***.***> wrote: Oof, not consistent, wish I was wrong and that it was consistent — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#817 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACVKAXEFYS4AL4S65ANUMXDRZTDCBANCNFSM4MMV74PQ> .

-- Devon O'Rourke Postdoctoral researcher, Northern Arizona University Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ twitter: @thesciencedork

sckott · 2020-07-02T20:06:36Z

a way to automatically flag instances

you could do something like

split_taxa <- function(taxa, downto, max_records = 5000) {
  x <- downstream(taxa, db = "bold", downto = downto)
  w <- lapply(x[[1]]$name, bold_stats)
  recs <- vapply(w, "[[", 1, "total_records")
  df <- data.frame(name = x[[1]]$name, total_records = recs)
  x <- merge(x[[1]], df)
  x2 <- data.frame(NULL)
  if (any(x$total_records > max_records)) {
    too_big <- x[x$total_records > max_records,]
    out <- list()
    for (i in seq_len(NROW(too_big))) {
      rank_get <- ranks[which(ranks %in% too_big[i,"rank"]) + 1]
      out[[i]] <- split_taxa(too_big[i,"name"], downto = rank_get)
    }
    x <- x[!x$name %in% too_big$name, ]
    x2 <- dplyr::bind_rows(out)
  }
  return(dplyr::bind_rows(x, x2))
}

It's pretty slow though, this takes a long time

res <- split_taxa(taxa = "Arthropoda", downto = "class")
# ends up with many taxa with 0 records, filter them out
res <- dplyr::filter(res, total_records > 0)
head(res)
#>            name     id  rank total_records y[FALSE, ]
#> 1 Acrothoracica 987854 class             1         NA
#> 2 Cephalocarida     73 class            25         NA
#> 3     Chilopoda     75 class          2652         NA
#> 4     Diplopoda     85 class          4157         NA
#> 5       Diplura 734358 class           183         NA
#> 6   Hexanauplia 765970 class             2         NA

max(res$total_records)
#> [1] 4999

A faster example to try:

split_taxa("Branchiopoda", "order")

#>              name     id   rank total_records
#> 1       Anostraca    314  order          1570
#> 2       Ctenopoda 951535  order           515
#> 3   Cyclestherida 951605  order            64
#> 4       Haplopoda 951534  order           220
#> 5    Laevicaudata 951532  order            58
#> 6      Notostraca    328  order           625
#> 7      Onychopoda 951533  order           731
#> 8    Spinicaudata 951608  order          1674
#> 9      Bosminidae   1569 family           310
#> 10     Chydoridae   1573 family          1481
#> 11     Daphniidae   1572 family          3663
#> 12   Eurycercidae 154591 family           184
#> 13  Ilyocryptidae 177866 family            23
#> 14 Macrothricidae   1566 family           109
#> 15       Moinidae   1570 family           450

sckott · 2020-07-02T20:12:21Z

Have you ever looked at how different GenBank and BOLD are these days?

No. I don't have a good sense of what different data they have. GenBank web services are at least a little easier to use so thats something. In general it seems NCBI does provide bulk access to data, but i'm not sure if they do for all data types.

Thanks for your help

of course! happy to help

sckott added this to the v0.9.95 milestone Apr 20, 2020

sckott added a commit that referenced this issue Apr 20, 2020

#817 start fxns for bold children and downsttream based on scraping t…

9bc3d55

…heir site

sckott mentioned this issue Apr 20, 2020

a rare case of a missing Insect Order ropensci/bold#60

Closed

sckott added a commit that referenced this issue Apr 21, 2020

#817 hide bold_children man page, make internal

364adeb

sckott added a commit that referenced this issue Apr 21, 2020

#817 bold children/downsttream tests added

15dca1b

sckott closed this as completed Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bold children and downstream fxns? #817

bold children and downstream fxns? #817

sckott commented Apr 20, 2020

sckott commented Apr 20, 2020

sckott commented Apr 20, 2020

sckott commented Apr 21, 2020

devonorourke commented Apr 22, 2020

sckott commented Apr 23, 2020

sckott commented Apr 23, 2020

devonorourke commented Jun 30, 2020

sckott commented Jun 30, 2020

devonorourke commented Jun 30, 2020 via email

sckott commented Jun 30, 2020

devonorourke commented Jul 2, 2020

sckott commented Jul 2, 2020 •

edited

Loading

devonorourke commented Jul 2, 2020 via email

sckott commented Jul 2, 2020

devonorourke commented Jul 2, 2020 via email

sckott commented Jul 2, 2020

sckott commented Jul 2, 2020

bold children and downstream fxns? #817

bold children and downstream fxns? #817

Comments

sckott commented Apr 20, 2020

sckott commented Apr 20, 2020

sckott commented Apr 20, 2020

sckott commented Apr 21, 2020

devonorourke commented Apr 22, 2020

sckott commented Apr 23, 2020

sckott commented Apr 23, 2020

devonorourke commented Jun 30, 2020

sckott commented Jun 30, 2020

devonorourke commented Jun 30, 2020 via email

sckott commented Jun 30, 2020

devonorourke commented Jul 2, 2020

sckott commented Jul 2, 2020 • edited Loading

devonorourke commented Jul 2, 2020 via email

sckott commented Jul 2, 2020

devonorourke commented Jul 2, 2020 via email

sckott commented Jul 2, 2020

sckott commented Jul 2, 2020

sckott commented Jul 2, 2020 •

edited

Loading