Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bold children and downstream fxns? #817

Closed
sckott opened this issue Apr 20, 2020 · 17 comments
Closed

bold children and downstream fxns? #817

sckott opened this issue Apr 20, 2020 · 17 comments
Milestone

Comments

@sckott
Copy link
Contributor

sckott commented Apr 20, 2020

via ropensci/bold#60

@sckott sckott added this to the v0.9.95 milestone Apr 20, 2020
@sckott
Copy link
Contributor Author

sckott commented Apr 20, 2020

still need to add tests

@sckott
Copy link
Contributor Author

sckott commented Apr 20, 2020

remotes::install_github("ropensci/taxize") - see ?bold_children, ?children, ?bold_downstream, ?downstream

@sckott
Copy link
Contributor Author

sckott commented Apr 21, 2020

@devonorourke let me know your thoughts on these functions

@devonorourke
Copy link

It's been a while since I've had to tackle this kind of problem - can you remind me how the proposed changes would impact my workflow? This was the original R script I ran to get everything that did all the manual scraping.
Thanks for the help

@sckott
Copy link
Contributor Author

sckott commented Apr 23, 2020

You started off the bold repo issue ropensci/bold#60 (comment) with usage of taxize::downstream() from ncbi. With these changes, you can now use bold instead if you like in downstream().

@sckott sckott closed this as completed Apr 23, 2020
@sckott
Copy link
Contributor Author

sckott commented Apr 23, 2020

still want any feedback on these new functions, but need to push a new version to cran now

@devonorourke
Copy link

Hey @sckott - I'm finally giving the new bold downstream functionality a go and it appears to be working. I had to make a tiny update to your example in the readme section on pulling large data.

The initial list was created like this:

x <- downstream("Arthropoda", db="bold", downto="class")

It's almost identical to the example/tutorial for pulling from NCBI, where all I'm doing is replacing the db= value from ncbi to bold. My confusion came from following two steps later in the tutorial where we loop over an element of the x object using the bold_seqspec function. In the tutorial, you do this with the list:

nms <- x$Arthropoda$childtaxa_name
out <- lapply(nms, bold_seq)

However, when you use bold as the database in the downstream function instead of ncbi, the
the vector of names is just called name instead of childtaxa_name. Nothing's wrong in the tutorial, of course, but it's funny to me that you created two different vector names depending on which database you pull from. I'm sure you had a reason! I ended up just doing this and things seem ok (obviously this is going to take a bit):

nms <- x$Arthropoda$name
out <- lapply(nms, bold_seq)

I'm interested in pulling the entire COI database from BOLD. Getting all arthropod records is certainly the biggest task, but I was curious if there was a simple way in your R package to split this task into two halves: one where I pull all arthropod records, and one where I pull all the rest. Or maybe something like (1) all animals that are arthropods; (2) all animals that aren't arthropods; (3) all COI that aren't animals.

Do you think BOLD would ever set up a way where you can go to some kind of FTP site of theirs and just pull all public records in one shot? I get that's not their main purpose, but I'd love to be able to set up something where I can pull their existing records once per year to keep up some kind of versioned release of their data.

On a related note, I was curious if you knew of any licensing issues in using/sharing these BOLD data. My hope is to take these BOLD data and create a database (or ten) for others (all free, of course). Am I violating any terms of agreement from BOLD? I can't find anything on their website about that kind of thing, but figured you may have encountered it before.

Thanks for the help, hope all is well.

@sckott
Copy link
Contributor Author

sckott commented Jun 30, 2020

@devonorourke Thanks for trying it!

curious if there was a simple way in your R package to split this task into two halves

by this package, do you mean taxize or bold? Not really either way. The way I usually recommend is for people to use downstream() or similar functions to split the target taxon into smaller taxonomic groups to query on because as you know BOLD does not handle very large queries well.

Do you think BOLD would ever set up a way where you can go to some kind of FTP site of theirs and just pull all public records in one shot?

There is this page http://boldsystems.org/index.php/datarelease but looks like no new data since 2015

curious if you knew of any licensing issues in using/sharing these BOLD data

not that I know of

@devonorourke
Copy link

devonorourke commented Jun 30, 2020 via email

@sckott
Copy link
Contributor Author

sckott commented Jun 30, 2020

last time I got a response was in 2017 from mmilton@boldsystems.org Megan Milton - there's some email address here http://boldsystems.org/index.php/Resources/ContactUs to try

@devonorourke
Copy link

Thanks.
I ran into an issue in trying to get through the entire Arthropoda dataset in one fell swoop and it crashed. I then went back to my old ways of splitting that dataset into 5: Coleoptera, Diptera, Hymenoptera, Lepidoptera, and all other Insects. Then it crashed again! 😠 ... So now I'm splitting up the Dipterans into 4 groups (the 3 families with the most specimen, plus all others) and seeing if it crashes anew.
I'm still trying to work out exactly how large is too large for BOLD, because I just ran the same kind of script today to pull data for all Chordata in one shot, and it worked fine. So clearly you can pull a lot of records, but just how many exactly I'm not so sure about.

In an ideal scenario, the user would be able to run something like this to get a sense of how many recoreds they are about to attempt to download:

arthropod_list <- downstream("Arthropoda", db = "bold", downto = "class")
x <- bold_stats(taxon=arthropod_list$Arthropoda$name)
x$total_records

What I think would be sweet is if there was a way to automatically flag instances where these total numbers of records are higher than a value we know is likely to trip an error. If the flag is thrown, maybe there would be a function that could then subset the elements of that list that are too big and then break them down to their next taxonomic rank into smaller chunks and proceed. Does that sound crazy?

Is there a simple way to recycle the bold_stats function across each element of arthropod_list$Arthropoda$name so that you get a record totals that are per $name instead of aggregated among all the records of all (in this case Arthropod Class names)?
Even if the user is going to need to be splitting up their searchers manually, starting with a broad taxonomic group and getting the number of individual records per subgroup (say, next taxonomic rank down) would be helpful in figuring out how to chunk the subgroups.

Thanks

@sckott
Copy link
Contributor Author

sckott commented Jul 2, 2020

I'm still trying to work out exactly how large is too large for BOLD

It may vary as well, e.g., if they're getting a lot of requests to their servers at once, the "too large" value may be larger at this point vs. when there are very few requests coming in. Unfortunately, we're stuck not knowing.

recycle the bold_stats function

library(taxize)
library(bold)
z <- downstream("Arthropoda", db = "bold", downto = "class")
x <- bold_stats(taxon=z$Arthropoda$name)
w <- lapply(z$Arthropoda$name[1:3], bold_stats)
as.list(stats::setNames(vapply(w, "[[", 1, "total_records"), z$Arthropoda$name[1:3]))

@devonorourke
Copy link

devonorourke commented Jul 2, 2020 via email

@sckott
Copy link
Contributor Author

sckott commented Jul 2, 2020

Oof, not consistent, wish I was wrong and that it was consistent

@devonorourke
Copy link

devonorourke commented Jul 2, 2020 via email

@sckott
Copy link
Contributor Author

sckott commented Jul 2, 2020

a way to automatically flag instances

you could do something like

split_taxa <- function(taxa, downto, max_records = 5000) {
  x <- downstream(taxa, db = "bold", downto = downto)
  w <- lapply(x[[1]]$name, bold_stats)
  recs <- vapply(w, "[[", 1, "total_records")
  df <- data.frame(name = x[[1]]$name, total_records = recs)
  x <- merge(x[[1]], df)
  x2 <- data.frame(NULL)
  if (any(x$total_records > max_records)) {
    too_big <- x[x$total_records > max_records,]
    out <- list()
    for (i in seq_len(NROW(too_big))) {
      rank_get <- ranks[which(ranks %in% too_big[i,"rank"]) + 1]
      out[[i]] <- split_taxa(too_big[i,"name"], downto = rank_get)
    }
    x <- x[!x$name %in% too_big$name, ]
    x2 <- dplyr::bind_rows(out)
  }
  return(dplyr::bind_rows(x, x2))
}

It's pretty slow though, this takes a long time

res <- split_taxa(taxa = "Arthropoda", downto = "class")
# ends up with many taxa with 0 records, filter them out
res <- dplyr::filter(res, total_records > 0)
head(res)
#>            name     id  rank total_records y[FALSE, ]
#> 1 Acrothoracica 987854 class             1         NA
#> 2 Cephalocarida     73 class            25         NA
#> 3     Chilopoda     75 class          2652         NA
#> 4     Diplopoda     85 class          4157         NA
#> 5       Diplura 734358 class           183         NA
#> 6   Hexanauplia 765970 class             2         NA

max(res$total_records)
#> [1] 4999

A faster example to try:

split_taxa("Branchiopoda", "order")

#>              name     id   rank total_records
#> 1       Anostraca    314  order          1570
#> 2       Ctenopoda 951535  order           515
#> 3   Cyclestherida 951605  order            64
#> 4       Haplopoda 951534  order           220
#> 5    Laevicaudata 951532  order            58
#> 6      Notostraca    328  order           625
#> 7      Onychopoda 951533  order           731
#> 8    Spinicaudata 951608  order          1674
#> 9      Bosminidae   1569 family           310
#> 10     Chydoridae   1573 family          1481
#> 11     Daphniidae   1572 family          3663
#> 12   Eurycercidae 154591 family           184
#> 13  Ilyocryptidae 177866 family            23
#> 14 Macrothricidae   1566 family           109
#> 15       Moinidae   1570 family           450

@sckott
Copy link
Contributor Author

sckott commented Jul 2, 2020

Have you ever looked at how different GenBank and BOLD are these days?

No. I don't have a good sense of what different data they have. GenBank web services are at least a little easier to use so thats something. In general it seems NCBI does provide bulk access to data, but i'm not sure if they do for all data types.

Thanks for your help

of course! happy to help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants