-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a rare case of a missing Insect Order #60
Comments
thanks for the report @devonorourke not sure what the answer is off the top. I'll poke around and see what I can find. It'd be great if there was a way to implement |
it seems like BOLD may follow Catalogue of Life taxonomy - I'm trying to get an answer on this |
They're definitely not getting back to me. They do appear to have children on each of their taxon page's, so we can scrape the names, BUT scraping is super fragile, so i'm somehwat reluctant to put this code in a package. this should work as is: bold_children_one <- function(id) {
x <- crul::HttpClient$new(paste0("https://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=", id))
res <- x$get()
res$raise_for_status()
html <- xml2::read_html(res$parse("UTF-8"))
nodes <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//ol')
if (length(nodes) == 0) {
message("no children found")
return(tibble::tibble())
}
group_nmz <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//lh')
bb <- lapply(nodes, bold_children_each_node)
if (length(group_nmz) > 0) {
lst_nmz <- tolower(gsub("\\([0-9]+\\)|\\s", "", xml2::xml_text(group_nmz)))
bb <- stats::setNames(bb, lst_nmz)
}
return(bb)
}
bold_children_each_node <- function(x) {
out <- lapply(xml2::xml_find_all(x, ".//a"), function(w) {
nm <- gsub("\\s\\[[0-9]+\\]$", "", xml2::xml_text(w))
id <- strextract(xml2::xml_attr(w, "href"), "[0-9]+$")
data.frame(name = nm, id = id, stringsAsFactors = FALSE)
})
tibble::as_tibble(data.table::rbindlist(out))
}
# Osmia (genus): 253 children
bold_children_one(id = 4940)
# Momotus (genus): 3 children
bold_children_one(id = 88899)
# Momotus aequatorialis (species): no children
bold_children_one(id = 115130)
# Osmia sp1 (species): no children
bold_children_one(id = 293378)
# Arthropoda (phylum): 27 children
bold_children_one(id = 82)
# Psocodea (order): 51 children
bold_children_one(id = 737139)
# Megachilinae (subfamily): 2 groups (tribes: 3, genera: 60)
bold_children_one(id = 4962)
# Stelis (species): 78 taxa
bold_children_one(id = 4952) |
I'm in support of whatever you advise. Agreed about challenge of webscraping. Anything you need from me? |
I stumbled across this quirk while following the Readme example for pulling all the arthropod data at once from BOLD. I'm guessing this is a rare thing, or perhaps a non thing and I'm just screwing something up, but if not, it seemed worth mentioning:
Great, all the Classes are there. So far so good.
But because the Insect Order has like 89% of all records, I thought I'd remove them from the subsequent
lapply(x.nms, bold_seqspec)
call and pull out all the Insects and do those separately. So the next step was to generate a list of all Insect Orders:Having spent more time staring at Insect Order names than I care to admit, I noticed that one was missing:
Psocodea
. In they.checks
object you'll notice that 'Psocoptera' is actually the one that is listed as missing, and it's because that name isn't used in the BOLD database but is used in NCBI. The BOLD list of all Insect Orders (here) listsPsocodea
as having 42380 records, so it's not a trivial issue. Especially for those bark lice lovers out there... which apparently include the bats I study! If you try a search forPsocoptera
it'll come up empty in BOLD.I think this is one of those weird instances where the superOrder 'Psocodea' is used in BOLD... so the NCBI approach may be screwing up what we're looking for in BOLD sometimes.
Thanks for the consideration!
The text was updated successfully, but these errors were encountered: