Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a rare case of a missing Insect Order #60

Closed
devonorourke opened this issue Feb 22, 2019 · 6 comments
Closed

a rare case of a missing Insect Order #60

devonorourke opened this issue Feb 22, 2019 · 6 comments

Comments

@devonorourke
Copy link

I stumbled across this quirk while following the Readme example for pulling all the arthropod data at once from BOLD. I'm guessing this is a rare thing, or perhaps a non thing and I'm just screwing something up, but if not, it seemed worth mentioning:

library("taxize")
library("bold")

x <- downstream("Arthropoda", db = "ncbi", downto = "class")
x.nms <- x$Arthropoda$childtaxa_name
x.checks <- bold_tax_name(x.nms)

Great, all the Classes are there. So far so good.

But because the Insect Order has like 89% of all records, I thought I'd remove them from the subsequent lapply(x.nms, bold_seqspec) call and pull out all the Insects and do those separately. So the next step was to generate a list of all Insect Orders:

y <- downstream("Insecta", db = "ncbi", downto = "order")
y.nms <- y$Insecta$childtaxa_name
y.checks <- bold_tax_name(nms)

Having spent more time staring at Insect Order names than I care to admit, I noticed that one was missing: Psocodea. In the y.checks object you'll notice that 'Psocoptera' is actually the one that is listed as missing, and it's because that name isn't used in the BOLD database but is used in NCBI. The BOLD list of all Insect Orders (here) lists Psocodea as having 42380 records, so it's not a trivial issue. Especially for those bark lice lovers out there... which apparently include the bats I study! If you try a search for Psocoptera it'll come up empty in BOLD.

I think this is one of those weird instances where the superOrder 'Psocodea' is used in BOLD... so the NCBI approach may be screwing up what we're looking for in BOLD sometimes.

Thanks for the consideration!

@sckott
Copy link
Contributor

sckott commented Feb 26, 2019

thanks for the report @devonorourke

not sure what the answer is off the top. I'll poke around and see what I can find.

It'd be great if there was a way to implement taxize::downstream for BOLD, but as far as I can remember, I don't think they have a way to get children of a taxon, which is the basis for making downstream work

@sckott sckott added this to the v1.0 milestone Jan 17, 2020
@sckott
Copy link
Contributor

sckott commented Jan 17, 2020

it seems like BOLD may follow Catalogue of Life taxonomy - I'm trying to get an answer on this

@sckott
Copy link
Contributor

sckott commented Apr 20, 2020

They're definitely not getting back to me.

They do appear to have children on each of their taxon page's, so we can scrape the names, BUT scraping is super fragile, so i'm somehwat reluctant to put this code in a package. this should work as is:

bold_children_one <- function(id) {  
  x <- crul::HttpClient$new(paste0("https://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=", id))
  res <- x$get()
  res$raise_for_status()
  html <- xml2::read_html(res$parse("UTF-8"))
  nodes <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//ol')
  if (length(nodes) == 0) {
    message("no children found")
    return(tibble::tibble())
  }
  group_nmz <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//lh')
  bb <- lapply(nodes, bold_children_each_node)
  if (length(group_nmz) > 0) {
    lst_nmz <- tolower(gsub("\\([0-9]+\\)|\\s", "", xml2::xml_text(group_nmz)))
    bb <- stats::setNames(bb, lst_nmz)
  }
  return(bb)
}

bold_children_each_node <- function(x) {
  out <- lapply(xml2::xml_find_all(x, ".//a"), function(w) {
    nm <- gsub("\\s\\[[0-9]+\\]$", "", xml2::xml_text(w))
    id <- strextract(xml2::xml_attr(w, "href"), "[0-9]+$")
    data.frame(name = nm, id = id, stringsAsFactors = FALSE)
  })
  tibble::as_tibble(data.table::rbindlist(out))
}

# Osmia (genus): 253 children
bold_children_one(id = 4940)
# Momotus (genus): 3 children
bold_children_one(id = 88899)
# Momotus aequatorialis (species): no children
bold_children_one(id = 115130)
# Osmia sp1 (species): no children
bold_children_one(id = 293378)
# Arthropoda (phylum): 27 children
bold_children_one(id = 82)
# Psocodea (order): 51 children
bold_children_one(id = 737139)
# Megachilinae (subfamily): 2 groups (tribes: 3, genera: 60)
bold_children_one(id = 4962)
# Stelis (species): 78 taxa
bold_children_one(id = 4952)

@sckott
Copy link
Contributor

sckott commented Apr 20, 2020

@devonorourke ^^

@devonorourke
Copy link
Author

I'm in support of whatever you advise. Agreed about challenge of webscraping. Anything you need from me?

@sckott
Copy link
Contributor

sckott commented Apr 20, 2020

@sckott sckott removed this from the v1.0 milestone May 1, 2020
@sckott sckott closed this as completed May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants