Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time to switch to BOLDv5 #110

Open
nickschurch opened this issue Dec 4, 2024 · 10 comments
Open

Time to switch to BOLDv5 #110

nickschurch opened this issue Dec 4, 2024 · 10 comments
Assignees
Labels
API-related API-side problem needing work around critical Something to fix URGENTLY
Milestone

Comments

@nickschurch
Copy link

nickschurch commented Dec 4, 2024

I'm trying to use this package to get a bunch of COI sequences from BOLD. I know the code works because sometimes it returns things, but mostly I'm plagued with intermittent
Warning: Content was type '' when it should've been type 'text/html; charset=utf-8' errors, or, for larger groups of sequences, occasionally The request timed out, see 'If a request times out'. returning partial output

For example:

> y <- bold_tax_name("Archaeognatha")
Warning: Content was type '' when it should've been type 'text/html; charset=utf-8'

Is this normal? Is something wrong with the bold servers? Can I set the timeout length of the maximum number of sequences to return or something?

@salix-d
Copy link
Collaborator

salix-d commented Dec 4, 2024

Is there a time when it happens more often?

I can try reaching out to them, because if it can happen when only asking for one species, it's definitively on their server side.
I know they were working on a new version of the API, might be related to that 🤔

@salix-d
Copy link
Collaborator

salix-d commented Dec 4, 2024

Oooooooooooh, the new api is finaly out! 😮
Had asked them to let me know so I could update 😢

Well, I'll check how much has changed and see how long it will take to update.

@salix-d salix-d self-assigned this Dec 4, 2024
@salix-d salix-d added API-related API-side problem needing work around critical Something to fix URGENTLY labels Dec 4, 2024
@salix-d salix-d added this to the v1.5 milestone Dec 4, 2024
@salix-d salix-d changed the title Nothing but errors and warnings.... Time to switch to BOLDv5 Dec 4, 2024
@nickschurch
Copy link
Author

Is there a time when it happens more often?

I can try reaching out to them, because if it can happen when only asking for one species, it's definitively on their server side. I know they were working on a new version of the API, might be related to that 🤔

Yesterday afternoon was particularly bad, evening not so much and I managed to get most of what I was after (all COI-5p sequences for Animalia), but I'm pretty sure it's not all of it which is annoying for reproducibility.

@nickschurch
Copy link
Author

nickschurch commented Dec 5, 2024

BOLD is really odd. When I use the web interface to get records for the order Diptera, It lists 6,660,909 records. When I try to get these via the R tool, it times out, but I can loop over all the families under Diptera and query those individually. When I do, I get 6,241,574 records, failures for two families (Heterocheilidae & Braulidae), and timeouts for three families (Perissommatidae, Neminidae, Teratomyzidae).

Searching for the failed families on the website search interface returns 1 & 3 sequences respectively. Searching for the three families that are timing out on the website search interface returns no hits despite the fact I know there are some in the database I think because there are no public records (following the "sub-taxa" links from Animalia > Arthropoda > Insecta > Diptera, it lists the families (e.g., https://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=532727, but lists that there is no public data available for this family).

Using this tool to try and pull data for this family (where there is no public data) with bold_seq("Perissommatidae", marker = "COI-5P") returns 5056 sequences before timing out and saying it's only returned partial results. Looking at the head of these results is weird though, since the first one is Scyliorhinus canicula, a catfish(!!), not an insect at all. Something very weird is going on here.

See, this is why I hate this restricted kind of database and semi-closed-source system.

@salix-d
Copy link
Collaborator

salix-d commented Dec 5, 2024

I see their taxonomy broser is still using v4. There was some inconsistency between the taxonomy api and the seq/specimen one. They might still be working on fixing those tbh.

Regardless, it's pretty weird that it returns an unrelated taxon, I thought maybe they share a rank name, but not even x_x
ACTUALY, that's on the way their API (v4) work! when there's no match for the taxa it returns matches to the other parameters, in this case "marker=COP-5P", hence the huge result that make no sense.

@salix-d
Copy link
Collaborator

salix-d commented Dec 5, 2024

Well, it seems they decided to do their on package https://github.com/boldsystems-central/BOLDconnectR/ ...

@salix-d
Copy link
Collaborator

salix-d commented Dec 5, 2024

Searching for the failed families on the website search interface returns 1 & 3 sequences respectively. Searching for the three families that are timing out on the website search interface returns no hits despite the fact I know there are some in the database I think because there are no public records (following the "sub-taxa" links from Animalia > Arthropoda > Insecta > Diptera, it lists the families (e.g., https://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=532727, but lists that there is no public data available for this family).

Well, on their new web site, there's only one specimen for that one.
https://portal.boldsystems.org/result?query=Perissommatidae[tax]

@salix-d
Copy link
Collaborator

salix-d commented Dec 5, 2024

ACTUALY, that's on the way their API (v4) work! when there's no match for the taxa it returns matches to the other parameters, in this case "marker=COP-5P", hence the huge result that make no sense.

That's partially why they redid the whole API.
If you do try their new package, I'd be curious to hear if it provides what you need from them.

@nickschurch
Copy link
Author

I'll give it a go and see and let you know.

The code is working and giving me useful sequences eventually, I just have no idea whether it's complete at the end, after iterating over different taxonomic levels and multiple retrying to get around the BOLD database http errors, to try and get everything. And if it's not complete, I've got no idea what's missing.

@nickschurch
Copy link
Author

Update:

  1. No joy on using BOLDconnectR to get programmatic access. This requires an API key to use. There is an option for requesting this, but it's been more than a week and I've still not been given an API key from BOLD and their customer support teams are asking questions like "What do you want to use the API key for?". Not hopeful, and very much not open science!

  2. However... I'm continuing to get a wide range of failures from using this library too. Trying to get all the COI-5p for Animalia is essentially proving impossible.

My conclusion is that, at the moment, it's practically impossible to get hold of data at scale from BOLD

P.S For reference, and in case it's useful for anyone else, the script I'm using to attempt to get the data is:

# load packages

library(tidyverse)
library(bold)    # API interface to BOLD
library(taxize)  # for NCBI taxonomy lookup
library(seqinr)  # for FASTA output

# set NCBI access token
Sys.setenv(ENTREZ_KEY = "mykey")

get_ncbi_id <- function(row, taxonomy_map) {
  
  taxonomic_levels <- c( "phylum_name", "class_name", "order_name",
                         "family_name", "subfamily_name", "genus_name",
                         "species_name")
  
  for (level in rev(taxonomic_levels)) {
    if (level %in% names(row)){
      tax_value <- row[[level]]
      if (!is.na(tax_value)) {
        match_row <- taxonomy_map[taxonomy_map$tax_name == tax_value, ]
        if (nrow(match_row) > 0 & !is.na(match_row$ncbi_id)) {
          return(match_row$ncbi_id[1]) # Return the first match
        }
      }
    }
  }
  return(NA) # Return NA if no match is found
}

retrieve_bold_data <- function(taxon_toplevel, taxon_deslevel, marker = "COI-5P", maxiter = 5) {
  
	# Get class-level taxa within "Animalia" from NCBI taxonomy
	message(sprintf("Getting %s level taxon information for %s from NCBI...",
	                taxon_deslevel, taxon_toplevel))
	redo <- TRUE
	iteration <- 1 # Track retry iterations
		
	while (redo) {
		taxa <- tryCatch(
			{downstream(taxon_toplevel, db = "ncbi", downto = taxon_deslevel)},
			error = function(e) {
			  if (iteration > maxiter) stop("Maximum retries reached.")
				message(sprintf("%s: retrying (%i/%i).", conditionMessage(e), iteration, maxiter))
				NULL # Return NULL on error to avoid breaking the loop
			}
		)

		# Check if retrieval was successful
		if (!is.null(taxa)) {
			redo <- FALSE
			message(sprintf("   ... found %i entries.", nrow(taxa[[1]])))
		}
		iteration <- iteration + 1
	}
	if (is.null(taxa)) break # end the function if we couldn't get the taxonomy

	# check if taxa present in BOLD
	message("Matching taxa to BOLD database taxonomy information...")
	redo <- TRUE
	iteration <- 1        # Track retry iterations
	
	while (redo) {
		bold.taxa <- tryCatch(
			{bold_tax_name(taxa[[1]]$childtaxa_name) %>%
					filter(!is.na(taxid)) %>%
					filter(tax_rank == taxon_deslevel) %>%
			    left_join(taxa$Animalia, by = c("taxon" = "childtaxa_name")) %>%
			    select(-c(rank, input, tax_rank, tax_division, taxonrep,
			              specimenrecords, representitive_image.image,
			              representitive_image.apectratio)) %>%
			    rename(ncbi_taxid = "childtaxa_id")
			  },
			error = function(e) {
			  if (iteration > maxiter) stop("Maximum retries reached.")
			  message(sprintf("error: %s: retrying (%i/%i).", conditionMessage(e), iteration, maxiter))
				NULL # Return NULL on error to avoid breaking the loop
			},
			warning = function(w) {
			  if (iteration > maxiter) stop("Maximum retries reached.")
			  message(sprintf("warning: %s: retrying (%i/%i).", conditionMessage(w), iteration, maxiter))
				NULL # Return NULL on error to avoid breaking the loop
			}
		)

		# Check if retrieval was successful
		if (!is.null(bold.taxa)) {
			redo <- FALSE
			message(sprintf("   ... found %i matching BOLD entries.", nrow(bold.taxa)))
		}
		iteration <- iteration + 1
	}
	if (is.null(bold.taxa)) break # end the function if we couldn't get the taxonomy
	# I note that all the above bullshit is required because BOLD and NCBI are both
	# a bunch of flakey bullshit
	
	all_results <- list() # To store successful results
	iteration <- 1        # Track retry iterations
	failed_taxa <- bold.taxa$taxon
	rety_lower <- list()

	# Loop until there are no failed taxa
	while (length(failed_taxa) > 0) {
		message(sprintf("Iteration: %i/%i - Taxa to process: %i", iteration,
		                maxiter, length(failed_taxa)))
		current_failed <- vector() # To track failures in this iteration

		for (taxon in failed_taxa) {
			message(sprintf("Processing taxon: %s", taxon))

			# Try fetching sequences with tryCatch
			result <- tryCatch(
				{bold_seq(taxon, marker = marker) %>%
				    rename(ID = "processid") %>%
				    bold_identify_taxonomy() %>%
				    select(-c(ends_with("taxID"),
				              identification, marker, accession))},
				warning = function(w) {
					# Handle warnings (e.g., log and mark taxon as failed)
					message(sprintf(" ... warning for taxon: %s - %s", taxon, conditionMessage(w)))
				  NULL # Return NULL on error to avoid breaking the loop
				},
				error = function(e) {
					# Handle errors if needed
					message(sprintf(" ... error for taxon: %s - %s", taxon, conditionMessage(e)))
				  NULL # Return NULL on error to avoid breaking the loop
				}
			)
			
			# Check the result and decide if taxon was successfully processed
			if (is.data.frame(result) && nrow(result) > 0) {
				message(sprintf("... got %i sequences", nrow(result)))
			  message("Matching taxonomy back to to NCBI, where possible...")
        
			  # get unique taxonomy names from the dataset
			  taxonomy_names <- result %>%
			    select(ends_with("name")) %>%
			    unlist() %>%
			    unique() %>%
			    na.omit()
			  
			  # get ncbi ids for the names
			  taxonomy_map <- data.frame(tax_name = taxonomy_names) %>%
			    mutate(ncbi_id = get_uid(tax_name, ask = TRUE, messages = FALSE,
			                             division_filter = "Animalia"))
			  
			  # merge them back tot he datafram, keeping only the most specific taxon
			  # id for each record (species, if the species is recognised
			  result$ncbi_id <- apply(result, 1, get_ncbi_id, taxonomy_map = taxonomy_map)
				
			  # merge with all the other results
			  all_results[[taxon]] <- result
			} else if (is.null(result)) {
			  # if the call resulted in an error or warning, list for retry
				message(paste("... marking for retry at lower taxonomic level for ", taxon))
				rety_lower[taxon] <- taxon_deslevel
			} else {
				# If the call returned zero rows, or something else weird, add to the retry list
				message("... failed")
				current_failed <- c(current_failed, taxon)
			}
			
			# Pause briefly to avoid rate-limiting
			Sys.sleep(1)
		}

		# Update failed taxa for the next iteration
		failed_taxa <- current_failed
		iteration <- iteration + 1
		if (iteration > maxiter) {
		  message("Maximum iterations reached; stopping.")
		  break
		}
	}

	# Combine all results into a single data frame
	combined_results <- bind_rows(all_results)
	list(results = combined_results, failed_taxa = failed_taxa,
	     retry_lower = rety_lower, taxonomy_map = taxonomy_map,
	     taxonomy_names = taxonomy_names)
}

taxonomy_order <- c("class", "order", "family", "genus", "species")
try_list <- list(Animalia  = "class")
res_list <- list()
while (length(names(try_list)) > 0) {
  for (i in 1:length(names(try_list))) {
    taxname = names(try_list)[i]
    taxorder <- try_list[[taxname]]
    message(sprintf("processing %s with taxonomic level %s...",
                    taxname, taxorder))
    thistry <- tryCatch({test <- retrieve_bold_data(taxname, taxorder)
      try_list[[taxname]] <- NULL
      res_list[[taxname]] <- test
      if (length(names(test$retry_lower)) > 0){
        for (retryname in names(test$retry_lower)){
          retryorderid <- which(taxonomy_order == test$retry_lower[[retryname]]) + 1
          if (retryorderid <= length(taxonomy_order)){
            try_list[retryname] <- taxonomy_order[retryorderid]
          } else {
            stop("reached lowest taxonomic level")
          }
        }
      }},
      warning = function(w) {
        # Handle warnings (e.g., log and mark taxon as failed)
        message(" ... warnings, but keeping taxon in trylist")
        NULL # Return NULL on error to avoid breaking the loop
      },
      error = function(e) {
        # Handle errors if needed
        message(" ... errors, but keeping taxon in trylist")
        NULL # Return NULL on error to avoid breaking the loop
      })
  }
}

final_results <- test$results %>%
  mutate(uniqueid = sprintf("BOLD_COI-5p_seq%07i", 1:nrow(test$results)),
         concatenated_taxonomy = pmap_chr(
           select(., phylum_name, class_name, order_name, family_name,
                  subfamily_name, genus_name, species_name),
           ~ paste(na.omit(c(...)), collapse = ":")),
         fasta_header = sprintf("%s taxid=%s BOLDid=%s BOLD_taxonomy=%s",
                                uniqueid, ncbi_id, ID, concatenated_taxonomy))


# write the results to a CSV file
write.csv(final_results, "bold_sequences.csv", row.names = FALSE)

# write sequences to file
write.fasta(
  sequences = as.list(final_results$sequence),
  names = final_results$fasta_header,
  nbchar = 80,
  file.out = "coi5p.fasta"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API-related API-side problem needing work around critical Something to fix URGENTLY
Projects
None yet
Development

No branches or pull requests

2 participants