To systematically collect, filter, and process cited scientific data from the Global Biodiversity Information Facility (GBIF) literature API. The focus is on obtaining literature that has cited GBIF data and ensuring only relevant and peer-reviewed sources are included.
- GBIF API: The GBIF literature API provides access to a database of biodiversity-related literature thast cite GBIF. This includes journals, working papers, books, and book sections.
requests
: For making HTTP requests to the GBIF API.json
: For parsing and writing JSON data.tqdm
: For providing a progress bar during data retrieval.os
: For managing file and directory operations.zipfile
: For extracting and processing ZIP files.csv
: For handling CSV files.sys
: For adjusting system settings to handle large CSV files.- Storage:
- Local storage on the D: drive to manage large data files, including downloaded ZIP files, output CSVs, and error logs.
- API Endpoint:
https://api.gbif.org/v1/literature/search
- Parameters:
contentType
: Filters to "literature".literatureType
: Includes "JOURNAL", "WORKING_PAPER", "BOOK", and "BOOK_SECTION".relevance
: Filters literature that is "GBIF_CITED".peerReview
: Ensures only peer-reviewed literature is included.limit
: Sets the number of records per request to 10.offset
: Starts the search from the first record.
- Fetch Initial Data:
- Make an initial API request to determine the total number of available records.
- If the initial request fails or the count is unavailable, terminate the process.
- Iterative Data Retrieval:
- Continuously request data from the API in batches of 10 records.
- Filter records that contain the
gbifDownloadKey
, indicating that the literature has associated GBIF data downloads. - Update the offset parameter to fetch the next batch until all data is retrieved.
- Save the filtered records with
gbifDownloadKey
to a JSON file namedfiltered_gbif_entries.json
for further processing.
- Increase the field size limit for CSV processing to handle large entries, setting it to the maximum allowable size.
Because the files are so large it is likely that the process will be interupted and will have to be restarted. Hence the need for a skip file.
- Load Processed DOIs:
- Load the previously processed DOIs from a skip file to avoid reprocessing.
- Save Processed DOIs:
- Append each processed DOI to the skip file to keep track of completed entries.
- Load Downloaded Keys:
- Load previously downloaded keys to avoid duplicate downloads.
- Directory Management:
- Ensure that all necessary directories for storing downloads, logs, and outputs exist.
- Download Data:
- Download data files associated with each
gbifDownloadKey
and save them as ZIP files in the specified directory.
- Download data files associated with each
- Unzip and Extract:
- Unzip downloaded files and check for the presence of relevant data (e.g.,
occurrence.txt
or CSV files).
- Unzip downloaded files and check for the presence of relevant data (e.g.,
- Filter and Save Relevant Data:
- Filter records for preserved specimens and append them to the output CSV file.
- Include columns such as
gbifID
,year
,countryCode
,gbifDownloadKey
, anddoi
.
- Error Handling:
- Log errors encountered during download or processing to an error log file for review.
- After successful extraction and processing, delete the downloaded ZIP files and extracted contents to conserve storage space.
- Filtered Entries File:
filtered_gbif_entries.json
containing all relevant entries withgbifDownloadKey
. - Output CSV File:
output_data.csv
with filtered and processed data of preserved specimens. - Error Log File:
error_log.txt
documenting any errors encountered during processing.
To analyze and visualize the thematic topics present in the literature that reference specimens in GBIF. It uses the GBIF Literature API to extract topics associated with each Digital Object Identifier (DOI) and constructs a network graph representing the co-occurrence of these topics.
-
Python Libraries:
requests
: For querying the GBIF Literature API.pandas
: For handling and processing CSV data.networkx
: For constructing and analyzing the topic co-occurrence network.matplotlib
: For visualizing the topic network graph.tqdm.notebook
: For providing progress bars during data processing.
-
Data Input:
- CSV File (
allDOIs.csv
): A CSV file containing a list of DOIs that reference GBIF data. This file is used as the source for querying the literature API. allDOIs.csv
is created fromoutput_data.csv
using awk: awk -F',' '!seen[$5]++ { print $5 }' filename.csv > unique_dois.txt
- CSV File (
- Querying the GBIF Literature API:
- A function
query_gbif_literature(doi)
is defined to fetch literature data from the GBIF API using a given DOI. - The function returns the JSON response if the request is successful; otherwise, it prints an error message.
- A function
-
Extract Topics:
- The function
extract_topics(data)
processes the API response to extract thematic topics associated with each literature entry. - Topics are extracted if they exist in the results; otherwise, a message is printed indicating the absence of topics.
- The function
-
Read DOIs from CSV:
- The script reads the
allDOIs.csv
file usingpandas
to obtain a list of DOIs for further processing.
- The script reads the
-
Topic Count and Co-occurrence Analysis:
- The script iterates through each DOI, querying the GBIF API and extracting topics.
- It maintains two dictionaries:
topic_counts
: Tracks the frequency of each unique topic.topic_cooccurrences
: Tracks how often pairs of topics co-occur within the same literature entry.
-
Create Network Graph:
- A network graph
G
is created using thenetworkx
library. - Nodes: Represent unique topics. The size of each node is proportional to the count of the topic in the literature.
- Edges: Represent co-occurrences between topics. The weight of each edge corresponds to the number of times the topics co-occurred.
- A network graph
-
Add Nodes and Edges:
- Nodes are added to the graph with attributes such as size and count.
- Edges are added between topics based on their co-occurrence frequency.
- The constructed topic network is saved in the
GraphML
format (topic_network.graphml
). This format allows for further analysis and visualization using various graph tools.
-
Visualize Network:
- The graph is visualized using
matplotlib
. - The size of each node in the visualization is scaled according to its count attribute.
- Nodes are colored, and edges are drawn with varying thickness based on their weights.
- The graph is visualized using
-
Plot Configuration:
- A spring layout is used to arrange the nodes for better visualization.
- Node size, font size, and colors are customized for readability.
- Network Graph File:
topic_network.graphml
containing the constructed topic network with nodes and edges. - Visual Plot: A visual representation of the topic network is optionally displayed, showing the structure and connections between topics.