biorxiv-retriever is a resilient wrapper to the Biorxiv API. It consists of two main classes: BiorxivDataGenerator and BiorxivRetriever. The former uses resilient HTTP requests to generate a dataset with the available preprints in Biorxiv. BiorxivRetriever is an API wrapper that allows for API calls to any of the services supported by the Biorxiv API.
Clone the repository and setup a Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
From the directory root you can get CLI help on how to call the commands using:
# To use BiorxivRetriever
python -m src.cli.search.search --help
# To use DatasetGenerator
python -m src.cli.create_data.create_data --help
Using the details service of the Biorxiv API to find all papers between first of May 2022 and the current date.
python -m src.cli.search.search details biorxiv \
--start_date=2022-05-01
Same as in the previous example with data from Medrxiv.
python -m src.cli.search.search details medrxiv \
--start_date=2022-05-01
Search for details of article publishers. In this case, the publisher with a prefix
doi 10.15252
python -m src.cli.search.search publisher biorxiv \
--prefix=10.15252 \
--start_date=2021-05-01
Show the summary of content statistics in Biorxiv
python -m src.cli.search.search sum biorxiv \
--interval=m
Get all the available metadata in biorxiv since 4th May 2022 <(-_-)> may the force be with you.
python -m src.cli.create_data.create_data biorxiv \
--start_date=2022-05-04 \
--email=your.email@company.acme
Same as above for Medrxiv.
python -m src.cli.create_data.create_data medrxiv \
--start_date=2022-05-04 \
--email=your.email@company.acme
Retrieve the entire metadata available since April 2022 and also the source XML text.
python -m src.cli.create_data.create_data biorxiv \
--start_date=2022-05-04 \
--email=your.email@company.acme \
--xml=True
The functionalities of biorxiv-retriever can be used as normal python modules in case it is necessary. The last line above can be called from a python script using:
from src.dataset_generator import BiorxivDataGenerator
data = BiorxivDataGenerator(start_date='2022-05-04',
email='your.email@company.acme',
xml=True)
data()
If you are interested on downloading the metadata only and want to download the source xml
files on a later stage, we provide the BiorxivDataGenerator.dl_source_xml
method.
It accepts the path to the json file with the metadata generated and it downloads the source
files. This is useful if you want to obtain the metadata first and the
source text on a later step.