Skip to content

source-data/biorxiv-retreiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

biorxiv-retreiver

biorxiv-retriever is a resilient wrapper to the Biorxiv API. It consists of two main classes: BiorxivDataGenerator and BiorxivRetriever. The former uses resilient HTTP requests to generate a dataset with the available preprints in Biorxiv. BiorxivRetriever is an API wrapper that allows for API calls to any of the services supported by the Biorxiv API.

Installing biorxiv-retriever

Clone the repository and setup a Python virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install --upgrade pip
    pip install -r requirements.txt 

Using biotxiv-retriever from the CLI

From the directory root you can get CLI help on how to call the commands using:

# To use BiorxivRetriever
python -m src.cli.search.search --help
# To use DatasetGenerator
python -m src.cli.create_data.create_data --help

Examples on using BiorxivRetriever

Using the details service of the Biorxiv API to find all papers between first of May 2022 and the current date.

python -m src.cli.search.search details biorxiv \
        --start_date=2022-05-01

Same as in the previous example with data from Medrxiv.

python -m src.cli.search.search details medrxiv \
        --start_date=2022-05-01

Search for details of article publishers. In this case, the publisher with a prefix doi 10.15252

python -m src.cli.search.search publisher biorxiv \
        --prefix=10.15252 \
        --start_date=2021-05-01

Show the summary of content statistics in Biorxiv

python -m src.cli.search.search sum biorxiv \
        --interval=m

Examples on using DatasetGenerator

Get all the available metadata in biorxiv since 4th May 2022 <(-_-)> may the force be with you.

python -m src.cli.create_data.create_data biorxiv \
      --start_date=2022-05-04 \
      --email=your.email@company.acme

Same as above for Medrxiv.

python -m src.cli.create_data.create_data medrxiv \
      --start_date=2022-05-04 \
      --email=your.email@company.acme

Retrieve the entire metadata available since April 2022 and also the source XML text.

python -m src.cli.create_data.create_data biorxiv \
      --start_date=2022-05-04 \
      --email=your.email@company.acme \
      --xml=True

Using biotxiv-retriever as a python module

The functionalities of biorxiv-retriever can be used as normal python modules in case it is necessary. The last line above can be called from a python script using:

from src.dataset_generator import BiorxivDataGenerator
data = BiorxivDataGenerator(start_date='2022-05-04', 
                            email='your.email@company.acme',
                            xml=True)
data()

If you are interested on downloading the metadata only and want to download the source xml files on a later stage, we provide the BiorxivDataGenerator.dl_source_xml method. It accepts the path to the json file with the metadata generated and it downloads the source files. This is useful if you want to obtain the metadata first and the source text on a later step.

About

Retrieves Biorxiv data and generates datasets out of it.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages