Skip to content

Latest commit

 

History

History
96 lines (67 loc) · 4.57 KB

README.md

File metadata and controls

96 lines (67 loc) · 4.57 KB

GAP-Replay: MediTron's Pre-Training corpus

This directory contains code to download and pre-process the GAP-Replay corpus.

MediTron’s domain-adaptive pre-training corpus GAP-Replay combines 48.1B tokens from four datasets:

  • Clinical Guidelines: a new dataset of 46K clinical practice guidelines from various healthcare-related sources,
  • Paper Abstracts: abstracts from 16.1M closed-access PubMed and PubMed Central papers,
  • Medical Papers: full-text articles extracted from 5M publicly available PubMed and PubMed Central papers.
  • Replay dataset: general domain data distilled to compose 1% of the entire corpus.

GAP-Replay

1. Downloading GAP-Replay

To download all datasets and combine them into a single GAP-Replay corpus, run:

./download.sh

1.1. Downloading PubMed papers and abstracts

To download and pre-process PubMed papers and abstracts from the S2ORC API, run:

./pubmed/download.sh

1.2. Downloading Replay Data

To download and sub-sample replay data from the RedPajama-v1 dataset, run:

./replay/download.sh

1.3. Downloading Guidelines

Only 8 of 16 sources of clinical guidelines allow for redistribution (namely CCO, CDC, CMA, ICRC, NICE, SPOR, WHO & WikiDoc). For these 36K open-access articles, we release raw and clean versions of the data on the HuggingFace datasets hub.

from datasets import load_dataset

dataset = load_dataset("epfl-llm/guidelines")

To scrape all 16 sources, you can use our web scrapers and cleaning code in guidelines/ by first setting up the dependencies.

# Install dependencies
pip install -r guidelines/requirements.txt

# Use spacy to get the English language pipeline
python -m spacy download en_core_web_sm 

# Install scipdf from GitHub to convert PDFs to text
pip install git+https://github.com/titipata/scipdf_parser

Then, to download and pre-process all 46K clinical practice guidelines, run:

./guidelines/download.sh

All sources of clinical practice guidelines supported by our scrapers are shown below.

Sources of CPGs

Source Full Name Source tag  Total guidelines Total words Audience Released
AAFP American Academy of Family Physicians  aafp 50 9.4K Doctor No
CCO Cancer Care Ontario cco 87 199K Doctor Yes
CDC Center for Disease Control and Prevention cdc 621  6.7M Doctor Yes
CMA Canadian Medical Association cma 431 1.7M Doctor Yes
CPS Canadian Paediatric Society cps 54 133K Doctor No
drugs.com Drugs.com drugs 6548 4.1M Both No
GuidelineCentral GuidelineCentral gc 1029 1M Doctor No
ICRC International Committee of the Red Cross icrc 49 1.2M Doctor Yes
IDSA Infectious Diseases Society of America idsa 47 646K Doctor No
MAGIC Making GRADE The Irresistible Choice magic 52 415K Doctor No
MayoClinic MayoClinic mayo 1100 2.2M Patient No
NICE National Institute for Health and Care Excellence nice 1656 8.1M Doctor Yes
RCH Royal Children's Hospital Melbourne rch 384 410K Doctor No
SPOR Strategy for Patient-Oriented Research spor 217 1.1M Doctor Yes
WHO World Health Organization who 223 3.1M Both Yes
WikiDoc WikiDoc wikidoc 33058 34M Both Yes

NOTE: The endpoints or data shape of some of the sources may have changed since we scraped them, so the scrapers may be outdated.