GPT Pre-training Data #3

kibitzing · 2024-06-19T00:26:50Z

GPT-3 data mix

Datasets are not sampled in proportion to their size
Datasets we view as higher-quality are sampled more frequently
- WebText2, Book1, Wikipedia datasets are sampled 2-3 times.
(Relatively) Low quality dataset like CommonCrawl and Books2 datasets are sampled less than once during training

High-quality datasets

including an expanded version of the WebText dataset [RWC+19] collected by scraping links over a longer period of time, and first described in [KMH+20],
two internet-based books corpora (Books1 and Books2)
English-language Wikipedia.

Data preparation process

We downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora
We performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting
We also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity

kibitzing · 2024-06-24T14:56:22Z

Common Crawl

downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019
Constituting 45TB of compressed plaintext before filtering
570GB after filtering (filtered a lot, only 1.27%)
Roughly equivalent to 400 billion byte-pair-encoded tokens

kibitzing · 2024-06-24T15:23:36Z

WebText 1 (from Language Models are Unsupervised Multitask Learners)

Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.
Common Crawl is large and diverse, but they have significant data quality issues.
To improve document quality, we only scraped web pages which have been curated/filtered by humans.
- We scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.
  - This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
The resulting dataset, WebText, contains the text subset of these 45 million links.
We removed all Wikipedia documents
All results presented in this paper use a preliminary version of WebText
- does not include links created after Dec 2017
- contains slightly over 8 million documents for a total of 40 GB of text after de-duplication and some heuristic based cleaning
To extract the text from HTML responses we use a combination of the Dragnet and Newspaper content extractors.

kibitzing · 2024-06-26T15:46:32Z

WebText 2 (from Scaling Laws for Neural Language Models)

An extended version of the WebText 1
- + Outbound Reddit links from the period of January to October 2018 also with a minimum of 3 karma.
The text of the new links was extracted with the Newspaper3k python library.
In total, the dataset consists of
- 20.3M documents containing 96 GB of text and 16.2B words (as defined by wc).
- We then apply the reversible tokenizer described in [RWC+19], which yields 22.9B tokens.
We reserve 660M tokens for use as a test set

kibitzing · 2024-06-27T15:34:40Z

Book Corpus

The paper does not clearly explain book1 and book2. Therefore, we can only speculate about these. Here are several articles that offer interesting conjectures.

Other References:

https://yknzhu.wixsite.com/mbweb
https://arxiv.org/pdf/1506.06724
- 74 Million sentences, 984 Million words
https://www.smashwords.com/
https://arxiv.org/pdf/2105.05241
https://arxiv.org/pdf/1803.09010
https://arxiv.org/pdf/2101.00027
https://github.com/jackbandy/bookcorpus-datasheet
https://huggingface.co/datasets/bookcorpus/bookcorpus

kibitzing self-assigned this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT Pre-training Data #3

GPT Pre-training Data #3

kibitzing commented Jun 19, 2024 •

edited

Loading

kibitzing commented Jun 24, 2024

kibitzing commented Jun 24, 2024 •

edited

Loading

kibitzing commented Jun 26, 2024

kibitzing commented Jun 27, 2024 •

edited

Loading

GPT Pre-training Data #3

GPT Pre-training Data #3

Comments

kibitzing commented Jun 19, 2024 • edited Loading

GPT-3 data mix

High-quality datasets

Data preparation process

kibitzing commented Jun 24, 2024

Common Crawl

kibitzing commented Jun 24, 2024 • edited Loading

WebText 1 (from Language Models are Unsupervised Multitask Learners)

kibitzing commented Jun 26, 2024

WebText 2 (from Scaling Laws for Neural Language Models)

kibitzing commented Jun 27, 2024 • edited Loading

Book Corpus

Other References:

kibitzing commented Jun 19, 2024 •

edited

Loading

kibitzing commented Jun 24, 2024 •

edited

Loading

kibitzing commented Jun 27, 2024 •

edited

Loading