Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT Pre-training Data #3

Open
kibitzing opened this issue Jun 19, 2024 · 4 comments
Open

GPT Pre-training Data #3

kibitzing opened this issue Jun 19, 2024 · 4 comments
Assignees

Comments

@kibitzing
Copy link
Owner

kibitzing commented Jun 19, 2024

GPT-3 data mix

Screenshot 2024-06-19 at 9 26 30 AM
  • Datasets are not sampled in proportion to their size
  • Datasets we view as higher-quality are sampled more frequently
    • WebText2, Book1, Wikipedia datasets are sampled 2-3 times.
  • (Relatively) Low quality dataset like CommonCrawl and Books2 datasets are sampled less than once during training

High-quality datasets

  • including an expanded version of the WebText dataset [RWC+19] collected by scraping links over a longer period of time, and first described in [KMH+20],
  • two internet-based books corpora (Books1 and Books2)
  • English-language Wikipedia.

Data preparation process

  1. We downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora
  2. We performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting
  3. We also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity
@kibitzing kibitzing self-assigned this Jun 19, 2024
@kibitzing
Copy link
Owner Author

Common Crawl

  • downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019
  • Constituting 45TB of compressed plaintext before filtering
  • 570GB after filtering (filtered a lot, only 1.27%)
  • Roughly equivalent to 400 billion byte-pair-encoded tokens

@kibitzing
Copy link
Owner Author

kibitzing commented Jun 24, 2024

WebText 1 (from Language Models are Unsupervised Multitask Learners)

  • Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.
  • Common Crawl is large and diverse, but they have significant data quality issues.
  • To improve document quality, we only scraped web pages which have been curated/filtered by humans.
    • We scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.
      • This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
  • The resulting dataset, WebText, contains the text subset of these 45 million links.
  • We removed all Wikipedia documents
  • All results presented in this paper use a preliminary version of WebText
    • does not include links created after Dec 2017
    • contains slightly over 8 million documents for a total of 40 GB of text after de-duplication and some heuristic based cleaning
  • To extract the text from HTML responses we use a combination of the Dragnet and Newspaper content extractors.

@kibitzing
Copy link
Owner Author

WebText 2 (from Scaling Laws for Neural Language Models)

  • An extended version of the WebText 1
    • + Outbound Reddit links from the period of January to October 2018 also with a minimum of 3 karma.
  • The text of the new links was extracted with the Newspaper3k python library.
  • In total, the dataset consists of
    • 20.3M documents containing 96 GB of text and 16.2B words (as defined by wc).
    • We then apply the reversible tokenizer described in [RWC+19], which yields 22.9B tokens.
  • We reserve 660M tokens for use as a test set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant