Byte-Pair Encoding Tokenization

An implementation of Byte-Pair Encoding (BPE) tokenization on a sample of the TinyStories dataset.

The main compoonents are split into inidividual files for better understanding.

pre_tokenize.py - Pre-tokenizes the text into tokens. Ensures that the text is split at the correct boundaries (special tokens). Uses the GPT2 regex pattern for pre-tokenization. Stores the frequency of the pre-tokenized tokens in a pickle file.
train.py - Trains the BPE model on the pre-tokenized text. Stores the vocabulary and merges in pickle files.
encode_decode.py - Encodes and decodes text using the BPE model using the vocabulary and merges.

Usage

python pre_tokenize.py
python train.py
python encode_decode.py

Obtaining the data

Instructions from the Stanford CS336 assignment:

mkdir -p data
cd data

wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz

cd ..

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md
encode_decode.py		encode_decode.py
pre_tokenize.py		pre_tokenize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Byte-Pair Encoding Tokenization

Usage

Obtaining the data

About

Uh oh!

Languages

shreyansh26/bpe

Folders and files

Latest commit

History

Repository files navigation

Byte-Pair Encoding Tokenization

Usage

Obtaining the data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages