Add BPE tokenizers #62

angeloskath · 2024-04-18T11:15:59Z

This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.

The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the Trie and it also adds a BPEMerges data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere except read_trie_from_spm which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.

Trie

Most things are moved to work with iterators internally which removes a bunch of std::vector<char> creations and copies.
Add search_longest_prefix which the Trie is perfect for
Add the ability to set the id when inserting
Changed the vector that holds the keys to an unordered_map to support the above

BPE

BPETokenizer::tokenize would be the most interesting function. It is not the prettiest implementation but it is pretty fast and beats SPM on my laptop. Possible room for improvement lines 135-160 where we search for neighbors with linear search.
read_bpe_from_spm ironically implements a small bpe in python to extract the merges from the file.

TL;DR

The following is implementing SPM tokenization so far with exactly identical results as spm or HF.

symbols, merges = read_bpe_from_spm("tokenizer.model")
ds = (
    ds
    .pad("text", 0, 1, 0, ord(" "))
    .replace("text", " ", "\u2581")
    .tokenize_bpe("text", symbols, merges)
)

angeloskath added 6 commits April 16, 2024 01:28

Add a replace operation

bea27c5

Start standardizing the SPM tokenization

3df150d

Add a seemingly working bpe tokenizer

06add11

Write BPE reading from spm model and fix BPETokenizer

4ebe4f5

Add some docs and a small test

596ba14

Add the BPE tokenize op

4e61bbb

angeloskath requested a review from andresy April 18, 2024 11:16

angeloskath mentioned this pull request Apr 18, 2024

A draft implementation of BPE tokenizer #39

Closed

Add pointers to left and right in symbol

17d4960

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BPE tokenizers #62

Add BPE tokenizers #62

angeloskath commented Apr 18, 2024

Add BPE tokenizers #62

Are you sure you want to change the base?

Add BPE tokenizers #62

Conversation

angeloskath commented Apr 18, 2024

Trie

BPE

TL;DR