Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.
The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the
Trie
and it also adds aBPEMerges
data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere exceptread_trie_from_spm
which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.Trie
std::vector<char>
creations and copies.search_longest_prefix
which theTrie
is perfect forid
when insertingunordered_map
to support the aboveBPE
read_bpe_from_spm
ironically implements a small bpe in python to extract the merges from the file.TL;DR
The following is implementing SPM tokenization so far with exactly identical results as spm or HF.