Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

robertknight · 2024-12-08T20:23:46Z

Attempting to load the tokenizer.json file from Llama 3.2 fails with an error processing the BPE merge entries:

Error: BpeError(InvalidMergeEntry("Ġ ĠĠĠ"))

If rten-text is modified to ignore this error, then the qwen2_chat example works with Llama 3.2, after a minor modification to the special token IDs.

Edit: I have just noticed the ignore_merges: true in the tokenizer.json file. This seems relevant.

The text was updated successfully, but these errors were encountered:

robertknight · 2024-12-08T20:54:57Z

ignore_merges was added in huggingface/tokenizers@914576f. See also https://github.com/huggingface/tokenizers/pull/1493/files.

The documentation says:

ignore_merges (bool, optional) — Whether or not to match tokens with the vocab before using merges.

robertknight added the tokenizers Issues related to the rten-text tokenization crate label Dec 8, 2024

robertknight mentioned this issue Dec 9, 2024

Clean up outdated comments in BPE tokenizer, pass configuration as a struct #455

Merged

robertknight changed the title ~~Investigate InvalidMergeEntry error when loading Llama 3 tokenizer~~ Support Llama 3 tokenizer (implement ignore_merges behavior) Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

robertknight commented Dec 8, 2024 •

edited

Loading

robertknight commented Dec 8, 2024 •

edited

Loading

Support Llama 3 tokenizer (implement ignore_merges behavior) #453

Support Llama 3 tokenizer (implement ignore_merges behavior) #453

Comments

robertknight commented Dec 8, 2024 • edited Loading

robertknight commented Dec 8, 2024 • edited Loading

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

Support Llama 3 tokenizer (implement `ignore_merges` behavior) #453

robertknight commented Dec 8, 2024 •

edited

Loading

robertknight commented Dec 8, 2024 •

edited

Loading