LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

lbeurerkellner · 2023-06-29T08:26:24Z

System Info

transformers version: 4.30.2
Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.31
Python version: 3.10.11
Huggingface_hub version: 0.14.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @youn

Information

The official example scripts
My own modified scripts

Reproduction

Comparing slow and fast LlamaTokenizer instances with huggyllama/llama-7b.

from transformers import AutoTokenizer

model = "huggyllama/llama-7b"

fast = AutoTokenizer.from_pretrained(model)
slow = AutoTokenizer.from_pretrained(model, use_fast=False)

# use tokenize()
print(fast.tokenize("<s>uns"), slow.tokenize("<s>uns"))
# -> (['▁<s>', 'uns'], ['<s>', '▁uns'])

# use __call__
print(fast(f"{fast.bos_token}uns", add_special_tokens=False), slow(f"{slow.bos_token}uns", add_special_tokens=False))
# -> ({'input_ids': [1, 6948], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]},
#     {'input_ids': [1, 9644], 'attention_mask': [1, 1]})

# round-tripping
print(fast.convert_tokens_to_string(fast.tokenize("<s>uns")), fast.convert_tokens_to_string(slow.tokenize("<s>uns")))
# -> ('<s>uns', '<s> uns')

Expected behavior

It looks like the slow LlamaTokenizer wrongly tokenises uns. I would not expect the additional whitespace when round-tripping or when tokenising in the first place.

Thanks a lot in advance.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-06-29T11:33:57Z

Thanks for reporting, will have a look

Bearnardd · 2023-07-03T07:07:35Z

Hi @ArthurZucker! Are you currently working on this? If not, I think I could fix it pretty quickly :)

ArthurZucker · 2023-07-03T11:14:57Z

Sure! Feel free to take it! 😉 I'll have a look soon otherwise

Bearnardd · 2023-07-03T20:49:05Z

@ArthurZucker @lbeurerkellner I have done some debugging and I have a few observations. Firstly I have checked other tokenizers that use LlamaTokenizer or LlamaTokenizerFast and the results are pretty weird:

the issue is not with uns but with any word after a special token like <s>. Why this is happening is pretty straightforward

# <s> is added to Trie so there is a split after its encounter in the text
tokens = self.tokens_trie.split(text) # tokenization_utils.py:517

So it seems like it was a deliberate decision to split special tokens like this?

because of the above split, all slow tokenizers based on LLaMaTokenizer return ['<s>', '▁uns']
more interesting thing is that most of the tokenizers based on LlamaTokenizerFast split text into ['▁<s>', 'uns'] (e.g fxmarty/tiny-llama-fast-tokenizer). But for example openlm-research/open_llama_3b which is one of the most downloaded llama based models outputs ['<s>', '▁uns'] even thought it has the same tokenizer config like the one from fxmarty.

LlamaTokenizerFast(name_or_path='openlm-research/open_llama_3b', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)

ArthurZucker · 2023-07-04T00:22:37Z

the fast is working properly! As suspected, this is linked to #24622 and #24565. I am working on a fix for all our spm based models.

For other tokenizers, I wouldn’t refer to them since a lot are outdated/don’t include some fixes

ArthurZucker · 2023-07-11T14:08:00Z

Actually this is fixed, the output is now ['▁<s>', 'uns'] ['<s>', 'uns']. The fast just works that way for tokenization, but the output is the same. Use

slow = AutoTokenizer.from_pretrained(model, use_fast=False, legacy = False)

lbeurerkellner changed the title ~~LlamaTokenizer: Fast and Slow implementations tokenize differently with~~ LlamaTokenizer: Slow implementation opts for whitespace-lead token Jun 29, 2023

lbeurerkellner changed the title ~~LlamaTokenizer: Slow implementation opts for whitespace-lead token~~ LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) Jun 29, 2023

ArthurZucker self-assigned this Jul 4, 2023

ArthurZucker mentioned this issue Jul 5, 2023

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

ArthurZucker linked a pull request Jul 11, 2023 that will close this issue

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

ArthurZucker closed this as completed in #24622 Jul 11, 2023

ArthurZucker reopened this Jul 11, 2023

ArthurZucker closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

lbeurerkellner commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023

Bearnardd commented Jul 3, 2023

ArthurZucker commented Jul 3, 2023

Bearnardd commented Jul 3, 2023

ArthurZucker commented Jul 4, 2023

ArthurZucker commented Jul 11, 2023

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Comments

lbeurerkellner commented Jun 29, 2023 • edited Loading

System Info

Who can help?

Information

Reproduction

Expected behavior

ArthurZucker commented Jun 29, 2023

Bearnardd commented Jul 3, 2023

ArthurZucker commented Jul 3, 2023

Bearnardd commented Jul 3, 2023

ArthurZucker commented Jul 4, 2023

ArthurZucker commented Jul 11, 2023

lbeurerkellner commented Jun 29, 2023 •

edited

Loading