M2M100Tokenizer vocabulary size is not equal to the m2m embedding_size for the "facebook/m2m100_418M" model. #33240

GerrySant · 2024-09-01T08:33:57Z

System Info

transformers version: 4.42.3
Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.31
Python version: 3.12.3
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run the code snippet below:

from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

print(len(tokenizer) == model.get_input_embeddings().weight.shape[0])

Expected behavior

I would expect the result of the above code to be True, however it is False.

Since the size of the tokenizer vocabulary and the embedding_size of the model are different, this causes unwanted behaviors. For example, in examples/pytorch/translation/run_translation.py there is this code fragment in charge of performing this same check, and in case it is not fulfilled it resizes the model embeddings.

embedding_size = model.get_input_embeddings().weight.shape[0]
if len(tokenizer) > embedding_size:
    model.resize_token_embeddings(len(tokenizer))

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-09-02T15:16:15Z

cc @ArthurZucker

ArthurZucker · 2024-09-06T09:58:57Z

Hey! Feel free to update that step, in general there is absolutely no guarantee that the tokenizer has the same length as the model's input embeddings:

you can have holes in your tokenizer's vocab
the embedding can be padded for performance reasons.
🤗

GerrySant added the bug label Sep 1, 2024

LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M2M100Tokenizer vocabulary size is not equal to the m2m embedding_size for the "facebook/m2m100_418M" model. #33240

M2M100Tokenizer vocabulary size is not equal to the m2m embedding_size for the "facebook/m2m100_418M" model. #33240

GerrySant commented Sep 1, 2024

LysandreJik commented Sep 2, 2024

ArthurZucker commented Sep 6, 2024

M2M100Tokenizer vocabulary size is not equal to the m2m embedding_size for the "facebook/m2m100_418M" model. #33240

M2M100Tokenizer vocabulary size is not equal to the m2m embedding_size for the "facebook/m2m100_418M" model. #33240

Comments

GerrySant commented Sep 1, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Sep 2, 2024

ArthurZucker commented Sep 6, 2024