Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M2M100Tokenizer vocabulary size is not equal to the m2m embedding_size for the "facebook/m2m100_418M" model. #33240

Open
1 of 4 tasks
GerrySant opened this issue Sep 1, 2024 · 2 comments
Labels
bug Core: Tokenization Internals of the library; Tokenization.

Comments

@GerrySant
Copy link

System Info

  • transformers version: 4.42.3
  • Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.31
  • Python version: 3.12.3
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.32.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Run the code snippet below:
from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

print(len(tokenizer) == model.get_input_embeddings().weight.shape[0])

Expected behavior

I would expect the result of the above code to be True, however it is False.

Since the size of the tokenizer vocabulary and the embedding_size of the model are different, this causes unwanted behaviors. For example, in examples/pytorch/translation/run_translation.py there is this code fragment in charge of performing this same check, and in case it is not fulfilled it resizes the model embeddings.

embedding_size = model.get_input_embeddings().weight.shape[0]
if len(tokenizer) > embedding_size:
    model.resize_token_embeddings(len(tokenizer))
@GerrySant GerrySant added the bug label Sep 1, 2024
@LysandreJik
Copy link
Member

cc @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Hey! Feel free to update that step, in general there is absolutely no guarantee that the tokenizer has the same length as the model's input embeddings:

  • you can have holes in your tokenizer's vocab
  • the embedding can be padded for performance reasons.
    🤗

@LysandreJik LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

3 participants