Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -895,9 +895,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
)
elif config_tokenizer_class is not None:
tokenizer_class = None
class_name = config_tokenizer_class.rstrip("Tokenizer").lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a bit brittle no? config_tokenizer_class can be LayoutLVM but the id of the model is layout_lvm so not juste lowered!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the way TOKENIZER_MAPPING_NAMES makes its keys virtually unusable otherwise (the keys don't correspond to any naming convention or classes in transformers), so in this line we can also strip punctuation or add more string manipulation if we want to keep it as minimal as possible, or we do a function similar to tokenizer_class_from_name like I had before

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#slow_class_name is something like SiglipTokenizer
def tokenizer_fast_class_from_name(slow_class_name: str):
    for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
        # search the VALUES of the dict to find the slow class
        if class_name in tokenizers:
            # check if the slow_class + "Fast" is in the values
            if f"{class_name}Fast" in tokenizers:
                module_name = model_type_to_module_name( f"{class_name}Fast")
                module = importlib.import_module(f".{module_name}", "transformers.models")
                try:
                    return getattr(module, class_name)
                except AttributeError:
                    continue
            # check if "PreTrainedTokenizerFast" is in the values
            elif "PreTrainedTokenizerFast" in tokenizers:
                return PreTrainedTokenizerFast
            else:
                return None
          

if use_fast and not config_tokenizer_class.endswith("Fast"):
tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
if tokenizer_class is None and class_name in TOKENIZER_MAPPING_NAMES:
tokenizer_class = tokenizer_class_from_name(TOKENIZER_MAPPING_NAMES[class_name][1])
if tokenizer_class is None:
tokenizer_class_candidate = config_tokenizer_class
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
Expand Down
Loading