Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

itazap · 2024-09-27T08:21:01Z

Current status for AutoTokenizer with fast=True:

checks tokenizer_config.json if tokenizer_class name ends with Fast
if not, load a slow tokenizer
(This PR): (unchanged) 1. checks tokenizer_config.json if tokenizer_class name ends with Fast 2. if not, check if repo has a tokenizer.json file 2.1 if yes, load PreTrainedTokenizerFast 3. if not, load a slow tokenizer

prereq for #29969

…okenizer.json file exists

HuggingFaceDocBuilderDev · 2024-09-27T09:01:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Could you add a test as well?

ArthurZucker · 2024-09-27T13:10:39Z

src/transformers/models/auto/tokenization_auto.py

+def has_pretrainedfast(class_name: str):
+    for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
+        if class_name in tokenizers and "PreTrainedTokenizerFast" in tokenizers:
+            return PreTrainedTokenizerFast
+
+    return None


I don't think we need a one liner, nor do we need a for loop! We can just get tokenizer_mapping_name["class_name"][1]

yeah the problem is we don't actually have the class name at this point, would have to regex (or strip some chars off the tokenizer name) to get 'siglip', I can make the change though

vignesh1507

Thanks for updating the code @itazap.

vignesh1507

Looks good to me and thanks for improving the code's overall accuracy.

itazap · 2024-09-30T08:31:40Z

@ArthurZucker SigLip will be the first model to be able to test this (first model where we infer PreTrainedTokenizerFast without it being specified in the tokenizer.json), I can add the test to the SigLip PR #29969 to avoid creating an internal model, wdyt?

Edit: added the test to the SigLip PR here: https://github.com/huggingface/transformers/pull/29969/files#r1784784751

ArthurZucker · 2024-10-03T08:10:11Z

src/transformers/models/auto/tokenization_auto.py

@@ -895,9 +895,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
            )
        elif config_tokenizer_class is not None:
            tokenizer_class = None
+            class_name = config_tokenizer_class.rstrip("Tokenizer").lower()


this is probably a bit brittle no? config_tokenizer_class can be LayoutLVM but the id of the model is layout_lvm so not juste lowered!

Yeah, the way TOKENIZER_MAPPING_NAMES makes its keys virtually unusable otherwise (the keys don't correspond to any naming convention or classes in transformers), so in this line we can also strip punctuation or add more string manipulation if we want to keep it as minimal as possible, or we do a function similar to tokenizer_class_from_name like I had before

#slow_class_name is something like SiglipTokenizer def tokenizer_fast_class_from_name(slow_class_name: str): for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items(): # search the VALUES of the dict to find the slow class if class_name in tokenizers: # check if the slow_class + "Fast" is in the values if f"{class_name}Fast" in tokenizers: module_name = model_type_to_module_name( f"{class_name}Fast") module = importlib.import_module(f".{module_name}", "transformers.models") try: return getattr(module, class_name) except AttributeError: continue # check if "PreTrainedTokenizerFast" is in the values elif "PreTrainedTokenizerFast" in tokenizers: return PreTrainedTokenizerFast else: return None

ArthurZucker

Could you add a test to show what we are enabling? I am not sure I follow, got a bit lost 😓

itazap · 2024-10-05T10:45:08Z

@ArthurZucker the test is here in the Silgip PR: https://github.com/huggingface/transformers/pull/29969/files#r1784784751 (copy pasted below)

        # Model does not have a fast tokenizer or PreTrainedTokenizerFast specified in config but can still load fast
        tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224", use_fast=True)
        self.assertEqual(type(tokenizer), PreTrainedTokenizerFast)

No, this test cannot be added to this PR because Siglip is not merged. Siglip would be the first model that is possible to test here. It is testing that when a fast tokenizer or PreTrainedTokenizerFast is not specified in the tokenizer config file, it can still load PreTrainedTokenizerFast.

In order to test this test in this PR, we can make an internal siglip model but I think it would be redundant since we are intending to merge siglip soon

ArthurZucker · 2024-10-05T12:24:47Z

Okay, I mean we can still add a model on the hub that would not have a tokenizer_config.json -> load a pretrained tokenizer fast still

ArthurZucker · 2024-10-05T12:25:41Z

Okay, it's just that IMO what we should do sounds simple: if we are not able to go through the normal route, but we have a tokeknizer.json -> just load PreTrainedTokenizerFast.

itazap · 2024-10-08T19:20:00Z

@ArthurZucker this feature requires that PreTrainedTokenizerFast is an option for a given model in this specific dictionary:
https://github.com/huggingface/transformers/pull/29969/files#r1792386118

So if we create a private testing model on the hub without the PreTrainedTokenizerFast class specified in the config, we would still have to modify tokenization_auto to add PreTrainedTokenizerFast to the value. Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.

Also - no longer have rights to create models for HF org on the hub 😢

ArthurZucker · 2024-10-10T12:35:47Z

Ah okay let me add you on the org!

ArthurZucker · 2024-10-16T12:21:27Z

Added you back! 🤗

ArthurZucker · 2024-10-16T18:37:32Z

checks tokenizer_config.json if tokenizer_class name ends with Fast

 if not, check if repo has a tokenizer.json file
2.1 if yes, load PreTrainedTokenizerFast 

if not, load a slow tokenizer

This is exactly what I want!

Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.

IMO let's just aime for simplicity, if we can't load anything (no tokenizer_config.json) then we just load the tokenizer.json that's it, it's the only thing to add !

Let's merge siglip this one will follow!

add support for loading a pretrainedfast tokenizer if fast=true and t…

d937869

…okenizer.json file exists

itazap changed the title ~~add support for loading a pretrainedfast tokenizer if fast=true and t…~~ Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists Sep 27, 2024

itazap requested a review from ArthurZucker September 27, 2024 08:21

diff idea

7f2e583

rm siglip change

3db349e

itazap marked this pull request as ready for review September 27, 2024 10:36

ArthurZucker reviewed Sep 27, 2024

View reviewed changes

vignesh1507 approved these changes Sep 27, 2024

View reviewed changes

itazap added 2 commits September 27, 2024 18:23

parse mapping instead

f4d7160

check if can index

88b140e

vignesh1507 approved these changes Sep 29, 2024

View reviewed changes

itazap requested a review from ArthurZucker September 30, 2024 12:18

itazap mentioned this pull request Oct 2, 2024

[SigLIP] Add fast tokenizer #29969

Open

2 tasks

ArthurZucker reviewed Oct 3, 2024

View reviewed changes

ArthurZucker reviewed Oct 4, 2024

View reviewed changes

itazap requested a review from ArthurZucker October 10, 2024 07:23

check if repo has tokenizer.json file, then load PreTrainedTokenizerFast

b6ae24f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

itazap commented Sep 27, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 27, 2024

ArthurZucker left a comment

ArthurZucker Sep 27, 2024

itazap Sep 27, 2024 •

edited

Loading

itazap Sep 27, 2024

vignesh1507 left a comment

vignesh1507 left a comment

itazap commented Sep 30, 2024 •

edited

Loading

ArthurZucker Oct 3, 2024

itazap Oct 3, 2024

itazap Oct 3, 2024

ArthurZucker left a comment

itazap commented Oct 5, 2024

ArthurZucker commented Oct 5, 2024

ArthurZucker commented Oct 5, 2024

itazap commented Oct 8, 2024 •

edited

Loading

ArthurZucker commented Oct 10, 2024

ArthurZucker commented Oct 16, 2024

ArthurZucker commented Oct 16, 2024

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

Are you sure you want to change the base?

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

Conversation

itazap commented Sep 27, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Sep 27, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 27, 2024

Choose a reason for hiding this comment

itazap Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

itazap Sep 27, 2024

Choose a reason for hiding this comment

vignesh1507 left a comment

Choose a reason for hiding this comment

vignesh1507 left a comment

Choose a reason for hiding this comment

itazap commented Sep 30, 2024 • edited Loading

ArthurZucker Oct 3, 2024

Choose a reason for hiding this comment

itazap Oct 3, 2024

Choose a reason for hiding this comment

itazap Oct 3, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

itazap commented Oct 5, 2024

ArthurZucker commented Oct 5, 2024

ArthurZucker commented Oct 5, 2024

itazap commented Oct 8, 2024 • edited Loading

ArthurZucker commented Oct 10, 2024

ArthurZucker commented Oct 16, 2024

ArthurZucker commented Oct 16, 2024

itazap commented Sep 27, 2024 •

edited

Loading

itazap Sep 27, 2024 •

edited

Loading

itazap commented Sep 30, 2024 •

edited

Loading

itazap commented Oct 8, 2024 •

edited

Loading