Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

itazap
Copy link
Collaborator

@itazap itazap commented Sep 27, 2024

Current status for AutoTokenizer with fast=True:

  1. checks tokenizer_config.json if tokenizer_class name ends with Fast
  2. if not, load a slow tokenizer
    (This PR):
(unchanged) 1. checks tokenizer_config.json if tokenizer_class name ends with Fast
2. if not, check if repo has a tokenizer.json file
2.1 if yes, load PreTrainedTokenizerFast
3. if not, load a slow tokenizer

prereq for #29969

@itazap itazap changed the title add support for loading a pretrainedfast tokenizer if fast=true and t… Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists Sep 27, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap marked this pull request as ready for review September 27, 2024 10:36
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test as well?

Comment on lines 593 to 598
def has_pretrainedfast(class_name: str):
for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
if class_name in tokenizers and "PreTrainedTokenizerFast" in tokenizers:
return PreTrainedTokenizerFast

return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a one liner, nor do we need a for loop! We can just get tokenizer_mapping_name["class_name"][1]

Copy link
Collaborator Author

@itazap itazap Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the problem is we don't actually have the class name at this point, would have to regex (or strip some chars off the tokenizer name) to get 'siglip', I can make the change though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated pr

Copy link

@vignesh1507 vignesh1507 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the code @itazap.

Copy link

@vignesh1507 vignesh1507 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and thanks for improving the code's overall accuracy.

@itazap
Copy link
Collaborator Author

itazap commented Sep 30, 2024

@ArthurZucker SigLip will be the first model to be able to test this (first model where we infer PreTrainedTokenizerFast without it being specified in the tokenizer.json), I can add the test to the SigLip PR #29969 to avoid creating an internal model, wdyt?

Edit: added the test to the SigLip PR here: https://github.com/huggingface/transformers/pull/29969/files#r1784784751

@itazap itazap mentioned this pull request Oct 2, 2024
2 tasks
@@ -895,9 +895,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
)
elif config_tokenizer_class is not None:
tokenizer_class = None
class_name = config_tokenizer_class.rstrip("Tokenizer").lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a bit brittle no? config_tokenizer_class can be LayoutLVM but the id of the model is layout_lvm so not juste lowered!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the way TOKENIZER_MAPPING_NAMES makes its keys virtually unusable otherwise (the keys don't correspond to any naming convention or classes in transformers), so in this line we can also strip punctuation or add more string manipulation if we want to keep it as minimal as possible, or we do a function similar to tokenizer_class_from_name like I had before

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#slow_class_name is something like SiglipTokenizer
def tokenizer_fast_class_from_name(slow_class_name: str):
    for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
        # search the VALUES of the dict to find the slow class
        if class_name in tokenizers:
            # check if the slow_class + "Fast" is in the values
            if f"{class_name}Fast" in tokenizers:
                module_name = model_type_to_module_name( f"{class_name}Fast")
                module = importlib.import_module(f".{module_name}", "transformers.models")
                try:
                    return getattr(module, class_name)
                except AttributeError:
                    continue
            # check if "PreTrainedTokenizerFast" is in the values
            elif "PreTrainedTokenizerFast" in tokenizers:
                return PreTrainedTokenizerFast
            else:
                return None
          

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to show what we are enabling? I am not sure I follow, got a bit lost 😓

@itazap
Copy link
Collaborator Author

itazap commented Oct 5, 2024

@ArthurZucker the test is here in the Silgip PR: https://github.com/huggingface/transformers/pull/29969/files#r1784784751 (copy pasted below)

        # Model does not have a fast tokenizer or PreTrainedTokenizerFast specified in config but can still load fast
        tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224", use_fast=True)
        self.assertEqual(type(tokenizer), PreTrainedTokenizerFast)

No, this test cannot be added to this PR because Siglip is not merged. Siglip would be the first model that is possible to test here. It is testing that when a fast tokenizer or PreTrainedTokenizerFast is not specified in the tokenizer config file, it can still load PreTrainedTokenizerFast.

In order to test this test in this PR, we can make an internal siglip model but I think it would be redundant since we are intending to merge siglip soon

@ArthurZucker
Copy link
Collaborator

Okay, I mean we can still add a model on the hub that would not have a tokenizer_config.json -> load a pretrained tokenizer fast still

@ArthurZucker
Copy link
Collaborator

Okay, it's just that IMO what we should do sounds simple: if we are not able to go through the normal route, but we have a tokeknizer.json -> just load PreTrainedTokenizerFast.

@itazap
Copy link
Collaborator Author

itazap commented Oct 8, 2024

@ArthurZucker this feature requires that PreTrainedTokenizerFast is an option for a given model in this specific dictionary:
https://github.com/huggingface/transformers/pull/29969/files#r1792386118

So if we create a private testing model on the hub without the PreTrainedTokenizerFast class specified in the config, we would still have to modify tokenization_auto to add PreTrainedTokenizerFast to the value. Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.

Also - no longer have rights to create models for HF org on the hub 😢

@ArthurZucker
Copy link
Collaborator

Ah okay let me add you on the org!

@ArthurZucker
Copy link
Collaborator

Added you back! 🤗

@ArthurZucker
Copy link
Collaborator

  1. checks tokenizer_config.json if tokenizer_class name ends with Fast
  2. 
if not, check if repo has a tokenizer.json file
    2.1 if yes, load PreTrainedTokenizerFast

  3. if not, load a slow tokenizer

This is exactly what I want!

Unless we change this feature to always try Fast but then it will never fallback to slow (which we currently do - the fallback to slow) if it doesn't work because the error would happen way too far in.

IMO let's just aime for simplicity, if we can't load anything (no tokenizer_config.json) then we just load the tokenizer.json that's it, it's the only thing to add !

Let's merge siglip this one will follow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants