-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751
base: main
Are you sure you want to change the base?
Conversation
…okenizer.json file exists
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test as well?
def has_pretrainedfast(class_name: str): | ||
for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items(): | ||
if class_name in tokenizers and "PreTrainedTokenizerFast" in tokenizers: | ||
return PreTrainedTokenizerFast | ||
|
||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a one liner, nor do we need a for loop! We can just get tokenizer_mapping_name["class_name"][1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the problem is we don't actually have the class name at this point, would have to regex (or strip some chars off the tokenizer name) to get 'siglip', I can make the change though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the code @itazap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and thanks for improving the code's overall accuracy.
@ArthurZucker SigLip will be the first model to be able to test this (first model where we infer Edit: added the test to the SigLip PR here: https://github.com/huggingface/transformers/pull/29969/files#r1784784751 |
@@ -895,9 +895,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): | |||
) | |||
elif config_tokenizer_class is not None: | |||
tokenizer_class = None | |||
class_name = config_tokenizer_class.rstrip("Tokenizer").lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably a bit brittle no? config_tokenizer_class
can be LayoutLVM
but the id of the model is layout_lvm
so not juste lowered!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the way TOKENIZER_MAPPING_NAMES
makes its keys virtually unusable otherwise (the keys don't correspond to any naming convention or classes in transformers
), so in this line we can also strip punctuation or add more string manipulation if we want to keep it as minimal as possible, or we do a function similar to tokenizer_class_from_name
like I had before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#slow_class_name is something like SiglipTokenizer
def tokenizer_fast_class_from_name(slow_class_name: str):
for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
# search the VALUES of the dict to find the slow class
if class_name in tokenizers:
# check if the slow_class + "Fast" is in the values
if f"{class_name}Fast" in tokenizers:
module_name = model_type_to_module_name( f"{class_name}Fast")
module = importlib.import_module(f".{module_name}", "transformers.models")
try:
return getattr(module, class_name)
except AttributeError:
continue
# check if "PreTrainedTokenizerFast" is in the values
elif "PreTrainedTokenizerFast" in tokenizers:
return PreTrainedTokenizerFast
else:
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test to show what we are enabling? I am not sure I follow, got a bit lost 😓
@ArthurZucker the test is here in the Silgip PR: https://github.com/huggingface/transformers/pull/29969/files#r1784784751 (copy pasted below) # Model does not have a fast tokenizer or PreTrainedTokenizerFast specified in config but can still load fast
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224", use_fast=True)
self.assertEqual(type(tokenizer), PreTrainedTokenizerFast) No, this test cannot be added to this PR because Siglip is not merged. Siglip would be the first model that is possible to test here. It is testing that when a fast tokenizer or In order to test this test in this PR, we can make an internal siglip model but I think it would be redundant since we are intending to merge siglip soon |
Okay, I mean we can still add a model on the hub that would not have a tokenizer_config.json -> load a pretrained tokenizer fast still |
Okay, it's just that IMO what we should do sounds simple: if we are not able to go through the normal route, but we have a tokeknizer.json -> just load PreTrainedTokenizerFast. |
@ArthurZucker this feature requires that So if we create a private testing model on the hub without the Also - no longer have rights to create models for HF org on the hub 😢 |
Ah okay let me add you on the org! |
Added you back! 🤗 |
This is exactly what I want!
IMO let's just aime for simplicity, if we can't load anything (no tokenizer_config.json) then we just load the tokenizer.json that's it, it's the only thing to add ! Let's merge siglip this one will follow! |
Current status for AutoTokenizer with fast=True:
(This PR): (unchanged) 1. checks tokenizer_config.json if tokenizer_class name ends with Fast 2. if not, check if repo has a tokenizer.json file 2.1 if yes, load PreTrainedTokenizerFast 3. if not, load a slow tokenizer
prereq for #29969