-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoTokenizer hash value got change after datasets.map #14931
Comments
It seems like this issue also occur with other AutoClass like |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@lhoestq Hi, can you look at this issue. I don't know whether I should reported in datasets or transformers. |
Hi ! This should be handled by I tried running the code above but the hash didn't change, I wasn't able to reproduce the issue |
@lhoestq Hi, I reported this on datasets huggingface/datasets#3638 |
Hi @tshu-w (For other readers, interesting comment is here: huggingface/datasets#3638 (comment) .)
The call is this:
This by definition, will modify the tokenizer underlying state since it has to modify the TruncationParams to set it to True. from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# ADD THIS
tokenizer(example["sentence1"], example["sentence2"], truncation=True)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
# ... rest of the script should work, and hashes be the same. Sorry I missed that reading the first issue. I had thought that this was triggered by some default configuration of the tokenizer (that wasn't properly set at initialization time) but this isn't the case. @lhoestq tagging you too, since looking into this issue, I realized that Actually the only state maintained by And a last option would be to make |
Thanks for the ideas @Narsil :)
Taking the hash after the first iteration can do the job, but the downside is that it would require users to wait for the first batch to be processed before checking the cache, which can be confusing IMO. Which |
Summarizing a discussion that happened orally: The best course of action right now is to try and modify
This is a pretty big change, but seems currently like the best course of action.
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Unstale. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Unstale. General Ping @Narsil. Is there anything I can do. |
TL;DRCall the function once on a dummy example beforehand will fix it. tokenizer("Some", "test", truncation=True) Long answerIf I remember the last status, it's hard doing anything, since the call itself tokenizer(example["sentence1"], example["sentence2"], truncation=True) will modify the tokenizer. It's the Finding a fix that :
is IIRC impossible for this use case. I can explain a bit more why the first option is not desirable. In order to "fix" this for tokenizers, we would need to make The other thing, is that it would force For the datasets specific solution I am not 100% I can explain them properly. |
Yes I think we can have a workaround: you have to reload your tokenizer before your second Note that we still need to fix this issue in |
Good to know. @Narsil So I think this issue can close now. 😄 |
I'll follow the discussion over there then. |
Btw @Narsil what's the attribute of the tokenizer we would need to ignore to have a hash of the tokenizer that doesn't depend on the state ? We could implement a custom pickling on the |
the python logic is there: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py#L354
|
Environment info
transformers
version: 4.15.0Who can help
@LysandreJik @lhoestq
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
got:
raw_datasets.map(tokenize_function, batched=True)
again and see some dataset are not using cache.Expected behavior
AutoTokenizer
work like specific Tokenizer (The hash value don't change after map):got:
The text was updated successfully, but these errors were encountered: