Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer hash value got change after datasets.map #14931

Closed
2 tasks done
tshu-w opened this issue Dec 26, 2021 · 19 comments
Closed
2 tasks done

AutoTokenizer hash value got change after datasets.map #14931

tshu-w opened this issue Dec 26, 2021 · 19 comments

Comments

@tshu-w
Copy link
Contributor

tshu-w commented Dec 26, 2021

Environment info

  • transformers version: 4.15.0
  • Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.27
  • Python version: 3.9.7
  • PyTorch version (GPU?): 1.10.1+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik @lhoestq

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

  • my own modified scripts

The tasks I am working on is:

  • an official GLUE/SQUaD task: mrpc

To reproduce

Steps to reproduce the behavior:

  1. trash huggingface datasets cache
  2. run the following code:
from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

raw_datasets = load_dataset("glue", "mrpc")

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

got:

Reusing dataset glue (/home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1112.35it/s]
f4976bb4694ebc51
3fca35a1fd4a1251
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.96ba/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.25ba/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.81ba/s]
d32837619b7d7d01
5fd925c82edd62b6
  1. run raw_datasets.map(tokenize_function, batched=True) again and see some dataset are not using cache.

Expected behavior

AutoTokenizer work like specific Tokenizer (The hash value don't change after map):

from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

raw_datasets = load_dataset("glue", "mrpc")

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

got:

Reusing dataset glue (/home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1091.22it/s]
46d4b31f54153fc7
5b8771afd8d43888
Loading cached processed dataset at /home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-6b07ff82ae9d5c51.arrow
Loading cached processed dataset at /home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-af738a6d84f3864b.arrow
Loading cached processed dataset at /home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-531d2a603ba713c1.arrow
46d4b31f54153fc7
5b8771afd8d43888
@tshu-w
Copy link
Contributor Author

tshu-w commented Dec 26, 2021

It seems like this issue also occur with other AutoClass like AutoFeatureExtractor.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@tshu-w
Copy link
Contributor Author

tshu-w commented Jan 26, 2022

@lhoestq Hi, can you look at this issue. I don't know whether I should reported in datasets or transformers.

@lhoestq
Copy link
Member

lhoestq commented Jan 26, 2022

Hi ! This should be handled by datasets IMO - feel free to create an issue in the dataset github repository: https://github.com/huggingface/datasets

I tried running the code above but the hash didn't change, I wasn't able to reproduce the issue

@tshu-w
Copy link
Contributor Author

tshu-w commented Jan 27, 2022

@lhoestq Hi, I reported this on datasets huggingface/datasets#3638

@Narsil
Copy link
Contributor

Narsil commented Jan 28, 2022

Hi @tshu-w

(For other readers, interesting comment is here: huggingface/datasets#3638 (comment) .)
It seems your example fall into this category

We could try and set these 2 dicts at initialization time, but it wouldn't work if a user modified the tokenizer state later

The call is this:

tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This by definition, will modify the tokenizer underlying state since it has to modify the TruncationParams to set it to True.
The only way you can actually fix it it to call it once before calling the map function, like so:

from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# ADD THIS
tokenizer(example["sentence1"], example["sentence2"], truncation=True)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# ... rest of the script should work, and hashes be the same.

Sorry I missed that reading the first issue. I had thought that this was triggered by some default configuration of the tokenizer (that wasn't properly set at initialization time) but this isn't the case.


@lhoestq tagging you too, since looking into this issue, I realized that hash(tokenizer) kept consistant, while Hasher.hash(..) wasn't. Maybe something can be done ? Or taking the hash function after 1 iteration ?

Actually the only state maintained by transformers FastTokenizer themselves are padding_side and truncation_side, no other arguments are ever kept within the class itself (meaning we cannot create the proper state before the call for them)

And a last option would be to make tokenizers Tokenizer themselves become stateless. Optimization-wise I don't know if the hit of passing the same arguments over and over will be significant or not (it probably needs to be checked though). But it's also a pretty big change I think.

@lhoestq
Copy link
Member

lhoestq commented Jan 31, 2022

Thanks for the ideas @Narsil :)

hash is not well-suited for caching unfortunately: it may not return the same hash in two different sessions. In the same session, two identical objects might not even have the same hash (try calling hash(lambda x: x) several times for example). This would lead to lots of cache misses.

Taking the hash after the first iteration can do the job, but the downside is that it would require users to wait for the first batch to be processed before checking the cache, which can be confusing IMO.

Which TruncationParams are you talking about exactly ? Would it make sense to make the datasets Hasher ignore this ?

@Narsil
Copy link
Contributor

Narsil commented Jan 31, 2022

Summarizing a discussion that happened orally:

The best course of action right now is to try and modify __setstate__ , __getstate__ of the FastTokenizers, to override the default pickle behavior.

  • That's the standard way to override things in datasets.
  • We can do that, because, since transformers FastTokenizer do not hold any state (except {padding|truncation}_side the actual thing that's going to cause a cache miss on datasets is going to be either:
    • Modifying the map function code
    • Modifying any of the associated values to that code
    • The state of the tokenizers (specifically _tokenizer.truncation and _tokenizer.padding) will NOT.

This is a pretty big change, but seems currently like the best course of action.
Since we're touching something relatively core we need to be extra careful:

  • Making sure we can alway unpickle already pickled tokenizers, and make sure it stays that way in the future
  • Making sure the cache hit/miss of datasets is kept working through time
  • Making sure _tokenizer pickling/unpickling can still happen correctly (this one still has state that has to be maintained).
  • Making sure both pickling behaviors don't interact in a nasty fashion.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Narsil
Copy link
Contributor

Narsil commented Feb 24, 2022

Unstale.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@tshu-w
Copy link
Contributor Author

tshu-w commented Mar 22, 2022

Unstale. General Ping @Narsil. Is there anything I can do.

@Narsil
Copy link
Contributor

Narsil commented Mar 22, 2022

TL;DR

Call the function once on a dummy example beforehand will fix it.

tokenizer("Some", "test", truncation=True)

Long answer

If I remember the last status, it's hard doing anything, since the call itself

tokenizer(example["sentence1"], example["sentence2"], truncation=True)

will modify the tokenizer. It's the truncation=True that modifies the tokenizer to put it into truncation mode if you will.
Calling the tokenizer once with that argument would fix the cache.

Finding a fix that :

  • Doesn't imply a huge chunk of work on tokenizers (with potential loss of performance, and breaking backward compatibility)
  • Doesn't imply datasets running a first pass of the loop
  • Doesn't imply datasets looking at the map function itself
  • Uses a sound hash for this object in datasets.

is IIRC impossible for this use case.

I can explain a bit more why the first option is not desirable.

In order to "fix" this for tokenizers, we would need to make tokenizer(..) purely without side effects. This means that the "options" of tokenization (like truncation and padding at least) would have to be sent every single time to make the function "pure". But it also means that we would need to send every single time a bunch of options from Python to Rust, and that boundary is not zero-cost. The cost hopefully would be minimal, but it could prove to be high (Python GIL is a tricky beast).

The other thing, is that it would force tokenizers library to behave differently for a datasets specific use-case which is less than ideal.

For the datasets specific solution I am not 100% I can explain them properly.

@tshu-w
Copy link
Contributor Author

tshu-w commented Mar 22, 2022

@Narsil Thank you for your detailed explanation. @lhoestq Can you take a look if there is some specific solution on Datasets

@lhoestq
Copy link
Member

lhoestq commented Mar 22, 2022

Yes I think we can have a workaround: you have to reload your tokenizer before your second map call.

Note that we still need to fix this issue in datasets first: huggingface/datasets#3847

@tshu-w
Copy link
Contributor Author

tshu-w commented Mar 23, 2022

Good to know. @Narsil So I think this issue can close now. 😄

@tshu-w tshu-w closed this as completed Mar 23, 2022
@Narsil
Copy link
Contributor

Narsil commented Mar 23, 2022

I'll follow the discussion over there then.

@lhoestq
Copy link
Member

lhoestq commented Jan 25, 2023

Btw @Narsil what's the attribute of the tokenizer we would need to ignore to have a hash of the tokenizer that doesn't depend on the state ? We could implement a custom pickling on the datasets side only

@Narsil
Copy link
Contributor

Narsil commented Jan 25, 2023

the python logic is there: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py#L354

tokenizer._tokenizer.{truncation,padding} it seems.
I don't think there are others. However this might affect tokenizer._tokenizer global hash too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants