AutoTokenizer hash value got change after datasets.map #14931

tshu-w · 2021-12-26T11:48:42Z

Environment info

transformers version: 4.15.0
Platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.27
Python version: 3.9.7
PyTorch version (GPU?): 1.10.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik @lhoestq

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

my own modified scripts

The tasks I am working on is:

an official GLUE/SQUaD task: mrpc

To reproduce

Steps to reproduce the behavior:

trash huggingface datasets cache
run the following code:

from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

raw_datasets = load_dataset("glue", "mrpc")

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

got:

Reusing dataset glue (/home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1112.35it/s]
f4976bb4694ebc51
3fca35a1fd4a1251
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.96ba/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.25ba/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.81ba/s]
d32837619b7d7d01
5fd925c82edd62b6

run raw_datasets.map(tokenize_function, batched=True) again and see some dataset are not using cache.

Expected behavior

AutoTokenizer work like specific Tokenizer (The hash value don't change after map):

from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

raw_datasets = load_dataset("glue", "mrpc")

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

print(Hasher.hash(tokenize_function))
print(Hasher.hash(tokenizer))

got:

Reusing dataset glue (/home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1091.22it/s]
46d4b31f54153fc7
5b8771afd8d43888
Loading cached processed dataset at /home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-6b07ff82ae9d5c51.arrow
Loading cached processed dataset at /home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-af738a6d84f3864b.arrow
Loading cached processed dataset at /home1/wts/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-531d2a603ba713c1.arrow
46d4b31f54153fc7
5b8771afd8d43888

The text was updated successfully, but these errors were encountered:

tshu-w · 2021-12-26T12:17:16Z

It seems like this issue also occur with other AutoClass like AutoFeatureExtractor.

github-actions · 2022-01-25T15:06:21Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tshu-w · 2022-01-26T07:26:28Z

@lhoestq Hi, can you look at this issue. I don't know whether I should reported in datasets or transformers.

lhoestq · 2022-01-26T16:34:07Z

Hi ! This should be handled by datasets IMO - feel free to create an issue in the dataset github repository: https://github.com/huggingface/datasets

I tried running the code above but the hash didn't change, I wasn't able to reproduce the issue

tshu-w · 2022-01-27T03:20:20Z

@lhoestq Hi, I reported this on datasets huggingface/datasets#3638

Narsil · 2022-01-28T10:41:53Z

Hi @tshu-w

(For other readers, interesting comment is here: huggingface/datasets#3638 (comment) .)
It seems your example fall into this category

We could try and set these 2 dicts at initialization time, but it wouldn't work if a user modified the tokenizer state later

The call is this:

tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This by definition, will modify the tokenizer underlying state since it has to modify the TruncationParams to set it to True.
The only way you can actually fix it it to call it once before calling the map function, like so:

from transformers import AutoTokenizer, BertTokenizer
from datasets import load_dataset
from datasets.fingerprint import Hasher
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# ADD THIS
tokenizer(example["sentence1"], example["sentence2"], truncation=True)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# ... rest of the script should work, and hashes be the same.

Sorry I missed that reading the first issue. I had thought that this was triggered by some default configuration of the tokenizer (that wasn't properly set at initialization time) but this isn't the case.

@lhoestq tagging you too, since looking into this issue, I realized that hash(tokenizer) kept consistant, while Hasher.hash(..) wasn't. Maybe something can be done ? Or taking the hash function after 1 iteration ?

Actually the only state maintained by transformers FastTokenizer themselves are padding_side and truncation_side, no other arguments are ever kept within the class itself (meaning we cannot create the proper state before the call for them)

And a last option would be to make tokenizers Tokenizer themselves become stateless. Optimization-wise I don't know if the hit of passing the same arguments over and over will be significant or not (it probably needs to be checked though). But it's also a pretty big change I think.

lhoestq · 2022-01-31T11:18:44Z

Thanks for the ideas @Narsil :)

hash is not well-suited for caching unfortunately: it may not return the same hash in two different sessions. In the same session, two identical objects might not even have the same hash (try calling hash(lambda x: x) several times for example). This would lead to lots of cache misses.

Taking the hash after the first iteration can do the job, but the downside is that it would require users to wait for the first batch to be processed before checking the cache, which can be confusing IMO.

Which TruncationParams are you talking about exactly ? Would it make sense to make the datasets Hasher ignore this ?

Narsil · 2022-01-31T13:53:55Z

Summarizing a discussion that happened orally:

The best course of action right now is to try and modify __setstate__ , __getstate__ of the FastTokenizers, to override the default pickle behavior.

That's the standard way to override things in datasets.
We can do that, because, since transformers FastTokenizer do not hold any state (except {padding|truncation}_side the actual thing that's going to cause a cache miss on datasets is going to be either:
- Modifying the map function code
- Modifying any of the associated values to that code
- The state of the tokenizers (specifically _tokenizer.truncation and _tokenizer.padding) will NOT.

This is a pretty big change, but seems currently like the best course of action.
Since we're touching something relatively core we need to be extra careful:

Making sure we can alway unpickle already pickled tokenizers, and make sure it stays that way in the future
Making sure the cache hit/miss of datasets is kept working through time
Making sure _tokenizer pickling/unpickling can still happen correctly (this one still has state that has to be maintained).
Making sure both pickling behaviors don't interact in a nasty fashion.

github-actions · 2022-02-24T15:07:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Narsil · 2022-02-24T15:11:57Z

Unstale.

github-actions · 2022-03-21T15:06:56Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tshu-w · 2022-03-22T00:49:37Z

Unstale. General Ping @Narsil. Is there anything I can do.

Narsil · 2022-03-22T10:11:59Z

TL;DR

Call the function once on a dummy example beforehand will fix it.

tokenizer("Some", "test", truncation=True)

Long answer

If I remember the last status, it's hard doing anything, since the call itself

tokenizer(example["sentence1"], example["sentence2"], truncation=True)

will modify the tokenizer. It's the truncation=True that modifies the tokenizer to put it into truncation mode if you will.
Calling the tokenizer once with that argument would fix the cache.

Finding a fix that :

Doesn't imply a huge chunk of work on tokenizers (with potential loss of performance, and breaking backward compatibility)
Doesn't imply datasets running a first pass of the loop
Doesn't imply datasets looking at the map function itself
Uses a sound hash for this object in datasets.

is IIRC impossible for this use case.

I can explain a bit more why the first option is not desirable.

In order to "fix" this for tokenizers, we would need to make tokenizer(..) purely without side effects. This means that the "options" of tokenization (like truncation and padding at least) would have to be sent every single time to make the function "pure". But it also means that we would need to send every single time a bunch of options from Python to Rust, and that boundary is not zero-cost. The cost hopefully would be minimal, but it could prove to be high (Python GIL is a tricky beast).

The other thing, is that it would force tokenizers library to behave differently for a datasets specific use-case which is less than ideal.

For the datasets specific solution I am not 100% I can explain them properly.

tshu-w · 2022-03-22T11:08:55Z

@Narsil Thank you for your detailed explanation. @lhoestq Can you take a look if there is some specific solution on Datasets

lhoestq · 2022-03-22T13:28:55Z

Yes I think we can have a workaround: you have to reload your tokenizer before your second map call.

Note that we still need to fix this issue in datasets first: huggingface/datasets#3847

tshu-w · 2022-03-23T01:15:03Z

Good to know. @Narsil So I think this issue can close now. 😄

Narsil · 2022-03-23T08:27:53Z

I'll follow the discussion over there then.

lhoestq · 2023-01-25T17:01:46Z

Btw @Narsil what's the attribute of the tokenizer we would need to ignore to have a hash of the tokenizer that doesn't depend on the state ? We could implement a custom pickling on the datasets side only

Narsil · 2023-01-25T17:53:33Z

the python logic is there: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py#L354

tokenizer._tokenizer.{truncation,padding} it seems.
I don't think there are others. However this might affect tokenizer._tokenizer global hash too.

tshu-w mentioned this issue Jan 27, 2022

AutoTokenizer hash value got change after datasets.map huggingface/datasets#3638

Open

tshu-w closed this as completed Jan 27, 2022

tshu-w reopened this Jan 28, 2022

Narsil mentioned this issue Jan 28, 2022

Adding a test for hashing tokenizer state. #15386

Closed

5 tasks

tshu-w closed this as completed Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer hash value got change after datasets.map #14931

AutoTokenizer hash value got change after datasets.map #14931

tshu-w commented Dec 26, 2021 •

edited

Loading

tshu-w commented Dec 26, 2021

github-actions bot commented Jan 25, 2022

tshu-w commented Jan 26, 2022

lhoestq commented Jan 26, 2022

tshu-w commented Jan 27, 2022

Narsil commented Jan 28, 2022 •

edited

Loading

lhoestq commented Jan 31, 2022

Narsil commented Jan 31, 2022

github-actions bot commented Feb 24, 2022

Narsil commented Feb 24, 2022

github-actions bot commented Mar 21, 2022

tshu-w commented Mar 22, 2022

Narsil commented Mar 22, 2022

tshu-w commented Mar 22, 2022

lhoestq commented Mar 22, 2022 •

edited

Loading

tshu-w commented Mar 23, 2022

Narsil commented Mar 23, 2022

lhoestq commented Jan 25, 2023

Narsil commented Jan 25, 2023

AutoTokenizer hash value got change after datasets.map #14931

AutoTokenizer hash value got change after datasets.map #14931

Comments

tshu-w commented Dec 26, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

tshu-w commented Dec 26, 2021

github-actions bot commented Jan 25, 2022

tshu-w commented Jan 26, 2022

lhoestq commented Jan 26, 2022

tshu-w commented Jan 27, 2022

Narsil commented Jan 28, 2022 • edited Loading

lhoestq commented Jan 31, 2022

Narsil commented Jan 31, 2022

github-actions bot commented Feb 24, 2022

Narsil commented Feb 24, 2022

github-actions bot commented Mar 21, 2022

tshu-w commented Mar 22, 2022

Narsil commented Mar 22, 2022

TL;DR

Long answer

tshu-w commented Mar 22, 2022

lhoestq commented Mar 22, 2022 • edited Loading

tshu-w commented Mar 23, 2022

Narsil commented Mar 23, 2022

lhoestq commented Jan 25, 2023

Narsil commented Jan 25, 2023

tshu-w commented Dec 26, 2021 •

edited

Loading

Narsil commented Jan 28, 2022 •

edited

Loading

lhoestq commented Mar 22, 2022 •

edited

Loading