Datasets' cache not re-used #3847

gejinchen · 2022-03-07T19:55:15Z

Describe the bug

For most tokenizers I have tested (e.g. the RoBERTa tokenizer), the data preprocessing cache are not fully reused in the first few runs, although their .arrow cache files are in the cache directory.

Steps to reproduce the bug

Here is a reproducer. The GPT2 tokenizer works perfectly with caching, but not the RoBERTa tokenizer in this example.

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_column_name = "text"
column_names = raw_datasets["train"].column_names

def tokenize_function(examples):
    return tokenizer(examples[text_column_name], return_special_tokens_mask=True)

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on every text in dataset",
)

Expected results

No tokenization would be required after the 1st run. Everything should be loaded from the cache.

Actual results

Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.

Environment info

datasets version: 1.18.3
Platform: Ubuntu 18.04.6 LTS
Python version: 3.6.9
PyArrow version: 6.0.1

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-03-08T00:17:00Z

I think this is because the tokenizer is stateful and because the order in which the splits are processed is not deterministic. Because of that, the hash of the tokenizer may change for certain splits, which causes issues with caching.

~~To fix this we can try making the order of the splits deterministic for map.~~

lhoestq · 2022-03-15T10:45:46Z

Actually this is not because of the order of the splits, but most likely because the tokenizer used to process the second split is in a state that has been modified by the first split.

Therefore after reloading the first split from the cache, then the second split can't be reloaded since the tokenizer hasn't seen the first split (and therefore is considered a different tokenizer).

This is a bit trickier to fix, we can explore fixing this next week maybe

lhoestq · 2022-04-19T17:11:50Z

Sorry didn't have the bandwidth to take care of this yet - will re-assign when I'm diving into it again !

thorinf · 2022-06-14T14:13:27Z

I had this issue with run_speech_recognition_ctc.py for wa2vec2.0 fine-tuning. I made a small change and the hash for the function (which includes tokenisation) is now the same before and after pre-porocessing. With the hash being the same, the caching works as intended.

Before:

    def prepare_dataset(batch):
        # load audio
        sample = batch[audio_column_name]

        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
        batch["input_values"] = inputs.input_values[0]
        batch["input_length"] = len(batch["input_values"])

        # encode targets
        additional_kwargs = {}
        if phoneme_language is not None:
            additional_kwargs["phonemizer_lang"] = phoneme_language

        batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids

        return batch

    with training_args.main_process_first(desc="dataset map preprocessing"):
        vectorized_datasets = raw_datasets.map(
            prepare_dataset,
            remove_columns=next(iter(raw_datasets.values())).column_names,
            num_proc=num_workers,
            desc="preprocess datasets",
        )

After:

    def prepare_dataset(batch, feature_extractor, tokenizer):
        # load audio
        sample = batch[audio_column_name]

        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
        batch["input_values"] = inputs.input_values[0]
        batch["input_length"] = len(batch["input_values"])

        # encode targets
        additional_kwargs = {}
        if phoneme_language is not None:
            additional_kwargs["phonemizer_lang"] = phoneme_language

        batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids

        return batch

    pd = lambda batch: prepare_dataset(batch, feature_extractor, tokenizer)

    with training_args.main_process_first(desc="dataset map preprocessing"):
        vectorized_datasets = raw_datasets.map(
            pd,
            remove_columns=next(iter(raw_datasets.values())).column_names,
            num_proc=num_workers,
            desc="preprocess datasets",
        )

lhoestq · 2022-06-14T14:25:22Z

Not sure why the second one would work and not the first one - they're basically the same with respect to hashing. In both cases the function is hashed recursively, and therefore the feature_extractor and the tokenizer are hashed the same way.

With which tokenizer or feature extractor are you experiencing this behavior ?

Do you also experience this ?

Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.

lhoestq · 2022-06-14T15:56:44Z

Thanks ! Hopefully this can be useful to others, and also to better understand and improve hashing/caching

thorinf · 2022-06-15T15:36:01Z

tokenizer.save_pretrained(training_args.output_dir) produces a different tokenizer hash when loaded on restart of the script. When I was debugging before I was terminating the script prior to this command, then rerunning.

I compared the tokenizer items on the first and second runs, there are two different items:
1st:

('_additional_special_tokens', [AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True)])

...

('tokens_trie', <transformers.tokenization_utils.Trie object at 0x7f4d6d0ddb38>)

2nd:

('_additional_special_tokens', [AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True)])

...

('tokens_trie', <transformers.tokenization_utils.Trie object at 0x7efc23dcce80>)

On every run of this the special tokens are being added on, and the hash is different on the tokens_trie. The increase in the special tokens category could be cleaned, but not sure about the hash for the tokens_trie. What might work is that the call for the tokenizer encoding can be translated into a function that strips any unnecessary information out, but that's a guess.

lhoestq · 2022-06-16T12:41:23Z

Thanks for investigating ! Does that mean that save_pretrained() produces non-deterministic tokenizers on disk ? Or is it from_pretrained() which is not deterministic given the same files on disk ?

I think one way to fix this would be to make save/from_pretrained deterministic, or make the pickling of transformers.tokenization_utils.Trie objects deterministic (this could be implemented in transformers, but maybe let's discuss in an issue in transformers before opening a PR)

Narsil · 2022-07-04T14:58:15Z

Late to the party but everything should be deterministic (afaik at least).

But Trie is a simple class object, so afaik it's hash function is linked to its id(self) so basically where it's stored in memory, so super highly non deterministic. Could that be the issue ?

lhoestq · 2022-07-04T15:42:11Z

But Trie is a simple class object, so afaik it's hash function is linked to its id(self) so basically where it's stored in memory, so super highly non deterministic. Could that be the issue ?

We're computing the hash of the pickle dump of the class so it should be fine, as long as the pickle dump is deterministic

thorinf · 2022-07-05T09:19:13Z

I've ported wav2vec2.0 fine-tuning into Optimum-Graphcore which is where I found the issue. The majority of the script was copied from the Transformers version to keep it similar, here is the tokenizer loading section from the source.

In the last comment I have two loaded tokenizers, one from run 'N' of the script and one from 'N+1'. I think what's happening is that when you add special tokens (e.g. PAD and UNK) another AddedToken object is appended when tokenizer is saved regardless of whether special tokens are there already.

If there is a AddedTokens cleanup at load/save this could solve the issue, but then is Trie going to cause hash to be different? I'm not sure.

Narsil · 2022-07-18T17:07:59Z

Which Python version are you using ?

The trie is basically a big dict of dics, so deterministic nature depends on python version:
https://stackoverflow.com/questions/2053021/is-the-order-of-a-python-dictionary-guaranteed-over-iterations

Maybe the investigation is actually not finding the right culprit though (the memory id is changed, but datasets is not using that to compare, so maybe we need to be looking within datasets so see where the comparison fails)

hankcs · 2023-02-02T23:35:45Z

Similar issue found on BartTokenizer. You can bypass the bug by loading a fresh new tokenizer everytime.

    dataset = dataset.map(lambda x: tokenize_func(x, BartTokenizer.from_pretrained(xxx)),
                          num_proc=num_proc, desc='Tokenize')

bzz · 2023-09-12T08:42:38Z

Linking in #6179 (comment) with an explanation.

mpenagar · 2023-10-19T11:31:21Z

I got the same problem while using Wav2Vec2CTCTokenizer in a distributed experiment (many processes), and found that the problem was localized in the serialization (pickle dump) of the field tokenizer.tokens_trie._tokens (just a python set). I focussed into the set serialization and found it is not deterministic:

from datasets.fingerprint import Hasher
from pickle import dumps,loads

# used just once to get a serialized literal
#print(dumps(set("abc")))
serialized = b'\x80\x04\x95\x11\x00\x00\x00\x00\x00\x00\x00\x8f\x94(\x8c\x01a\x94\x8c\x01c\x94\x8c\x01b\x94\x90.'

myset = loads(serialized)
print(f'{myset=} {Hasher.hash(myset)}')
print(serialized == dumps(myset))

Every time you run the python script (different processes) you get a random result. @lhoestq does it make any sense?

mpenagar · 2023-10-19T12:05:43Z

OK, I assume python's set is just a hash table implementation that uses internally the hash() function. The problem is that python's hash() is not deterministic. I believe that setting the environment variable PYTHONHASHSEED to a fixed value, you can force it to be deterministic. I tried it (file set_pickle_dump.py):

#!/usr/bin/python3

from datasets.fingerprint import Hasher
from pickle import dumps,loads

# used just once to get a serialized literal (with environment variable PYTHONHASHSEED set to 42)
#print(dumps(set("abc")))
serialized = b'\x80\x04\x95\x11\x00\x00\x00\x00\x00\x00\x00\x8f\x94(\x8c\x01b\x94\x8c\x01c\x94\x8c\x01a\x94\x90.'

myset = loads(serialized)
print(f'{myset=} {Hasher.hash(myset)}')
print(serialized == dumps(myset))

and now every run (PYTHONHASHSEED=42 ./set_pickle_dump.py) gets tthe same result. I tried then to test it with the tokenizer (file test_tokenizer.py):

#!/usr/bin/python3
from transformers import AutoTokenizer
from datasets.fingerprint import Hasher

tokenizer = AutoTokenizer.from_pretrained('model')
print(f'{type(tokenizer)=}')
print(f'{Hasher.hash(tokenizer)=}')

executed as PYTHONHASHSEED=42 ./test_tokenizer.py and now the tokenizer fingerprint is allways the same!

lhoestq · 2023-10-19T12:19:55Z

Thanks for reporting. I opened a PR here to propose a fix: #6318 and doesn't require setting PYTHONHASHSEED

Can you try to install datasets from this branch and tell me if it fixes the issue ?

mpenagar · 2023-10-19T13:40:49Z

I patched (*) the file datasets/utils/py_utils.py and cache is working propperly now. Thanks!

(*): I am running my experiments inside a docker container that depends on huggingface/transformers-pytorch-gpu:latest, so pattched the file instead of rebuilding the container from scratch

albertvillanova · 2023-10-20T10:08:45Z

Fixed by #6318.

lhoestq · 2023-10-20T10:21:53Z

The OP issue hasn't been fixed, re-opening

enze5088 · 2023-11-01T18:15:01Z

I think the Trie()._tokens of PreTrainedTokenizer need to be a sorted set So that the results of hash_bytes(dumps(tokenizer)) are consistent every time

enze5088 · 2023-11-01T19:05:02Z

I believe the issue may be linked to tokenization_utils.py#L507，specifically in the line where self.tokens_trie.add(token.content) is called. The function _update_trie appears to modify an unordered set. Consequently, this line:
value = hash_bytes(dumps(tokenizer.tokens_trie._tokens))
can lead to inconsistencies when rerunning the code.

This, in turn, results in inconsistent outputs for both hash_bytes(dumps(function)) at arrow_dataset.py#L3053 and
hasher.update(transform_args[key]) at fingerprint.py#L323

dataset_kwargs = {
    "shard": raw_datasets,
    "function": tokenize_function,
}
transform = format_transform_for_fingerprint(Dataset._map_single)
kwargs_for_fingerprint = format_kwargs_for_fingerprint(Dataset._map_single, (), dataset_kwargs)
kwargs_for_fingerprint["fingerprint_name"] = "new_fingerprint"
new_fingerprint = update_fingerprint(raw_datasets._fingerprint, transform, kwargs_for_fingerprint)

enze5088 · 2023-11-01T19:13:36Z

Alternatively, does the "dumps" function require separate processing for the set?

lhoestq · 2023-11-01T21:30:23Z

We did a fix that does sorting whenever we hash sets. The fix is available on main if you want to try it out. We'll do a new release soon :)

Guitaricet · 2023-11-16T16:21:06Z

Is there a documentation chapter that discusses in which cases you should expect your dataset preprocessing to be cached. Including do's and don'ts for the preprocessing functions? I think Datasets team does amazing job at tacking this issue on their side, but it would be great to have some guidelines on the user side as well.

In our current project we have two cases (text-to-text classification and summarization) and in one of them the cache is sometimes reused when it's not supposed to be reused while in the other it's never used at all 😅

lhoestq · 2023-11-20T18:14:36Z

You can find some docs here :)
https://huggingface.co/docs/datasets/about_cache

gejinchen added the bug Something isn't working label Mar 7, 2022

lhoestq mentioned this issue Mar 14, 2022

Deterministic split order in DatasetDict.map #3913

Closed

lhoestq mentioned this issue Mar 22, 2022

AutoTokenizer hash value got change after datasets.map huggingface/transformers#14931

Closed

2 tasks

lhoestq self-assigned this Mar 25, 2022

lhoestq removed their assignment Apr 19, 2022

bhavani105 mentioned this issue Jul 22, 2022

Caching Issues - Running out of Disk Space primeqa/primeqa#259

Closed

mariosasko mentioned this issue Jun 26, 2023

Cannot reuse tokenizer object for dataset map #5985

Closed

lhoestq mentioned this issue Oct 19, 2023

Deterministic set hash #6318

Merged

albertvillanova linked a pull request Oct 20, 2023 that will close this issue

Deterministic set hash #6318

Merged

albertvillanova closed this as completed Oct 20, 2023

lhoestq reopened this Oct 20, 2023

enze5088 mentioned this issue Nov 1, 2023

Multi process map did not load cache file correctly #6369

Closed

hamishivi mentioned this issue Nov 20, 2023

Not loading cached datasets for preprocessing allenai/open-instruct#74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets' cache not re-used #3847

Datasets' cache not re-used #3847

gejinchen commented Mar 7, 2022

lhoestq commented Mar 8, 2022 •

edited

Loading

lhoestq commented Mar 15, 2022 •

edited

Loading

lhoestq commented Apr 19, 2022

thorinf commented Jun 14, 2022

lhoestq commented Jun 14, 2022

lhoestq commented Jun 14, 2022

thorinf commented Jun 15, 2022

lhoestq commented Jun 16, 2022

Narsil commented Jul 4, 2022

lhoestq commented Jul 4, 2022

thorinf commented Jul 5, 2022 •

edited

Loading

Narsil commented Jul 18, 2022

hankcs commented Feb 2, 2023

bzz commented Sep 12, 2023

mpenagar commented Oct 19, 2023

mpenagar commented Oct 19, 2023 •

edited

Loading

lhoestq commented Oct 19, 2023 •

edited

Loading

mpenagar commented Oct 19, 2023 •

edited

Loading

albertvillanova commented Oct 20, 2023

lhoestq commented Oct 20, 2023

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

lhoestq commented Nov 1, 2023

Guitaricet commented Nov 16, 2023

lhoestq commented Nov 20, 2023

Datasets' cache not re-used #3847

Datasets' cache not re-used #3847

Comments

gejinchen commented Mar 7, 2022

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

lhoestq commented Mar 8, 2022 • edited Loading

lhoestq commented Mar 15, 2022 • edited Loading

lhoestq commented Apr 19, 2022

thorinf commented Jun 14, 2022

lhoestq commented Jun 14, 2022

lhoestq commented Jun 14, 2022

thorinf commented Jun 15, 2022

lhoestq commented Jun 16, 2022

Narsil commented Jul 4, 2022

lhoestq commented Jul 4, 2022

thorinf commented Jul 5, 2022 • edited Loading

Narsil commented Jul 18, 2022

hankcs commented Feb 2, 2023

bzz commented Sep 12, 2023

mpenagar commented Oct 19, 2023

mpenagar commented Oct 19, 2023 • edited Loading

lhoestq commented Oct 19, 2023 • edited Loading

mpenagar commented Oct 19, 2023 • edited Loading

albertvillanova commented Oct 20, 2023

lhoestq commented Oct 20, 2023

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

lhoestq commented Nov 1, 2023

Guitaricet commented Nov 16, 2023

lhoestq commented Nov 20, 2023

lhoestq commented Mar 8, 2022 •

edited

Loading

lhoestq commented Mar 15, 2022 •

edited

Loading

thorinf commented Jul 5, 2022 •

edited

Loading

mpenagar commented Oct 19, 2023 •

edited

Loading

lhoestq commented Oct 19, 2023 •

edited

Loading

mpenagar commented Oct 19, 2023 •

edited

Loading