Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets' cache not re-used #3847

Open
gejinchen opened this issue Mar 7, 2022 · 26 comments · Fixed by #6318
Open

Datasets' cache not re-used #3847

gejinchen opened this issue Mar 7, 2022 · 26 comments · Fixed by #6318
Labels
bug Something isn't working

Comments

@gejinchen
Copy link

Describe the bug

For most tokenizers I have tested (e.g. the RoBERTa tokenizer), the data preprocessing cache are not fully reused in the first few runs, although their .arrow cache files are in the cache directory.

Steps to reproduce the bug

Here is a reproducer. The GPT2 tokenizer works perfectly with caching, but not the RoBERTa tokenizer in this example.

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_column_name = "text"
column_names = raw_datasets["train"].column_names

def tokenize_function(examples):
    return tokenizer(examples[text_column_name], return_special_tokens_mask=True)

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on every text in dataset",
)

Expected results

No tokenization would be required after the 1st run. Everything should be loaded from the cache.

Actual results

Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.

Environment info

  • datasets version: 1.18.3
  • Platform: Ubuntu 18.04.6 LTS
  • Python version: 3.6.9
  • PyArrow version: 6.0.1
@gejinchen gejinchen added the bug Something isn't working label Mar 7, 2022
@lhoestq
Copy link
Member

lhoestq commented Mar 8, 2022

I think this is because the tokenizer is stateful and because the order in which the splits are processed is not deterministic. Because of that, the hash of the tokenizer may change for certain splits, which causes issues with caching.

To fix this we can try making the order of the splits deterministic for map.

@lhoestq
Copy link
Member

lhoestq commented Mar 15, 2022

Actually this is not because of the order of the splits, but most likely because the tokenizer used to process the second split is in a state that has been modified by the first split.

Therefore after reloading the first split from the cache, then the second split can't be reloaded since the tokenizer hasn't seen the first split (and therefore is considered a different tokenizer).

This is a bit trickier to fix, we can explore fixing this next week maybe

@lhoestq
Copy link
Member

lhoestq commented Apr 19, 2022

Sorry didn't have the bandwidth to take care of this yet - will re-assign when I'm diving into it again !

@thorinf
Copy link

thorinf commented Jun 14, 2022

I had this issue with run_speech_recognition_ctc.py for wa2vec2.0 fine-tuning. I made a small change and the hash for the function (which includes tokenisation) is now the same before and after pre-porocessing. With the hash being the same, the caching works as intended.

Before:

    def prepare_dataset(batch):
        # load audio
        sample = batch[audio_column_name]

        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
        batch["input_values"] = inputs.input_values[0]
        batch["input_length"] = len(batch["input_values"])

        # encode targets
        additional_kwargs = {}
        if phoneme_language is not None:
            additional_kwargs["phonemizer_lang"] = phoneme_language

        batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids

        return batch

    with training_args.main_process_first(desc="dataset map preprocessing"):
        vectorized_datasets = raw_datasets.map(
            prepare_dataset,
            remove_columns=next(iter(raw_datasets.values())).column_names,
            num_proc=num_workers,
            desc="preprocess datasets",
        )

After:

    def prepare_dataset(batch, feature_extractor, tokenizer):
        # load audio
        sample = batch[audio_column_name]

        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
        batch["input_values"] = inputs.input_values[0]
        batch["input_length"] = len(batch["input_values"])

        # encode targets
        additional_kwargs = {}
        if phoneme_language is not None:
            additional_kwargs["phonemizer_lang"] = phoneme_language

        batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids

        return batch

    pd = lambda batch: prepare_dataset(batch, feature_extractor, tokenizer)

    with training_args.main_process_first(desc="dataset map preprocessing"):
        vectorized_datasets = raw_datasets.map(
            pd,
            remove_columns=next(iter(raw_datasets.values())).column_names,
            num_proc=num_workers,
            desc="preprocess datasets",
        )

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2022

Not sure why the second one would work and not the first one - they're basically the same with respect to hashing. In both cases the function is hashed recursively, and therefore the feature_extractor and the tokenizer are hashed the same way.

With which tokenizer or feature extractor are you experiencing this behavior ?

Do you also experience this ?

Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2022

Thanks ! Hopefully this can be useful to others, and also to better understand and improve hashing/caching

@thorinf
Copy link

thorinf commented Jun 15, 2022

tokenizer.save_pretrained(training_args.output_dir) produces a different tokenizer hash when loaded on restart of the script. When I was debugging before I was terminating the script prior to this command, then rerunning.

I compared the tokenizer items on the first and second runs, there are two different items:
1st:

('_additional_special_tokens', [AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True)])

...

('tokens_trie', <transformers.tokenization_utils.Trie object at 0x7f4d6d0ddb38>)

2nd:

('_additional_special_tokens', [AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True)])

...

('tokens_trie', <transformers.tokenization_utils.Trie object at 0x7efc23dcce80>)

On every run of this the special tokens are being added on, and the hash is different on the tokens_trie. The increase in the special tokens category could be cleaned, but not sure about the hash for the tokens_trie. What might work is that the call for the tokenizer encoding can be translated into a function that strips any unnecessary information out, but that's a guess.

@lhoestq
Copy link
Member

lhoestq commented Jun 16, 2022

Thanks for investigating ! Does that mean that save_pretrained() produces non-deterministic tokenizers on disk ? Or is it from_pretrained() which is not deterministic given the same files on disk ?

I think one way to fix this would be to make save/from_pretrained deterministic, or make the pickling of transformers.tokenization_utils.Trie objects deterministic (this could be implemented in transformers, but maybe let's discuss in an issue in transformers before opening a PR)

@Narsil
Copy link
Contributor

Narsil commented Jul 4, 2022

Late to the party but everything should be deterministic (afaik at least).

But Trie is a simple class object, so afaik it's hash function is linked to its id(self) so basically where it's stored in memory, so super highly non deterministic. Could that be the issue ?

@lhoestq
Copy link
Member

lhoestq commented Jul 4, 2022

But Trie is a simple class object, so afaik it's hash function is linked to its id(self) so basically where it's stored in memory, so super highly non deterministic. Could that be the issue ?

We're computing the hash of the pickle dump of the class so it should be fine, as long as the pickle dump is deterministic

@thorinf
Copy link

thorinf commented Jul 5, 2022

I've ported wav2vec2.0 fine-tuning into Optimum-Graphcore which is where I found the issue. The majority of the script was copied from the Transformers version to keep it similar, here is the tokenizer loading section from the source.

In the last comment I have two loaded tokenizers, one from run 'N' of the script and one from 'N+1'. I think what's happening is that when you add special tokens (e.g. PAD and UNK) another AddedToken object is appended when tokenizer is saved regardless of whether special tokens are there already.

If there is a AddedTokens cleanup at load/save this could solve the issue, but then is Trie going to cause hash to be different? I'm not sure.

@Narsil
Copy link
Contributor

Narsil commented Jul 18, 2022

Which Python version are you using ?

The trie is basically a big dict of dics, so deterministic nature depends on python version:
https://stackoverflow.com/questions/2053021/is-the-order-of-a-python-dictionary-guaranteed-over-iterations

Maybe the investigation is actually not finding the right culprit though (the memory id is changed, but datasets is not using that to compare, so maybe we need to be looking within datasets so see where the comparison fails)

@hankcs
Copy link

hankcs commented Feb 2, 2023

Similar issue found on BartTokenizer. You can bypass the bug by loading a fresh new tokenizer everytime.

    dataset = dataset.map(lambda x: tokenize_func(x, BartTokenizer.from_pretrained(xxx)),
                          num_proc=num_proc, desc='Tokenize')

@bzz
Copy link

bzz commented Sep 12, 2023

Linking in #6179 (comment) with an explanation.

@mpenagar
Copy link

I got the same problem while using Wav2Vec2CTCTokenizer in a distributed experiment (many processes), and found that the problem was localized in the serialization (pickle dump) of the field tokenizer.tokens_trie._tokens (just a python set). I focussed into the set serialization and found it is not deterministic:

from datasets.fingerprint import Hasher
from pickle import dumps,loads

# used just once to get a serialized literal
#print(dumps(set("abc")))
serialized = b'\x80\x04\x95\x11\x00\x00\x00\x00\x00\x00\x00\x8f\x94(\x8c\x01a\x94\x8c\x01c\x94\x8c\x01b\x94\x90.'

myset = loads(serialized)
print(f'{myset=} {Hasher.hash(myset)}')
print(serialized == dumps(myset))

Every time you run the python script (different processes) you get a random result. @lhoestq does it make any sense?

@mpenagar
Copy link

mpenagar commented Oct 19, 2023

OK, I assume python's set is just a hash table implementation that uses internally the hash() function. The problem is that python's hash() is not deterministic. I believe that setting the environment variable PYTHONHASHSEED to a fixed value, you can force it to be deterministic. I tried it (file set_pickle_dump.py):

#!/usr/bin/python3

from datasets.fingerprint import Hasher
from pickle import dumps,loads

# used just once to get a serialized literal (with environment variable PYTHONHASHSEED set to 42)
#print(dumps(set("abc")))
serialized = b'\x80\x04\x95\x11\x00\x00\x00\x00\x00\x00\x00\x8f\x94(\x8c\x01b\x94\x8c\x01c\x94\x8c\x01a\x94\x90.'

myset = loads(serialized)
print(f'{myset=} {Hasher.hash(myset)}')
print(serialized == dumps(myset))

and now every run (PYTHONHASHSEED=42 ./set_pickle_dump.py) gets tthe same result. I tried then to test it with the tokenizer (file test_tokenizer.py):

#!/usr/bin/python3
from transformers import AutoTokenizer
from datasets.fingerprint import Hasher

tokenizer = AutoTokenizer.from_pretrained('model')
print(f'{type(tokenizer)=}')
print(f'{Hasher.hash(tokenizer)=}')

executed as PYTHONHASHSEED=42 ./test_tokenizer.py and now the tokenizer fingerprint is allways the same!

@lhoestq
Copy link
Member

lhoestq commented Oct 19, 2023

Thanks for reporting. I opened a PR here to propose a fix: #6318 and doesn't require setting PYTHONHASHSEED

Can you try to install datasets from this branch and tell me if it fixes the issue ?

@mpenagar
Copy link

mpenagar commented Oct 19, 2023

I patched (*) the file datasets/utils/py_utils.py and cache is working propperly now. Thanks!

(*): I am running my experiments inside a docker container that depends on huggingface/transformers-pytorch-gpu:latest, so pattched the file instead of rebuilding the container from scratch

@albertvillanova albertvillanova linked a pull request Oct 20, 2023 that will close this issue
@albertvillanova
Copy link
Member

Fixed by #6318.

@lhoestq
Copy link
Member

lhoestq commented Oct 20, 2023

The OP issue hasn't been fixed, re-opening

@lhoestq lhoestq reopened this Oct 20, 2023
@enze5088
Copy link

enze5088 commented Nov 1, 2023

I think the Trie()._tokens of PreTrainedTokenizer need to be a sorted set So that the results of hash_bytes(dumps(tokenizer)) are consistent every time

@enze5088
Copy link

enze5088 commented Nov 1, 2023

I believe the issue may be linked to tokenization_utils.py#L507,specifically in the line where self.tokens_trie.add(token.content) is called. The function _update_trie appears to modify an unordered set. Consequently, this line:
value = hash_bytes(dumps(tokenizer.tokens_trie._tokens))
can lead to inconsistencies when rerunning the code.

This, in turn, results in inconsistent outputs for both hash_bytes(dumps(function)) at arrow_dataset.py#L3053 and
hasher.update(transform_args[key]) at fingerprint.py#L323

dataset_kwargs = {
    "shard": raw_datasets,
    "function": tokenize_function,
}
transform = format_transform_for_fingerprint(Dataset._map_single)
kwargs_for_fingerprint = format_kwargs_for_fingerprint(Dataset._map_single, (), dataset_kwargs)
kwargs_for_fingerprint["fingerprint_name"] = "new_fingerprint"
new_fingerprint = update_fingerprint(raw_datasets._fingerprint, transform, kwargs_for_fingerprint)

@enze5088
Copy link

enze5088 commented Nov 1, 2023

Alternatively, does the "dumps" function require separate processing for the set?

@lhoestq
Copy link
Member

lhoestq commented Nov 1, 2023

We did a fix that does sorting whenever we hash sets. The fix is available on main if you want to try it out. We'll do a new release soon :)

@Guitaricet
Copy link
Contributor

Is there a documentation chapter that discusses in which cases you should expect your dataset preprocessing to be cached. Including do's and don'ts for the preprocessing functions? I think Datasets team does amazing job at tacking this issue on their side, but it would be great to have some guidelines on the user side as well.

In our current project we have two cases (text-to-text classification and summarization) and in one of them the cache is sometimes reused when it's not supposed to be reused while in the other it's never used at all 😅

@lhoestq
Copy link
Member

lhoestq commented Nov 20, 2023

You can find some docs here :)
https://huggingface.co/docs/datasets/about_cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
10 participants