Multi process map did not load cache file correctly #6369

enze5088 · 2023-11-01T06:36:54Z

Describe the bug

When I was training model on Multiple GPUs by DDP, the dataset is tokenized multiple times after main process.

Code is modified from run_clm.py

Steps to reproduce the bug

block_size = data_args.block_size
IGNORE_INDEX = -100
Ignore_Input = False

def tokenize_function(examples):
    sources = []
    targets = []
    for instruction, inputs, output in zip(examples['instruction'], examples['input'], examples['output']):

        source = instruction + inputs 

        target = f"{output}{tokenizer.eos_token}"

        sources.append(source)
        targets.append(target)

    tokenized_sources = tokenizer(sources, return_attention_mask=False)

    tokenized_targets = tokenizer(targets, return_attention_mask=False,
                                  add_special_tokens=False
                                  )

    all_input_ids = []
    all_labels = []
    for s, t in zip(tokenized_sources['input_ids'], tokenized_targets['input_ids']):
        if len(s) > block_size and Ignore_Input == False:
            # print(s)
            continue
        input_ids = torch.LongTensor(s + t)[:block_size]
        if Ignore_Input:
            labels = torch.LongTensor([IGNORE_INDEX] * len(s) + t)[:block_size]
        else:
            labels = input_ids
        assert len(input_ids) == len(labels)
        all_input_ids.append(input_ids)
        all_labels.append(labels)

    results = {
        'input_ids': all_input_ids,
        'labels': all_labels,

    }
    return results



with training_args.main_process_first(desc="dataset map tokenization ", local=False):
    # print('local_rank',training_args.local_rank)
    if not data_args.streaming:
        tokenized_datasets = raw_datasets.map(
            tokenize_function,
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            remove_columns=column_names,
            load_from_cache_file=not data_args.overwrite_cache,
            desc="Running tokenizer on dataset ",
        )
    else:
        tokenized_datasets = raw_datasets.map(
            tokenize_function,
            batched=True,
            remove_columns=column_names,
            desc="Running tokenizer on dataset "
        )

Expected behavior

This code should only tokenize the dataset in the main process, and the other processes load the dataset after waiting

Environment info

transformers == 4.34.1
datasets == 2.14.5

The text was updated successfully, but these errors were encountered:

enze5088 · 2023-11-01T12:21:44Z

The inconsistency may be caused by the usage of "update_fingerprint" and setting "trust_remote_code" to "True."
When the tokenizer employs "trust_remote_code," the behavior of the map function varies with each code execution. Even if the remote code of the tokenizer remains the same, the result of "asher.hexdigest()" is found to be inconsistent each time.
This may result in different processes executing multiple maps

enze5088 · 2023-11-01T18:33:54Z

The issue may be related to problems previously discussed in GitHub issues #3847 and #6318.
This arises from the fact that tokenizer.tokens_trie._tokens is an unordered set, leading to varying hash results:
value = hash_bytes(dumps(tokenizer.tokens_trie._tokens))
Consequently, this results in different outcomes each time for:
new_fingerprint = update_fingerprint(datasets._fingerprint, transform, kwargs_for_fingerprint)

To address this issue, it's essential to make Trie._tokens a deterministic set while ensuring a consistent order after the final update of _tokens.

mariosasko · 2023-11-30T16:04:45Z

We now sort set and dict items to make their hashes deterministic (install from main with pip install git+https://github.com/huggingface/datasets to test this). Consequently, this should also make the tokenizer.tokens_trie's hash deterministic. Feel free to re-open the issue if this is not the case.

mariosasko closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi process map did not load cache file correctly #6369

Multi process map did not load cache file correctly #6369

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

mariosasko commented Nov 30, 2023

Multi process map did not load cache file correctly #6369

Multi process map did not load cache file correctly #6369

Comments

enze5088 commented Nov 1, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

enze5088 commented Nov 1, 2023

enze5088 commented Nov 1, 2023

mariosasko commented Nov 30, 2023