Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi process map did not load cache file correctly #6369

Closed
enze5088 opened this issue Nov 1, 2023 · 3 comments
Closed

Multi process map did not load cache file correctly #6369

enze5088 opened this issue Nov 1, 2023 · 3 comments

Comments

@enze5088
Copy link

enze5088 commented Nov 1, 2023

Describe the bug

When I was training model on Multiple GPUs by DDP, the dataset is tokenized multiple times after main process.
1698820541284

1698820501568

Code is modified from run_clm.py

Steps to reproduce the bug

block_size = data_args.block_size
IGNORE_INDEX = -100
Ignore_Input = False

def tokenize_function(examples):
    sources = []
    targets = []
    for instruction, inputs, output in zip(examples['instruction'], examples['input'], examples['output']):

        source = instruction + inputs 

        target = f"{output}{tokenizer.eos_token}"

        sources.append(source)
        targets.append(target)

    tokenized_sources = tokenizer(sources, return_attention_mask=False)

    tokenized_targets = tokenizer(targets, return_attention_mask=False,
                                  add_special_tokens=False
                                  )

    all_input_ids = []
    all_labels = []
    for s, t in zip(tokenized_sources['input_ids'], tokenized_targets['input_ids']):
        if len(s) > block_size and Ignore_Input == False:
            # print(s)
            continue
        input_ids = torch.LongTensor(s + t)[:block_size]
        if Ignore_Input:
            labels = torch.LongTensor([IGNORE_INDEX] * len(s) + t)[:block_size]
        else:
            labels = input_ids
        assert len(input_ids) == len(labels)
        all_input_ids.append(input_ids)
        all_labels.append(labels)

    results = {
        'input_ids': all_input_ids,
        'labels': all_labels,

    }
    return results



with training_args.main_process_first(desc="dataset map tokenization ", local=False):
    # print('local_rank',training_args.local_rank)
    if not data_args.streaming:
        tokenized_datasets = raw_datasets.map(
            tokenize_function,
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            remove_columns=column_names,
            load_from_cache_file=not data_args.overwrite_cache,
            desc="Running tokenizer on dataset ",
        )
    else:
        tokenized_datasets = raw_datasets.map(
            tokenize_function,
            batched=True,
            remove_columns=column_names,
            desc="Running tokenizer on dataset "
        )

Expected behavior

This code should only tokenize the dataset in the main process, and the other processes load the dataset after waiting

Environment info

transformers == 4.34.1
datasets == 2.14.5

@enze5088
Copy link
Author

enze5088 commented Nov 1, 2023

The inconsistency may be caused by the usage of "update_fingerprint" and setting "trust_remote_code" to "True."
When the tokenizer employs "trust_remote_code," the behavior of the map function varies with each code execution. Even if the remote code of the tokenizer remains the same, the result of "asher.hexdigest()" is found to be inconsistent each time.
This may result in different processes executing multiple maps
1698841094290
1698841117416

@enze5088
Copy link
Author

enze5088 commented Nov 1, 2023

The issue may be related to problems previously discussed in GitHub issues #3847 and #6318.
This arises from the fact that tokenizer.tokens_trie._tokens is an unordered set, leading to varying hash results:
value = hash_bytes(dumps(tokenizer.tokens_trie._tokens))
Consequently, this results in different outcomes each time for:
new_fingerprint = update_fingerprint(datasets._fingerprint, transform, kwargs_for_fingerprint)

To address this issue, it's essential to make Trie._tokens a deterministic set while ensuring a consistent order after the final update of _tokens.

@mariosasko
Copy link
Collaborator

We now sort set and dict items to make their hashes deterministic (install from main with pip install git+https://github.com/huggingface/datasets to test this). Consequently, this should also make the tokenizer.tokens_trie's hash deterministic. Feel free to re-open the issue if this is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants