-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi process map did not load cache file correctly #6369
Comments
The issue may be related to problems previously discussed in GitHub issues #3847 and #6318. To address this issue, it's essential to make |
We now sort |
Describe the bug
When I was training model on Multiple GPUs by DDP, the dataset is tokenized multiple times after main process.
Code is modified from run_clm.py
Steps to reproduce the bug
Expected behavior
This code should only tokenize the dataset in the main process, and the other processes load the dataset after waiting
Environment info
transformers == 4.34.1
datasets == 2.14.5
The text was updated successfully, but these errors were encountered: