Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not loading cached datasets for preprocessing #74

Closed
tangzhy opened this issue Oct 18, 2023 · 1 comment
Closed

Not loading cached datasets for preprocessing #74

tangzhy opened this issue Oct 18, 2023 · 1 comment

Comments

@tangzhy
Copy link

tangzhy commented Oct 18, 2023

Dear,

I am currently working with the script found at https://github.com/allenai/open-instruct/blob/main/scripts/finetune_with_hf_trainer.sh and utilizing 4 GPUs.

From the logs I observed:

Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [02:06<00:00, 16738.93 examples/s]
Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [02:38<00:00, 13336.92 examples/s] 
Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [03:10<00:00, 11098.11 examples/s] 
Filter: 100%|██████████| 2109561/2109561 [02:02<00:00, 17255.15 examples/s]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Filter: 100%|██████████| 2109561/2109561 [01:46<00:00, 19796.75 examples/s]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Filter: 100%|██████████| 2109561/2109561 [01:38<00:00, 21370.43 examples/s]

Although you explicitly let the main process run tokenization first in the finetune_trainer.py. But the rest 3 processes didn't load the cached results. I've verified that the overwrite_cache option is turned off.

Have you met the same circumstance?

@hamishivi
Copy link
Collaborator

Hi, yeah, I faced this issue recently. I suspect it might be related to this issue in HF datasets, and actually updating to the latest datasets version fixed it for me. So I would try that, hopefully it works!

If not, I would recommend making a minimal example and opening an issue on the huggingface datasets repository, since its probably an issue with their caching mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants