Not loading cached datasets for preprocessing #74

tangzhy · 2023-10-18T08:54:30Z

Dear,

I am currently working with the script found at https://github.com/allenai/open-instruct/blob/main/scripts/finetune_with_hf_trainer.sh and utilizing 4 GPUs.

From the logs I observed:

Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [02:06<00:00, 16738.93 examples/s]
Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [02:38<00:00, 13336.92 examples/s] 
Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [03:10<00:00, 11098.11 examples/s] 
Filter: 100%|██████████| 2109561/2109561 [02:02<00:00, 17255.15 examples/s]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Filter: 100%|██████████| 2109561/2109561 [01:46<00:00, 19796.75 examples/s]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Filter: 100%|██████████| 2109561/2109561 [01:38<00:00, 21370.43 examples/s]

Although you explicitly let the main process run tokenization first in the finetune_trainer.py. But the rest 3 processes didn't load the cached results. I've verified that the overwrite_cache option is turned off.

Have you met the same circumstance?

The text was updated successfully, but these errors were encountered:

hamishivi · 2023-11-20T07:59:01Z

Hi, yeah, I faced this issue recently. I suspect it might be related to this issue in HF datasets, and actually updating to the latest datasets version fixed it for me. So I would try that, hopefully it works!

If not, I would recommend making a minimal example and opening an issue on the huggingface datasets repository, since its probably an issue with their caching mechanism.

hamishivi closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not loading cached datasets for preprocessing #74

Not loading cached datasets for preprocessing #74

tangzhy commented Oct 18, 2023 •

edited

Loading

hamishivi commented Nov 20, 2023

Not loading cached datasets for preprocessing #74

Not loading cached datasets for preprocessing #74

Comments

tangzhy commented Oct 18, 2023 • edited Loading

hamishivi commented Nov 20, 2023

tangzhy commented Oct 18, 2023 •

edited

Loading