You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
My code can run on 1/2/3/4 GPU(s), but errors occur when I try to use more GPUs.
The command I use : accelerate launch --multi_gpu --gpu_ids 0,1,2,3,4,5,6,7,8 --num_processes 8 --main_process_port 2525 ./train_args_multi.py --batch_size 4 --save_name tmp_model_multi
Errors occur when loading checkpoint shards (as the bar shows below):
$accelerate launch --multi_gpu --num_processes 8 --gpu_ids 0,1,2,3,4,5,6,7 --main_process_port 25252 ./train_args_multi.py --batch_size 4 --save_name tmp_model_multi
Device: cuda:0
Device: cuda:6
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s$
Device: cuda:5
Device: cuda:3
Device: cuda:4
Device: cuda:7
Device: cuda:1
Device: cuda:2
Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2/4 [00:11<00:12, 6....r(args)
File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
./train_args_multi.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-26_16:17:47
host : pe-resource-pool033093226243.center
rank : 5 (local_rank: 5)
exitcode : -9 (pid: 84403)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 84403
======================================================
(llama_factory)
I found that the memory ran out (not CUDA memory) when loading the models by using the free command.
I can run my code using 1/2/3/4 GPUs but can't use more.
The program tried to load 8 same models (the number of GPU I use) into memory at once, causing the OOM problem.
I am wondering if I can or how to initialize the training process on each GPU one by one to avoid this problem.
Thank you very much!
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
My code can run on 1/2/3/4 GPU(s), but errors occur when I try to use more GPUs.
The command I use :
accelerate launch --multi_gpu --gpu_ids 0,1,2,3,4,5,6,7,8 --num_processes 8 --main_process_port 2525 ./train_args_multi.py --batch_size 4 --save_name tmp_model_multi
The code where errors occur:
Expected behavior
Errors occur when loading checkpoint shards (as the bar shows below):
I found that the memory ran out (not CUDA memory) when loading the models by using the free command.
I can run my code using 1/2/3/4 GPUs but can't use more.
The program tried to load 8 same models (the number of GPU I use) into memory at once, causing the OOM problem.
I am wondering if I can or how to initialize the training process on each GPU one by one to avoid this problem.
Thank you very much!
The text was updated successfully, but these errors were encountered: