How to load checkpoint shards one by one to avoid OOM error? #3263

amoyplane · 2024-11-26T08:25:37Z

System Info

- `Accelerate` version: 1.1.0
- Platform: Linux-5.10.112-005.ali5000.al8.x86_64-x86_64-with-glibc2.17
- `accelerate` bash location: /home/admin/anaconda3/envs/llama_factory/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 128.00 GB
- GPU type: NVIDIA H20
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

My code can run on 1/2/3/4 GPU(s), but errors occur when I try to use more GPUs.

The command I use :
accelerate launch --multi_gpu --gpu_ids 0,1,2,3,4,5,6,7,8 --num_processes 8 --main_process_port 2525 ./train_args_multi.py --batch_size 4 --save_name tmp_model_multi

The code where errors occur:

    accelerator = Accelerator()
    device = accelerator.device
    print('Device: ', device)

    model = MyModel(path=path, device=device).to(device)

    random.seed(seed)
    torch.manual_seed(seed)
    np.random.seed(seed)

    train_data, train_loader = data_provider(train_data_path, batch_size, num_workers=num_workers, flag='train')
    test_data, test_loader = data_provider(test_data_path, batch_size, num_workers=num_workers, flag='test')
    
    model_optim = optim.Adam(trained_parameters, lr=learning_rate)
    
    print('Preparing for accelerator...')
    model, model_optim, train_loader, test_loader = accelerator.prepare(model, model_optim, train_loader, test_loader)

Expected behavior

Errors occur when loading checkpoint shards (as the bar shows below):

$accelerate launch --multi_gpu --num_processes 8 --gpu_ids 0,1,2,3,4,5,6,7 --main_process_port 25252 ./train_args_multi.py --batch_size 4 --save_name tmp_model_multi
Device:  cuda:0    
Device:  cuda:6                                       
Loading checkpoint shards:   0%|                                                                                                                                                                                                                                                      | 0/4 [00:00<?, ?it/s$
Device:  cuda:5    
Device:  cuda:3                                                                
Device:  cuda:4
Device:  cuda:7
Device:  cuda:1
Device:  cuda:2
Loading checkpoint shards:  50%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                       | 2/4 [00:11<00:12,  6....r(args)
  File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/admin/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
./train_args_multi.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-26_16:17:47
  host      : pe-resource-pool033093226243.center
  rank      : 5 (local_rank: 5)
  exitcode  : -9 (pid: 84403)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 84403
======================================================
(llama_factory)

I found that the memory ran out (not CUDA memory) when loading the models by using the free command.

I can run my code using 1/2/3/4 GPUs but can't use more.
The program tried to load 8 same models (the number of GPU I use) into memory at once, causing the OOM problem.
I am wondering if I can or how to initialize the training process on each GPU one by one to avoid this problem.

Thank you very much!

The text was updated successfully, but these errors were encountered:

ruiyang-zhou · 2024-12-04T23:52:05Z

Maybe worth reading these two: huggingface/transformers#25107
#1777

github-actions · 2024-12-29T15:06:03Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load checkpoint shards one by one to avoid OOM error? #3263

How to load checkpoint shards one by one to avoid OOM error? #3263

amoyplane commented Nov 26, 2024 •

edited

Loading

ruiyang-zhou commented Dec 4, 2024

github-actions bot commented Dec 29, 2024

How to load checkpoint shards one by one to avoid OOM error? #3263

How to load checkpoint shards one by one to avoid OOM error? #3263

Comments

amoyplane commented Nov 26, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

ruiyang-zhou commented Dec 4, 2024

github-actions bot commented Dec 29, 2024

amoyplane commented Nov 26, 2024 •

edited

Loading