Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

QishengL · 2023-03-18T23:53:22Z

System Info

- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.199-ql-generic-12.0-18-x86_64-with-glibc2.31
- Python version: 3.10.4
- Numpy version: 1.23.1
- PyTorch version (GPU?): 1.12.1 (False)

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

The error: Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Try to load optimizer and models but got CPU OOM with code below
optimizer.load_state_dict(torch.load("optim.pt",map_location='cpu'))
of
accelerator.load_state("my_checkpoint")

If I use the code below then everything is fine.
optimizer.load_state_dict(torch.load("optim.pt",map_location=accelerator.device))

Expected behavior

Is it possible to use accelerator.load_state("my_checkpoint") and load the state to accelerator.device directly?

The text was updated successfully, but these errors were encountered:

sgugger · 2023-03-20T13:41:47Z

I think this comes from the load of the optimizer state on multiple processes at once. @muellerzr we should probably load the optimizer on the device when in multi-GPU setting with num_processes>1

muellerzr · 2023-04-06T18:02:48Z

Should be fixed with #1220, can you try it out via pip install git+https://github.com/huggingface/accelerate and let us know? :)

github-actions · 2023-05-01T15:06:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muellerzr self-assigned this Mar 20, 2023

muellerzr mentioned this issue Mar 20, 2023

Set the state device dependant to Accelerator on multigpu #1220

Merged

github-actions bot closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

QishengL commented Mar 18, 2023

sgugger commented Mar 20, 2023

muellerzr commented Apr 6, 2023

github-actions bot commented May 1, 2023

Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

Comments

QishengL commented Mar 18, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

sgugger commented Mar 20, 2023

muellerzr commented Apr 6, 2023

github-actions bot commented May 1, 2023