Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

Closed
2 of 4 tasks
QishengL opened this issue Mar 18, 2023 · 3 comments
Assignees

Comments

@QishengL
Copy link

System Info

- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.199-ql-generic-12.0-18-x86_64-with-glibc2.31
- Python version: 3.10.4
- Numpy version: 1.23.1
- PyTorch version (GPU?): 1.12.1 (False)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

The error: Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Try to load optimizer and models but got CPU OOM with code below
optimizer.load_state_dict(torch.load("optim.pt",map_location='cpu'))
of
accelerator.load_state("my_checkpoint")

If I use the code below then everything is fine.
optimizer.load_state_dict(torch.load("optim.pt",map_location=accelerator.device))

Expected behavior

Is it possible to use accelerator.load_state("my_checkpoint") and load the state to accelerator.device directly?
@sgugger
Copy link
Collaborator

sgugger commented Mar 20, 2023

I think this comes from the load of the optimizer state on multiple processes at once. @muellerzr we should probably load the optimizer on the device when in multi-GPU setting with num_processes>1

@muellerzr
Copy link
Collaborator

Should be fixed with #1220, can you try it out via pip install git+https://github.com/huggingface/accelerate and let us know? :)

@github-actions
Copy link

github-actions bot commented May 1, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants