You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
The error: Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Try to load optimizer and models but got CPU OOM with code below
optimizer.load_state_dict(torch.load("optim.pt",map_location='cpu'))
of
accelerator.load_state("my_checkpoint")
If I use the code below then everything is fine.
optimizer.load_state_dict(torch.load("optim.pt",map_location=accelerator.device))
Expected behavior
Is it possible to use accelerator.load_state("my_checkpoint") and load the state to accelerator.device directly?
The text was updated successfully, but these errors were encountered:
I think this comes from the load of the optimizer state on multiple processes at once. @muellerzr we should probably load the optimizer on the device when in multi-GPU setting with num_processes>1
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
The error: Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Try to load optimizer and models but got CPU OOM with code below
optimizer.load_state_dict(torch.load("optim.pt",map_location='cpu'))
of
accelerator.load_state("my_checkpoint")
If I use the code below then everything is fine.
optimizer.load_state_dict(torch.load("optim.pt",map_location=accelerator.device))
Expected behavior
The text was updated successfully, but these errors were encountered: