-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248
Conversation
pretrain_gpt.py
Outdated
# enabled=args.zero_stage == 3, | ||
# mpu=mpu): | ||
|
||
# XXX: make `enabled` configurable or always load on GPU? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any time where we don't want to allocate directly on GPU for pretrain?
For posterity: this works most of the time, except when it doesn't. In the complex Megatron-Deepspeed framework using this PR lead to all kinds of very weird problems often on CUDA level. We eventually found that the problem wasn't the shortage of CPU memory but some issue in apex's FusedAdam which is described here and a workaround has been merged: #249 |
As we are dealing with CPU RAM < GPU RAM this PR is trying
Code is courtesy of @jeffra with a demo at https://gist.github.com/jeffra/ec8a0b762e58a64d19ad1417250dc600
The idea was to remove
deepspeed.zero.Init
as supposedly it wasn't really doing anything under zero1, but when I did that the tp test started hanging. So I put it back and double insertedAllocateOnGPU
in a nested fashion.deepspeed.zero.Init
actually does a whole bunch of things regardless of ZeRO stage.--stagger_checkpoint_save_load_group_size
- if not set normal all processes at once is performed.This works on
load_checkpoint
, but hangs onsave_checkpoint
- I think it deadlocks sincesave_checkpoint
probably calls abarrier
too.