Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

Closed
wants to merge 10 commits into from

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Feb 14, 2022

As we are dealing with CPU RAM < GPU RAM this PR is trying

  1. to init the model on GPU by allocating the model on gpu directly
    Code is courtesy of @jeffra with a demo at https://gist.github.com/jeffra/ec8a0b762e58a64d19ad1417250dc600

The idea was to remove deepspeed.zero.Init as supposedly it wasn't really doing anything under zero1, but when I did that the tp test started hanging. So I put it back and double inserted AllocateOnGPU in a nested fashion. deepspeed.zero.Init actually does a whole bunch of things regardless of ZeRO stage.

  1. stagger save/load_checkpoints in groups defined by --stagger_checkpoint_save_load_group_size - if not set normal all processes at once is performed.

This works on load_checkpoint, but hangs on save_checkpoint - I think it deadlocks since save_checkpoint probably calls a barrier too.


  1. while at it sped up several tests by making the test model smaller and training for 50x shorter.

pretrain_gpt.py Outdated
# enabled=args.zero_stage == 3,
# mpu=mpu):

# XXX: make `enabled` configurable or always load on GPU?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any time where we don't want to allocate directly on GPU for pretrain?

@stas00 stas00 changed the title allocate on gpu directly [low CPU RAM] allocate on gpu directly, stagger checkpoint load/save Feb 15, 2022
@stas00
Copy link
Contributor Author

stas00 commented Mar 7, 2022

For posterity: this works most of the time, except when it doesn't. In the complex Megatron-Deepspeed framework using this PR lead to all kinds of very weird problems often on CUDA level.

We eventually found that the problem wasn't the shortage of CPU memory but some issue in apex's FusedAdam which is described here and a workaround has been merged: #249
The merged workaround is a hack and needs more investigation.

@stas00 stas00 closed this Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants