[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

stas00 · 2022-02-14T19:13:11Z

As we are dealing with CPU RAM < GPU RAM this PR is trying

to init the model on GPU by allocating the model on gpu directly
Code is courtesy of @jeffra with a demo at https://gist.github.com/jeffra/ec8a0b762e58a64d19ad1417250dc600

The idea was to remove deepspeed.zero.Init as supposedly it wasn't really doing anything under zero1, but when I did that the tp test started hanging. So I put it back and double inserted AllocateOnGPU in a nested fashion. deepspeed.zero.Init actually does a whole bunch of things regardless of ZeRO stage.

stagger save/load_checkpoints in groups defined by --stagger_checkpoint_save_load_group_size - if not set normal all processes at once is performed.

This works on load_checkpoint, but hangs on save_checkpoint - I think it deadlocks since save_checkpoint probably calls a barrier too.

while at it sped up several tests by making the test model smaller and training for 50x shorter.

stas00 · 2022-02-14T19:14:43Z

pretrain_gpt.py

+    #                          enabled=args.zero_stage == 3,
+    #                          mpu=mpu):
+
+    # XXX: make `enabled` configurable or always load on GPU?


Is there any time where we don't want to allocate directly on GPU for pretrain?

megatron/model/transformer.py

stas00 · 2022-03-07T18:49:29Z

For posterity: this works most of the time, except when it doesn't. In the complex Megatron-Deepspeed framework using this PR lead to all kinds of very weird problems often on CUDA level.

We eventually found that the problem wasn't the shortage of CPU memory but some issue in apex's FusedAdam which is described here and a workaround has been merged: #249
The merged workaround is a hack and needs more investigation.

allocate on gpu directly

d4d277e

stas00 commented Feb 14, 2022

View reviewed changes

fix tensor function

f6cc6cd

stas00 commented Feb 14, 2022

View reviewed changes

megatron/model/transformer.py Show resolved Hide resolved

stas00 added 4 commits February 14, 2022 18:33

make tests faster - smaller model/iterations

bcf4f7c

tp test hangs w/o deepspeed.zero.Init

8534b6d

fix comment

d96863a

process staggering wip

1d4441f

stas00 changed the title ~~allocate on gpu directly~~ [low CPU RAM] allocate on gpu directly, stagger checkpoint load/save Feb 15, 2022

stas00 added 4 commits February 14, 2022 21:31

cleanup

ac4d77e

fix

3a95a03

fix

4953ee7

fix

67ec9c1

HaokunLiu mentioned this pull request Mar 7, 2022

[REQUEST] High CPU memory usage at initialization microsoft/DeepSpeed#1814

Open

stas00 closed this Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

stas00 commented Feb 14, 2022 •

edited

Loading

stas00 Feb 14, 2022

stas00 commented Mar 7, 2022 •

edited

Loading

[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

Conversation

stas00 commented Feb 14, 2022 • edited Loading

stas00 Feb 14, 2022

Choose a reason for hiding this comment

stas00 commented Mar 7, 2022 • edited Loading

stas00 commented Feb 14, 2022 •

edited

Loading

stas00 commented Mar 7, 2022 •

edited

Loading