Update zero_to_fp32.py - to support deepspeed_stage_1 #3936

PicoCreator · 2023-07-12T09:58:12Z

Since deepspeed 1 checkpoint structure is identical to deepspeed 2 (AFAIK), we should just change the stage check and add support accordingly

However I am not 100% sure if this is intentional by design, or some coincidence in my use case - so might need someone with more knowledge on this topic to weigh in 🤔

Since deepspeed 1 checkpoint structure is identical to deepspeed 2 (AFAIK), we should just change the version check and add support accordingly

tjruwase · 2023-07-12T15:41:25Z

@stas00, do you remember why stage 1 was excluded?

stas00 · 2023-07-12T15:53:27Z

I think at the time I developed it I didn't think anybody used it, or at least I didn't, so I didn't have a use case for it.

tjruwase · 2023-07-12T15:55:47Z

@PicoCreator, thanks for this PR.

PicoCreator · 2023-07-12T16:34:04Z

Glad to see my wild guess, work and be of use (found a few issues on dependent projects that encounted this)

I currently use deepspeed 1 to train small toy models (<=3B), as fast as possible, and to test param/model architecture changes =) on consumer hardware (where the gpu-to-gpu communication of deepspeed 2+ is noticeable)

tjruwase · 2023-07-12T16:41:11Z

@PicoCreator, thanks for sharing your context and experience. We always appreciate hearing customer stories. I want to share a minor naming clarification in terms of DeepSpeed versus ZeRO.

DeepSpeed is the overall DL library that provides training, inference, compression, etc. optimizations.
ZeRO is a memory optimization of DeepSpeed with stages 1 to 3, in addition to CPU and NVMe offloading.

Update zero_to_fp32.py

8e84b3d

Since deepspeed 1 checkpoint structure is identical to deepspeed 2 (AFAIK), we should just change the version check and add support accordingly

PicoCreator requested review from jeffra, tjruwase and awan-10 as code owners July 12, 2023 09:58

PicoCreator mentioned this pull request Jul 12, 2023

Potential Mismatch Between Checkpoint Saving and Loading Behavior When Using DeepSpeed Stage 1 Lightning-AI/pytorch-lightning#16117

Closed

tjruwase approved these changes Jul 12, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Jul 12, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2023

tjruwase added this pull request to the merge queue Jul 12, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2023

mrwyattii added this pull request to the merge queue Jul 12, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2023

mrwyattii added this pull request to the merge queue Jul 12, 2023

Merged via the queue into deepspeedai:master with commit 103884a Jul 12, 2023

awaelchli mentioned this pull request Jul 19, 2023

Cannot load or use checkpoint with deepspeed stage 2 on 8 GPUs Lightning-AI/pytorch-lightning#15834

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update zero_to_fp32.py - to support deepspeed_stage_1 #3936

Update zero_to_fp32.py - to support deepspeed_stage_1 #3936

PicoCreator commented Jul 12, 2023 •

edited

Loading

tjruwase commented Jul 12, 2023

stas00 commented Jul 12, 2023

tjruwase commented Jul 12, 2023

PicoCreator commented Jul 12, 2023 •

edited

Loading

tjruwase commented Jul 12, 2023

Update zero_to_fp32.py - to support deepspeed_stage_1 #3936

Update zero_to_fp32.py - to support deepspeed_stage_1 #3936

Conversation

PicoCreator commented Jul 12, 2023 • edited Loading

tjruwase commented Jul 12, 2023

stas00 commented Jul 12, 2023

tjruwase commented Jul 12, 2023

PicoCreator commented Jul 12, 2023 • edited Loading

tjruwase commented Jul 12, 2023

PicoCreator commented Jul 12, 2023 •

edited

Loading

PicoCreator commented Jul 12, 2023 •

edited

Loading