Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update zero_to_fp32.py - to support deepspeed_stage_1 #3936

Merged
merged 1 commit into from
Jul 12, 2023

Conversation

PicoCreator
Copy link
Contributor

@PicoCreator PicoCreator commented Jul 12, 2023

Since deepspeed 1 checkpoint structure is identical to deepspeed 2 (AFAIK), we should just change the stage check and add support accordingly

However I am not 100% sure if this is intentional by design, or some coincidence in my use case - so might need someone with more knowledge on this topic to weigh in 🤔

Since deepspeed 1 checkpoint structure is identical to deepspeed 2 (AFAIK), we should just change the version check and add support accordingly
@tjruwase
Copy link
Contributor

@stas00, do you remember why stage 1 was excluded?

@stas00
Copy link
Collaborator

stas00 commented Jul 12, 2023

I think at the time I developed it I didn't think anybody used it, or at least I didn't, so I didn't have a use case for it.

@tjruwase
Copy link
Contributor

@PicoCreator, thanks for this PR.

@PicoCreator
Copy link
Contributor Author

PicoCreator commented Jul 12, 2023

Glad to see my wild guess, work and be of use (found a few issues on dependent projects that encounted this)

I currently use deepspeed 1 to train small toy models (<=3B), as fast as possible, and to test param/model architecture changes =) on consumer hardware (where the gpu-to-gpu communication of deepspeed 2+ is noticeable)

@tjruwase
Copy link
Contributor

@PicoCreator, thanks for sharing your context and experience. We always appreciate hearing customer stories. I want to share a minor naming clarification in terms of DeepSpeed versus ZeRO.

  1. DeepSpeed is the overall DL library that provides training, inference, compression, etc. optimizations.
  2. ZeRO is a memory optimization of DeepSpeed with stages 1 to 3, in addition to CPU and NVMe offloading.

@tjruwase tjruwase added this pull request to the merge queue Jul 12, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2023
@tjruwase tjruwase added this pull request to the merge queue Jul 12, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2023
@mrwyattii mrwyattii added this pull request to the merge queue Jul 12, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2023
@mrwyattii mrwyattii added this pull request to the merge queue Jul 12, 2023
Merged via the queue into deepspeedai:master with commit 103884a Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants