Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support saving models trained with DeepSpeed in Trainer callbacks #31338

Open
dwyatte opened this issue Jun 9, 2024 · 4 comments
Open

Support saving models trained with DeepSpeed in Trainer callbacks #31338

dwyatte opened this issue Jun 9, 2024 · 4 comments
Labels

Comments

@dwyatte
Copy link
Contributor

dwyatte commented Jun 9, 2024

Feature request

Trainer callbacks pass a model to any registered callback, but this model cannot be saved if training with DeepSpeed Stage 3 (needs access to Trainer.accelerator and Trainer.model_wrapped)

Motivation

I have a custom callback that logs a model to a tracking server at the end of training, but need access to Trainer.accelerator and Trainer.model_wrapped in my callback to prevent skipping saving sharded tensors. I believe this may lead to a bug in the official wandb callback which uses a "fake trainer" (I don't use wandb so can't confirm, but see an error similar to one reported in another repo if I try the "fake trainer" approach in my custom callback e.g., axolotl-ai-cloud/axolotl#1092)

Your contribution

I can contribute this feature, but wanted to get guidance on the design. Roughly

  • trainer_callback.CallbackHandler receives an additional arg, accelerator, which will be passed Trainer.accelerator on instantiation
  • trainer_callback.CallbackHandler has an attribute model_wrapped which gets updated with Trainer.model_wrapped by Trainer
  • trainer_callback.CallbackHandler.call_event will pass along self.accelerator and self.model_wrapped when calling callbacks

This allows my callback to do something like state_dict = accelerator.get_state_dict(model_wrapped) and pass that along to model.save_pretrained as to not skip saving sharded tensors

If there is a better design or if this is better classified as a bug, please let me know

@dwyatte dwyatte added the Feature request Request for a new feature label Jun 9, 2024
@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@cubatic45
Copy link

Support saving models trained with DeepSpeed to wandb

@YeLuoSuiYou
Copy link

Supoort it, because if we want to monitor per layers grad in the callback or wandb, we also need to call self.model_wrapped in callback~

@ri938
Copy link

ri938 commented Dec 24, 2024

Also interested in this feature request: I want to save checkpoints when using deepspeed stage 3. Currently I want to upload them to huggingface but empty tensors get pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants