Support saving models trained with DeepSpeed in Trainer callbacks #31338

dwyatte · 2024-06-09T18:08:44Z

Feature request

Trainer callbacks pass a model to any registered callback, but this model cannot be saved if training with DeepSpeed Stage 3 (needs access to Trainer.accelerator and Trainer.model_wrapped)

Motivation

I have a custom callback that logs a model to a tracking server at the end of training, but need access to Trainer.accelerator and Trainer.model_wrapped in my callback to prevent skipping saving sharded tensors. I believe this may lead to a bug in the official wandb callback which uses a "fake trainer" (I don't use wandb so can't confirm, but see an error similar to one reported in another repo if I try the "fake trainer" approach in my custom callback e.g., axolotl-ai-cloud/axolotl#1092)

Your contribution

I can contribute this feature, but wanted to get guidance on the design. Roughly

trainer_callback.CallbackHandler receives an additional arg, accelerator, which will be passed Trainer.accelerator on instantiation
trainer_callback.CallbackHandler has an attribute model_wrapped which gets updated with Trainer.model_wrapped by Trainer
trainer_callback.CallbackHandler.call_event will pass along self.accelerator and self.model_wrapped when calling callbacks

This allows my callback to do something like state_dict = accelerator.get_state_dict(model_wrapped) and pass that along to model.save_pretrained as to not skip saving sharded tensors

If there is a better design or if this is better classified as a bug, please let me know

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-06-10T09:06:58Z

cc @muellerzr @SunMarc

cubatic45 · 2024-08-19T01:42:24Z

Support saving models trained with DeepSpeed to wandb

YeLuoSuiYou · 2024-10-15T02:22:46Z

Supoort it, because if we want to monitor per layers grad in the callback or wandb, we also need to call self.model_wrapped in callback~

ri938 · 2024-12-24T18:35:49Z

Also interested in this feature request: I want to save checkpoints when using deepspeed stage 3. Currently I want to upload them to huggingface but empty tensors get pushed.

dwyatte added the Feature request Request for a new feature label Jun 9, 2024

amyeroberts added trainer DeepSpeed labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support saving models trained with DeepSpeed in Trainer callbacks #31338

Support saving models trained with DeepSpeed in Trainer callbacks #31338

dwyatte commented Jun 9, 2024 •

edited

Loading

amyeroberts commented Jun 10, 2024

cubatic45 commented Aug 19, 2024

YeLuoSuiYou commented Oct 15, 2024

ri938 commented Dec 24, 2024

Support saving models trained with DeepSpeed in Trainer callbacks #31338

Support saving models trained with DeepSpeed in Trainer callbacks #31338

Comments

dwyatte commented Jun 9, 2024 • edited Loading

Feature request

Motivation

Your contribution

amyeroberts commented Jun 10, 2024

cubatic45 commented Aug 19, 2024

YeLuoSuiYou commented Oct 15, 2024

ri938 commented Dec 24, 2024

dwyatte commented Jun 9, 2024 •

edited

Loading