You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trainer callbacks pass a model to any registered callback, but this model cannot be saved if training with DeepSpeed Stage 3 (needs access to Trainer.accelerator and Trainer.model_wrapped)
Motivation
I have a custom callback that logs a model to a tracking server at the end of training, but need access to Trainer.accelerator and Trainer.model_wrapped in my callback to prevent skipping saving sharded tensors. I believe this may lead to a bug in the official wandb callback which uses a "fake trainer" (I don't use wandb so can't confirm, but see an error similar to one reported in another repo if I try the "fake trainer" approach in my custom callback e.g., axolotl-ai-cloud/axolotl#1092)
Your contribution
I can contribute this feature, but wanted to get guidance on the design. Roughly
trainer_callback.CallbackHandler receives an additional arg, accelerator, which will be passed Trainer.accelerator on instantiation
trainer_callback.CallbackHandler has an attribute model_wrapped which gets updated with Trainer.model_wrapped by Trainer
trainer_callback.CallbackHandler.call_event will pass along self.accelerator and self.model_wrapped when calling callbacks
This allows my callback to do something like state_dict = accelerator.get_state_dict(model_wrapped) and pass that along to model.save_pretrained as to not skip saving sharded tensors
If there is a better design or if this is better classified as a bug, please let me know
The text was updated successfully, but these errors were encountered:
Also interested in this feature request: I want to save checkpoints when using deepspeed stage 3. Currently I want to upload them to huggingface but empty tensors get pushed.
Feature request
Trainer callbacks pass a model to any registered callback, but this model cannot be saved if training with DeepSpeed Stage 3 (needs access to
Trainer.accelerator
andTrainer.model_wrapped
)Motivation
I have a custom callback that logs a model to a tracking server at the end of training, but need access to
Trainer.accelerator
andTrainer.model_wrapped
in my callback to prevent skipping saving sharded tensors. I believe this may lead to a bug in the officialwandb
callback which uses a "fake trainer" (I don't usewandb
so can't confirm, but see an error similar to one reported in another repo if I try the "fake trainer" approach in my custom callback e.g., axolotl-ai-cloud/axolotl#1092)Your contribution
I can contribute this feature, but wanted to get guidance on the design. Roughly
trainer_callback.CallbackHandler
receives an additional arg,accelerator
, which will be passedTrainer.accelerator
on instantiationtrainer_callback.CallbackHandler
has an attributemodel_wrapped
which gets updated withTrainer.model_wrapped
byTrainer
trainer_callback.CallbackHandler.call_event
will pass alongself.accelerator
andself.model_wrapped
when calling callbacksThis allows my callback to do something like
state_dict = accelerator.get_state_dict(model_wrapped)
and pass that along tomodel.save_pretrained
as to not skip saving sharded tensorsIf there is a better design or if this is better classified as a bug, please let me know
The text was updated successfully, but these errors were encountered: