Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving with trainer deepspeed zero3 missing config.json and tokenizer files. #25368

Closed
zjjMaiMai opened this issue Aug 8, 2023 · 2 comments · Fixed by #25817
Closed

Saving with trainer deepspeed zero3 missing config.json and tokenizer files. #25368

zjjMaiMai opened this issue Aug 8, 2023 · 2 comments · Fixed by #25817

Comments

@zjjMaiMai
Copy link
Contributor

trainer will not save tokenizer and config.json when training in deepspeed-zero3 with stage3_gather_16bit_weights_on_model_save=False.

line 2776 will raise ValueError, so line 2778 self._save never run to save tokenizer and other stuff. is this expected behavior?

elif self.is_deepspeed_enabled:
# this takes care of everything as long as we aren't under zero3
if version.parse(accelerate_version) <= version.parse("0.20.3"):
raise ValueError("Install Accelerate from main branch")
try:
state_dict = self.accelerator.get_state_dict(self.deepspeed)
if self.args.should_save:
self._save(output_dir, state_dict=state_dict)
except ValueError:
logger.warning(
" stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use"
" zero_to_fp32.py to recover weights"
)
self.model_wrapped.save_checkpoint(output_dir)

Originally posted by @zjjMaiMai in #24728 (comment)

@sgugger
Copy link
Collaborator

sgugger commented Aug 8, 2023

cc @pacman100

@pacman100
Copy link
Contributor

Hello @zjjMaiMai, Thank you for the details. This shouldn't be the behaviour and I'll be working on fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants