Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cannot restore from checkpoint #3360

Closed
YuboLong opened this issue Dec 4, 2023 · 2 comments · Fixed by coqui-ai/Trainer#131 or coqui-ai/Trainer#135
Closed

[Bug] Cannot restore from checkpoint #3360

YuboLong opened this issue Dec 4, 2023 · 2 comments · Fixed by coqui-ai/Trainer#131 or coqui-ai/Trainer#135
Labels
bug Something isn't working

Comments

@YuboLong
Copy link

YuboLong commented Dec 4, 2023

Describe the bug

I'm training a vits model , when continue a training process using

python TTS/bin/train_tts.py --continue_path path/to/training/model/ouput/checkpoint/

the code in function _restore_best_loss ( trainer.py: line 1720 ) did not check type of ch["model_loss"] .
Restoring from best_model_xxx.pth or best_model.pth the ch["model_loss"] is a float/real number , when restore from checkpoint_xxx.pth , it becomes a dict.
At the end of one epoch , it will compare a loss value , it raise a error says 'dict' cannot compare with real number , and the training process exits.

currently I modify the code to following to avoid this problem.

    def _restore_best_loss(self):
        """Restore the best loss from the args.best_path if provided else
        from the model (`args.continue_path`) used for resuming the training"""
        if self.args.continue_path and (self.restore_step != 0 or self.args.best_path):
            logger.info(" > Restoring best loss from %s ...", os.path.basename(self.args.best_path))
            ch = load_fsspec(self.args.restore_path, map_location="cpu")
            if "model_loss" in ch:
                theLoss = ch["model_loss"]
                if type(theLoss)==dict:
                    self.best_loss = ch["model_loss"]["train_loss"]
                else:
                   self.best_loss = theLoss
            logger.info(" > Starting with loaded last best loss %f", self.best_loss)

To Reproduce

retore training process using a checkpoint , triggered from ctrl-c or 'save_best_after'

Expected behavior

continue a training without process exit

Logs

No response

Environment

Windows 10 with RTX3060 
Colab With T4
Git Branch : Dev (11ec9f7471620ebaa57db7ff5705254829ffe516)

In both environment I encounter the issue.

Additional context

No response

@eginhard
Copy link
Contributor

eginhard commented Dec 4, 2023

coqui-ai/Trainer#131 would fix this

@erogol
Copy link
Member

erogol commented Dec 7, 2023

FYI Just reverted that PR due to CI conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants