[Bug] Cannot restore from checkpoint #3360

YuboLong · 2023-12-04T02:53:38Z

Describe the bug

I'm training a vits model , when continue a training process using

python TTS/bin/train_tts.py --continue_path path/to/training/model/ouput/checkpoint/

the code in function _restore_best_loss ( trainer.py: line 1720 ) did not check type of ch["model_loss"] .
Restoring from best_model_xxx.pth or best_model.pth the ch["model_loss"] is a float/real number , when restore from checkpoint_xxx.pth , it becomes a dict.
At the end of one epoch , it will compare a loss value , it raise a error says 'dict' cannot compare with real number , and the training process exits.

currently I modify the code to following to avoid this problem.

    def _restore_best_loss(self):
        """Restore the best loss from the args.best_path if provided else
        from the model (`args.continue_path`) used for resuming the training"""
        if self.args.continue_path and (self.restore_step != 0 or self.args.best_path):
            logger.info(" > Restoring best loss from %s ...", os.path.basename(self.args.best_path))
            ch = load_fsspec(self.args.restore_path, map_location="cpu")
            if "model_loss" in ch:
                theLoss = ch["model_loss"]
                if type(theLoss)==dict:
                    self.best_loss = ch["model_loss"]["train_loss"]
                else:
                   self.best_loss = theLoss
            logger.info(" > Starting with loaded last best loss %f", self.best_loss)

To Reproduce

retore training process using a checkpoint , triggered from ctrl-c or 'save_best_after'

Expected behavior

continue a training without process exit

Logs

No response

Environment

Windows 10 with RTX3060 
Colab With T4
Git Branch : Dev (11ec9f7471620ebaa57db7ff5705254829ffe516)

In both environment I encounter the issue.

Additional context

No response

The text was updated successfully, but these errors were encountered:

eginhard · 2023-12-04T05:26:48Z

coqui-ai/Trainer#131 would fix this

erogol · 2023-12-07T13:21:32Z

FYI Just reverted that PR due to CI conflicts.

YuboLong added the bug Something isn't working label Dec 4, 2023

eginhard mentioned this issue Dec 4, 2023

fix: make --continue_path work again coqui-ai/Trainer#131

Merged

erogol closed this as completed in coqui-ai/Trainer#131 Dec 5, 2023

eginhard mentioned this issue Dec 11, 2023

Revert "Revert "fix: make --continue_path work again (#131)"" coqui-ai/Trainer#135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Cannot restore from checkpoint #3360

[Bug] Cannot restore from checkpoint #3360

YuboLong commented Dec 4, 2023

eginhard commented Dec 4, 2023

erogol commented Dec 7, 2023

[Bug] Cannot restore from checkpoint #3360

[Bug] Cannot restore from checkpoint #3360

Comments

YuboLong commented Dec 4, 2023

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

eginhard commented Dec 4, 2023

erogol commented Dec 7, 2023