You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the code in function _restore_best_loss ( trainer.py: line 1720 ) did not check type of ch["model_loss"] .
Restoring from best_model_xxx.pth or best_model.pth the ch["model_loss"] is a float/real number , when restore from checkpoint_xxx.pth , it becomes a dict.
At the end of one epoch , it will compare a loss value , it raise a error says 'dict' cannot compare with real number , and the training process exits.
currently I modify the code to following to avoid this problem.
def _restore_best_loss(self):
"""Restore the best loss from the args.best_path if provided else
from the model (`args.continue_path`) used for resuming the training"""
if self.args.continue_path and (self.restore_step != 0 or self.args.best_path):
logger.info(" > Restoring best loss from %s ...", os.path.basename(self.args.best_path))
ch = load_fsspec(self.args.restore_path, map_location="cpu")
if "model_loss" in ch:
theLoss = ch["model_loss"]
if type(theLoss)==dict:
self.best_loss = ch["model_loss"]["train_loss"]
else:
self.best_loss = theLoss
logger.info(" > Starting with loaded last best loss %f", self.best_loss)
To Reproduce
retore training process using a checkpoint , triggered from ctrl-c or 'save_best_after'
Expected behavior
continue a training without process exit
Logs
No response
Environment
Windows 10 with RTX3060
Colab With T4
Git Branch : Dev (11ec9f7471620ebaa57db7ff5705254829ffe516)
In both environment I encounter the issue.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
I'm training a vits model , when continue a training process using
the code in function _restore_best_loss ( trainer.py: line 1720 ) did not check type of ch["model_loss"] .
Restoring from best_model_xxx.pth or best_model.pth the ch["model_loss"] is a float/real number , when restore from checkpoint_xxx.pth , it becomes a dict.
At the end of one epoch , it will compare a loss value , it raise a error says 'dict' cannot compare with real number , and the training process exits.
currently I modify the code to following to avoid this problem.
To Reproduce
retore training process using a checkpoint , triggered from ctrl-c or 'save_best_after'
Expected behavior
continue a training without process exit
Logs
No response
Environment
Windows 10 with RTX3060 Colab With T4 Git Branch : Dev (11ec9f7471620ebaa57db7ff5705254829ffe516) In both environment I encounter the issue.
Additional context
No response
The text was updated successfully, but these errors were encountered: