-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early Stopping stops too early when using SLURM #2038
Comments
Updated Lightning to current master, now early stopping doesn't work at all. |
When you are using 0.7.6 the earlystopping is called twice in the training loop. Therefore having patience 50, was effectively 25. See #1751 This was supposed to be fixed by now, but there are some issues with this. |
@HansBambel thanks. Strange that it worked locally but not on the cluster, though. Maybe I was using a slightly different version of Lightning locally. Should I close this and open a new bug report because of the early stopping not working at all right now? I had a similar problem because I didn't use a val_step before, maybe something like that crept in again? |
I think having no val_step could definitely be an issue, since earlystopping relies on the validation metric (to my knowledge). |
@HansBambel Okay, thanks, will do. I'll leave this open till I have tested it when early stopping starts working again at all. Early stopping is supposed to also work with metrics of the training step |
Alright! |
Closed via #2119 |
🐛 Bug
I have a really strange bug where the Early Stopping Callback seems to fire too early, but only when using my unis Slurm cluster. When I train the same model on my laptop locally this does not happen. Sadly I can't run the code directly on the login node to see if happens on all of their systems or only when Slurm is being used. What's really strange is, when i use higher patience, the training lasts longer, early stopping never stops training sooner than hparams.patience/2 (actually it happens weirdly close to hparams.patience/2) but almost never as late as hparams.patience. I tried to create a minimum working example, code below.
To Reproduce
Steps to reproduce the behavior:
Code sample
And here is my .sh file which I call via sbatch slurm_script.sh:
Expected behavior
The training to last at least as long as the patience value of the Early Stopping Callback.
I'm using Pytorch Lightning 0.7.7.dev0
The text was updated successfully, but these errors were encountered: