-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve early stopping verbose logging #6811
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6811 +/- ##
=======================================
- Coverage 91% 87% -4%
=======================================
Files 200 200
Lines 12875 12887 +12
=======================================
- Hits 11765 11231 -534
- Misses 1110 1656 +546 |
msg = f"Metric {self.monitor} improved. New best score: {current:.3f}" | ||
return msg | ||
|
||
def _log_info(self, msg: str, rank: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli What's the point of having the rank
parameter? Maybe rank_zero_only
=True parameter would make more sense? Or the rank_zero_info
utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends. we can log_info on each rank when verbose, like the way I have done it for debug a few lines below.
basically I implemented it like this so people can give feedback what they prefer. happy to go either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to display on all ranks only if the current_value
is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's different when the user does not use sync_dist=True
in logging, but we don't know that.
The best I could do is all_gather
and then compare. Is it worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be an over-kill. I don't have strong opinion on either log_info
or rank_zero_info
.
Insofar as this was triggered my my suggestion, let me give you the use-case I had in mind. Basically I think this is almost a "return code" sort of situation? Maybe that is even a better way to do it for those guis. @williamFalcon basically in the GUI I would want to be able to sort on which of my hyperparameter options divergered, hit max_time, etc. and @luiscape this is also useful for the sorts of return codes from an experiment that we discussed for hyperprameter optimization. In the controller you might want to do different things based on them. |
Hey @jlperla, sounds like a great idea. EarlyStopping should provide a summary to why it stops the run which can then be parsed. Maybe in a following PR @awaelchli ? |
I cannot comment on the Grid integration part at this point. In this PR I propose to add a report for why EarlyStopping is triggered. Currently, the only reason for stopping is that the monitored metric did not improve. |
Hey, sounds good. Take a look at this: #6795 (comment) I just really think that the whole early stopping stuff should be thought of holistically. To me, stopping at a threshold, hitting max time, hitting the maximum number of epochs, and having the optimizer stalling out before it can solve the model (i.e. the Honestly, given that I could only have one, I would much prefer a "return code" than any logging. That way I can do a whole bunch of runs, put those return codes in a csv table and start analyzing the proportion that succeeded (e.g. below the tolerance) etc. The other thing is that this feature is useless if it only happens when |
Also, why would you only log the stopping reason for the early stopping callback rather than the max_epochs or max_time ones? I really think that this smells like the whole user added early stopping callback is the wrong design. There is no reason in my mind why So my proposal: stop on this and the other issues that spawned out of #6795, and see what would be involved to add those things in as core trainer options. I can't imagine a scenario were you wouldn't want early stopping, checking of divergence, etc. especially if you are paying by the minute. Users could still write callbacks if they wanted, but I am not sure it is that useful to make it first class. The way most optimizers work is that if a user wants to stop an optimization run they are also able to just return -Inf or Inf or whatever. That is good enough for most scenarios I can think of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
Part of #6795
When the user sets
verbose=True
, show detailed information how early stopping tracks the metric over time.Adds a message when metric improves by at least min delta. Looks like this:
For multi-gpu, the message will show on all ranks. Looks like this:
User can log with
sync_dist=True
or torchmetrics, then the values are the same for all early stopping across gpus.Also
Fixes #6113: make sure we don't report patience in terms of "epochs" because depending on the Trainer flags we may not check every epoch.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃