improve early stopping verbose logging #6811

awaelchli · 2021-04-03T21:11:42Z

What does this PR do?

Part of #6795
When the user sets verbose=True, show detailed information how early stopping tracks the metric over time.

Adds a message when metric improves by at least min delta. Looks like this:

Metric valid_loss improved by 0.076 >= min_delta = 0.075. New best score: 0.310

For multi-gpu, the message will show on all ranks. Looks like this:

[0] Metric valid_loss improved by 0.076 >= min_delta = 0.075. New best score: 0.310
[1] Metric valid_loss improved by 0.077 >= min_delta = 0.075. New best score: 0.309

User can log with sync_dist=True or torchmetrics, then the values are the same for all early stopping across gpus.

Also
Fixes #6113: make sure we don't report patience in terms of "epochs" because depending on the Trainer flags we may not check every epoch.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-04-03T21:27:29Z

Codecov Report

Merging #6811 (7dd7405) into master (393b252) will decrease coverage by 4%.
The diff coverage is 93%.

@@           Coverage Diff           @@
##           master   #6811    +/-   ##
=======================================
- Coverage      91%     87%    -4%     
=======================================
  Files         200     200            
  Lines       12875   12887    +12     
=======================================
- Hits        11765   11231   -534     
- Misses       1110    1656   +546

kaushikb11 · 2021-04-05T12:46:05Z

pytorch_lightning/callbacks/early_stopping.py

+            msg = f"Metric {self.monitor} improved. New best score: {current:.3f}"
+        return msg
+
+    def _log_info(self, msg: str, rank: int) -> None:


@awaelchli What's the point of having the rank parameter? Maybe rank_zero_only=True parameter would make more sense? Or the rank_zero_info utility.

it depends. we can log_info on each rank when verbose, like the way I have done it for debug a few lines below.
basically I implemented it like this so people can give feedback what they prefer. happy to go either way

I think it would make sense to display on all ranks only if the current_value is different.

it's different when the user does not use sync_dist=True in logging, but we don't know that.
The best I could do is all_gather and then compare. Is it worth it?

Might be an over-kill. I don't have strong opinion on either log_info or rank_zero_info.

jlperla · 2021-04-05T13:29:40Z

Insofar as this was triggered my my suggestion, let me give you the use-case I had in mind.
When I am running a whole bunch of experiments, in a single run, I would love to be able to see in the grid experiments view the reason for stopping occurring. With this early stopping changes discussed it could be: max_epochs hit, max_time hit, metric divergence, metric hit patience level, metric hit threshold. You would almost always want to see that sort of stuff, not just during a verbose flag, and since you would want it in a tight web view it would be nice to have it fairly succinct.

Basically I think this is almost a "return code" sort of situation? Maybe that is even a better way to do it for those guis. @williamFalcon basically in the GUI I would want to be able to sort on which of my hyperparameter options divergered, hit max_time, etc. and @luiscape this is also useful for the sorts of return codes from an experiment that we discussed for hyperprameter optimization. In the controller you might want to do different things based on them.

tchaton · 2021-04-06T07:20:28Z

Insofar as this was triggered my my suggestion, let me give you the use-case I had in mind.
When I am running a whole bunch of experiments, in a single run, I would love to be able to see in the grid experiments view the reason for stopping occurring. With this early stopping changes discussed it could be: max_epochs hit, max_time hit, metric divergence, metric hit patience level, metric hit threshold. You would almost always want to see that sort of stuff, not just during a verbose flag, and since you would want it in a tight web view it would be nice to have it fairly succinct.

Basically I think this is almost a "return code" sort of situation? Maybe that is even a better way to do it for those guis. @williamFalcon basically in the GUI I would want to be able to sort on which of my hyperparameter options divergered, hit max_time, etc. and @luiscape this is also useful for the sorts of return codes from an experiment that we discussed for hyperprameter optimization. In the controller you might want to do different things based on them.

Hey @jlperla, sounds like a great idea. EarlyStopping should provide a summary to why it stops the run which can then be parsed. Maybe in a following PR @awaelchli ?

awaelchli · 2021-04-06T07:58:45Z

I cannot comment on the Grid integration part at this point. In this PR I propose to add a report for why EarlyStopping is triggered. Currently, the only reason for stopping is that the monitored metric did not improve.

jlperla · 2021-04-06T14:14:36Z

I cannot comment on the Grid integration part at this point. In this PR I propose to add a report for why EarlyStopping is triggered. Currently, the only reason for stopping is that the monitored metric did not improve.
Hey @jlperla, sounds like a great idea. EarlyStopping should provide a summary to why it stops the run which can then be parsed. Maybe in a following PR @awaelchli ?

Hey, sounds good. Take a look at this: #6795 (comment) I just really think that the whole early stopping stuff should be thought of holistically.

To me, stopping at a threshold, hitting max time, hitting the maximum number of epochs, and having the optimizer stalling out before it can solve the model (i.e. the patience thing) are all related and something that should be part of the trainer options/optimizer options itself rather than as an add-on since they are so essential.

Honestly, given that I could only have one, I would much prefer a "return code" than any logging. That way I can do a whole bunch of runs, put those return codes in a csv table and start analyzing the proportion that succeeded (e.g. below the tolerance) etc.

The other thing is that this feature is useless if it only happens when verbose=True. I just want to know why it stopped, not get a whole bunch of other stuff in there

jlperla · 2021-04-06T14:27:12Z

Also, why would you only log the stopping reason for the early stopping callback rather than the max_epochs or max_time ones?

I really think that this smells like the whole user added early stopping callback is the wrong design. There is no reason in my mind why max_epochs and max_time belong in the trainer and the other stuff belongs in a separate callback. I think the intent was that it would be easy for users to create their own early stopping logic, but (1) it didn't work out that way because of the difficulty of stopping distributed processes; (2) as soon as you start having return codes/etc. for a controller, it shows these things are all linked; (3) I think that these features are a lot more consistent than it first appeared. Basically, add in the features of an optimizer.... and (4) if it is in a separate callback then users need to manually add in all of the CLI options and cannot use the pl.Trainer.from_argparse_args(args) to load from it. This last one is an issue for anyone paying for cloud services or GPUs (e.g. grid) since you want it to be as easy as possible for people to stop things or they would blow through their $$$ and be mad that they have little to show for it.

So my proposal: stop on this and the other issues that spawned out of #6795, and see what would be involved to add those things in as core trainer options. I can't imagine a scenario were you wouldn't want early stopping, checking of divergence, etc. especially if you are paying by the minute.

Users could still write callbacks if they wanted, but I am not sure it is that useful to make it first class. The way most optimizers work is that if a user wants to stop an optimization run they are also able to just return -Inf or Inf or whatever. That is good enough for most scenarios I can think of.

tchaton

LGTM !

kaushikb11

LGTM!

pytorch_lightning/callbacks/early_stopping.py

awaelchli added 2 commits April 3, 2021 23:06

improve early stopping verbosity

0cbd835

import

3272a59

awaelchli added feature Is an improvement or enhancement callback labels Apr 3, 2021

update message

1a438d6

awaelchli mentioned this pull request Apr 4, 2021

Add more early stopping options #6795

Closed

awaelchli added 3 commits April 4, 2021 22:47

abs for min delta

a96b982

add changelog

4bc3d6f

handle special case when score is not finite in first epoch

0149431

awaelchli added this to the 1.3 milestone Apr 5, 2021

awaelchli marked this pull request as ready for review April 5, 2021 09:28

awaelchli requested review from Borda, carmocca, justusschock, SeanNaren, tchaton and williamFalcon as code owners April 5, 2021 09:28

update message code for better readability

300eab6

kaushikb11 reviewed Apr 5, 2021

View reviewed changes

wip

30388b0

tchaton approved these changes Apr 8, 2021

View reviewed changes

awaelchli changed the title ~~Early Stopping verbose logging~~ [blocked by #6868] Early Stopping verbose logging Apr 8, 2021

mergify bot added the has conflicts label Apr 9, 2021

awaelchli changed the title ~~[blocked by #6868] Early Stopping verbose logging~~ Early Stopping verbose logging Apr 26, 2021

awaelchli added 2 commits April 26, 2021 11:44

Merge branch 'master' into feature/early-stopping-verbosity

481d6bb

message

7fb1f28

mergify bot removed the has conflicts label Apr 26, 2021

awaelchli marked this pull request as draft April 26, 2021 10:01

awaelchli added 5 commits April 27, 2021 13:27

Merge branch 'master' into feature/early-stopping-verbosity

b789d74

revert

953bbe2

changelog

b32ca1d

update typing

9ed2b10

update verbose log

372063d

awaelchli changed the title ~~Early Stopping verbose logging~~ improve early stopping verbose logging Apr 27, 2021

awaelchli marked this pull request as ready for review April 27, 2021 17:12

mergify bot added the has conflicts label Apr 27, 2021

Merge branch 'master' into feature/early-stopping-verbosity

493a279

mergify bot added has conflicts and removed has conflicts labels Apr 27, 2021

awaelchli added the ready PRs ready to be merged label Apr 30, 2021

Merge branch 'master' into feature/early-stopping-verbosity

e9181f4

mergify bot removed the has conflicts label Apr 30, 2021

kaushikb11 approved these changes Apr 30, 2021

View reviewed changes

change wording epoch -> record

2719571

ananthsub approved these changes Apr 30, 2021

View reviewed changes

ananthsub reviewed Apr 30, 2021

View reviewed changes

pytorch_lightning/callbacks/early_stopping.py Outdated Show resolved Hide resolved

awaelchli added 2 commits April 30, 2021 22:44

include rank prefix

94e5deb

Merge branch 'master' into feature/early-stopping-verbosity

7dd7405

awaelchli enabled auto-merge (squash) May 3, 2021 19:55

awaelchli merged commit bf1394a into master May 3, 2021

awaelchli deleted the feature/early-stopping-verbosity branch May 3, 2021 20:20

kaushikb11 pushed a commit to kaushikb11/pytorch-lightning that referenced this pull request May 4, 2021

improve early stopping verbose logging (Lightning-AI#6811)

96dcb10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve early stopping verbose logging #6811

improve early stopping verbose logging #6811

awaelchli commented Apr 3, 2021 •

edited

Loading

codecov bot commented Apr 3, 2021 •

edited

Loading

kaushikb11 Apr 5, 2021 •

edited

Loading

awaelchli Apr 5, 2021 •

edited

Loading

tchaton Apr 6, 2021

awaelchli Apr 6, 2021

tchaton Apr 6, 2021

jlperla commented Apr 5, 2021

tchaton commented Apr 6, 2021

awaelchli commented Apr 6, 2021

jlperla commented Apr 6, 2021

jlperla commented Apr 6, 2021

tchaton left a comment

kaushikb11 left a comment

improve early stopping verbose logging #6811

improve early stopping verbose logging #6811

Conversation

awaelchli commented Apr 3, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Apr 3, 2021 • edited Loading

Codecov Report

kaushikb11 Apr 5, 2021 • edited Loading

Choose a reason for hiding this comment

awaelchli Apr 5, 2021 • edited Loading

Choose a reason for hiding this comment

tchaton Apr 6, 2021

Choose a reason for hiding this comment

awaelchli Apr 6, 2021

Choose a reason for hiding this comment

tchaton Apr 6, 2021

Choose a reason for hiding this comment

jlperla commented Apr 5, 2021

tchaton commented Apr 6, 2021

awaelchli commented Apr 6, 2021

jlperla commented Apr 6, 2021

jlperla commented Apr 6, 2021

tchaton left a comment

Choose a reason for hiding this comment

kaushikb11 left a comment

Choose a reason for hiding this comment

awaelchli commented Apr 3, 2021 •

edited

Loading

codecov bot commented Apr 3, 2021 •

edited

Loading

kaushikb11 Apr 5, 2021 •

edited

Loading

awaelchli Apr 5, 2021 •

edited

Loading