nan metric breaking ModelCheckpoint #2636

ehsanmok · 2020-07-17T20:49:57Z

🐛 Bug

Comparing any numbers to float('nan') is False in Python so as a result if a non-loss metric score is nan initially in training, then callback cannot checkpoint any scores after.

Expected behavior

Ignore a nan metric score. This is orthogonal to when grad or weights become nan.

Environment

PyTorch Version (e.g., 1.0): 1.5.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 10.0
GPU models and configuration: Tesla V100

Additional context

Previous issue wasn't addressed completely #1008

The text was updated successfully, but these errors were encountered:

github-actions · 2020-07-17T20:50:49Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-09-15T17:43:27Z

@ehsanmok mind send a PR? 🐰

ddrevicky · 2020-09-26T10:02:44Z

Assign this to me please.

ddrevicky · 2020-09-27T07:44:38Z

So I can reproduce this issue like this.

I am not exactly clear on what the expected behavior is though. In @awaelchli 's PR for nan detection and intervention, training is stopped when loss or weights contain nan or infinite values.

What do we want to do:

if a metric that was passed as monitor param to ModelCheckpointgoes nan/inf:
- raise an error (like when loss/weights do the same)
- raise a warning but continue training and do not save any more checkpoints
- raise a warning but continue training and do not save checkpoints when monitor is nan/inf. What if it returns to non nan/inf values (as mentioned in #1008)?
separate from this issue perhaps if any metric goes nan/inf (regardless of whether ModelCheckpoint) is used:
- raise a warning / error ? (if the former than this should be addressed in a different issue perhaps)
when do we want to detect nan/infmetrics (whether monitor or any)?
- ASAP which would be in the first validation step when it happens (perhaps even in Trainer.run_sanity_check)
- on_validation_end (after all batches are processed) and model checkpoint is being saved

ehsanmok added bug Something isn't working help wanted Open to be worked on labels Jul 17, 2020

Borda added the good first issue Good for newcomers label Sep 15, 2020

edenlightning added checkpointing Related to checkpointing Metrics labels Sep 22, 2020

edenlightning changed the title ~~ModelCheckpoint is hopeless againts a nan metric~~ nan metric breaking ModelCheckpoint Sep 22, 2020

edenlightning added v1.0 allowed priority: 0 High priority task labels Sep 22, 2020

edenlightning added this to the 0.9.x milestone Sep 23, 2020

teddykoker self-assigned this Sep 23, 2020

edenlightning unassigned teddykoker Sep 24, 2020

edenlightning added the Hacktoberfest label Sep 24, 2020

rohitgr7 assigned ddrevicky and rohitgr7 Sep 26, 2020

ddrevicky mentioned this issue Sep 28, 2020

Fix checkpoint warning for floats #2143

Closed

edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020

edenlightning removed Hacktoberfest help wanted Open to be worked on good first issue Good for newcomers labels Oct 4, 2020

This was referenced Oct 5, 2020

WIP: Fix non-tensor scalar result aggregation #3540

Closed

fix init nan for checkpointing #3863

Merged

Borda assigned Borda and unassigned rohitgr7 Oct 5, 2020

williamFalcon closed this as completed in #3863 Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan metric breaking ModelCheckpoint #2636

nan metric breaking ModelCheckpoint #2636

ehsanmok commented Jul 17, 2020 •

edited

Loading

github-actions bot commented Jul 17, 2020

Borda commented Sep 15, 2020

ddrevicky commented Sep 26, 2020

ddrevicky commented Sep 27, 2020

nan metric breaking ModelCheckpoint #2636

nan metric breaking ModelCheckpoint #2636

Comments

ehsanmok commented Jul 17, 2020 • edited Loading

🐛 Bug

Expected behavior

Environment

Additional context

github-actions bot commented Jul 17, 2020

Borda commented Sep 15, 2020

ddrevicky commented Sep 26, 2020

ddrevicky commented Sep 27, 2020

ehsanmok commented Jul 17, 2020 •

edited

Loading