Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan metric breaking ModelCheckpoint #2636

Closed
ehsanmok opened this issue Jul 17, 2020 · 4 comments · Fixed by #3863
Closed

nan metric breaking ModelCheckpoint #2636

ehsanmok opened this issue Jul 17, 2020 · 4 comments · Fixed by #3863
Assignees
Labels
bug Something isn't working checkpointing Related to checkpointing priority: 0 High priority task
Milestone

Comments

@ehsanmok
Copy link

ehsanmok commented Jul 17, 2020

🐛 Bug

Comparing any numbers to float('nan') is False in Python so as a result if a non-loss metric score is nan initially in training, then callback cannot checkpoint any scores after.

Expected behavior

Ignore a nan metric score. This is orthogonal to when grad or weights become nan.

Environment

  • PyTorch Version (e.g., 1.0): 1.5.0
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.0
  • GPU models and configuration: Tesla V100

Additional context

Previous issue wasn't addressed completely #1008

@ehsanmok ehsanmok added bug Something isn't working help wanted Open to be worked on labels Jul 17, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Sep 15, 2020

@ehsanmok mind send a PR? 🐰

@Borda Borda added the good first issue Good for newcomers label Sep 15, 2020
@edenlightning edenlightning added checkpointing Related to checkpointing Metrics labels Sep 22, 2020
@edenlightning edenlightning changed the title ModelCheckpoint is hopeless againts a nan metric nan metric breaking ModelCheckpoint Sep 22, 2020
@edenlightning edenlightning added this to the 0.9.x milestone Sep 23, 2020
@teddykoker teddykoker self-assigned this Sep 23, 2020
@ddrevicky
Copy link
Contributor

Assign this to me please.

@ddrevicky
Copy link
Contributor

So I can reproduce this issue like this.

I am not exactly clear on what the expected behavior is though. In @awaelchli 's PR for nan detection and intervention, training is stopped when loss or weights contain nan or infinite values.

What do we want to do:

  1. if a metric that was passed as monitor param to ModelCheckpointgoes nan/inf:

    • raise an error (like when loss/weights do the same)
    • raise a warning but continue training and do not save any more checkpoints
    • raise a warning but continue training and do not save checkpoints when monitor is nan/inf. What if it returns to non nan/inf values (as mentioned in #1008)?
  2. separate from this issue perhaps if any metric goes nan/inf (regardless of whether ModelCheckpoint) is used:

    • raise a warning / error ? (if the former than this should be addressed in a different issue perhaps)
  3. when do we want to detect nan/infmetrics (whether monitor or any)?

    • ASAP which would be in the first validation step when it happens (perhaps even in Trainer.run_sanity_check)
    • on_validation_end (after all batches are processed) and model checkpoint is being saved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants