You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inside the training loop, we incorrectly skip running evaluation when reload_dataloaders_every_epoch=True and num_sanity_val_steps=0. With these settings, we defer setting the validation dataloader on the trainer until the evaluation loop is run from inside the training loop. However, this is too late as the training loop depends on the validation dataloader settings being set in order to even determine whether we run the evaluation loop at all.
This means it's possible to have these states set inside of the training loop when determining whether to run the evaluation loop:
This points out that should_check_val and should_train_only were not consistent with each other :(
#6075 changed the order with which we call run_evaluation inside the training loop. Before, this was covered up by luck because of the ordering. After the swap occurred there, this has been broken.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:
Additional context
The text was updated successfully, but these errors were encountered:
🐛 Bug
Inside the training loop, we incorrectly skip running evaluation when
reload_dataloaders_every_epoch=True
andnum_sanity_val_steps=0
. With these settings, we defer setting the validation dataloader on the trainer until the evaluation loop is run from inside the training loop. However, this is too late as the training loop depends on the validation dataloader settings being set in order to even determine whether we run the evaluation loop at all.This means it's possible to have these states set inside of the training loop when determining whether to run the evaluation loop:
should_skip_eval=True
whenself.trainer.num_val_batches
isn't set: In this instancetrainer.num_val_batches=[]
.https://github.com/PyTorchLightning/pytorch-lightning/blob/44d775fccfb825561937f6fa03fe258af25c2b83/pytorch_lightning/trainer/training_loop.py#L551
This points out that
should_check_val
andshould_train_only
were not consistent with each other :(#6075 changed the order with which we call
run_evaluation
inside the training loop. Before, this was covered up by luck because of the ordering. After the swap occurred there, this has been broken.Please reproduce using the BoringModel
https://colab.research.google.com/drive/1z9ln3gYBK-VGidNPdUE2UgE0ISAgjLpu?usp=sharing
To Reproduce
Use following BoringModel and post here
Expected behavior
Checkpointing should still work as expected because we run the evaluation loop when expected
Environment
Note:
Bugs with code
are solved faster !Colab Notebook
should be madepublic
!IDE
: Please, use our python bug_report_model.py template.Colab Notebook
: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).You can get the script and run it with:
conda
,pip
, source):Additional context
The text was updated successfully, but these errors were encountered: