Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.progress_bar_dict values are delayed by one epoch #4863

Closed
carmocca opened this issue Nov 26, 2020 · 3 comments · Fixed by #4913
Closed

trainer.progress_bar_dict values are delayed by one epoch #4863

carmocca opened this issue Nov 26, 2020 · 3 comments · Fixed by #4913
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented Nov 26, 2020

🐛 Bug

trainer.progress_bar_dict does not contain what was self.log-ed in that epoch's training step, but the previous value.
This also means that there is no value for epoch 0.

To Reproduce (and to test)

def test_progress_bar_dict_contains_values_on_train_epoch_end(tmpdir):
    class TestModel(BoringModel):
        def training_step(self, *args):
            self.log("foo", torch.tensor(self.current_epoch), on_step=False, on_epoch=True, prog_bar=True)
            return super().training_step(*args)

        def on_train_epoch_end(self, *_):
            self.epoch_end_called = True
            assert self.trainer.progress_bar_dict["foo"] == self.current_epoch

    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=2,
        limit_train_batches=1,
        limit_val_batches=0,
        checkpoint_callback=False,
        logger=False,
        weights_summary=None,
        progress_bar_refresh_rate=0,
    )
    model = TestModel()
    trainer.fit(model)
    assert model.epoch_end_called

Expected behavior

trainer.progress_bar_dict contains the values logged in training_step

Environment

The test fails on master

The commit that introduced this bug is 9c8701f. The previous one works as expected.

I haven't dived into the changes much since it is a large commit. Maybe the author can help @tchaton

@carmocca carmocca added bug Something isn't working help wanted Open to be worked on labels Nov 26, 2020
@carmocca
Copy link
Contributor Author

carmocca commented Nov 27, 2020

Okay, the cause of the issue is that trainer.call_hook('on_train_epoch_end', epoch_output) is run before trainer.logger_connector.on_train_epoch_end(). And the progress_bar_dict is updated in the second one.

However, changing the order here
https://github.com/PyTorchLightning/pytorch-lightning/blob/217650320e376f4dadd1c7b8c034ec55dee60a23/pytorch_lightning/trainer/training_loop.py#L816-L820
fixes the test above but fails https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/trainer/logging/test_logger_connector.py#L52-L215 with a KeyError

I will keep debugging...

@djovanoski
Copy link

I think the problem is in the reset in call_hook of the model._results as we call after that also self._cache_logged_metrics() which is on empty Result.

Also in the def training_step(...) in training_loop.py we have call_hook which reset the Result and after that collecting the results from model._results which is empty

@carmocca carmocca changed the title Missing values in trainer.progress_bar_dict after epoch 0 trainer.progress_bar_dict values are delayed by one epoch Nov 28, 2020
@tchaton tchaton self-assigned this Nov 28, 2020
@tchaton tchaton added the priority: 0 High priority task label Nov 28, 2020
@tchaton tchaton added this to the 1.1 milestone Nov 28, 2020
@tchaton
Copy link
Contributor

tchaton commented Nov 28, 2020

Hey @carmocca,

Thanks for reporting this bug. Yes, I needs to be permuted. I will resolve this next week.

Best regards,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants