Training fails at the end of the epoch when returning None in the training step #7544

TommasoBendinelli · 2021-05-14T09:17:48Z

🐛 Bug

Sometimes my training loss in a batch is nan. Hence, I return None as loss so that the model will not backpropagate through it as suggested here: #4956. It works fine during the epoch; however, the code fails at the end of the epoch in the function reduce_across_time (line 532).

           if isinstance(value, list):
                value = torch.tensor(value)

In case of None, value will be equal to [None] and torch cannot create a proper tensor out of it (*** RuntimeError: Could not infer dtype of NoneType)

Is it me doing something wrong, or is it a bug in Lightning? Is there any workaround?

Pytorch Version
pytorch-lightning-1.3.1
torch 1.8.1+cu11
python 3.7.9

awaelchli · 2021-05-14T10:32:10Z

Thanks for reporting this. Can you simulate it with our bug report model please? Would help me alot thanks!

TommasoBendinelli · 2021-05-14T10:41:40Z

Sure, this reproduce the bug

import os
import random

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        if batch_idx == 2:
            loss = None
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=5,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=10,
        weights_summary=None,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    run()

rohitgr7 · 2021-05-14T10:53:36Z

I think its because of this

if batch_idx == 2:
    loss = None
self.log("train_loss", loss)

None values are being logged and stored here which are then accumulated at epoch end which is then throwing this error.
This should work

if batch_idx == 2:
    loss = None
else:
    self.log("train_loss", loss)

or lightning should handle this internally?

TommasoBendinelli · 2021-05-14T11:01:04Z

Ahh, I see, it makes sense. When averaging the loss across multiple batches, how does lightning handles the fact that a batch was skipped due to the loss being None? Does it simply not include it in the average?

TommasoBendinelli · 2021-05-14T11:05:04Z

Perfect thank you.

awaelchli · 2021-05-14T11:08:32Z

Sorry, had to delete my answer and double check but yes, it averages only over the metrics logged, not over all training_steps.

rohitgr7 · 2021-05-14T11:13:16Z

to be specific it does weighted average by default using batch_size. In your case, it hasn't reached up till that point because this error is thrown while converting the logs list to PyTorch tensor and since it contains NaN values, it is throwing the error. Ideally, if a batch is skipped then it shouldn't contribute while aggregating the results so you can have an else statement there which will just work fine.

TommasoBendinelli added bug Something isn't working help wanted Open to be worked on labels May 14, 2021

rohitgr7 closed this as completed May 14, 2021

rohitgr7 reopened this May 14, 2021

awaelchli self-assigned this May 14, 2021

awaelchli closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails at the end of the epoch when returning None in the training step #7544

Training fails at the end of the epoch when returning None in the training step #7544

TommasoBendinelli commented May 14, 2021 •

edited

Loading

awaelchli commented May 14, 2021

TommasoBendinelli commented May 14, 2021

rohitgr7 commented May 14, 2021

TommasoBendinelli commented May 14, 2021

TommasoBendinelli commented May 14, 2021

awaelchli commented May 14, 2021 •

edited

Loading

rohitgr7 commented May 14, 2021

Training fails at the end of the epoch when returning None in the training step #7544

Training fails at the end of the epoch when returning None in the training step #7544

Comments

TommasoBendinelli commented May 14, 2021 • edited Loading

🐛 Bug

awaelchli commented May 14, 2021

TommasoBendinelli commented May 14, 2021

rohitgr7 commented May 14, 2021

TommasoBendinelli commented May 14, 2021

TommasoBendinelli commented May 14, 2021

awaelchli commented May 14, 2021 • edited Loading

rohitgr7 commented May 14, 2021

TommasoBendinelli commented May 14, 2021 •

edited

Loading

awaelchli commented May 14, 2021 •

edited

Loading