Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename Model steps #1051

Merged
merged 37 commits into from
Mar 5, 2020
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
4972608
training_end renamed to training_step_end
williamFalcon Mar 5, 2020
738692c
training_end renamed to training_step_end
williamFalcon Mar 5, 2020
84560fd
training_end renamed to training_step_end
williamFalcon Mar 5, 2020
a3dbe56
training_end renamed to training_step_end
williamFalcon Mar 5, 2020
165ca51
training_end to training_step_end
williamFalcon Mar 5, 2020
890364e
training_end to training_step_end
williamFalcon Mar 5, 2020
f76fdb8
training_end to training_step_end
williamFalcon Mar 5, 2020
39b66c3
training_end to training_step_end
williamFalcon Mar 5, 2020
e7e1ce9
fix lost model reference
williamFalcon Mar 5, 2020
9db4d1f
training_end to training_step_end
williamFalcon Mar 5, 2020
e13beac
training_end to training_step_end
williamFalcon Mar 5, 2020
684fc47
training_end to training_step_end
williamFalcon Mar 5, 2020
c508dfb
training_end to training_step_end
williamFalcon Mar 5, 2020
8b8679c
training_end to training_step_end
williamFalcon Mar 5, 2020
2afc704
training_end to training_step_end
williamFalcon Mar 5, 2020
697cb5d
training_end to training_step_end
williamFalcon Mar 5, 2020
4938b81
training_end to training_step_end
williamFalcon Mar 5, 2020
78e2435
training_end to training_step_end
williamFalcon Mar 5, 2020
8d980a9
training_end to training_step_end
williamFalcon Mar 5, 2020
bfa3fdd
training_end to training_step_end
williamFalcon Mar 5, 2020
77baa64
training_end to training_step_end
williamFalcon Mar 5, 2020
3f7d5e0
training_end to training_step_end
williamFalcon Mar 5, 2020
b13b348
training_end to training_step_end
williamFalcon Mar 5, 2020
e3ac274
training_end to training_step_end
williamFalcon Mar 5, 2020
0a27d95
training_end to training_step_end
williamFalcon Mar 5, 2020
a106c47
training_end to training_step_end
williamFalcon Mar 5, 2020
bc4db9f
training_end to training_step_end
williamFalcon Mar 5, 2020
8199a73
training_end to training_step_end
williamFalcon Mar 5, 2020
aa97340
training_end to training_step_end
williamFalcon Mar 5, 2020
8568674
training_end to training_step_end
williamFalcon Mar 5, 2020
b964b70
training_end to training_step_end
williamFalcon Mar 5, 2020
5a9e405
training_end to training_step_end
williamFalcon Mar 5, 2020
2184e52
training_end to training_step_end
williamFalcon Mar 5, 2020
c56b09c
training_end to training_step_end
williamFalcon Mar 5, 2020
f687043
training_end to training_step_end
williamFalcon Mar 5, 2020
a401841
training_end to training_step_end
williamFalcon Mar 5, 2020
7268839
training_end to training_step_end
williamFalcon Mar 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/experiment_reporting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ Log metrics

To plot metrics into whatever logger you passed in (tensorboard, comet, neptune, etc...)

1. Training_end, validation_end, test_end will all log anything in the "log" key of the return dict.
1. training_step_end, validation_end, test_end will all log anything in the "log" key of the return dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these should say validation_epoch_end or validation_step_end?

.. code-block:: python

def training_end(self, outputs):
def training_step_end(self, outputs):
loss = some_loss()
...

Expand All @@ -62,7 +62,7 @@ To plot metrics into whatever logger you passed in (tensorboard, comet, neptune,
results = {'log': logs}
return results

2. Most of the time, you only need training_step and not training_end. You can also return logs from here:
2. Most of the time, you only need training_step and not training_step_end. You can also return logs from here:

.. code-block:: python

Expand Down
2 changes: 1 addition & 1 deletion docs/source/hooks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Training loop
- on_batch_start
- tbptt_split_batch
- training_step
- training_end (optional)
- training_step_end (optional)
- backward
- on_after_backward
- optimizer.step()
Expand Down
45 changes: 40 additions & 5 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,12 +165,13 @@ you will only be operating on one of those pieces.
y_0 = batch

For most metrics, this doesn't really matter. However, if you want
full batch statistics or want to use the outputs of the training_step
to do something like a softmax, you can use the `training_end` step.
to add something to your computational graph (like softmax)
using all batch parts you can use the `training_step_end` step.

.. code-block:: python

def training_end(self, outputs):
def training_step_end(self, outputs):
# only use when on dp
outputs = torch.cat(outputs, dim=1)
softmax = softmax(outputs, dim=1)
out = softmax.mean()
Expand All @@ -195,9 +196,43 @@ In pseudocode, the full sequence is:
out = gpu_model(batch_split)
all_results.append(out)

# calculate statistics for all parts of the batch
full out = model.training_end(all_results)
# use the full batch for something like softmax
full out = model.training_step_end(all_results)

to illustrate why this is needed, let's look at dataparallel

.. code-block:: python

def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(batch)

# on dp or ddp2 if we did softmax now it would be wrong
# because batch is actually a piece of the full batch
return y_hat

def training_step_end(self, batch_parts_outputs):
# batch_parts_outputs has outputs of each part of the batch

# do softmax here
outputs = torch.cat(outputs, dim=1)
softmax = softmax(outputs, dim=1)
out = softmax.mean()

return out

If `training_step_end` is defined it will be called regardless of tpu, dp, ddp, etc... which means
it will behave the same no matter the backend.

Validation and test step also have the same option when using dp

.. code-block:: python

def validation_step_end(self, batch_parts_outputs):
...

def test_step_end(self, batch_parts_outputs):
...

Implement Your Own Distributed (DDP) training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
193 changes: 169 additions & 24 deletions pytorch_lightning/core/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,73 @@
"""
A LightningModule is a strict superclass of torch.nn.Module but provides an interface to standardize
the "ingredients" for a research or production system.
A LightningModule organizes your PyTorch code into the following sections:

- The model/system definition (__init__)
- The model/system computations (forward)
- What happens in the training loop (training_step, training_end)
- What happens in the validation loop (validation_step, validation_end)
- What happens in the test loop (test_step, test_end)
- What optimizers to use (configure_optimizers)
- What data to use (train_dataloader, val_dataloader, test_dataloader)
.. figure:: /_images/lightning_module/pt_to_pl.png
:alt: Convert from PyTorch to Lightning

Most methods are optional. Here's a minimal example.

Notice a few things.

1. It's the SAME code.

2. The PyTorch code IS NOT abstracted - just organized.

3. All the other code that didn't go in the LightningModule has been automated
for you by the trainer

.. code-block:: python

net = Net()
trainer = Trainer()
trainer.fit(net)

4. There are no .cuda() or .to() calls... Lightning does these for you.

.. code-block:: python

# don't do in lightning
x = torch.Tensor(2, 3)
x = x.cuda()
x = x.to(device)

# do this instead
x = x # leave it alone!

# or to init a new tensor
new_x = torch.Tensor(2, 3)
new_x = new_x.type_as(x.type())


5. There are no samplers for distributed, Lightning also does this for you.

.. code-block:: python

# Don't do in Lightning...
data = MNIST(...)
sampler = DistributedSampler(data)
DataLoader(data, sampler=sampler)

# do this instead
data = MNIST(...)
DataLoader(data)


6. A LightingModule is a torch.nn.Module but with added functionality. Use it as such!

.. code-block:: python

net = Net.load_from_checkpoint(PATH)
net.freeze()
out = net(x)

Thus, to use Lightning, you just need to organize your code which takes about 30 minutes,
(and let's be real, you probably should do anyhow).

------------

Minimal Example
---------------

Here are the only required methods.

.. code-block:: python

Expand All @@ -37,13 +94,13 @@ def training_step(self, batch, batch_idx):
y_hat = self.forward(x)
return {'loss': F.cross_entropy(y_hat, y)}

def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)

def train_dataloader(self):
return DataLoader(MNIST(os.getcwd(), train=True, download=True,
transform=transforms.ToTensor()), batch_size=32)

def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)

Which you can train by doing:

.. code-block:: python
Expand All @@ -53,7 +110,35 @@ def train_dataloader(self):

trainer.fit(model)

If you wanted to add a validation loop
----------

Training loop structure
-----------------------

The general pattern is that each loop (training, validation, test loop)
has 2 methods:

- ``` ___step ```
- ``` ___epoch_end```

To show how lightning calls these, let's use the validation loop as an example

.. code-block:: python

val_outs = []
for val_batch in val_data:
# do something with each batch
out = validation_step(val_batch)
val_outs.append(out)

# do something with the outputs for all batches
# like calculate validation set accuracy or loss
validation_epoch_end(val_outs)

Add validation loop
^^^^^^^^^^^^^^^^^^^

Thus, if we wanted to add a validation loop you would add this to your LightningModule

.. code-block:: python

Expand All @@ -63,36 +148,96 @@ def validation_step(self, batch, batch_idx):
y_hat = self.forward(x)
return {'val_loss': F.cross_entropy(y_hat, y)}

def validation_end(self, outputs):
def validation_epoch_end(self, outputs):
val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()
return {'val_loss': val_loss_mean}

def val_dataloader(self):
# can also return a list of val dataloaders
return DataLoader(MNIST(os.getcwd(), train=True, download=True,
transform=transforms.ToTensor()), batch_size=32)
return DataLoader(...)

Or add a test loop
Add test loop
^^^^^^^^^^^^^

.. code_block:: python
.. code-block:: python

class CoolModel(pl.LightningModule):

def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self.forward(x)
return {'test_loss': F.cross_entropy(y_hat, y)}

def test_end(self, outputs):
def test_epoch_end(self, outputs):
test_loss_mean = torch.stack([x['test_loss'] for x in outputs]).mean()
return {'test_loss': test_loss_mean}

def test_dataloader(self):
# OPTIONAL
# can also return a list of test dataloaders
return DataLoader(MNIST(os.getcwd(), train=False, download=True,
transform=transforms.ToTensor()), batch_size=32)
return DataLoader(...)

However, the test loop won't ever be called automatically to make sure you
don't run your test data by accident. Instead you have to explicitly call:

.. code-block:: python

# call after training
trainer = Trainer()
trainer.fit(model)
trainer.test()

# or call with pretrained model
model = MyLightningModule.load_from_checkpoint(PATH)
trainer = Trainer()
trainer.test(model)

Training_step_end method
------------------------
When using dataParallel or distributedDataParallel2, the training_step
will be operating on a portion of the batch. This is normally ok but in special
cases like calculating NCE loss using negative samples, we might want to
perform a softmax across all samples in the batch.

For these types of situations, each loop has an additional ```__step_end``` method
which allows you to operate on the pieces of the batch

.. code-block:: python

training_outs = []
for train_batch in train_data:
# dp, ddp2 splits the batch
sub_batches = split_batches_for_dp(batch)

# run training_step on each piece of the batch
batch_parts_outputs = [training_step(sub_batch) for sub_batch in sub_batches]

# do softmax with all pieces
out = training_step_end(batch_parts_outputs)
training_outs.append(out)

# do something with the outputs for all batches
# like calculate validation set accuracy or loss
training_epoch_end(val_outs)

Remove cuda calls
-----------------
In a LightningModule, all calls to ```.cuda()```
and ```.to(device)``` should be removed. Lightning will do these
automatically. This will allow your code to work on CPUs, TPUs and GPUs.

When you init a new tensor in your code, just use type_as

.. code-block:: python

def training_step(self, batch, batch_idx):
x, y = batch

# put the z on the appropriate gpu or tpu core
z = sample_noise()
z = z.type_as(x.type())

Live demo
---------
Check out how this live demo
Check out this
`COLAB <https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31#scrollTo=HOk9c4_35FKg>`_
for a live demo.
Expand Down
Loading