Skip to content

Commit

Permalink
Merge branch 'master' into feat-early-stop-train
Browse files Browse the repository at this point in the history
  • Loading branch information
ananthsub authored Apr 27, 2021
2 parents 21d662c + a153c15 commit 1a48461
Show file tree
Hide file tree
Showing 41 changed files with 706 additions and 133 deletions.
29 changes: 28 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added support for the `EarlyStopping` callback to run at the end of the training epoch ([#6944](https://github.com/PyTorchLightning/pytorch-lightning/pull/6944/))


- Added synchronization points before and after `setup` hooks are run ([#7202](https://github.com/PyTorchLightning/pytorch-lightning/pull/7202))


- Added a `teardown` hook to `ClusterEnvironment` ([#6942](https://github.com/PyTorchLightning/pytorch-lightning/pull/6942))


Expand Down Expand Up @@ -118,6 +121,11 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added new `EarlyStopping` parameters `stopping_threshold` and `divergence_threshold` ([#6868](https://github.com/PyTorchLightning/pytorch-lightning/pull/6868))


- Added new `UnrepeatedDistributedSampler` and `IndexBatchSamplerWrapper` for tracking distributed predictions ([#7215](https://github.com/PyTorchLightning/pytorch-lightning/pull/7215))


- Added `trainer.predict(return_predictions=None|False|True)` ([#7215](https://github.com/PyTorchLightning/pytorch-lightning/pull/7215))


### Changed

Expand Down Expand Up @@ -148,11 +156,20 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Changed warnings and recommendations for dataloaders in `ddp_spawn` ([#6762](https://github.com/PyTorchLightning/pytorch-lightning/pull/6762/))


- `pl.seed_everyting` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))
- `pl.seed_everything` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))


- Changed default setting for communication of multi-node training using `DDPShardedPlugin` ([#6937](https://github.com/PyTorchLightning/pytorch-lightning/pull/6937))


- `LightningModule.from_datasets()` now accepts `IterableDataset` instances as training datasets. ([#7503](https://github.com/PyTorchLightning/pytorch-lightning/pull/7503))


### Deprecated

- Deprecated the `save_function` property from the `ModelCheckpoint` callback ([#7201](https://github.com/PyTorchLightning/pytorch-lightning/pull/7201))


- Deprecated `LightningModule.write_predictions` and `LigtningModule.write_predictions_dict` ([#7066](https://github.com/PyTorchLightning/pytorch-lightning/pull/7066))


Expand Down Expand Up @@ -194,6 +211,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

### Removed


- Removed `automatic_optimization` as a property from the training loop in favor of `LightningModule.automatic_optimization` ([#7130](https://github.com/PyTorchLightning/pytorch-lightning/pull/7130))


Expand Down Expand Up @@ -344,9 +362,18 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed parsing for pre-release package versions ([#6999](https://github.com/PyTorchLightning/pytorch-lightning/pull/6999))


- Fixed `num_sanity_val_steps` affecting reproducibility of training data shuffling ([#7014](https://github.com/PyTorchLightning/pytorch-lightning/pull/7014))


- Fixed resetting device after `fitting/evaluating/predicting` ([#7188](https://github.com/PyTorchLightning/pytorch-lightning/pull/7188))


- Fixed metrics not being properly logged with `precision=16` and `manual_optimization` ([#7228](https://github.com/PyTorchLightning/pytorch-lightning/pull/7228))


- Fixed `parameters_to_ignore` not properly set to DDPWrapper ([#7239](https://github.com/PyTorchLightning/pytorch-lightning/pull/7239))


## [1.2.7] - 2021-04-06

### Fixed
Expand Down
22 changes: 21 additions & 1 deletion dockers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ docker image list
docker image rm pytorch-lightning:latest
```

### Run docker image with GPUs
## Run docker image with GPUs

To run docker image with access to you GPUs you need to install
```bash
Expand All @@ -63,3 +63,23 @@ and later run the docker image with `--gpus all` so for example
```
docker run --rm -it --gpus all pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.6
```

## Run Jupyter server

Inspiration comes from https://u.group/thinking/how-to-put-jupyter-notebooks-in-a-dockerfile

1. Build the docker image:
```bash
docker image build \
-t pytorch-lightning:v1.2.9 \
-f dockers/nvidia/Dockerfile \
--build-arg LIGHTNING_VERSION=1.2.9 \
.
```
2. start the server and map ports:
```bash
docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -p 8888:8888 pytorch-lightning:v1.2.9
```
3. Connect in local browser:
- copy the generated path e.g. `http://hostname:8888/?token=0719fa7e1729778b0cec363541a608d5003e26d4910983c6`
- replace the `hostname` by `localhost`
26 changes: 18 additions & 8 deletions dockers/nvidia/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,18 @@
# limitations under the License.

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-03.html#rel_21-03
FROM nvcr.io/nvidia/pytorch:20.12-py3
FROM nvcr.io/nvidia/pytorch:21.03-py3

MAINTAINER PyTorchLightning <https://github.com/PyTorchLightning>

ARG LIGHTNING_VERSION=""

RUN python -c "import torch ; print(torch.__version__)" >> torch_version.info

COPY ./ /workspace/pytorch-lightning/

RUN \
cd /workspace && \
mv pytorch-lightning/notebooks . && \
mv pytorch-lightning/pl_examples . && \
# replace by specific version if asked
if [ ! -z "$LIGHTNING_VERSION" ] ; then \
rm -rf pytorch-lightning ; \
Expand All @@ -33,18 +33,28 @@ RUN \
mv pytorch-lightning-*/ pytorch-lightning ; \
rm *.zip ; \
fi && \
# save the examples
mv pytorch-lightning/notebooks . && \
mv pytorch-lightning/pl_examples . && \

# Installations
python -c "fname = './pytorch-lightning/requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if not line.startswith('horovod')] ; open(fname, 'w').writelines(lines)" && \
pip install -r ./pytorch-lightning/requirements/extra.txt --no-cache-dir --upgrade-strategy only-if-needed && \
pip install -r ./pytorch-lightning/requirements/examples.txt --no-cache-dir --upgrade-strategy only-if-needed && \
pip install ./pytorch-lightning --no-cache-dir && \
pip install "Pillow>=8.1" "torchtext>=0.9.0" ipython[all] --no-cache-dir --upgrade-strategy only-if-needed && \
rm -rf pytorch-lightning
pip install "Pillow>=8.1" --no-cache-dir --upgrade-strategy only-if-needed && \
rm -rf pytorch-lightning && \
pip list

ENV PYTHONPATH="/workspace"

RUN python --version && \
RUN \
TORCH_VERSION=$(cat torch_version.info) && \
rm torch_version.info && \
python --version && \
pip --version && \
pip list && \
pip list | grep torch && \
python -c "from torch import __version__ as ver ; assert ver == '$TORCH_VERSION', ver" && \
python -c "import pytorch_lightning as pl; print(pl.__version__)"

# CMD ["/bin/bash"]
CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]
2 changes: 1 addition & 1 deletion docs/source/advanced/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -675,7 +675,7 @@ To use Sharded Training, you need to first install FairScale using the command b
.. code-block:: python
# train using Sharded DDP
trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
trainer = Trainer(plugins='ddp_sharded')
Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

Expand Down
7 changes: 6 additions & 1 deletion pytorch_lightning/accelerators/accelerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,7 +331,12 @@ def clip_gradients(
gradient_clip_algorithm: GradClipAlgorithmType = GradClipAlgorithmType.NORM,
) -> None:
"""clips all the optimizer parameters to the given value"""
self.precision_plugin.clip_gradients(optimizer, clip_val, gradient_clip_algorithm=gradient_clip_algorithm)
self.precision_plugin.clip_gradients(
optimizer,
clip_val,
gradient_clip_algorithm=gradient_clip_algorithm,
model=self.model,
)

def on_train_epoch_end(self, outputs: EPOCH_OUTPUT) -> None:
"""Hook to do something on the end of an training epoch
Expand Down
33 changes: 23 additions & 10 deletions pytorch_lightning/callbacks/model_checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import re
from copy import deepcopy
from pathlib import Path
from typing import Any, Dict, Optional, Union
from typing import Any, Callable, Dict, Optional, Union

import numpy as np
import torch
Expand Down Expand Up @@ -201,19 +201,19 @@ def __init__(
self.best_model_score = None
self.best_model_path = ""
self.last_model_path = ""
self.save_function = None

self.__init_monitor_mode(monitor, mode)
self.__init_ckpt_dir(dirpath, filename, save_top_k)
self.__init_triggers(every_n_train_steps, every_n_val_epochs, period)
self.__validate_init_configuration()
self._save_function = None

def on_pretrain_routine_start(self, trainer, pl_module):
"""
When pretrain routine starts we build the ckpt dir on the fly
"""
self.__resolve_ckpt_dir(trainer)
self.save_function = trainer.save_checkpoint
self._save_function = trainer.save_checkpoint

def on_train_batch_end(
self, trainer, pl_module, outputs: Any, batch: Any, batch_idx: int, dataloader_idx: int
Expand Down Expand Up @@ -254,9 +254,9 @@ def on_load_checkpoint(self, callback_state: Dict[str, Any]):

def save_checkpoint(self, trainer, unused: Optional = None):
"""
Performs the main logic around saving a checkpoint.
This method runs on all ranks, it is the responsibility of `self.save_function`
to handle correct behaviour in distributed training, i.e., saving only on rank 0.
Performs the main logic around saving a checkpoint. This method runs on all ranks.
It is the responsibility of `trainer.save_checkpoint` to correctly handle the behaviour in distributed training,
i.e., saving only on rank 0 for data parallel use cases.
"""
if unused is not None:
rank_zero_deprecation(
Expand Down Expand Up @@ -396,6 +396,22 @@ def period(self, value: Optional[int]) -> None:
)
self._period = value

@property
def save_function(self) -> Optional[Callable]:
rank_zero_deprecation(
'Property `save_function` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5.'
' Please use `trainer.save_checkpoint` instead.'
)
return self._save_function

@save_function.setter
def save_function(self, value: Optional[Callable]) -> None:
rank_zero_deprecation(
'Property `save_function` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5.'
' Please use `trainer.save_checkpoint` instead.'
)
self._save_function = value

@rank_zero_only
def _del_model(self, filepath: str):
if self._fs.exists(filepath):
Expand All @@ -420,10 +436,7 @@ def _do_save(self, trainer, filepath: str):
self._fs.makedirs(os.path.dirname(filepath), exist_ok=True)

# delegate the saving to the trainer
if self.save_function is not None:
self.save_function(filepath, self.save_weights_only)
else:
raise ValueError(".save_function() not set")
trainer.save_checkpoint(filepath, self.save_weights_only)

def check_monitor_top_k(self, trainer, current: Optional[torch.Tensor] = None) -> bool:
if current is None:
Expand Down
7 changes: 4 additions & 3 deletions pytorch_lightning/core/datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from argparse import ArgumentParser, Namespace
from typing import Any, List, Mapping, Optional, Sequence, Tuple, Union

from torch.utils.data import DataLoader, Dataset
from torch.utils.data import DataLoader, Dataset, IterableDataset

from pytorch_lightning.core.hooks import CheckpointHooks, DataHooks
from pytorch_lightning.utilities import rank_zero_only
Expand All @@ -26,7 +26,7 @@

class _DataModuleWrapper(type):

def __init__(self, *args, **kwargs):
def __init__(self, *args: Any, **kwargs: Any) -> None:
super().__init__(*args, **kwargs)
self.__has_added_checks = False

Expand Down Expand Up @@ -363,7 +363,8 @@ def from_datasets(
"""

def dataloader(ds, shuffle=False):
def dataloader(ds: Dataset, shuffle: bool = False) -> DataLoader:
shuffle &= not isinstance(ds, IterableDataset)
return DataLoader(
ds,
batch_size=batch_size,
Expand Down
5 changes: 1 addition & 4 deletions pytorch_lightning/core/lightning.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ class LightningModule(
"model_size",
] + DeviceDtypeModuleMixin.__jit_unused_properties__

def __init__(self, *args, **kwargs):
def __init__(self, *args: Any, **kwargs: Any) -> None:
super().__init__(*args, **kwargs)

# see (https://github.com/pytorch/pytorch/blob/3e6bb5233f9ca2c5aa55d9cda22a7ee85439aa6e/
Expand Down Expand Up @@ -1379,9 +1379,6 @@ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
optimizer.step(closure=optimizer_closure)
"""
if not isinstance(optimizer, LightningOptimizer):
# wraps into LightingOptimizer only for running step
optimizer = LightningOptimizer._to_lightning_optimizer(optimizer, self.trainer, optimizer_idx)
optimizer.step(closure=optimizer_closure)

def optimizer_zero_grad(self, epoch: int, batch_idx: int, optimizer: Optimizer, optimizer_idx: int):
Expand Down
3 changes: 3 additions & 0 deletions pytorch_lightning/overrides/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ def __init__(self, pl_module: LightningModule):
super().__init__()
self.module = pl_module

# set the parameters_to_ignore from LightningModule.
self._ddp_params_and_buffers_to_ignore = getattr(pl_module, "_ddp_params_and_buffers_to_ignore", [])

def forward(self, *inputs, **kwargs):
trainer = self.module.trainer

Expand Down
Loading

0 comments on commit 1a48461

Please sign in to comment.