Merge branch 'master' into feat-early-stop-train

Lightning-AI · Apr 27, 2021 · 1a48461 · 1a48461
2 parents 21d662c + a153c15
commit 1a48461
Show file tree

Hide file tree

Showing 41 changed files with 706 additions and 133 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added support for the `EarlyStopping` callback to run at the end of the training epoch ([#6944](https://github.com/PyTorchLightning/pytorch-lightning/pull/6944/))
 
 
+- Added synchronization points before and after `setup` hooks are run ([#7202](https://github.com/PyTorchLightning/pytorch-lightning/pull/7202))
+
+
 - Added a `teardown` hook to `ClusterEnvironment` ([#6942](https://github.com/PyTorchLightning/pytorch-lightning/pull/6942))
 
 
@@ -118,6 +121,11 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added new `EarlyStopping` parameters `stopping_threshold` and `divergence_threshold` ([#6868](https://github.com/PyTorchLightning/pytorch-lightning/pull/6868))
 
 
+- Added new `UnrepeatedDistributedSampler` and `IndexBatchSamplerWrapper` for tracking distributed predictions ([#7215](https://github.com/PyTorchLightning/pytorch-lightning/pull/7215))
+
+
+- Added `trainer.predict(return_predictions=None|False|True)` ([#7215](https://github.com/PyTorchLightning/pytorch-lightning/pull/7215))
+
 
 ### Changed
 
@@ -148,11 +156,20 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Changed warnings and recommendations for dataloaders in `ddp_spawn` ([#6762](https://github.com/PyTorchLightning/pytorch-lightning/pull/6762/))
 
 
-- `pl.seed_everyting` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))
+- `pl.seed_everything` will now also set the seed on the `DistributedSampler` ([#7024](https://github.com/PyTorchLightning/pytorch-lightning/pull/7024))
+
+
+- Changed default setting for communication of multi-node training using `DDPShardedPlugin` ([#6937](https://github.com/PyTorchLightning/pytorch-lightning/pull/6937))
+
+
+- `LightningModule.from_datasets()` now accepts `IterableDataset` instances as training datasets. ([#7503](https://github.com/PyTorchLightning/pytorch-lightning/pull/7503))
 
 
 ### Deprecated
 
+- Deprecated the `save_function` property from the `ModelCheckpoint` callback ([#7201](https://github.com/PyTorchLightning/pytorch-lightning/pull/7201))
+
+
 - Deprecated `LightningModule.write_predictions` and `LigtningModule.write_predictions_dict` ([#7066](https://github.com/PyTorchLightning/pytorch-lightning/pull/7066))
 
 
@@ -194,6 +211,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Removed
 
+
 - Removed `automatic_optimization` as a property from the training loop in favor of `LightningModule.automatic_optimization` ([#7130](https://github.com/PyTorchLightning/pytorch-lightning/pull/7130))
 
 
@@ -344,9 +362,18 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed parsing for pre-release package versions ([#6999](https://github.com/PyTorchLightning/pytorch-lightning/pull/6999))
 
 
+- Fixed `num_sanity_val_steps` affecting reproducibility of training data shuffling ([#7014](https://github.com/PyTorchLightning/pytorch-lightning/pull/7014))
+
+
 - Fixed resetting device after `fitting/evaluating/predicting` ([#7188](https://github.com/PyTorchLightning/pytorch-lightning/pull/7188))
 
 
+- Fixed metrics not being properly logged with `precision=16` and `manual_optimization` ([#7228](https://github.com/PyTorchLightning/pytorch-lightning/pull/7228))
+
+
+- Fixed `parameters_to_ignore` not properly set to DDPWrapper ([#7239](https://github.com/PyTorchLightning/pytorch-lightning/pull/7239))
+
+
 ## [1.2.7] - 2021-04-06
 
 ### Fixed

diff --git a/dockers/README.md b/dockers/README.md
@@ -45,7 +45,7 @@ docker image list
 docker image rm pytorch-lightning:latest
 ```
 
-### Run docker image with GPUs
+## Run docker image with GPUs
 
 To run docker image with access to you GPUs you need to install
 ```bash
@@ -63,3 +63,23 @@ and later run the docker image with `--gpus all` so for example
 ```
 docker run --rm -it --gpus all pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.6
 ```
+
+## Run Jupyter server
+
+Inspiration comes from https://u.group/thinking/how-to-put-jupyter-notebooks-in-a-dockerfile
+
+1. Build the docker image:
+    ```bash
+    docker image build \
+        -t pytorch-lightning:v1.2.9 \
+        -f dockers/nvidia/Dockerfile \
+        --build-arg LIGHTNING_VERSION=1.2.9 \
+        .
+    ```
+2. start the server and map ports:
+    ```bash
+    docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -p 8888:8888 pytorch-lightning:v1.2.9
+    ```
+3. Connect in local browser:
+    - copy the generated path e.g. `http://hostname:8888/?token=0719fa7e1729778b0cec363541a608d5003e26d4910983c6`
+    - replace the `hostname` by `localhost`
diff --git a/dockers/nvidia/Dockerfile b/dockers/nvidia/Dockerfile
@@ -13,18 +13,18 @@
 # limitations under the License.
 
 # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-03.html#rel_21-03
-FROM nvcr.io/nvidia/pytorch:20.12-py3
+FROM nvcr.io/nvidia/pytorch:21.03-py3
 
 MAINTAINER PyTorchLightning <https://github.com/PyTorchLightning>
 
 ARG LIGHTNING_VERSION=""
 
+RUN python -c "import torch ; print(torch.__version__)" >> torch_version.info
+
 COPY ./ /workspace/pytorch-lightning/
 
 RUN \
     cd /workspace  && \
-    mv pytorch-lightning/notebooks . && \
-    mv pytorch-lightning/pl_examples . && \
     # replace by specific version if asked
     if [ ! -z "$LIGHTNING_VERSION" ] ; then \
         rm -rf pytorch-lightning ; \
@@ -33,18 +33,28 @@ RUN \
         mv pytorch-lightning-*/ pytorch-lightning ; \
         rm *.zip ; \
     fi && \
+# save the examples
+    mv pytorch-lightning/notebooks . && \
+    mv pytorch-lightning/pl_examples . && \
 
 # Installations
     python -c "fname = './pytorch-lightning/requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if not line.startswith('horovod')] ; open(fname, 'w').writelines(lines)" && \
     pip install -r ./pytorch-lightning/requirements/extra.txt --no-cache-dir --upgrade-strategy only-if-needed && \
     pip install -r ./pytorch-lightning/requirements/examples.txt --no-cache-dir --upgrade-strategy only-if-needed && \
     pip install ./pytorch-lightning --no-cache-dir && \
-    pip install "Pillow>=8.1" "torchtext>=0.9.0" ipython[all] --no-cache-dir --upgrade-strategy only-if-needed && \
-    rm -rf pytorch-lightning
+    pip install "Pillow>=8.1" --no-cache-dir --upgrade-strategy only-if-needed && \
+    rm -rf pytorch-lightning && \
+    pip list
+
+ENV PYTHONPATH="/workspace"
 
-RUN python --version && \
+RUN \
+    TORCH_VERSION=$(cat torch_version.info) && \
+    rm torch_version.info && \
+    python --version && \
     pip --version && \
-    pip list && \
+    pip list | grep torch && \
+    python -c "from torch import __version__ as ver ; assert ver == '$TORCH_VERSION', ver" && \
     python -c "import pytorch_lightning as pl; print(pl.__version__)"
 
-# CMD ["/bin/bash"]
+CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]
diff --git a/docs/source/advanced/multi_gpu.rst b/docs/source/advanced/multi_gpu.rst
@@ -675,7 +675,7 @@ To use Sharded Training, you need to first install FairScale using the command b
 .. code-block:: python
 
     # train using Sharded DDP
-    trainer = Trainer(accelerator='ddp', plugins='ddp_sharded')
+    trainer = Trainer(plugins='ddp_sharded')
 
 Sharded Training can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.
 

diff --git a/pytorch_lightning/accelerators/accelerator.py b/pytorch_lightning/accelerators/accelerator.py
@@ -331,7 +331,12 @@ def clip_gradients(
         gradient_clip_algorithm: GradClipAlgorithmType = GradClipAlgorithmType.NORM,
     ) -> None:
         """clips all the optimizer parameters to the given value"""
-        self.precision_plugin.clip_gradients(optimizer, clip_val, gradient_clip_algorithm=gradient_clip_algorithm)
+        self.precision_plugin.clip_gradients(
+            optimizer,
+            clip_val,
+            gradient_clip_algorithm=gradient_clip_algorithm,
+            model=self.model,
+        )
 
     def on_train_epoch_end(self, outputs: EPOCH_OUTPUT) -> None:
         """Hook to do something on the end of an training epoch

diff --git a/pytorch_lightning/callbacks/model_checkpoint.py b/pytorch_lightning/callbacks/model_checkpoint.py
@@ -23,7 +23,7 @@
 import re
 from copy import deepcopy
 from pathlib import Path
-from typing import Any, Dict, Optional, Union
+from typing import Any, Callable, Dict, Optional, Union
 
 import numpy as np
 import torch
@@ -201,19 +201,19 @@ def __init__(
         self.best_model_score = None
         self.best_model_path = ""
         self.last_model_path = ""
-        self.save_function = None
 
         self.__init_monitor_mode(monitor, mode)
         self.__init_ckpt_dir(dirpath, filename, save_top_k)
         self.__init_triggers(every_n_train_steps, every_n_val_epochs, period)
         self.__validate_init_configuration()
+        self._save_function = None
 
     def on_pretrain_routine_start(self, trainer, pl_module):
         """
         When pretrain routine starts we build the ckpt dir on the fly
         """
         self.__resolve_ckpt_dir(trainer)
-        self.save_function = trainer.save_checkpoint
+        self._save_function = trainer.save_checkpoint
 
     def on_train_batch_end(
         self, trainer, pl_module, outputs: Any, batch: Any, batch_idx: int, dataloader_idx: int
@@ -254,9 +254,9 @@ def on_load_checkpoint(self, callback_state: Dict[str, Any]):
 
     def save_checkpoint(self, trainer, unused: Optional = None):
         """
-        Performs the main logic around saving a checkpoint.
-        This method runs on all ranks, it is the responsibility of `self.save_function`
-        to handle correct behaviour in distributed training, i.e., saving only on rank 0.
+        Performs the main logic around saving a checkpoint. This method runs on all ranks.
+        It is the responsibility of `trainer.save_checkpoint` to correctly handle the behaviour in distributed training,
+        i.e., saving only on rank 0 for data parallel use cases.
         """
         if unused is not None:
             rank_zero_deprecation(
@@ -396,6 +396,22 @@ def period(self, value: Optional[int]) -> None:
         )
         self._period = value
 
+    @property
+    def save_function(self) -> Optional[Callable]:
+        rank_zero_deprecation(
+            'Property `save_function` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5.'
+            ' Please use `trainer.save_checkpoint` instead.'
+        )
+        return self._save_function
+
+    @save_function.setter
+    def save_function(self, value: Optional[Callable]) -> None:
+        rank_zero_deprecation(
+            'Property `save_function` in `ModelCheckpoint` is deprecated in v1.3 and will be removed in v1.5.'
+            ' Please use `trainer.save_checkpoint` instead.'
+        )
+        self._save_function = value
+
     @rank_zero_only
     def _del_model(self, filepath: str):
         if self._fs.exists(filepath):
@@ -420,10 +436,7 @@ def _do_save(self, trainer, filepath: str):
             self._fs.makedirs(os.path.dirname(filepath), exist_ok=True)
 
         # delegate the saving to the trainer
-        if self.save_function is not None:
-            self.save_function(filepath, self.save_weights_only)
-        else:
-            raise ValueError(".save_function() not set")
+        trainer.save_checkpoint(filepath, self.save_weights_only)
 
     def check_monitor_top_k(self, trainer, current: Optional[torch.Tensor] = None) -> bool:
         if current is None:

diff --git a/pytorch_lightning/core/datamodule.py b/pytorch_lightning/core/datamodule.py
@@ -17,7 +17,7 @@
 from argparse import ArgumentParser, Namespace
 from typing import Any, List, Mapping, Optional, Sequence, Tuple, Union
 
-from torch.utils.data import DataLoader, Dataset
+from torch.utils.data import DataLoader, Dataset, IterableDataset
 
 from pytorch_lightning.core.hooks import CheckpointHooks, DataHooks
 from pytorch_lightning.utilities import rank_zero_only
@@ -26,7 +26,7 @@
 
 class _DataModuleWrapper(type):
 
-    def __init__(self, *args, **kwargs):
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
         super().__init__(*args, **kwargs)
         self.__has_added_checks = False
 
@@ -363,7 +363,8 @@ def from_datasets(
 
         """
 
-        def dataloader(ds, shuffle=False):
+        def dataloader(ds: Dataset, shuffle: bool = False) -> DataLoader:
+            shuffle &= not isinstance(ds, IterableDataset)
             return DataLoader(
                 ds,
                 batch_size=batch_size,

diff --git a/pytorch_lightning/core/lightning.py b/pytorch_lightning/core/lightning.py
@@ -74,7 +74,7 @@ class LightningModule(
         "model_size",
     ] + DeviceDtypeModuleMixin.__jit_unused_properties__
 
-    def __init__(self, *args, **kwargs):
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
         super().__init__(*args, **kwargs)
 
         # see (https://github.com/pytorch/pytorch/blob/3e6bb5233f9ca2c5aa55d9cda22a7ee85439aa6e/
@@ -1379,9 +1379,6 @@ def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx,
                 optimizer.step(closure=optimizer_closure)
 
         """
-        if not isinstance(optimizer, LightningOptimizer):
-            # wraps into LightingOptimizer only for running step
-            optimizer = LightningOptimizer._to_lightning_optimizer(optimizer, self.trainer, optimizer_idx)
         optimizer.step(closure=optimizer_closure)
 
     def optimizer_zero_grad(self, epoch: int, batch_idx: int, optimizer: Optimizer, optimizer_idx: int):

diff --git a/pytorch_lightning/overrides/base.py b/pytorch_lightning/overrides/base.py
@@ -36,6 +36,9 @@ def __init__(self, pl_module: LightningModule):
         super().__init__()
         self.module = pl_module
 
+        # set the parameters_to_ignore from LightningModule.
+        self._ddp_params_and_buffers_to_ignore = getattr(pl_module, "_ddp_params_and_buffers_to_ignore", [])
+
     def forward(self, *inputs, **kwargs):
         trainer = self.module.trainer