Merge branch 'master' into issue/wandb_global_step

Lightning-AI · Apr 30, 2020 · 30d8dc4 · 30d8dc4
2 parents d9c0e66 + 142bc02
commit 30d8dc4
Show file tree

Hide file tree

Showing 17 changed files with 275 additions and 67 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -4,7 +4,7 @@
 - [ ] Did you read the [contributor guideline](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md), Pull Request section?
 - [ ] Did you make sure to update the docs?   
 - [ ] Did you write any new necessary tests?  
-- [ ] If you made a notable change (that affects users), did you update the [CHANGELOG](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CHANGELOG.md)?
+- [ ] If you made a notable change (that affects users), did you update the [CHANGELOG](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/CHANGELOG.md)?
 
 <!-- For CHANGELOG separate each item in unreleased section by blank line to reduce collisions -->
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
+- Added callback for logging learning rates ([#1498](https://github.com/PyTorchLightning/pytorch-lightning/pull/1498))
+
 ### Changed
 
 ### Deprecated
@@ -17,11 +19,15 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 ### Fixed
 
 - Fixed wandb logger 'global_step' affects other loggers ([#1492](https://github.com/PyTorchLightning/pytorch-lightning/pull/1492))
+- Fixed broken link in PR template ([#1675](https://github.com/PyTorchLightning/pytorch-lightning/pull/1675))
+- Fixed ModelCheckpoint not None checking filepath ([1654](https://github.com/PyTorchLightning/pytorch-lightning/pull/1654))
+- Trainer now calls `on_load_checkpoint()` when resuming from a checkpoint ([1666](https://github.com/PyTorchLightning/pytorch-lightning/pull/1666))
+
 
 ## [0.7.5] - 2020-04-27
 
 ### Changed
- 
+
 - Allow logging of metrics together with `hparams` ([#1630](https://github.com/PyTorchLightning/pytorch-lightning/pull/1630))
 - Allow metrics logged together with hparams ([#1630](https://github.com/PyTorchLightning/pytorch-lightning/pull/1630))
 
@@ -52,7 +58,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added `ddp_cpu` backend for testing ddp without GPUs ([#1158](https://github.com/PyTorchLightning/pytorch-lightning/pull/1158))
 - Added [Horovod](http://horovod.ai) support as a distributed backend `Trainer(distributed_backend='horovod')` ([#1529](https://github.com/PyTorchLightning/pytorch-lightning/pull/1529))
 - Added support for 8 core distributed training on Kaggle TPU's ([#1568](https://github.com/PyTorchLightning/pytorch-lightning/pull/1568))
-- Added support for native AMP ([#1561](https://github.com/PyTorchLightning/pytorch-lightning/pull/1561), [#1580](https://github.com/PyTorchLightning/pytorch-lightning/pull/1580)) 
+- Added support for native AMP ([#1561](https://github.com/PyTorchLightning/pytorch-lightning/pull/1561), [#1580](https://github.com/PyTorchLightning/pytorch-lightning/pull/1580))
 
 ### Changed
 
@@ -79,7 +85,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed loggers - flushing last logged metrics even before continue, e.g. `trainer.test()` results ([#1459](https://github.com/PyTorchLightning/pytorch-lightning/pull/1459))
 - Fixed optimizer configuration when `configure_optimizers` returns dict without `lr_scheduler` ([#1443](https://github.com/PyTorchLightning/pytorch-lightning/pull/1443))
 - Fixed `LightningModule` - mixing hparams and arguments in `LightningModule.__init__()` crashes load_from_checkpoint() ([#1505](https://github.com/PyTorchLightning/pytorch-lightning/pull/1505))
-- Added a missing call to the `on_before_zero_grad` model hook ([#1493](https://github.com/PyTorchLightning/pytorch-lightning/pull/1493)). 
+- Added a missing call to the `on_before_zero_grad` model hook ([#1493](https://github.com/PyTorchLightning/pytorch-lightning/pull/1493)).
 - Allow use of sweeps with `WandbLogger` ([#1512](https://github.com/PyTorchLightning/pytorch-lightning/pull/1512))
 - Fixed a bug that caused the `callbacks` Trainer argument to reference a global variable ([#1534](https://github.com/PyTorchLightning/pytorch-lightning/pull/1534)).
 - Fixed a bug that set all boolean CLI arguments from `Trainer.add_argparse_args` always to True ([#1571](https://github.com/PyTorchLightning/pytorch-lightning/pull/1571))

diff --git a/docs/source/callbacks.rst b/docs/source/callbacks.rst
@@ -84,3 +84,11 @@ We successfully extended functionality without polluting our super clean
 .. automodule:: pytorch_lightning.callbacks.progress
    :noindex:
    :exclude-members:
+
+---------
+
+.. automodule:: pytorch_lightning.callbacks.lr_logger
+    :noindex:
+    :exclude-members:
+        _extract_lr,
+        _find_names
diff --git a/docs/source/fast_training.rst b/docs/source/fast_training.rst
@@ -42,45 +42,26 @@ Must use an int if using an IterableDataset.
     # check every 100 train batches (ie: for IterableDatasets or fixed frequency)
     trainer = Trainer(val_check_interval=100)
 
-Use training data subset
-------------------------
-If you don't want to check 100% of the training set (for debugging or if it's huge), set this flag.
+Use data subset for training, validation and test
+-------------------------------------------------
+If you don't want to check 100% of the training/validation/test set (for debugging or if it's huge), set these flags.
 
 .. code-block:: python
 
    # DEFAULT
-   trainer = Trainer(train_percent_check=1.0)
-
-   # check 10% only
-   trainer = Trainer(train_percent_check=0.1)
-
-.. note:: ``train_percent_check`` will be overwritten by ``overfit_pct`` if ``overfit_pct`` > 0.
-
-Use test data subset
---------------------
-If you don't want to check 100% of the test set (for debugging or if it's huge), set this flag.
-
-.. code-block:: python
-
-    # DEFAULT
-    trainer = Trainer(test_percent_check=1.0)
-
-    # check 10% only
-    trainer = Trainer(test_percent_check=0.1)
-
-.. note:: ``test_percent_check`` will be overwritten by ``overfit_pct`` if ``overfit_pct`` > 0.
-
-Use validation data subset
---------------------------
-If you don't want to check 100% of the validation set (for debugging or if it's huge), set this flag.
-
-.. code-block:: python
-
-    # DEFAULT
-    trainer = Trainer(val_percent_check=1.0)
-
-    # check 10% only
-    trainer = Trainer(val_percent_check=0.1)
-
-.. note:: ``val_percent_check`` will be overwritten by ``overfit_pct`` if ``overfit_pct`` > 0 and ignored if
-    ``fast_dev_run=True``.
+   trainer = Trainer(
+       train_percent_check=1.0,
+       val_percent_check=1.0,
+       test_percent_check=1.0
+   )
+
+   # check 10%, 20%, 30% only, respectively for training, validation and test set
+   trainer = Trainer(
+       train_percent_check=0.1,
+       val_percent_check=0.2,
+       test_percent_check=0.3
+   )
+
+.. note:: ``train_percent_check``, ``val_percent_check`` and ``test_percent_check`` will be overwritten by ``overfit_pct`` if ``overfit_pct`` > 0. ``val_percent_check`` will be ignored if ``fast_dev_run=True``.
+
+.. note:: If you set ``val_percent_check=0``, validation will be disabled.
diff --git a/docs/source/new-project.rst b/docs/source/new-project.rst
@@ -100,7 +100,7 @@ To also add a validation loop add the following functions
         def validation_epoch_end(self, outputs):
             avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
             tensorboard_logs = {'val_loss': avg_loss}
-            return {'val_loss': avg_loss, 'log': tensorboard_logs
+            return {'val_loss': avg_loss, 'log': tensorboard_logs}
 
         def val_dataloader(self):
             # TODO: do a real train/val split

diff --git a/docs/source/slurm.rst b/docs/source/slurm.rst
@@ -11,17 +11,14 @@ To train a model using multiple-nodes do the following:
 
 1. Design your LightningModule.
 
-2. Add `torch.DistributedSampler <https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler>`_
-   which enables access to a subset of your full dataset to each GPU.
-
-3. Enable ddp in the trainer
+2. Enable ddp in the trainer
 
 .. code-block:: python
 
    # train on 32 GPUs across 4 nodes
    trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')
 
-4. It's a good idea to structure your train.py file like this:
+3. It's a good idea to structure your train.py file like this:
 
 .. code-block:: python
 
@@ -91,6 +88,8 @@ To train a model using multiple-nodes do the following:
 
     sbatch submit.sh
 
+.. note:: using :class:`~torch.utils.data.distributed.DistributedSampler` is already handled by Lightning.
+
 Walltime auto-resubmit
 -----------------------------------
 When you use Lightning in a SLURM cluster, lightning automatically detects when it is about

diff --git a/pytorch_lightning/callbacks/__init__.py b/pytorch_lightning/callbacks/__init__.py
@@ -2,13 +2,15 @@
 from pytorch_lightning.callbacks.early_stopping import EarlyStopping
 from pytorch_lightning.callbacks.gradient_accumulation_scheduler import GradientAccumulationScheduler
 from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
+from pytorch_lightning.callbacks.lr_logger import LearningRateLogger
 from pytorch_lightning.callbacks.progress import ProgressBarBase, ProgressBar
 
 __all__ = [
     'Callback',
     'EarlyStopping',
     'ModelCheckpoint',
     'GradientAccumulationScheduler',
+    'LearningRateLogger',
     'ProgressBarBase',
     'ProgressBar',
 ]
diff --git a/pytorch_lightning/callbacks/lr_logger.py b/pytorch_lightning/callbacks/lr_logger.py
@@ -0,0 +1,118 @@
+r"""
+
+Logging of learning rates
+=========================
+
+Log learning rate for lr schedulers during training
+
+"""
+
+from pytorch_lightning.callbacks.base import Callback
+from pytorch_lightning.utilities.exceptions import MisconfigurationException
+
+
+class LearningRateLogger(Callback):
+    r"""
+    Automatically logs learning rate for learning rate schedulers during training.
+
+    Example::
+
+        >>> from pytorch_lightning import Trainer
+        >>> from pytorch_lightning.callbacks import LearningRateLogger
+        >>> lr_logger = LearningRateLogger()
+        >>> trainer = Trainer(callbacks=[lr_logger])
+
+    Logging names are automatically determined based on optimizer class name.
+    In case of multiple optimizers of same type, they will be named `Adam`,
+    `Adam-1` etc. If a optimizer has multiple parameter groups they will
+    be named `Adam/pg1`, `Adam/pg2` etc. To control naming, pass in a
+    `name` keyword in the construction of the learning rate schdulers
+
+    Example::
+
+        def configure_optimizer(self):
+            optimizer = torch.optim.Adam(...)
+            lr_scheduler = {'scheduler': torch.optim.lr_schedulers.LambdaLR(optimizer, ...)
+                            'name': 'my_logging_name'}
+            return [optimizer], [lr_scheduler]
+    """
+    def __init__(self):
+        self.lrs = None
+        self.lr_sch_names = []
+
+    def on_train_start(self, trainer, pl_module):
+        """ Called before training, determines unique names for all lr
+            schedulers in the case of multiple of the same type or in
+            the case of multiple parameter groups
+        """
+        if trainer.lr_schedulers == []:
+            raise MisconfigurationException(
+                'Cannot use LearningRateLogger callback with models that have no'
+                ' learning rate schedulers. Please see documentation for'
+                ' `configure_optimizers` method.')
+
+        if not trainer.logger:
+            raise MisconfigurationException(
+                'Cannot use LearningRateLogger callback with Trainer that has no logger.')
+
+        # Find names for schedulers
+        names = self._find_names(trainer.lr_schedulers)
+
+        # Initialize for storing values
+        self.lrs = dict.fromkeys(names, [])
+
+    def on_batch_start(self, trainer, pl_module):
+        latest_stat = self._extract_lr(trainer, 'step')
+        if trainer.logger and latest_stat:
+            trainer.logger.log_metrics(latest_stat, step=trainer.global_step)
+
+    def on_epoch_start(self, trainer, pl_module):
+        latest_stat = self._extract_lr(trainer, 'epoch')
+        if trainer.logger and latest_stat:
+            trainer.logger.log_metrics(latest_stat, step=trainer.global_step)
+
+    def _extract_lr(self, trainer, interval):
+        """ Extracts learning rates for lr schedulers and saves information
+            into dict structure. """
+        latest_stat = {}
+        for name, scheduler in zip(self.lr_sch_names, trainer.lr_schedulers):
+            if scheduler['interval'] == interval:
+                param_groups = scheduler['scheduler'].optimizer.param_groups
+                if len(param_groups) != 1:
+                    for i, pg in enumerate(param_groups):
+                        lr, key = pg['lr'], f'{name}/{i + 1}'
+                        self.lrs[key].append(lr)
+                        latest_stat[key] = lr
+                else:
+                    self.lrs[name].append(param_groups[0]['lr'])
+                    latest_stat[name] = param_groups[0]['lr']
+        return latest_stat
+
+    def _find_names(self, lr_schedulers):
+        # Create uniqe names in the case we have multiple of the same learning
+        # rate schduler + multiple parameter groups
+        names = []
+        for scheduler in lr_schedulers:
+            sch = scheduler['scheduler']
+            if 'name' in scheduler:
+                name = scheduler['name']
+            else:
+                opt_name = 'lr-' + sch.optimizer.__class__.__name__
+                i, name = 1, opt_name
+                # Multiple schduler of the same type
+                while True:
+                    if name not in names:
+                        break
+                    i, name = i + 1, f'{opt_name}-{i}'
+
+            # Multiple param groups for the same schduler
+            param_groups = sch.optimizer.param_groups
+            if len(param_groups) != 1:
+                for i, pg in enumerate(param_groups):
+                    temp = name + '/pg' + str(i + 1)
+                    names.append(temp)
+            else:
+                names.append(name)
+
+            self.lr_sch_names.append(name)
+        return names
diff --git a/pytorch_lightning/callbacks/model_checkpoint.py b/pytorch_lightning/callbacks/model_checkpoint.py
@@ -86,7 +86,7 @@ def __init__(self, filepath: Optional[str] = None, monitor: str = 'val_loss', ve
                  save_top_k: int = 1, save_weights_only: bool = False,
                  mode: str = 'auto', period: int = 1, prefix: str = ''):
         super().__init__()
-        if save_top_k > 0 and os.path.isdir(filepath) and len(os.listdir(filepath)) > 0:
+        if save_top_k > 0 and filepath is not None and os.path.isdir(filepath) and len(os.listdir(filepath)) > 0:
             rank_zero_warn(
                 f"Checkpoint directory {filepath} exists and is not empty with save_top_k != 0."
                 "All files in this directory will be deleted when a checkpoint is saved!"

diff --git a/pytorch_lightning/core/lightning.py b/pytorch_lightning/core/lightning.py
@@ -930,10 +930,12 @@ def init_ddp_connection(
         if 'MASTER_ADDR' not in os.environ:
             log.warning("MASTER_ADDR environment variable is not defined. Set as localhost")
             os.environ['MASTER_ADDR'] = '127.0.0.1'
+        log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
 
         if 'MASTER_PORT' not in os.environ:
             log.warning("MASTER_PORT environment variable is not defined. Set as 12910")
             os.environ['MASTER_PORT'] = '12910'
+        log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")
 
         if 'WORLD_SIZE' in os.environ and os.environ['WORLD_SIZE'] != world_size:
             log.warning("WORLD_SIZE environment variable is not equal to the computed "
@@ -1160,9 +1162,9 @@ def optimizer_step(self, current_epoch, batch_idx, optimizer,
 
             # native amp + lbfgs is a no go right now
             if self.trainer.use_amp and self.trainer.use_native_amp:
-                m = 'native PyTorch amp and lbfgs are not compatible. To request, please file' \
-                    'a Github issue in PyTorch and tag @mcarilli'
-                raise MisconfigurationException(m)
+                raise MisconfigurationException(
+                    'native PyTorch amp and lbfgs are not compatible.'
+                    ' To request, please file a Github issue in PyTorch and tag @mcarilli')
             optimizer.step(second_order_closure)
         else:
             if self.trainer.use_amp and self.trainer.use_native_amp:

diff --git a/pytorch_lightning/trainer/distrib_data_parallel.py b/pytorch_lightning/trainer/distrib_data_parallel.py
@@ -277,6 +277,10 @@ def configure_slurm_ddp(self, num_gpu_nodes):
         except Exception as e:
             pass
 
+        # notify user the that slurm is managing tasks
+        if self.is_slurm_managing_tasks:
+            log.info('Multi-processing is handled by Slurm.')
+
     def set_nvidia_flags(self, is_slurm_managing_tasks, data_parallel_device_ids):
         if data_parallel_device_ids is None:
             return
@@ -293,7 +297,7 @@ def set_nvidia_flags(self, is_slurm_managing_tasks, data_parallel_device_ids):
                 gpu_str = ','.join([str(x) for x in data_parallel_device_ids])
                 os.environ["CUDA_VISIBLE_DEVICES"] = gpu_str
 
-        log.info(f'CUDA_VISIBLE_DEVICES: [{os.environ["CUDA_VISIBLE_DEVICES"]}]')
+        log.debug(f'CUDA_VISIBLE_DEVICES: [{os.environ["CUDA_VISIBLE_DEVICES"]}]')
 
     def ddp_train(self, process_idx, model):
         """

diff --git a/pytorch_lightning/trainer/distrib_parts.py b/pytorch_lightning/trainer/distrib_parts.py
@@ -461,10 +461,15 @@ def __transfer_data_to_device(self, batch, device, gpu_id=None):
 
         # when tuple
         if isinstance(batch, tuple):
-            batch = list(batch)
-            for i, x in enumerate(batch):
-                batch[i] = self.__transfer_data_to_device(x, device, gpu_id)
-            return tuple(batch)
+            # when namedtuple
+            if hasattr(batch, '_fields'):
+                elem_type = type(batch)
+                return elem_type(*(self.__transfer_data_to_device(x, device, gpu_id) for x in batch))
+            else:
+                batch = list(batch)
+                for i, x in enumerate(batch):
+                    batch[i] = self.__transfer_data_to_device(x, device, gpu_id)
+                return tuple(batch)
 
         # when dict
         if isinstance(batch, dict):