Refactoring train.py, removing OpenCV, adding training results to Te…

…nsborboard, bug fixes (#264) I think moving forward, we'll use smaller PRs. But here are the changes in this one: Fixes issue #236 that involves rewriting a big portion of train.py such that: All the tensorboard event handlers are organized in tensorboard_handlers.py and only called in train.py to log training and validation results in Tensorboard The code logs the same results for training and validation. Also, it adds the class IoU score as well. All single-use functions (e.g. _select_max, _tensor_to_numpy, _select_pred_and_mask) are lambda functions now The code is organized into more meaningful "chunks".. e.g. all the optimizer-related code should be together if possible, same thing for logging, configuration, loaders, tensorboard, ..etc. In addition: Fixed a visualization bug where the seismic images where not normalized correctly. This solves Issue #217. Fixed a visualization bug where the predictions where not masked where the input image was padded. This improves the ability to visually inspect and evaluate the results. This solves Issue #230. Fixes a potential issue where Tensorboard can crash when a large training batchsize is used. Now the number of images visualized in Tensorboard from every batch has an upper limit. Completely removed OpenCV as a dependency from the DeepSeismic Repo. It was only used in a small part of the code where it wasn't really necessary, and OpenCV is a huge library. Fixes Issue #218 where the epoch number for the images in Tensorboard was always logged as 1 (therefore, not allowing use to see the epoch number of the different results in Tensorboard. Removes the HorovodLRScheduler class since its no longer used Removes toolz.take from Debug mode, and uses PyTorch's native Subset() dataset class Changes default patch size for the HRNet model to 256 In addition to several other minor changes Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com> Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net> Co-authored-by: Max Kaznady <maxkaz@microsoft.com>
microsoft · Apr 22, 2020 · e0de54c · e0de54c
1 parent cf336ee
commit e0de54c
Show file tree

Hide file tree

Showing 35 changed files with 325 additions and 494 deletions.
diff --git a/AUTHORS.md b/AUTHORS.md
@@ -9,14 +9,15 @@ Contributors  (sorted alphabetically)
 -------------------------------------
 To contributors: please add your name to the list when you submit a patch to the project.
 
+* Yazeed Alaudah
 * Ashish Bhatia
+* Sharat Chikkerur
 * Daniel Ciborowski
 * George Iordanescu
 * Ilia Karmanov
 * Max Kaznady
 * Vanja Paunic
 * Mathew Salvaris
-* Sharat Chikkerur
 * Wee Hyong Tok
 
 ## How to be a contributor to the repository

diff --git a/README.md b/README.md
@@ -287,7 +287,7 @@ for the Penobscot dataset follow the same instructions but navigate to the [peno
 
 ## Contributing
 
-This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit [https://cla.opensource.microsoft.com](https://cla.opensource.microsoft.com).
 
 ### Submitting a Pull Request
 
@@ -321,7 +321,7 @@ A typical output will be:
 someusername@somevm:/projects/DeepSeismic$ which python
 /anaconda/envs/py35/bin/python
 ```
-which will indicate that anaconda folder is __/anaconda__. We'll refer to this location in the instructions below, but you should update the commands according to your local anaconda folder.
+which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this location in the instructions below, but you should update the commands according to your local anaconda folder.
 
 <details>
   <summary><b>Data Science Virtual Machine conda package installation errors</b></summary>
@@ -339,7 +339,7 @@ which will indicate that anaconda folder is __/anaconda__. We'll refer to this l
 <details>
   <summary><b>Data Science Virtual Machine conda package installation warnings</b></summary>
 
-  It could happen that while creating the conda environment defined by environment/anaconda/local/environment.yml on an Ubuntu DSVM, one can get multiple warnings like so:
+  It could happen that while creating the conda environment defined by `environment/anaconda/local/environment.yml` on an Ubuntu DSVM, one can get multiple warnings like so:
   ```
   WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(140): Could not remove or rename /anaconda/pkgs/ipywidgets-7.5.1-py_0/site-packages/ipywidgets-7.5.1.dist-info/LICENSE.  Please remove this file manually (you may need to reboot to free file handles)  
   ```
@@ -350,7 +350,7 @@ which will indicate that anaconda folder is __/anaconda__. We'll refer to this l
   sudo chown -R $USER /anaconda
   ```
 
-  After these command completes, try creating the conda environment in __environment/anaconda/local/environment.yml__ again.
+  After these command completes, try creating the conda environment in `__environment/anaconda/local/environment.yml__` again.
 
 </details>
 
@@ -395,7 +395,7 @@ which will indicate that anaconda folder is __/anaconda__. We'll refer to this l
 <details>
   <summary><b>GPU out of memory errors</b></summary>
 
-  You should be able to see how much GPU memory your process is using by running
+  You should be able to see how much GPU memory your process is using by running:
   ```bash
   nvidia-smi
   ```

diff --git a/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/hrnet.yaml b/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/hrnet.yaml
@@ -9,6 +9,7 @@ WORKERS: 4
 PRINT_FREQ: 10
 LOG_CONFIG: logging.conf
 SEED: 2019
+OPENCV_BORDER_CONSTANT: 0
 
 
 DATASET:
@@ -73,7 +74,7 @@ TRAIN:
   WEIGHT_DECAY: 0.0001
   SNAPSHOTS: 5
   AUGMENTATION: True
-  DEPTH: "section" #"patch" # Options are No, Patch and Section
+  DEPTH: "section" #"patch" # Options are none, patch, and section
   STRIDE: 50
   PATCH_SIZE: 100
   AUGMENTATIONS:

diff --git a/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/patch_deconvnet.yaml b/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/patch_deconvnet.yaml
@@ -30,7 +30,7 @@ TRAIN:
   WEIGHT_DECAY: 0.0001
   SNAPSHOTS: 5
   AUGMENTATION: True
-  DEPTH: "none" #"patch" # Options are None, Patch and Section
+  DEPTH: "none" #"patch" # Options are none, patch, and section
   STRIDE: 50
   PATCH_SIZE: 99
   AUGMENTATIONS:

diff --git a/...ib/experiments/interpretation/dutchf3_patch/distributed/configs/patch_deconvnet_skip.yaml b/...ib/experiments/interpretation/dutchf3_patch/distributed/configs/patch_deconvnet_skip.yaml
@@ -30,7 +30,7 @@ TRAIN:
   WEIGHT_DECAY: 0.0001
   SNAPSHOTS: 5
   AUGMENTATION: True
-  DEPTH: "none" #"patch" # Options are None, Patch and Section
+  DEPTH: "none" #"patch" # Options are none, patch, and section
   STRIDE: 50
   PATCH_SIZE: 99
   AUGMENTATIONS:

diff --git a/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/seresnet_unet.yaml b/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/seresnet_unet.yaml
@@ -30,7 +30,7 @@ TRAIN:
   WEIGHT_DECAY: 0.0001
   SNAPSHOTS: 5
   AUGMENTATION: True
-  DEPTH: "section" # Options are No, Patch and Section
+  DEPTH: "section" # Options are none, patch, and section
   STRIDE: 50
   PATCH_SIZE: 100
   AUGMENTATIONS:

diff --git a/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/unet.yaml b/contrib/experiments/interpretation/dutchf3_patch/distributed/configs/unet.yaml
@@ -33,7 +33,7 @@ TRAIN:
   WEIGHT_DECAY: 0.0001
   SNAPSHOTS: 5
   AUGMENTATION: True
-  DEPTH: "section" # Options are No, Patch and Section
+  DEPTH: "section" # Options are none, patch, and section
   STRIDE: 50
   PATCH_SIZE: 100
   AUGMENTATIONS:

diff --git a/contrib/experiments/interpretation/dutchf3_patch/distributed/default.py b/contrib/experiments/interpretation/dutchf3_patch/distributed/default.py
@@ -20,6 +20,7 @@
 _C.PIN_MEMORY = True
 _C.LOG_CONFIG = "logging.conf"
 _C.SEED = 42
+_C.OPENCV_BORDER_CONSTANT = 0
 
 # Cudnn related params
 _C.CUDNN = CN()
@@ -58,7 +59,7 @@
 _C.TRAIN.PATCH_SIZE = 99
 _C.TRAIN.MEAN = 0.0009997  # 0.0009996710808862074
 _C.TRAIN.STD = 0.21  # 0.20976548783479299
-_C.TRAIN.DEPTH = "None"  # Options are None, Patch and Section
+_C.TRAIN.DEPTH = "none"  # Options are: none, patch, and section
 # None adds no depth information and the num of channels remains at 1
 # Patch adds depth per patch so is simply the height of that patch from 0 to 1, channels=3
 # Section adds depth per section so contains depth information for the whole section, channels=3

diff --git a/contrib/experiments/interpretation/dutchf3_patch/distributed/train.py b/contrib/experiments/interpretation/dutchf3_patch/distributed/train.py
@@ -21,59 +21,30 @@
 import os
 from os import path
 
-import cv2
 import fire
 import numpy as np
 import toolz
 import torch
-from albumentations import Compose, HorizontalFlip, Normalize, Resize, PadIfNeeded
-from cv_lib.utils import load_log_configuration
-from cv_lib.event_handlers import (
-    SnapshotHandler,
-    logging_handlers,
-    tensorboard_handlers,
-)
-from cv_lib.event_handlers.logging_handlers import Evaluator
-from cv_lib.event_handlers.tensorboard_handlers import (
-    create_image_writer,
-    create_summary_writer,
-)
-from cv_lib.segmentation import models
-from cv_lib.segmentation import extract_metric_from
-from deepseismic_interpretation.dutchf3.data import get_patch_loader, decode_segmap
-from cv_lib.segmentation.dutchf3.engine import (
-    create_supervised_evaluator,
-    create_supervised_trainer,
-)
-
-from ignite.metrics import Loss
-from cv_lib.segmentation.metrics import (
-    pixelwise_accuracy,
-    class_accuracy,
-    mean_class_accuracy,
-    class_iou,
-    mean_iou,
-)
-
-from cv_lib.segmentation.dutchf3.utils import (
-    current_datetime,
-    generate_path,
-    git_branch,
-    git_hash,
-    np_to_tb,
-)
-from default import _C as config
-from default import update_config
-from ignite.contrib.handlers import (
-    ConcatScheduler,
-    CosineAnnealingScheduler,
-    LinearCyclicalScheduler,
-)
+from albumentations import Compose, HorizontalFlip, Normalize, PadIfNeeded, Resize
+from ignite.contrib.handlers import ConcatScheduler, CosineAnnealingScheduler, LinearCyclicalScheduler
 from ignite.engine import Events
+from ignite.metrics import Loss
 from ignite.utils import convert_tensor
 from toolz import compose, curry
 from torch.utils import data
 
+from cv_lib.event_handlers import SnapshotHandler, logging_handlers, tensorboard_handlers
+from cv_lib.event_handlers.logging_handlers import Evaluator
+from cv_lib.event_handlers.tensorboard_handlers import create_image_writer, create_summary_writer
+from cv_lib.segmentation import extract_metric_from, models
+from cv_lib.segmentation.dutchf3.engine import create_supervised_evaluator, create_supervised_trainer
+from cv_lib.segmentation.dutchf3.utils import current_datetime, generate_path, git_branch, git_hash, np_to_tb
+from cv_lib.segmentation.metrics import class_accuracy, class_iou, mean_class_accuracy, mean_iou, pixelwise_accuracy
+from cv_lib.utils import load_log_configuration
+from deepseismic_interpretation.dutchf3.data import decode_segmap, get_patch_loader
+from default import _C as config
+from default import update_config
+
 
 def prepare_batch(batch, device=None, non_blocking=False):
     x, y = batch
@@ -123,7 +94,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
         # provide environment variables, and requires that you use init_method=`env://`.
         torch.distributed.init_process_group(backend="nccl", init_method="env://")
 
-    scheduler_step = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
+    epochs_per_cycle = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
     torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
 
     torch.manual_seed(config.SEED)
@@ -137,7 +108,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
             PadIfNeeded(
                 min_height=config.TRAIN.PATCH_SIZE,
                 min_width=config.TRAIN.PATCH_SIZE,
-                border_mode=cv2.BORDER_CONSTANT,
+                border_mode=config.OPENCV_BORDER_CONSTANT,
                 always_apply=True,
                 mask_value=255,
             ),
@@ -147,7 +118,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
             PadIfNeeded(
                 min_height=config.TRAIN.AUGMENTATIONS.PAD.HEIGHT,
                 min_width=config.TRAIN.AUGMENTATIONS.PAD.WIDTH,
-                border_mode=cv2.BORDER_CONSTANT,
+                border_mode=config.OPENCV_BORDER_CONSTANT,
                 always_apply=True,
                 mask_value=255,
             ),
@@ -185,15 +156,16 @@ def run(*options, cfg=None, local_rank=0, debug=False):
     logger.info(f"Validation examples {len(val_set)}")
     n_classes = train_set.n_classes
 
-    #if debug:
-        #val_set = data.Subset(val_set, range(config.VALIDATION.BATCH_SIZE_PER_GPU))
-        #train_set = data.Subset(train_set, range(config.TRAIN.BATCH_SIZE_PER_GPU*2))
+    if debug:
+        logger.info("Running in debug mode..")
+        train_set = data.Subset(train_set, list(range(4)))
+        val_set = data.Subset(val_set, list(range(4)))
 
     logger.info(f"Training examples {len(train_set)}")
     logger.info(f"Validation examples {len(val_set)}")    
 
-    train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=local_rank)
 
+    train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=local_rank)
     train_loader = data.DataLoader(
         train_set, batch_size=config.TRAIN.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, sampler=train_sampler,
     )
@@ -226,9 +198,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
 
     model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device], find_unused_parameters=True)
 
-    snapshot_duration = scheduler_step * len(train_loader)
-    if debug:
-        snapshot_duration = 2
+    snapshot_duration = epochs_per_cycle * len(train_loader) if not debug else 2*len(train_loader)
     warmup_duration = 5 * len(train_loader)
     warmup_scheduler = LinearCyclicalScheduler(
         optimizer,
@@ -238,7 +208,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
         cycle_size=10 * len(train_loader),
     )
     cosine_scheduler = CosineAnnealingScheduler(
-        optimizer, "lr", config.TRAIN.MAX_LR * world_size, config.TRAIN.MIN_LR * world_size, snapshot_duration,
+        optimizer, "lr", config.TRAIN.MAX_LR * world_size, config.TRAIN.MIN_LR * world_size, cycle_size=snapshot_duration,
     )
 
     scheduler = ConcatScheduler(schedulers=[warmup_scheduler, cosine_scheduler], durations=[warmup_duration])
@@ -270,18 +240,27 @@ def _select_pred_and_mask(model_out_dict):
         device=device,
     )
 
-    # Set the validation run to start on the epoch completion of the training run    
+    # Set the validation run to start on the epoch completion of the training run
+
     trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluator(evaluator, val_loader))
 
     if local_rank == 0:  # Run only on master process
 
         trainer.add_event_handler(
-            Events.ITERATION_COMPLETED, logging_handlers.log_training_output(log_interval=config.TRAIN.BATCH_SIZE_PER_GPU),
+            Events.ITERATION_COMPLETED,
+            logging_handlers.log_training_output(log_interval=config.TRAIN.BATCH_SIZE_PER_GPU),
         )
-        trainer.add_event_handler(Events.EPOCH_STARTED, logging_handlers.log_lr(optimizer))    
+        trainer.add_event_handler(Events.EPOCH_STARTED, logging_handlers.log_lr(optimizer))
 
         try:
-            output_dir = generate_path(config.OUTPUT_DIR, git_branch(), git_hash(), config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)
+            output_dir = generate_path(
+                config.OUTPUT_DIR,
+                git_branch(),
+                git_hash(),
+                config_file_name,
+                config.TRAIN.MODEL_DIR,
+                current_datetime(),
+            )
         except TypeError:
             output_dir = generate_path(config.OUTPUT_DIR, config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)
 
@@ -322,9 +301,7 @@ def _tensor_to_numpy(pred_tensor):
             return pred_tensor.squeeze().cpu().numpy()
 
         transform_func = compose(np_to_tb, decode_segmap(n_classes=n_classes), _tensor_to_numpy)
-
         transform_pred = compose(transform_func, _select_max)
-
         evaluator.add_event_handler(
             Events.EPOCH_COMPLETED, create_image_writer(summary_writer, "Validation/Image", "image"),
         )
@@ -341,19 +318,22 @@ def snapshot_function():
             return (trainer.state.iteration % snapshot_duration) == 0
 
         checkpoint_handler = SnapshotHandler(
-            output_dir,
-            config.MODEL.NAME,
-            extract_metric_from("mIoU"),
-            snapshot_function,
+            output_dir, config.MODEL.NAME, extract_metric_from("mIoU"), snapshot_function,
         )
         evaluator.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {"model": model})
-
         logger.info("Starting training")
-    
+
         if debug:
-            trainer.run(train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length = config.TRAIN.BATCH_SIZE_PER_GPU*2, seed = config.SEED)
+            trainer.run(
+                train_loader,
+                max_epochs=config.TRAIN.END_EPOCH,
+                epoch_length=config.TRAIN.BATCH_SIZE_PER_GPU * 2,
+                seed=config.SEED,
+            )
         else:
-            trainer.run(train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length = len(train_loader), seed = config.SEED)
+            trainer.run(
+                train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length=len(train_loader), seed=config.SEED
+            )
 
 
 if __name__ == "__main__":

diff --git a/contrib/experiments/interpretation/dutchf3_section/local/default.py b/contrib/experiments/interpretation/dutchf3_section/local/default.py
@@ -21,6 +21,7 @@
 _C.PIN_MEMORY = True
 _C.LOG_CONFIG = "./logging.conf"  # Logging config file relative to the experiment
 _C.SEED = 42
+_C.OPENCV_BORDER_CONSTANT = 0
 
 # Cudnn related params
 _C.CUDNN = CN()
@@ -55,7 +56,7 @@
 _C.TRAIN.AUGMENTATION = True
 _C.TRAIN.MEAN = 0.0009997  # 0.0009996710808862074
 _C.TRAIN.STD = 0.20977  # 0.20976548783479299
-_C.TRAIN.DEPTH = "none"  # Options are 'none', 'patch' and 'section'
+_C.TRAIN.DEPTH = "none"  # Options are: none, patch, and section
 # None adds no depth information and the num of channels remains at 1
 # Patch adds depth per patch so is simply the height of that patch from 0 to 1, channels=3
 # Section adds depth per section so contains depth information for the whole section, channels=3

diff --git a/contrib/experiments/interpretation/dutchf3_section/local/train.py b/contrib/experiments/interpretation/dutchf3_section/local/train.py
@@ -84,7 +84,7 @@ def run(*options, cfg=None, debug=False):
     load_log_configuration(config.LOG_CONFIG)
     logger = logging.getLogger(__name__)
     logger.debug(config.WORKERS)
-    scheduler_step = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
+    epochs_per_cycle = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
     torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK
 
     torch.manual_seed(config.SEED)
@@ -164,8 +164,8 @@ def __len__(self):
 
     summary_writer = create_summary_writer(log_dir=path.join(output_dir, config.LOG_DIR))
 
-    snapshot_duration = scheduler_step * len(train_loader)
-    scheduler = CosineAnnealingScheduler(optimizer, "lr", config.TRAIN.MAX_LR, config.TRAIN.MIN_LR, snapshot_duration)
+    snapshot_duration = epochs_per_cycle * len(train_loader) if not debug else 2*len(train_loader)
+    scheduler = CosineAnnealingScheduler(optimizer, "lr", config.TRAIN.MAX_LR, config.TRAIN.MIN_LR, cycle_size=snapshot_duration)
 
     # weights are inversely proportional to the frequency of the classes in
     # the training set

diff --git a/contrib/experiments/interpretation/dutchf3_voxel/configs/texture_net.yaml b/contrib/experiments/interpretation/dutchf3_voxel/configs/texture_net.yaml
@@ -29,7 +29,7 @@ TRAIN:
   LR: 0.02
   MOMENTUM: 0.9
   WEIGHT_DECAY: 0.0001    
-  DEPTH: "voxel" # Options are No, Patch, Section and Voxel  
+  DEPTH: "voxel" # Options are none, patch, section and voxel  
   MODEL_DIR: "models"
 
 VALIDATION: