Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Commit

Permalink
Refactoring train.py, removing OpenCV, adding training results to Te…
Browse files Browse the repository at this point in the history
…nsborboard, bug fixes (#264)

I think moving forward, we'll use smaller PRs. But here are the changes in this one:

Fixes issue #236 that involves rewriting a big portion of train.py such that:

    All the tensorboard event handlers are organized in tensorboard_handlers.py and only called in train.py to log training and validation results in Tensorboard
    The code logs the same results for training and validation. Also, it adds the class IoU score as well.
    All single-use functions (e.g. _select_max, _tensor_to_numpy, _select_pred_and_mask) are lambda functions now
    The code is organized into more meaningful "chunks".. e.g. all the optimizer-related code should be together if possible, same thing for logging, configuration, loaders, tensorboard, ..etc.

In addition:

    Fixed a visualization bug where the seismic images where not normalized correctly. This solves Issue #217.
    Fixed a visualization bug where the predictions where not masked where the input image was padded. This improves the ability to visually inspect and evaluate the results. This solves Issue #230.
    Fixes a potential issue where Tensorboard can crash when a large training batchsize is used. Now the number of images visualized in Tensorboard from every batch has an upper limit.
    Completely removed OpenCV as a dependency from the DeepSeismic Repo. It was only used in a small part of the code where it wasn't really necessary, and OpenCV is a huge library.
    Fixes Issue #218 where the epoch number for the images in Tensorboard was always logged as 1 (therefore, not allowing use to see the epoch number of the different results in Tensorboard.
    Removes the HorovodLRScheduler class since its no longer used
    Removes toolz.take from Debug mode, and uses PyTorch's native Subset() dataset class
    Changes default patch size for the HRNet model to 256
    In addition to several other minor changes


Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com>
Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net>
Co-authored-by: Max Kaznady <maxkaz@microsoft.com>
  • Loading branch information
4 people authored Apr 22, 2020
1 parent cf336ee commit e0de54c
Show file tree
Hide file tree
Showing 35 changed files with 325 additions and 494 deletions.
3 changes: 2 additions & 1 deletion AUTHORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@ Contributors (sorted alphabetically)
-------------------------------------
To contributors: please add your name to the list when you submit a patch to the project.

* Yazeed Alaudah
* Ashish Bhatia
* Sharat Chikkerur
* Daniel Ciborowski
* George Iordanescu
* Ilia Karmanov
* Max Kaznady
* Vanja Paunic
* Mathew Salvaris
* Sharat Chikkerur
* Wee Hyong Tok

## How to be a contributor to the repository
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ for the Penobscot dataset follow the same instructions but navigate to the [peno

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit [https://cla.opensource.microsoft.com](https://cla.opensource.microsoft.com).

### Submitting a Pull Request

Expand Down Expand Up @@ -321,7 +321,7 @@ A typical output will be:
someusername@somevm:/projects/DeepSeismic$ which python
/anaconda/envs/py35/bin/python
```
which will indicate that anaconda folder is __/anaconda__. We'll refer to this location in the instructions below, but you should update the commands according to your local anaconda folder.
which will indicate that anaconda folder is `__/anaconda__`. We'll refer to this location in the instructions below, but you should update the commands according to your local anaconda folder.

<details>
<summary><b>Data Science Virtual Machine conda package installation errors</b></summary>
Expand All @@ -339,7 +339,7 @@ which will indicate that anaconda folder is __/anaconda__. We'll refer to this l
<details>
<summary><b>Data Science Virtual Machine conda package installation warnings</b></summary>

It could happen that while creating the conda environment defined by environment/anaconda/local/environment.yml on an Ubuntu DSVM, one can get multiple warnings like so:
It could happen that while creating the conda environment defined by `environment/anaconda/local/environment.yml` on an Ubuntu DSVM, one can get multiple warnings like so:
```
WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(140): Could not remove or rename /anaconda/pkgs/ipywidgets-7.5.1-py_0/site-packages/ipywidgets-7.5.1.dist-info/LICENSE. Please remove this file manually (you may need to reboot to free file handles)
```
Expand All @@ -350,7 +350,7 @@ which will indicate that anaconda folder is __/anaconda__. We'll refer to this l
sudo chown -R $USER /anaconda
```

After these command completes, try creating the conda environment in __environment/anaconda/local/environment.yml__ again.
After these command completes, try creating the conda environment in `__environment/anaconda/local/environment.yml__` again.

</details>

Expand Down Expand Up @@ -395,7 +395,7 @@ which will indicate that anaconda folder is __/anaconda__. We'll refer to this l
<details>
<summary><b>GPU out of memory errors</b></summary>

You should be able to see how much GPU memory your process is using by running
You should be able to see how much GPU memory your process is using by running:
```bash
nvidia-smi
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ WORKERS: 4
PRINT_FREQ: 10
LOG_CONFIG: logging.conf
SEED: 2019
OPENCV_BORDER_CONSTANT: 0


DATASET:
Expand Down Expand Up @@ -73,7 +74,7 @@ TRAIN:
WEIGHT_DECAY: 0.0001
SNAPSHOTS: 5
AUGMENTATION: True
DEPTH: "section" #"patch" # Options are No, Patch and Section
DEPTH: "section" #"patch" # Options are none, patch, and section
STRIDE: 50
PATCH_SIZE: 100
AUGMENTATIONS:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ TRAIN:
WEIGHT_DECAY: 0.0001
SNAPSHOTS: 5
AUGMENTATION: True
DEPTH: "none" #"patch" # Options are None, Patch and Section
DEPTH: "none" #"patch" # Options are none, patch, and section
STRIDE: 50
PATCH_SIZE: 99
AUGMENTATIONS:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ TRAIN:
WEIGHT_DECAY: 0.0001
SNAPSHOTS: 5
AUGMENTATION: True
DEPTH: "none" #"patch" # Options are None, Patch and Section
DEPTH: "none" #"patch" # Options are none, patch, and section
STRIDE: 50
PATCH_SIZE: 99
AUGMENTATIONS:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ TRAIN:
WEIGHT_DECAY: 0.0001
SNAPSHOTS: 5
AUGMENTATION: True
DEPTH: "section" # Options are No, Patch and Section
DEPTH: "section" # Options are none, patch, and section
STRIDE: 50
PATCH_SIZE: 100
AUGMENTATIONS:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ TRAIN:
WEIGHT_DECAY: 0.0001
SNAPSHOTS: 5
AUGMENTATION: True
DEPTH: "section" # Options are No, Patch and Section
DEPTH: "section" # Options are none, patch, and section
STRIDE: 50
PATCH_SIZE: 100
AUGMENTATIONS:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
_C.PIN_MEMORY = True
_C.LOG_CONFIG = "logging.conf"
_C.SEED = 42
_C.OPENCV_BORDER_CONSTANT = 0

# Cudnn related params
_C.CUDNN = CN()
Expand Down Expand Up @@ -58,7 +59,7 @@
_C.TRAIN.PATCH_SIZE = 99
_C.TRAIN.MEAN = 0.0009997 # 0.0009996710808862074
_C.TRAIN.STD = 0.21 # 0.20976548783479299
_C.TRAIN.DEPTH = "None" # Options are None, Patch and Section
_C.TRAIN.DEPTH = "none" # Options are: none, patch, and section
# None adds no depth information and the num of channels remains at 1
# Patch adds depth per patch so is simply the height of that patch from 0 to 1, channels=3
# Section adds depth per section so contains depth information for the whole section, channels=3
Expand Down
118 changes: 49 additions & 69 deletions contrib/experiments/interpretation/dutchf3_patch/distributed/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,59 +21,30 @@
import os
from os import path

import cv2
import fire
import numpy as np
import toolz
import torch
from albumentations import Compose, HorizontalFlip, Normalize, Resize, PadIfNeeded
from cv_lib.utils import load_log_configuration
from cv_lib.event_handlers import (
SnapshotHandler,
logging_handlers,
tensorboard_handlers,
)
from cv_lib.event_handlers.logging_handlers import Evaluator
from cv_lib.event_handlers.tensorboard_handlers import (
create_image_writer,
create_summary_writer,
)
from cv_lib.segmentation import models
from cv_lib.segmentation import extract_metric_from
from deepseismic_interpretation.dutchf3.data import get_patch_loader, decode_segmap
from cv_lib.segmentation.dutchf3.engine import (
create_supervised_evaluator,
create_supervised_trainer,
)

from ignite.metrics import Loss
from cv_lib.segmentation.metrics import (
pixelwise_accuracy,
class_accuracy,
mean_class_accuracy,
class_iou,
mean_iou,
)

from cv_lib.segmentation.dutchf3.utils import (
current_datetime,
generate_path,
git_branch,
git_hash,
np_to_tb,
)
from default import _C as config
from default import update_config
from ignite.contrib.handlers import (
ConcatScheduler,
CosineAnnealingScheduler,
LinearCyclicalScheduler,
)
from albumentations import Compose, HorizontalFlip, Normalize, PadIfNeeded, Resize
from ignite.contrib.handlers import ConcatScheduler, CosineAnnealingScheduler, LinearCyclicalScheduler
from ignite.engine import Events
from ignite.metrics import Loss
from ignite.utils import convert_tensor
from toolz import compose, curry
from torch.utils import data

from cv_lib.event_handlers import SnapshotHandler, logging_handlers, tensorboard_handlers
from cv_lib.event_handlers.logging_handlers import Evaluator
from cv_lib.event_handlers.tensorboard_handlers import create_image_writer, create_summary_writer
from cv_lib.segmentation import extract_metric_from, models
from cv_lib.segmentation.dutchf3.engine import create_supervised_evaluator, create_supervised_trainer
from cv_lib.segmentation.dutchf3.utils import current_datetime, generate_path, git_branch, git_hash, np_to_tb
from cv_lib.segmentation.metrics import class_accuracy, class_iou, mean_class_accuracy, mean_iou, pixelwise_accuracy
from cv_lib.utils import load_log_configuration
from deepseismic_interpretation.dutchf3.data import decode_segmap, get_patch_loader
from default import _C as config
from default import update_config


def prepare_batch(batch, device=None, non_blocking=False):
x, y = batch
Expand Down Expand Up @@ -123,7 +94,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
# provide environment variables, and requires that you use init_method=`env://`.
torch.distributed.init_process_group(backend="nccl", init_method="env://")

scheduler_step = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
epochs_per_cycle = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK

torch.manual_seed(config.SEED)
Expand All @@ -137,7 +108,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
PadIfNeeded(
min_height=config.TRAIN.PATCH_SIZE,
min_width=config.TRAIN.PATCH_SIZE,
border_mode=cv2.BORDER_CONSTANT,
border_mode=config.OPENCV_BORDER_CONSTANT,
always_apply=True,
mask_value=255,
),
Expand All @@ -147,7 +118,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
PadIfNeeded(
min_height=config.TRAIN.AUGMENTATIONS.PAD.HEIGHT,
min_width=config.TRAIN.AUGMENTATIONS.PAD.WIDTH,
border_mode=cv2.BORDER_CONSTANT,
border_mode=config.OPENCV_BORDER_CONSTANT,
always_apply=True,
mask_value=255,
),
Expand Down Expand Up @@ -185,15 +156,16 @@ def run(*options, cfg=None, local_rank=0, debug=False):
logger.info(f"Validation examples {len(val_set)}")
n_classes = train_set.n_classes

#if debug:
#val_set = data.Subset(val_set, range(config.VALIDATION.BATCH_SIZE_PER_GPU))
#train_set = data.Subset(train_set, range(config.TRAIN.BATCH_SIZE_PER_GPU*2))
if debug:
logger.info("Running in debug mode..")
train_set = data.Subset(train_set, list(range(4)))
val_set = data.Subset(val_set, list(range(4)))

logger.info(f"Training examples {len(train_set)}")
logger.info(f"Validation examples {len(val_set)}")

train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=local_rank)

train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, num_replicas=world_size, rank=local_rank)
train_loader = data.DataLoader(
train_set, batch_size=config.TRAIN.BATCH_SIZE_PER_GPU, num_workers=config.WORKERS, sampler=train_sampler,
)
Expand Down Expand Up @@ -226,9 +198,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device], find_unused_parameters=True)

snapshot_duration = scheduler_step * len(train_loader)
if debug:
snapshot_duration = 2
snapshot_duration = epochs_per_cycle * len(train_loader) if not debug else 2*len(train_loader)
warmup_duration = 5 * len(train_loader)
warmup_scheduler = LinearCyclicalScheduler(
optimizer,
Expand All @@ -238,7 +208,7 @@ def run(*options, cfg=None, local_rank=0, debug=False):
cycle_size=10 * len(train_loader),
)
cosine_scheduler = CosineAnnealingScheduler(
optimizer, "lr", config.TRAIN.MAX_LR * world_size, config.TRAIN.MIN_LR * world_size, snapshot_duration,
optimizer, "lr", config.TRAIN.MAX_LR * world_size, config.TRAIN.MIN_LR * world_size, cycle_size=snapshot_duration,
)

scheduler = ConcatScheduler(schedulers=[warmup_scheduler, cosine_scheduler], durations=[warmup_duration])
Expand Down Expand Up @@ -270,18 +240,27 @@ def _select_pred_and_mask(model_out_dict):
device=device,
)

# Set the validation run to start on the epoch completion of the training run
# Set the validation run to start on the epoch completion of the training run

trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluator(evaluator, val_loader))

if local_rank == 0: # Run only on master process

trainer.add_event_handler(
Events.ITERATION_COMPLETED, logging_handlers.log_training_output(log_interval=config.TRAIN.BATCH_SIZE_PER_GPU),
Events.ITERATION_COMPLETED,
logging_handlers.log_training_output(log_interval=config.TRAIN.BATCH_SIZE_PER_GPU),
)
trainer.add_event_handler(Events.EPOCH_STARTED, logging_handlers.log_lr(optimizer))
trainer.add_event_handler(Events.EPOCH_STARTED, logging_handlers.log_lr(optimizer))

try:
output_dir = generate_path(config.OUTPUT_DIR, git_branch(), git_hash(), config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)
output_dir = generate_path(
config.OUTPUT_DIR,
git_branch(),
git_hash(),
config_file_name,
config.TRAIN.MODEL_DIR,
current_datetime(),
)
except TypeError:
output_dir = generate_path(config.OUTPUT_DIR, config_file_name, config.TRAIN.MODEL_DIR, current_datetime(),)

Expand Down Expand Up @@ -322,9 +301,7 @@ def _tensor_to_numpy(pred_tensor):
return pred_tensor.squeeze().cpu().numpy()

transform_func = compose(np_to_tb, decode_segmap(n_classes=n_classes), _tensor_to_numpy)

transform_pred = compose(transform_func, _select_max)

evaluator.add_event_handler(
Events.EPOCH_COMPLETED, create_image_writer(summary_writer, "Validation/Image", "image"),
)
Expand All @@ -341,19 +318,22 @@ def snapshot_function():
return (trainer.state.iteration % snapshot_duration) == 0

checkpoint_handler = SnapshotHandler(
output_dir,
config.MODEL.NAME,
extract_metric_from("mIoU"),
snapshot_function,
output_dir, config.MODEL.NAME, extract_metric_from("mIoU"), snapshot_function,
)
evaluator.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {"model": model})

logger.info("Starting training")

if debug:
trainer.run(train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length = config.TRAIN.BATCH_SIZE_PER_GPU*2, seed = config.SEED)
trainer.run(
train_loader,
max_epochs=config.TRAIN.END_EPOCH,
epoch_length=config.TRAIN.BATCH_SIZE_PER_GPU * 2,
seed=config.SEED,
)
else:
trainer.run(train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length = len(train_loader), seed = config.SEED)
trainer.run(
train_loader, max_epochs=config.TRAIN.END_EPOCH, epoch_length=len(train_loader), seed=config.SEED
)


if __name__ == "__main__":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
_C.PIN_MEMORY = True
_C.LOG_CONFIG = "./logging.conf" # Logging config file relative to the experiment
_C.SEED = 42
_C.OPENCV_BORDER_CONSTANT = 0

# Cudnn related params
_C.CUDNN = CN()
Expand Down Expand Up @@ -55,7 +56,7 @@
_C.TRAIN.AUGMENTATION = True
_C.TRAIN.MEAN = 0.0009997 # 0.0009996710808862074
_C.TRAIN.STD = 0.20977 # 0.20976548783479299
_C.TRAIN.DEPTH = "none" # Options are 'none', 'patch' and 'section'
_C.TRAIN.DEPTH = "none" # Options are: none, patch, and section
# None adds no depth information and the num of channels remains at 1
# Patch adds depth per patch so is simply the height of that patch from 0 to 1, channels=3
# Section adds depth per section so contains depth information for the whole section, channels=3
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def run(*options, cfg=None, debug=False):
load_log_configuration(config.LOG_CONFIG)
logger = logging.getLogger(__name__)
logger.debug(config.WORKERS)
scheduler_step = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
epochs_per_cycle = config.TRAIN.END_EPOCH // config.TRAIN.SNAPSHOTS
torch.backends.cudnn.benchmark = config.CUDNN.BENCHMARK

torch.manual_seed(config.SEED)
Expand Down Expand Up @@ -164,8 +164,8 @@ def __len__(self):

summary_writer = create_summary_writer(log_dir=path.join(output_dir, config.LOG_DIR))

snapshot_duration = scheduler_step * len(train_loader)
scheduler = CosineAnnealingScheduler(optimizer, "lr", config.TRAIN.MAX_LR, config.TRAIN.MIN_LR, snapshot_duration)
snapshot_duration = epochs_per_cycle * len(train_loader) if not debug else 2*len(train_loader)
scheduler = CosineAnnealingScheduler(optimizer, "lr", config.TRAIN.MAX_LR, config.TRAIN.MIN_LR, cycle_size=snapshot_duration)

# weights are inversely proportional to the frequency of the classes in
# the training set
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ TRAIN:
LR: 0.02
MOMENTUM: 0.9
WEIGHT_DECAY: 0.0001
DEPTH: "voxel" # Options are No, Patch, Section and Voxel
DEPTH: "voxel" # Options are none, patch, section and voxel
MODEL_DIR: "models"

VALIDATION:
Expand Down
Loading

0 comments on commit e0de54c

Please sign in to comment.