Refactoring train.py to enhance readability, maintainability, and add training results to Tensborboard. #236

yalaudah · 2020-03-26T18:04:37Z

In the process of fixing Issue #230 , I ran into several issues in train.py that needed to be fixed. Refactoring train.py will to allow for better readability and maintainability for the code. It will also potentially allow us to reduce a lot of code duplication across dutchf3/penobscot and local/distributed scripts.

Here are some of the issues that should be addressed:

All the tensorboard event handlers should be organized in tensorboard_handlers.py and only called in train.py to log training and validation results in Tensorboard
The code should log the same results for training and validation. At the very minimum, we should be able to see training images, masks, and predictions.
All single-use functions (e.g. _select_max, _tensor_to_numpy, _select_pred_and_mask) should be lambda functions
The code should be organized in meaningful "chunks".. e.g. all the optimizer-related code should be together if possible, same thing for logging, configuration, loaders, tensorboard, ..etc. This is somewhat already the case but can be improved.

The text was updated successfully, but these errors were encountered:

…nsborboard, bug fixes (#264) I think moving forward, we'll use smaller PRs. But here are the changes in this one: Fixes issue #236 that involves rewriting a big portion of train.py such that: All the tensorboard event handlers are organized in tensorboard_handlers.py and only called in train.py to log training and validation results in Tensorboard The code logs the same results for training and validation. Also, it adds the class IoU score as well. All single-use functions (e.g. _select_max, _tensor_to_numpy, _select_pred_and_mask) are lambda functions now The code is organized into more meaningful "chunks".. e.g. all the optimizer-related code should be together if possible, same thing for logging, configuration, loaders, tensorboard, ..etc. In addition: Fixed a visualization bug where the seismic images where not normalized correctly. This solves Issue #217. Fixed a visualization bug where the predictions where not masked where the input image was padded. This improves the ability to visually inspect and evaluate the results. This solves Issue #230. Fixes a potential issue where Tensorboard can crash when a large training batchsize is used. Now the number of images visualized in Tensorboard from every batch has an upper limit. Completely removed OpenCV as a dependency from the DeepSeismic Repo. It was only used in a small part of the code where it wasn't really necessary, and OpenCV is a huge library. Fixes Issue #218 where the epoch number for the images in Tensorboard was always logged as 1 (therefore, not allowing use to see the epoch number of the different results in Tensorboard. Removes the HorovodLRScheduler class since its no longer used Removes toolz.take from Debug mode, and uses PyTorch's native Subset() dataset class Changes default patch size for the HRNet model to 256 In addition to several other minor changes Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com> Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net> Co-authored-by: Max Kaznady <maxkaz@microsoft.com>

yalaudah · 2020-04-22T20:02:32Z

Merged in #264

* correctness branch setup (#251) * created correctnes branch, trimmed experiments to Dutch F3 only * trivial change to re-trigger build * dummy PR to re-trigger malfunctioning builds * reducing scope further (#258) * created correctnes branch, trimmed experiments to Dutch F3 only * trivial change to re-trigger build * dummy PR to re-trigger malfunctioning builds * reducing scope of the correctness branch further * added branch triggers * 214 Ignite 0.3.0 upgrade (#261) * upgraded to Ignite 0.3.0 and fixed upgrade compatibility * added seeds and modified notebook for ignite 0.3.0 * updated code and tests to work with ignite 0.3.0 * made code consistent with Ignite 0.3.0 as much as possible * fixed iterator epoch_length bug by subsetting validation set * applied same fix to the notebook * bugfix in distributed train.py * increased distributed tests to 2 batched - hoping for one batch per GPU * resolved rebase conflict * added seeds and modified notebook for ignite 0.3.0 * updated code and tests to work with ignite 0.3.0 * made code consistent with Ignite 0.3.0 as much as possible * fixed iterator epoch_length bug by subsetting validation set * applied same fix to the notebook * bugfix in distributed train.py * increased distributed tests to 2 batched - hoping for one batch per GPU * update docker readme (#262) Co-authored-by: maxkazmsft <maxkaz@microsoft.com> * tagged all TODOs with issues on github (and created issues) (#278) * created correctnes branch, trimmed experiments to Dutch F3 only * trivial change to re-trigger build * dummy PR to re-trigger malfunctioning builds * resolved merge conflict * flagged all non-contrib TODO with github issues * resolved rebase conflict * resolved merge conflict * cleaned up archaic voxel code * Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes (#264) I think moving forward, we'll use smaller PRs. But here are the changes in this one: Fixes issue #236 that involves rewriting a big portion of train.py such that: All the tensorboard event handlers are organized in tensorboard_handlers.py and only called in train.py to log training and validation results in Tensorboard The code logs the same results for training and validation. Also, it adds the class IoU score as well. All single-use functions (e.g. _select_max, _tensor_to_numpy, _select_pred_and_mask) are lambda functions now The code is organized into more meaningful "chunks".. e.g. all the optimizer-related code should be together if possible, same thing for logging, configuration, loaders, tensorboard, ..etc. In addition: Fixed a visualization bug where the seismic images where not normalized correctly. This solves Issue #217. Fixed a visualization bug where the predictions where not masked where the input image was padded. This improves the ability to visually inspect and evaluate the results. This solves Issue #230. Fixes a potential issue where Tensorboard can crash when a large training batchsize is used. Now the number of images visualized in Tensorboard from every batch has an upper limit. Completely removed OpenCV as a dependency from the DeepSeismic Repo. It was only used in a small part of the code where it wasn't really necessary, and OpenCV is a huge library. Fixes Issue #218 where the epoch number for the images in Tensorboard was always logged as 1 (therefore, not allowing use to see the epoch number of the different results in Tensorboard. Removes the HorovodLRScheduler class since its no longer used Removes toolz.take from Debug mode, and uses PyTorch's native Subset() dataset class Changes default patch size for the HRNet model to 256 In addition to several other minor changes Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com> Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net> Co-authored-by: Max Kaznady <maxkaz@microsoft.com> * Fixes training/validation overlap #143, #233, #253, and #259 (#282) * Correctness single GPU switch (#290) * resolved rebase conflict * resolved merge conflict * resolved rebase conflict * resolved merge conflict * reverted multi-GPU builds to run on single GPU * 249r3 (#283) * resolved rebase conflict * resolved merge conflict * resolved rebase conflict * resolved merge conflict * wrote the bulk of checkerboard example * finished checkerboard generator * resolved merge conflict * resolved rebase conflict * got binary dataset to run * finished first implementation mockup - commit before rebase * made sure rebase went well manually * added new files * resolved PR comments and made tests work * fixed build error * fixed build VM errors * more fixes to get the test to pass * fixed n_classes issue in data.py * fixed notebook as well * cleared notebook run cell * trivial commit to restart builds * addressed PR comments * moved notebook tests to main build pipeline * fixed checkerboard label precision * relaxed performance tests for now * resolved merge conflict * resolved merge conflict * fixed build error * resolved merge conflicts * fixed another merge mistake * enabling development on docker (#291) * 289: correctness metrics and tighter tests (#293) * resolved rebase conflict * resolved merge conflict * resolved rebase conflict * resolved merge conflict * wrote the bulk of checkerboard example * finished checkerboard generator * resolved merge conflict * resolved rebase conflict * got binary dataset to run * finished first implementation mockup - commit before rebase * made sure rebase went well manually * added new files * resolved PR comments and made tests work * fixed build error * fixed build VM errors * more fixes to get the test to pass * fixed n_classes issue in data.py * fixed notebook as well * cleared notebook run cell * trivial commit to restart builds * addressed PR comments * moved notebook tests to main build pipeline * fixed checkerboard label precision * relaxed performance tests for now * resolved merge conflict * resolved merge conflict * fixed build error * resolved merge conflicts * fixed another merge mistake * resolved rebase conflict * resolved rebase 2 * resolved merge conflict * resolved merge conflict * adding new logging * added better logging - cleaner - debugged metrics on checkerboard dataset * resolved rebase conflict * resolved merge conflict * resolved merge conflict * resolved merge conflict * resolved rebase 2 * resolved merge conflict * updated notebook with the changes * addressed PR comments * addressed another PR comment * uniform colormap and correctness tests (#295) * correctness code good for PR review * addressed PR comments * added data dumps to the code * all dumps work properly now * fixed build error, added binary dataset * done - now need to test * finished dev build script * updates to tests to run on local machine as well we build * updated gradient direction in gen_checkerboard * increased Dutch F3 timeout Co-authored-by: yalaudah <yazeed.alaudah@microsoft.com> Co-authored-by: Yazeed Alaudah <yalaudah@users.noreply.github.com> Co-authored-by: Ubuntu <yazeed@yaalauda-dsvm-nd24.jsxrnelwp15e1jpgk5vvfmbzyb.bx.internal.cloudapp.net>

yalaudah self-assigned this Mar 26, 2020

yalaudah added Status: In-progress Active on-going work in the sprint Type: Enhancement This an enhancement to an existing feature labels Mar 26, 2020

yalaudah modified the milestones: Outstanding items, Ver 1.0 Mar 26, 2020

yalaudah mentioned this issue Mar 31, 2020

Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes #240

Closed

yalaudah linked a pull request Mar 31, 2020 that will close this issue

Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes #240

Closed

yalaudah mentioned this issue Apr 13, 2020

Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes #256

Merged

yalaudah linked a pull request Apr 13, 2020 that will close this issue

Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes #256

Merged

This was referenced Apr 17, 2020

Update visualization #263

Closed

Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes #264

Merged

yalaudah linked a pull request Apr 17, 2020 that will close this issue

Refactoring train.py, removing OpenCV, adding training results to Tensborboard, bug fixes #264

Merged

maxkazmsft modified the milestones: Ver 1.0, V0.2 [Correctness] Apr 21, 2020

yalaudah closed this as completed Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring train.py to enhance readability, maintainability, and add training results to Tensborboard. #236

Refactoring train.py to enhance readability, maintainability, and add training results to Tensborboard. #236

yalaudah commented Mar 26, 2020

yalaudah commented Apr 22, 2020

Refactoring train.py to enhance readability, maintainability, and add training results to Tensborboard. #236

Refactoring train.py to enhance readability, maintainability, and add training results to Tensborboard. #236

Comments

yalaudah commented Mar 26, 2020

yalaudah commented Apr 22, 2020