-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Move init dist connection into the setup function #6506
Conversation
…we set up the accelerator
Some tests in |
Maybe it helps to go back to before the accelerator refactor and look again what the order of calls were. The commit before acc refactor is 309ce7a |
self.setup_training_type_plugin(self.training_type_plugin, model) | ||
self.setup_optimizers(trainer) | ||
self.connect_precision_plugin(self.precision_plugin) | ||
self.setup_precision_plugin(self.precision_plugin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone extended and made their own accelerator, this will be a breaking change so might need to handle a deprecation path here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to rename them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't i guess? It's just a bit concerning because if I don't rename them, connect_training_type_plugin
will be calling plugin.setup
, and connect
will be calling training_type_plugin.connect
. Just confusing function names. I think if it becomes an issue we can make this BW compatible however in most cases it seems users should be defining plugins, not accelerators.
|
||
# TODO: we moved it to the trainer.fit after calling pre_dispatch | ||
# ... need to double check that it is the correct place | ||
# self.trainer.call_setup_hook(self.model) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah my silly todo....
"need to double check that it is the correct place"
Thanks for double checking @SeanNaren 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Great job !
# Conflicts: # CHANGELOG.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
log.info("-" * 100) | ||
log.info(f"distributed_backend={self.distributed_backend}") | ||
log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes") | ||
log.info("-" * 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, shall we have this as a single message intend for 4 separate?
@@ -180,7 +180,7 @@ def test_deepspeed_defaults(tmpdir): | |||
assert isinstance(plugin.config["zero_optimization"], dict) | |||
|
|||
|
|||
@RunIf(deepspeed=True) | |||
@RunIf(min_gpus=1, deepspeed=True, special=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so cool :D
* Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit edde60b * Revert "Delete deepspeed test" This reverts commit 9d317429 * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit b5450de * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> (cherry picked from commit 4e9b453)
Setting milestone as 1.3 as it requires separating the fix from the API change to get into |
* Add hint in docs for how to use shared memory (#6036) * Prevent flickering progress bar (#6009) * add padding * fix * fix * Update pytorch_lightning/callbacks/progress.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * updated based on suggestion * changelog * add test * fix pep8 * resolve test * fix code format Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: tchaton <thomas@grid.ai> * Fix Wrapping optimizers upon assignment (#6006) * Update properties.py * pep8 * [Bugfix] Apply untoggle_optimizer when result is None (#5983) * update changelog * apply untoggle_optimizer when result is None * update tests * still return loss sometimes * Update CHANGELOG.md Co-authored-by: deng-cy <dcy1996@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * remove outdated info (#6032) * DeepSpeed Integration (#5954) * Add initial deepspeed changes * Address code review * Move static method outside of function * Fixes * Add missing annotation * Remove seed setting * Doc changes * Doc changes, add address reviews * Fix docs * Try fixing issue by moving to torch adam * Clean up check * Changes, better APIs! * Add wrapper, swap to git install revision * Add special test * Add warning * Address review * Add better disclaimer * Turn off ZeRO for testing due to compilation * Add description on modifying parameters via the plugin * Doc strings clear * Small doc fixes * Fix hash, reduce test * Added CI change * Move to azure pipeline * Fix test name * Add missing flag * Remove sudo... * Try conda instead * Swap to conda base * Try suggested install * Apply suggestions from code review * Apply suggestions from code review * Revert "Apply suggestions from code review" This reverts commit 41cca05a * Revert "Apply suggestions from code review" This reverts commit e06ec29e * Remove setter * Address most review * Move out function, remove DeepSpeed from requirements * Install deepspeed/mpi4py within container * Use special tests, move to master commit for deepspeed * Export path * Force compile to happen first * Remove! * Debugging ninja * Fix error in optimizer step logic * Attempt to fix symbolic link * Reverse to aid debugging * Export path again * Clean up mess * var * Revert "var" This reverts commit 3450eaca * Address review, add todo * Add note about unsupported functionality Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * Trainer only references accelerator (#6039) * Trainer only references accelerator where it can * Move teardown to the trainer, as it is reponsible for the accelerator * Address code review for deepspeed (#6042) * [feat] Add Trainer(stochastic_weight_avg=True/False) (#6038) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * [CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043) * Move to CUDA image * Remove deepspeed install as deepspeed now in the cuda image * Remove path setting, as ninja should be in the container now * drop deprecated result object 1/n (#5005) * ro1 * ro2 * Add option for weight tying on TPU's (#5441) * added on_post_move_to_device * added tests * docs and refactors * Update tests/backends/test_tpu_backend.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update docs/source/tpu.rst Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update docs/source/tpu.rst Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/core/decorators.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/core/decorators.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update docs/source/tpu.rst Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update pytorch_lightning/core/decorators.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update pytorch_lightning/core/hooks.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * moved weight sharing module back to test updated tpu available * add count to warning * fix doctest * import trainer in doctest * import trainer in doctest * do not test code as no TPU device * param count to layer count * formatting * update docs * update import * update * resolve tests * remove legacy accelerator Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Your Name <you@example.com> * Delete tests.helpers.TrialMNISTDataModule (#5999) * Remove TrialMNISTDataModule * Allow using TrialMNIST in the MNISTDataModule * Update tests/helpers/datasets.py * Fix: Allow hashing of metrics with lists in their state (#5939) * Fix: Allow hashing of metrics with lists in their state * Add test case and modify semantics of Metric __hash__ in order to be compatible with structural equality checks * Fix pep8 style issue Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * et al. (#6050) * et al. * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: chaton <thomas@grid.ai> * [ModelPruning] Add missing attribute with use_global_unstructured=False and verbose (#6045) * fix/test quant (#6040) * fix/test quant * ... * --- * Add descriptions to accelerator broadcast function/clean up all_gather (#6044) * Add descriptions to accelerator broadcast function/clean up all_gather * Remove todo * Add before_batch_transfer and after_batch_transfer hooks (#3671) * add hooks * comment * docs * add tests * make it private * fix tests * docs * chlog * testcode * codefactor * fix doctest * fix doctest * suggestions * is always overriden * pep and BoringModel * BoringModel * docs * docs * docs * fix * rebase * rebase * suggestions * docs * suggestions * try fix docs * docs * update name * yapf * docs * rebase * yapf * Make parallel devices optional across all plugins (#6051) * Make parallel devices optional across all plugins so that they can be instantiated * Add any to types to capture vars passed in * clarify gpu / process (#6049) * Fix docs typo (#6055) Put .test() in code blocks * Docs for Pruning, Quantization, and SWA (#6041) Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * Replace .get_model() with explicit .lightning_module (#6035) * rename get_model -> lightning_module * update references to get_model * pep8 * add proper deprecation * remove outdated _get_reference_model * fix cyclic import * rename accelerator_backend -> accelerator (#6034) * rename accelerator backend * rename new additions from master * add proper deprecation * pep8 * warning match * add missing warning type * fix flake8 for new plugins (#5951) * flake8 * fix cyclic import * isort * fix docs links (#6057) * Add warnings to on_before/after_batch_transfer hooks (#6059) * Add warnings to hooks * Add default idx to prevent signature change in the future * Nothing to see here * Add default val to transfer_batch_to_device hook * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Revert "Add default val to transfer_batch_to_device hook" This reverts commit 5c6a68f2 Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * v1.2.0rc2 (#6063) * v1.2.0rc2 * chlogs * chlogs * format * Apply suggestions from code review Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update auto-opt docs (#6037) * fix docs * update on comments * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * rm comment * Update docs/source/common/lightning_module.rst Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai> * Raise AttributeError in lightning_getattr and lightning_setattr when attribute not found (#6024) * Empty commit * Raise AttributeError instead of ValueError * Make functions private * Update tests * Add match string * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * lightning to Lightning Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * default sched (#6062) * v1.2.0 (#6065) * v1.2.0 * docs * add Azure tags trigger (#6066) * add Azure tags trigger * fix * mnodes * pypi azure badges - tags (#6068) * pypi azure badges - tags * pep8 * id * continue towards 1.3 (#6069) * Fix amp autocast (#6080) * precision fixes * add amp test model * fix test * revert * move assert to training step * fix test * fix test * remove unrelated changes * add changelog * remove unused import * add sanity check on nb available GPUs (#6092) * consistent behavior for reduce method across all Plugins (#6011) * reduction docs * docs for abstract base method * make mean the default * add preliminary chlog Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * [Hot Fix] Give priority to plugins to set distributed mode, and then accelerator (#6089) * Give priority to plugins to set distributed mode, and then accelerator * Add CHANGELOG.md * Update CHANGELOG.md * Remove very scary line * Ensure we set cluster environment after slurm configured if necessary * Simplify the fix with a reset Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Enable ZeRO tests for CI, fix to/half function calls (#6070) * Enable ZeRO optimization, and make sure that the lightning module hook is called when we move to half precision * Added test, update to function * Expose DeepSpeed FP16 parameters due to loss instability (#6115) * Expose deepspeed config parameters to init function due to instability in parameters * See if tests can run on normal CI, without special tests * Add changelog * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Collapse 2 DeepSpeed tests (#6108) * fix amp/apex misconfiguration error for cpu (#6107) * fix weird test * fix apex plugin test * fix raise * cpu test * fix type * add changelog * Update Contributing Guide (#6118) * Update Contributing Guide * update docs * Minor fixes/improvements in Metric docs (#6114) * Fix wrong render * Improve classification metrics docs * Improve other domain metrics docs * Change the structure level in the docs * Avoid printing ModelCheckpoint log with monitor=None and verbose=True (#6109) * Feature/5275 clean progress bar print (#5470) * Trainer.test should return only test metrics (#5214) * resolve bug * merge tests * Fix metric state reset (#5273) * Fix metric state reset * Fix test * Improve formatting Co-authored-by: Ananya Harsh Jha <ananya@pytorchlightning.ai> * print() method added to ProgressBar * printing alongside progress bar added to LightningModule.print() * LightningModule.print() method documentation updated * ProgressBarBase.print() stub added * stub * add progress bar tests * fix isort * Progress Callback fixes * test_metric.py duplicate DummyList removed * PEP and isort fixes * CHANGELOG updated * test_progress_bar_print win linesep fix * test_progress_bar.py remove whitespaces * Update CHANGELOG.md Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Tadej Svetina <tadej.svetina@gmail.com> Co-authored-by: Ananya Harsh Jha <ananya@pytorchlightning.ai> Co-authored-by: Alexander Snorkin <Alexander.Snorkin@acronis.com> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * mini refactor for _running_stage access (#5724) * running stage * circular import * running stage cleanup * fix unused import * fix running stage access * add return type * Revert "add return type" This reverts commit 65b0fe269c6547213e34b6a88b97bee31cdfe8c7. * try fix typing * Add specifics around DeepSpeed docs (#6142) * Be more specific with DeepSpeed compatibility * Better wording * Ensure accelerator is valid if running interactively (#5970) Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> * fixing miss-leading tested acc values (#5876) * fixing tested values * . * tests * yapf * softmax * hvd * rename * lr * duplicate * drop * classif * rm EvalModel * Revert "rm EvalModel" This reverts commit 6c3fb39ebe0c4bfb52357bccfd050438f2c0f31c. * update tests * fix * azure * azure * self * cpu * Apply suggestions from code review Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> * Update CHANGELOG (#6156) * prune deprecated profiler as bool (#6164) * prune profiler * chlog * prune deprecated Trainer arg `enable_pl_optimizer` (#6163) * prune enable_pl_optimizer * prune automatic_optimization * Prune deprecated metrics for 1.3 (#6161) * prune deprecated metrics for 1.3 * isort / yapf * [Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 (#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> * Prune deprecated checkpoint arguments (#6162) * prune prefix * prune mode=auto * chlog * Prune deprecated EarlyStopping(mode='auto') (#6167) Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Fix typo (#6178) * Update issue template to use discussions for questions (#6155) * add issue config * remove question template * update URL * Update README.md * Update README.md Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update .github/ISSUE_TEMPLATE/config.yml Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update with GitHub Discussions (#6186) * Update gpu warning (#6181) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Kaushik Bokka <kaushikbokka@gmail.com> * type accelerators (#6148) * Fix for multiple callbacks (#6197) * Fix for multiple callbacks * Add CHANGELOG.md * Remove old params * Skip tests on windows using ddp * Change name of the variable to not clash with should stop, which is separate * Apply suggestions from code review * Fix params Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Add checkpoint parameter to on_save_checkpoint (#6072) Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Document exceptions in loggers (#6171) * Document exceptions in loggers * minor formatting * docstring changed in comet.py * Apply suggestions from code review Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Prune deprecated Trainer(checkpoint_callback=ModelCheckpoint()) (#6166) * fix parallel devices return type & add copyright (#6215) * Add mypy typing to precision plugins. (#6149) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * apply_func.py: from torchtext.legacy.data import Batch (#6211) * Update apply_func.py The name Batch is no longer located under torchtext.data --Error message-- File "/home/daniel/py38/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 25, in <module> from torchtext.data import Batch ImportError: cannot import name 'Batch' from 'torchtext.data' (/home/daniel/py38/lib/p ython3.8/site-packages/torchtext/data/__init__.py) You can fix this by changing line line 28 to: from torchtext.legacy.data import Batch * Update apply_func.py * Update apply_func.py * Update apply_func.py * Update apply_func.py * Update apply_func.py * fix(wandb): prevent WandbLogger from dropping values (#5931) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Prune deprecated hparams setter (#6207) * document exceptions for metrics/regression (#6202) Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Prajakta Phadke <pphadke@iu.edu> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * simplify skip-if tests >> 0/n (#5920) * skipif + yapf + isort * tests * docs * pp * update (#6237) * Document Exceptions in profilers (#6229) * docstring changes in profilers * minor changes in profilers.py * Call `optimizer.zero_grad()` before backward inside closure in AutoOpt (#6147) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> * Fix for incorrect usage of detach(), cpu(), to() (#6216) * Fix for incorrect detach/cpu calls (#6214) * Fix incorrect use of detach(), to(), and cpu(), #6214 * Fix incorrect use of detach() and cpu(), #6214 * update pr * add typing * chlog * more... * revert on module * update on comments * revert changes on model Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * add skipif warpper (#6258) * cleaning SWA (#6259) * rename * if * test * chlog * Remove opt from manual_backward in docs (#6267) * switch agents pool (#6270) * docstring changes in tuner (#6264) * docstring changes in tuner * added full stop * Disable CPU Offload as default for DeepSpeed (#6262) * Change default for CPU offload to false for best throughput/memory efficiency * Add changelog * default Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * split profilers (#6261) * Refactor: skipif for multi - gpus 1/n (#6266) * ngpus * gpu * isort * pt * flake8 * Improved EarlyStopping.patience documentation (#6278) * Improved early stopping documentation * Changed to 120 column format * doc * doc * doc Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * Refactor: skipif for Windows 2/n (#6268) * win * isort * flake8 * fix duplicate console logging bug v2 (#6275) Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Refactor: skipif for AMPs 3/n (#6293) * args * native * apex * isort * [fix] Ensure we check deepspeed/sharded in multinode DDP (#6297) * Ensure we check deepspeed/sharded in multinode * Add CHANGELOG.md * Add CHANGELOG.md * Drop mock, use actual multi-gpu node * unfreeze torchtext version (#6302) * Add possibility for custom naming when using multiple dataloaders (#6274) * try to fix imports for parsing (#6256) * try to fix imports * legacy 1.2.1 * Refactor: Runif for TPU and Horovod 5/n (#6301) * TPU * horovod * extra * fix * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * doc Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Refactor: runif for spec 6/6 (#6307) * special * rpc * Add fairscale & deepspeed to skipif 4/n (#6281) * add fairscale & windows to skipif * add deepspeed to runif * fairscale * deepspeed * flake8 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * [bugfix] TPU test hangs to barrier on 1 process (#6272) * update * resolve flake8 * update * update * update changelog * update * resolve flake8 Co-authored-by: Your Name <you@example.com> * prune duplicite test in optim (#6312) * Simplify test for AMP plugins (#6311) * AMP * fuse * yapf * Fix ModelPruning(make_pruning_permanent=True) buffers getting removed when saved during training (#6073) Co-authored-by: chaton <thomas@grid.ai> * [bugfix] TPU + all_gather + SingleTPU shouldn't call xm.all_gather (#6296) * resolve an issue with TPU * update * add changelog * drop unused variable in API (#6308) * drop unused pl model in ckpt * irelevant * on_evaluation_batch_start * evaluation_epoch_end * attach_datamodule * hotfix for PT1.6 and torchtext (#6323) * ci: azure reinstall torchtext * move * todos * 0.6.0 * skip examples * formatter * skip * todo * Apply suggestions from code review * [fix] Use training type plugin hook when saving (FSDP 1/n) (#6321) * Rely on training type plugin when saving * Add better typing to training type plugin * leaving lezwon (#6347) * Add `tests/utilities/test_parsing.py` (#4460) * Create branch tests/4400_parsing * Rename test file for parsing.py * Fix lightning_hasattr * Fix lightning_hasattr * Fix lightning_setattr * Add empty lines and remove rubbish spaces * Raise AttributeError not ValueError * Use getattr in hasattr * Remove rubbish spaces * Fix getattr * Fix by flake8 * Add tests for str_to_bool_or_str * Fix by flake8 * Add tests for str_to_bool * Add tests for is_picklable * Add tests for clean_namespace * Fix typo * Fix lightning_getattr * Add tests for AttributeDict * Add tests for flatten_dict * Fix by flake8 * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Apply isort * Revert "Apply suggestions from code review" * Define unpicklable_function outside * Add comment to test_clean_namespace * Add tests for parse_class_init_keys * Add tests for get_init_args and collect_init_args * Share objects across the tests Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk> * Add ignore param to save_hyperparameters (#6056) * add ignore param to save_hyperparameters * add docstring for ignore * add type for frame object * Update pytorch_lightning/core/lightning.py Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Update pytorch_lightning/core/lightning.py Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * fix whitespace * Update pytorch_lightning/core/lightning.py Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Parametrize tests * Update pytorch_lightning/core/lightning.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Update pytorch_lightning/core/lightning.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * seq * fix docs * Update lightning.py * Update lightning.py * fix docs errors * add example keyword * update docstring Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Fix when _stable_1d_sort to work when n >= N (#6177) * Fix when _stable_1d_sort to work when n >= N * Apply suggestions Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> * Update docs on arg train_dataloader in fit (#6076) * add to docs * update docs * Apply suggestions from code review * Update pytorch_lightning/core/hooks.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * nested loaders * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * shorten text length * Update pytorch_lightning/core/hooks.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * missing tests default_root_dir=tmpdir (#6314) * default_root_dir=tmpdir * miss * Document exception for metrics/classification (#6190) * document exception for metrics/classification * minor formatting fixes * fix trailing whitespaces * document exception for metrics * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * [Fix] Call clip gradients if clip val greater than 0 (#6330) * Call clip gradients if clip val greater than 0 * format * Format * Move to top of file * [bugfix] Check LightningOptimizer doesn't delete optimizer hooks (#6305) * update * resolve bug * docstring changes in accelerators (#6327) * docstring changes in accelerators * docstrings moved * whitespaces removed * PEP8 correction[1] * [bugfix] Perform reduction for dict in training_step and DP (#6324) * fix * update * update * add changelog * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update tests/accelerators/test_dp.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update changelog Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * introduce default cluster environment for lightning-specific ddp (#5915) * handle distributed_sampler_kwargs * move emptying cache to accelertor * fix a few tests * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (#5742) * final cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * connector cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * trainer cleanup Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * clean cluster envs Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * cleanup plugins Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * add broadcasting Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856) * resovle a bug * Accelerator refactor sharded rpc (#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * abstract the cluster plugins * default plugin * integrate default environment * fix property * adapt tests * adjust test * fix world size access * base cluster env * revert rebase errors * revert rebase errors * missing import * revert unrelated change * remove unused cluster local rank * remove unrelated changes * fix unrelated changes * fix pep8 * remove unused var * reset permissions * ypaf * test default environment * test torchelastic environment * world size as int * tests for slurm environment * changelog * test comments * remove unintended change * keep master port fixed after it is generated * test random master port * yapf * add missing default environment * move helper function * rename default environment * rename * rename * yapf * Update pytorch_lightning/plugins/environments/lightning_environment.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update CHANGELOG.md Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * spawn -> create Co-authored-by: justusschock <justus.schock@posteo.de> Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: root <root@ip-172-31-88-60.ec2.internal> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * [bugfix] Resolve memory leak for evaluation (#6326) * resolve bug * resolve flake8 * revert name * Update changelog for v1.2.2 (#6325) * update changelog for v1.2.2 * ckpr 1.2.2 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * CI: fix examples - patch download MNIST (#6357) * patch download * CI * isort * extra * [bug] Fix Pytorch profiler with emit_nvtx (#6260) * resolve bug * update changelog * Update tests/trainer/test_trainer.py * Update pytorch_lightning/profiler/profilers.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * resolve comments * resolve flake8 Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * fix importing torchtext batch (#6365) * copy torchtext batch * update * rev * rev * give a more complete GAN example (#6294) * Refactor RunningStage usage in advance of implementing Trainer.validate() (#4945) * Update code Co-authored-by: EliaCereda * More property updates * Move properties. Introduce trainer._fitting * Use trainer.fitting * Fix reset dataloaders * Unused code * RunningStage.SANITY_CHECKING * Use setters * Fix bugs * Fix bugs * TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING} * Fix bugs * Fix bugs * Fix tests * Update CHANGELOG. Add deprecation warning. Fix tests * Unused imports * Optional trainer * More deprecation. More refactoring * Correct version * Use properties * Address comments * flake8 * Missed renamings * Typo * is -> == It is recommended to use for Enums since they are singletons, however, since the LightningEnum subclasses str, it's not a good idea in case a user sets the state/stage with a str * Also for tests * Typo * Address @tchaton's comments * PEP8 * Correct property * Update CHANGELOG * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/trainer/trainer.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Remove called sanity check Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * require: adjust versions (#6363) * adjust versions * release * manifest * pep8 * CI * fix * build * Use f-"""-string in a Trainer comment (#6377) * Use f-"""-string * Add r * Use Trainer. * r -> noqa: W605 * Remove no return warning from val/test step (#6139) * remove warning * auto_opt * chlog * auto_opt * no_warning_call * rm old code * add warning for predict * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix manual optimization in pl_example (#6373) * Fix automatic_optimization * Fix automatic_optimization * Uncomment fairscale * Update Sharded test with RunIf (#6384) * Remove optimizer_idx arg in manual optimization (#6093) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai> * [doc] Improve Multiple Val/Test Dataloaders with simultaneous batches option (#6320) * improve doc to describe how to combine batches of multiple test and val dataloaders simultaneously * fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * use paramref Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * [doc] Fix closure in manual optimization (#6374) * Fix manual optimization docs * Fix typo. Thanks @import-antigravity * Fix ModelCheckpoint(monitor=None, save_last=True) not saving checkpoints (#6136) Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update TBLogger docs (#6315) * Update tensorboard.py * Update logging.rst * pep8 * Update logging.rst * Update logging.rst * Apply suggestions from code review * add code sample * Update logging.rst Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Fix trainer not resetting lightning_optimizers (#6372) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update python version (#6399) * Fix AttributeError: 'NoneType' object has no attribute 'finalize' on TPU (#6221) * Fix bug Fix AttributeError: 'NoneType' object has no attribute 'finalize' * Update CHANGELOG.md * deleted a period * Update CHANGELOG.md Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Update CHANGELOG.md * Update pytorch_lightning/plugins/training_type/tpu_spawn.py Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Run CI (#6402) * Pass {fit,validate,test,predict} to setup() and teardown() (#6386) * fix dp reduction test (#6404) * fix * update * fix * move the class outside * Add check for verbose attribute of ModelCheckpoint (#6419) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * fixed bug where tuner would not tune lr if also tuning batch_size (#4688) * fixed bug where tuner would not tune lr if also tuning batch_size * added a '+1' to computing the smoothed loss. This maintains the behavior for the smoothed loss as before the bug fix * pep8 fix * add changelog Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update (#6403) * fix logger creating directory structure too early in DDP (#6380) * fix * add simple test * fix imports * add changelog * tighter test with on_fit_start hook closer to the dispatch call * move class inside test f unction * add a comment * Typing for tests 1/n (#6313) * typing * yapf * typing * [changelog] Update Changelog on release v1.2.3 (#6444) * update changelog * legacy 1.2.3 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * Improve DummyLogger (#6398) * fix dummy logger * docs * update docs * add changelog * add none return annotation * return empty string for name, version * Raise an exception if check_val_every_n_epoch is not an integer (#6411) * raise an exception if check_val_every_n_epoch is not an integer * remove unused object * add type hints * add return type * update exception message * update exception message * Set find unused parameters to True by default to fix breaking compatibility (#6438) * Set find unused parameters to True by default to fix breaking models, add suggestion to re-enable * Add changelog * [bug] All_gather support tensor on cpu (#6416) * add test * update changelog * update * rename function * [Fix] Ensure we set the default device before initializing deepspeed (#6460) * Ensure we set the default device before initializing deepspeed * Add CHANGELOG.md * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Remove redundant test (#6466) * Add Trainer.validate(…) method to run one validation epoch (#4948) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Allow user to disable the automatic formatting of checkpoint file names. (#6277) * cleaning SWA (#6259) * rename * if * test * chlog * Remove opt from manual_backward in docs (#6267) * switch agents pool (#6270) * Allow user to disable the automatic formatting of checkpoint file names. * Added changelog entry. * Made flake8 happy. * Applied review suggestion: quotes for special characters in docstring Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Fixed example in docstring. * Fixed syntax error in docstring. Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Hotfix for torchvision (#6476) * cover subproc coverage (#6477) * argparse: Add use_argument_group=True (#6088) * argparse: Add inplace option Replicate in GAN model * datamodule: Deduplicate logic w/ argparser utilities * Update pl_examples/domain_templates/generative_adversarial_net.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Keep docstrings * Correct name * Whitespace * Consistency * fix weird type stuff * try alt - use_argument_group * fix syntax + lint * fix ci errs * fix ci * change examples... still failing w/ "unrecognized arguments: --batch_size" * address review * mnist_datamodule: add some docstrings * argparse: check cls or cls.__init__ for param didn't capture issue, but meh * fix lint * fix no-doc edge case * address review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> * Disable batch transfer in DP mode (#6098) * add exceptions and test * hook * fix * clean up * clean up * regex * regex * docs * rev * comment and docs * chlog * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Apply suggestions from code review Co-authored-by: chaton <thomas@grid.ai> * Monkey-patch device count * docs * pep * api_change Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai> * remove obsolete todo in pl_examples (#6475) * [feat] Support iteration-based checkpointing in model checkpoint callback (#6146) * Update model_checkpoint.py * add tests * Update model_checkpoint.py * Update test_model_checkpoint.py * fix tests * every_n_batches * Update test_model_checkpoint.py * defaults * rm tests * Update model_checkpoint.py * Update test_model_checkpoint.py * Prune deprecated metrics for 1.3 (#6161) * prune deprecated metrics for 1.3 * isort / yapf * Update model_checkpoint.py * add tests * defaults * Update CHANGELOG.md * pre-commit * Update model_checkpoint.py * update defaults * Update test_remove_1-5.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * fix tests * Update test_model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update test_model_checkpoint.py * ckpt-callback * Update test_model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * validation-end * Update model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * Update test_model_checkpoint.py * clarify-names - Make names explicit as to which hooks they apply to - Use step instead of batch for consistency with global step * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * Update model_checkpoint.py * mutual-exclusive Make every_n_train_steps and every_n_val_epochs mutually exclusive * fix-default-0 * Update CHANGELOG.md * formatting * make-private make attributes private to the class * rebase Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update xla version (#6464) * Remove unused mixin attributes (#6487) * Remove unused mixing attributes * Missing import * [doc] Update the order of zero_grad and backward (#6478) * Fix zero_grad in docs * Fix zero_grad in docs * Fix tuner.scale_batch_size not finding batch size attribute when using datamodule (#5968) * Update docs for limit_predict_batches (#6507) * add docs and minor updates * docs * fraction * [bug] Update broadcast + reduce decision ModelCheckpoint] (#6410) * resolve bug * update * update changelog * update PR * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * add todo * resolve issues * resolve flake8 * update * add coverage for reduce * wip * restore back to brodbact * remove test.py * resolve flake8 * update * check world size * resolve test * update * use pytorch version when defined * update on comments * update on comments * flake8 * resolve bugs * Update CHANGELOG.md Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update * update * update * update * remove test * update * resolve flake8 * update * update * update * proxy * update * update * resolve typo * prune * update parallel * update Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Handle torch.jit scripted modules in layer summary (#6511) * CI: resume testing with py3.8 (#6516) * testing on python 3.8 * req * document exceptions for metrics/functional (#6273) * document exceptions for metrics/functional * Apply suggestions from code review Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Mean Average Precision metric for Information Retrieval (1/5) (#5032) * init information retrieval metrics * changed retrieval metrics names, expanded arguments and fixed typo * added 'Retrieval' prefix to metrics and fixed conflict with already-present 'average_precision' file * improved code formatting * pep8 code compatibility * features/implemented new Mean Average Precision metrics for Information Retrieval + doc * fixed pep8 compatibility * removed threshold parameter and fixed typo on types in RetrievalMAP and improved doc * improved doc, put first class-specific args in RetrievalMetric and transformed RetrievalMetric in abstract class * implemented tests for functional and class metric. fixed typo when input tensors are empty or when all targets are False * fixed typos in doc and changed torch.true_divide to torch.div * fixed typos pep8 compatibility * fixed types in long division in ir_average_precision and example in mean_average_precision * RetrievalMetric states are not lists and _metric method accepts predictions and targets for easier extension * updated CHANGELOG file * added '# noqa: F401' flag to not used imports * added double space before '# noqa: F401' flag * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * change get_mini_groups in get_group_indexes * added checks on target inputs * minor refactoring for code cleanness * split tests over exception raising in separate function && refactored test code into multiple functions * fixed pep8 compatibility * implemented suggestions of @SkafteNicki * fixed imports for isort and added types annontations to functions in test_map.py * isort on test_map and fixed typing * isort on retrieval and on __init__.py and utils.py in metrics package * fixed typo in pytorch_lightning/metrics/__init__.py regarding code style * fixed yapf compatibility * fixed yapf compatibility * fixed typo in doc Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * CI: Azure publish results (#6514) * deprecate metrics pkg (#6505) * deprecate metrics * examples * req * docs * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * pep8 Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * [test] lr_find with bs_scale (#6422) * init test: test_lr_find_with_bs_scale * Update test_lr_finder.py * remove gpu req * try boring model * custom boring model * pep8 * fix typo * Update test_lr_finder.py * typo * typo * Update DeepSpeed docs (#6528) * Clean up docs and add some explicitness around stages * Apply suggestions from code review Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * fix attribute access in LightningModule.toggle_optimizer (#6513) * Update hook lifecycle (#6538) * Update hook lifecycle * Update docs/source/common/lightning_module.rst * Prune metrics base classes 2/n (#6530) * base class * extensions * chlog * _stable_1d_sort * _check_same_shape * _input_format_classification_one_hot * utils * to_onehot * select_topk * to_categorical * get_num_classes * reduce * class_reduce * tests * Custom Plugin is_distributed (#6537) * return from plugin * dont return for tpu * refactor reading env defaults (#6510) * change tests * fix * test * _defaults_from_env_vars Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Prune metric: helpers and inputs 3/n (#6547) * _basic_input_validation * _check_shape_and_type_consistency * _check_num_classes_binary * _check_num_classes_mc * _check_num_classes_ml * _check_top_k * _check_classification_inputs * _input_format_classification * _reduce_stat_scores * DataType * rest * flake8 * chlog * prune warning & deprecation wrapper (#6540) * docs * wrapper * test * count * flake8 * Add outputs param for `on_val/test_epoch_end` hooks (#6120) * add outputs param for on_val/test_epoch_end hooks * update changelog * fix warning message * add custom call hook * cache logged metrics * add args to docstrings * use warning cache * add utility method for param in sig check * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update docstring * add test for eval epoch end hook * add types and replace model ref * add deprecation test * fix test fx name * add model hooks warning * add old signature model to tests * add clear warning cache * sopport args param * update tests * add tests for model hooks * code suggestions * add signature utils * fix pep8 issues * fix pep8 issues * fix outputs issue * fix tests * code fixes * fix validate test * test Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * [doc] Add Zero Grad `set_to_none=True` trick (#6548) * add trick to doc * update * update path * Update docs/source/benchmarking/performance.rst Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * fix deprecation wrapper & tests (#6553) * fix deprecation wrapper & tests * flake8 * prune metric: accuracy 4/n (#6515) * prune accuracy * chlog * flake8 * Apply suggestions from code review Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * wrap * test * test * fix Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> * Prune metrics: AUC & AUROC (#6572) * class: AUC AUROC * func: auc auroc * format * tests * [doc] Update Dict Train Loader doc. (#6579) * update doc * update example * Prune metrics: precision & recall 6/n (#6573) * avg precision * precision * recall * curve * tests * chlog * isort * fix * Update Changelog for v1.2.4 (#6581) * Update changelog for v1.2.4 * lagacy v1.2.4 * prune duplicates from changelog Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> * [Fix] Move init dist connection into the setup function (#6506) * Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit edde60b8 * Revert "Delete deepspeed test" This reverts commit 9d317429 * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit b5450def * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix all_gather for tpu_cores=8 (#6587) * Update Gradient Clipping for TPU Accelerator (#6576) * NGC container PoC (#6187) * add NVIDIA flows * push * pull * ... * extras * ci prune * fix * tag * . * list * Automatically set sync_batchnorm for training_type_plugin (#6536) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> Co-authored-by: Kaushik Bokka <kaushikbokka@gmail.com> * Prune metrics: other classification 7/n (#6584) * confusion_matrix * iou * f_beta * hamming_distance * stat_scores * tests * flake8 * chlog * fixing examples (#6600) * try Azure * -e * path * Add AMP for validation, prediction and testing (#6565) * Add Tests for val and test-steps * Add native AMP * pep8 tests * pep8 plugin * changelog * Add trainer.predict config validation (#6543) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Add DDP Spawn being default for Multi GPUs (#6292) * Move profiler tests (#6619) * drop mypy from .pre-commit-config.yaml (#6542) * Clean utilities/argparse and add missing tests (#6607) * Allow training type plugin to delay optimizer creation (FSDP 2/n) (#6331) * Allow training_type_plugin to delay optimizer configure * Add missing references to trainer, add a CPU accelerator based test * Add teardown method to BaseProfiler. (#6370) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * refactoring setup (#6590) * refactoring setup * . * docs * flake8 * hotfix: mock examples (#6632) * mock examples * drop from GA * [refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633) * add setup * update * updates on comment * Minor changes * Extra import * Docs Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> * fix comparing versions (#6434) * fix comparing versions * chlog * . * ... * datasets * Prune metrics: regression 8/n (#6636) * explained_variance * tests * mean_absolute_error * mean_squared_error * mean_relative_error * mean_squared_log_error * chlog * Prune metyrics: regression 9/n (#6637) * psnr * r2score * ssim * chlog * Refactor base profilers 3/5 (#6621) Co-authored-by: tchaton <thomas@grid.ai> * prune metrics: info retrieval (#6649) * Flash predict step (#6577) * add predict_step * Update predict_loop.py * Update trainer.py * Update trainer.py * resolve bugs * update * update * update * resolve bug * resolve some failing tests * udpate tests * update * resolve tests * add a test * remove typo * add a test for attachement * update * changed to on_train_dataloader * remove __flash_special_attr__ * resolve tests * update * update * update * update on comments * Update pytorch_lightning/trainer/data_loading.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * fix back-compatibility for Accel (#6655) * Refactor PyTorch profiler 4/5 (#6349) Co-authored-by: thomas chaton <thomas@grid.ai> * Add PyTorch 1.8 Profiler 5/5 (#6618) * Refactor profilers * Update PassThrough * WIP - This is broken and will change * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: thomas chaton <thomas@grid.ai> * resolve tests * resolve tests * find output * try something * update * add support for test and predict * update * update * use getattr * test * test * update * tests * update * update * update * update * update * remove file * update * update * update * update * update * test * update# * update * update tests * update * add suport for 1.8 * rename records * add support for 1.8 * update * resolve flake8 * resolve test * Refactor basic profilers * Fixes * Unused import * Introduce setup * Profile on all ranks. Print to stdout on 0 * Introduce dirpath + filename * CHANGELOG * Add tests. Address comments * add `on_run_stage_setup` * add on_run_stage_setup function * update * add test for RegisterRecordFunction * update lightnng flow direction * move variable to private * remove trace * Undo code that should be in 3/4 * Multi-stage multi-rank * 2/5 changes * Pass stage in __del__ * Remove TODOs * Describe on_evaluation_end. Add tests * Typo * Address comments * deepcopy tests * Advanced teardown * Fix teardown test * Fix tests * Minor change * Update CHANGELOG.md * Fix test * Quick fixes * Fix 6522 * resolve ddp tests * resolve tests * resolve some tests …
…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f
What does this PR do?
Fixes #6318
This fix moves the init ddp connection for DDP into the setup function, and reorders the hook such that setup can now have access to the initialized distributed environment. This is also important for FSDP.
This fix however diverges DDP Spawn from DDP, and should be noted in the docs. As @ananthsub and @awaelchli have brought up, we may need to discuss the responsibility of hooks, primarily because we're seeing some inflexibility in the API when changing call orders of hooks/custom loops.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃