Re-design `call_hook` interface #10575

daniellepintz · 2021-11-16T22:05:51Z

What does this PR do?

Replaces call_hook with 4 methods: call_callback_hooks, call_lightning_module_hook, call_accelerator_hook and call_ttp_hook. The last two can hopefully be removed at some point in the future.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

pytorch_lightning/loops/dataloader/evaluation_loop.py

four4fish · 2021-11-19T17:58:33Z

[RFC] @daniellepintz @awaelchli what's our plan for TraininerOptimizerMixin? TraininerOptimizerMixin is basically unrelated to trainer, the setup optimizer and lr logic can stay in Strategy. Strategy is the only place using setup optimizer and lr logic and it owns optimizers and lr.
But TraininerOptimizerMixin is calling trainer.call_hook(). Could we separate call_hook from trainer, then we can remove reference to Trainer from strategy and remove TraininerOptimizerMixin.

ananthsub · 2021-11-19T18:31:40Z

@four4fish TrainerOptimizerMixin is mostly separate from reworking call_hook, so let's write up a separate issue/doc to discuss it

daniellepintz · 2021-11-19T20:47:14Z

@four4fish the plan for TrainerOptimizerMixin is to move all of its logic to Strategy, and remove it. I don't think we need to separate call_hook from the trainer in order to remove TrainerOptimizerMixin. re. the trainer reference on Strategy, is call_hook the only reason it is needed?

four4fish · 2021-11-19T20:50:53Z

@daniellepintz after the refactor (include simplify spawning logic ), yeah i think so. Now we need pass trainer to setup, setup_optizer because call_hook in TrainerOptimizerMixin. Otherwise we only need trainer.fn.state.

pytorch_lightning/trainer/trainer.py

…lightning into call_hook

for more information, see https://pre-commit.ci

…g into call_hook

pytorch_lightning/trainer/trainer.py

daniellepintz · 2021-12-04T01:20:14Z

There are 12 failing tests remaining in the CI, and they are all passing locally for me so I am quite confused:

FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[ddp_spawn-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[None-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_manual_plugin_selection[DDPShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_automatic_plugin_selection[trainer_kwargs1]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_spawn_find_unused_parameters_false-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_find_unused_parameters_false-DDPShardedPlugin]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs27-expected27]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs28-expected28]

ananthsub · 2021-12-04T03:49:41Z

There are 12 failing tests remaining in the CI, and they are all passing locally for me so I am quite confused:

FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[ddp_spawn-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[None-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_manual_plugin_selection[DDPShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_automatic_plugin_selection[trainer_kwargs1]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_spawn_find_unused_parameters_false-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_find_unused_parameters_false-DDPShardedPlugin]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs27-expected27]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs28-expected28]

The CI is failing with this

pytorch_lightning.utilities.exceptions.MisconfigurationException: `DDPShardedPlugin` requires `fairscale` to be installed. Install it by running `pip install fairscale`.

was fairscale removed as a dependency?

daniellepintz · 2021-12-04T04:12:24Z

no, I didn't change anything with fairscale. maybe @awaelchli knows something since he has been working with DDP recently?

awaelchli · 2021-12-04T04:39:50Z

Happy to help.

Take the example of this test that fails: tests/trainer/test_trainer.py::test_trainer_config_strategy
You see the error in CI and not locally because:

the test initializes a Trainer with sharded plugin
during Trainer init, there is a call to call_callback_hooks("on_init_start")
This guy accesses training_type_plugin.lightning_module
the sharded plugin cannot work if fairscale is not installed, hence the error
locally you have it installed, so you don't see the error, the CI however does not install fairscale
even if sharded plugin behaved differently, a lightningmodule reference is not available in Trainer init anyhow.

This all occurs because this PR replaced
self.on_init_start()
with
self._call_callback_hooks("on_init_start")
And that is no longer equivalent to what it was before. The same problem would occur if we replaced self.on_init_start() with self.call_hook("on_init_start") on master. on_init_start only exists for callbacks, not for the LightningModule.

My suggestion for resolution:

Special case the cases for on_init_start/end not to access self.lightning_module (this is partially already done inside _call_callback_hooks
Eventually remove on_init_start/end [RFC] Deprecate callback hooks on_init_start and on_init_end #10894

daniellepintz · 2021-12-04T05:03:57Z

@awaelchli thank you SO much for your help! That makes total sense. Fixed!!

carmocca

Epic!

pytorch_lightning/trainer/trainer.py

carmocca · 2021-12-04T12:57:19Z

pytorch_lightning/trainer/trainer.py

+        fn = getattr(self.accelerator, hook_name)
+        if not callable(fn):
+            return None
+


We discussed not setting the current_fx_name for accelerator but since we do it manually in all places that call this method, might as well put it inside.

Yeah, I ended up having to change it because some tests were failing. Also at least this way we don't use the LM's trainer reference in TTP.py

I have some mind boggling problems in #10890. One question, why do these new call_xyz methods not reset the current_fx_name to its previous value? Was this forgotten or intentional?

After discussion with @carmocca I learned that only LM hooks and callback hooks need to set current_fx_name, so I only included that in the _call_LM_hook and _call_callback_hooks, and not for _call_ttp_hook and _call_accelerator_hook. However, before this PR, all the places where we called the accelerator hook had a line like this:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3d6262b7a91215e72019d720e742e6261e1636dc/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py#L220

so I included self.lightning_module._current_fx_name = hook_name in _call_accelerator_hook and got rid of the individual lines like the above

pytorch_lightning/loops/epoch/training_epoch_loop.py

codecov · 2021-12-04T21:12:06Z

Codecov Report

Merging #10575 (492fc62) into master (a28b4cd) will decrease coverage by 4%.
The diff coverage is 98%.

@@           Coverage Diff            @@
##           master   #10575    +/-   ##
========================================
- Coverage      92%      88%    -4%     
========================================
  Files         177      177            
  Lines       16553    16484    -69     
========================================
- Hits        15204    14522   -682     
- Misses       1349     1962   +613

daniellepintz · 2021-12-04T21:40:43Z

Merged, but @carmocca if you have any follow ups I can add them in subsequent PRs

carmocca · 2021-12-06T16:12:07Z

Just opened #10957 with what I think was missing,

@awaelchli, for your PR, you'll have to set and unset in _call_ttp_hook as you've moved the calls from the accelerator to the ttp.

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

first draft

dc8e838

daniellepintz requested review from awaelchli, Borda, carmocca, edenlightning, justusschock, kaushikb11, rohitgr7, SeanNaren, tchaton and williamFalcon as code owners November 16, 2021 22:05

daniellepintz marked this pull request as draft November 16, 2021 22:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

3667a4d

for more information, see https://pre-commit.ci

daniellepintz mentioned this pull request Nov 16, 2021

[RFC] Re-design call_hook interface #8506

Closed

doc fix

36078f2

daniellepintz commented Nov 16, 2021

View reviewed changes

pytorch_lightning/loops/dataloader/evaluation_loop.py Outdated Show resolved Hide resolved

ananthsub reviewed Nov 19, 2021

View reviewed changes

daniellepintz and others added 9 commits November 22, 2021 08:58

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

09949ee

…lightning into call_hook

separate call_hooks

eabeba4

update call_hook refs

ec445a9

fix more refs

0c59dd8

fix log

6513caa

cover edge case hooks

28701a0

[pre-commit.ci] auto fixes from pre-commit.com hooks

d3bdb46

for more information, see https://pre-commit.ci

small fix

319ec5f

Merge branch 'call_hook' of github.com:daniellepintz/pytorch-lightnin…

6ba14e4

…g into call_hook

ananthsub reviewed Dec 3, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

fix test

a163136

fix _call_callback_hooks

7e8ed03

daniellepintz enabled auto-merge (squash) December 4, 2021 05:24

carmocca added this to the 1.6 milestone Dec 4, 2021

carmocca added hooks Related to the hooks API refactor labels Dec 4, 2021

carmocca approved these changes Dec 4, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Dec 4, 2021

carmocca reviewed Dec 4, 2021

View reviewed changes

pytorch_lightning/loops/epoch/training_epoch_loop.py Show resolved Hide resolved

carmocca disabled auto-merge December 4, 2021 13:03

TypeError and other fix

492fc62

daniellepintz enabled auto-merge (squash) December 4, 2021 21:39

daniellepintz merged commit 6043179 into Lightning-AI:master Dec 4, 2021

daniellepintz deleted the call_hook branch December 4, 2021 21:40

carmocca added a commit that referenced this pull request Dec 6, 2021

Follow-up changes to #10575

d37e8da

carmocca mentioned this pull request Dec 6, 2021

Follow-up changes to #10575 #10957

Merged

8 tasks

awaelchli mentioned this pull request Dec 6, 2021

4/n Move Accelerator into strategy - remove X_step() from accelerator #10890

Merged

12 tasks

carmocca added a commit that referenced this pull request Dec 7, 2021

Follow-up changes to #10575 (#10957)

99adc45

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

daniellepintz mentioned this pull request Dec 7, 2021

Deprecate call_hook #10979

Merged

12 tasks

daniellepintz mentioned this pull request Dec 18, 2021

Deprecate TrainerCallbackHookMixin #11148

Merged

12 tasks

AndresAlgaba mentioned this pull request Sep 23, 2022

Remove deprecated call_hook #14869

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-design `call_hook` interface #10575

Re-design `call_hook` interface #10575

daniellepintz commented Nov 16, 2021 •

edited

Loading

four4fish commented Nov 19, 2021

ananthsub commented Nov 19, 2021 •

edited

Loading

daniellepintz commented Nov 19, 2021 •

edited

Loading

four4fish commented Nov 19, 2021

daniellepintz commented Dec 4, 2021

ananthsub commented Dec 4, 2021 •

edited

Loading

daniellepintz commented Dec 4, 2021

awaelchli commented Dec 4, 2021 •

edited

Loading

daniellepintz commented Dec 4, 2021

carmocca left a comment

carmocca Dec 4, 2021

daniellepintz Dec 4, 2021

awaelchli Dec 6, 2021 •

edited

Loading

daniellepintz Dec 6, 2021

codecov bot commented Dec 4, 2021

daniellepintz commented Dec 4, 2021

carmocca commented Dec 6, 2021

Re-design call_hook interface #10575

Re-design call_hook interface #10575

Conversation

daniellepintz commented Nov 16, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

four4fish commented Nov 19, 2021

ananthsub commented Nov 19, 2021 • edited Loading

daniellepintz commented Nov 19, 2021 • edited Loading

four4fish commented Nov 19, 2021

daniellepintz commented Dec 4, 2021

ananthsub commented Dec 4, 2021 • edited Loading

daniellepintz commented Dec 4, 2021

awaelchli commented Dec 4, 2021 • edited Loading

daniellepintz commented Dec 4, 2021

carmocca left a comment

Choose a reason for hiding this comment

carmocca Dec 4, 2021

Choose a reason for hiding this comment

daniellepintz Dec 4, 2021

Choose a reason for hiding this comment

awaelchli Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

daniellepintz Dec 6, 2021

Choose a reason for hiding this comment

codecov bot commented Dec 4, 2021

Codecov Report

daniellepintz commented Dec 4, 2021

carmocca commented Dec 6, 2021

Re-design `call_hook` interface #10575

Re-design `call_hook` interface #10575

daniellepintz commented Nov 16, 2021 •

edited

Loading

ananthsub commented Nov 19, 2021 •

edited

Loading

daniellepintz commented Nov 19, 2021 •

edited

Loading

ananthsub commented Dec 4, 2021 •

edited

Loading

awaelchli commented Dec 4, 2021 •

edited

Loading

awaelchli Dec 6, 2021 •

edited

Loading