Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design call_hook interface #10575

Merged
merged 43 commits into from
Dec 4, 2021

Conversation

daniellepintz
Copy link
Contributor

@daniellepintz daniellepintz commented Nov 16, 2021

What does this PR do?

Fixes #8506

Replaces call_hook with 4 methods: call_callback_hooks, call_lightning_module_hook, call_accelerator_hook and call_ttp_hook. The last two can hopefully be removed at some point in the future.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@four4fish
Copy link
Contributor

[RFC] @daniellepintz @awaelchli what's our plan for TraininerOptimizerMixin? TraininerOptimizerMixin is basically unrelated to trainer, the setup optimizer and lr logic can stay in Strategy. Strategy is the only place using setup optimizer and lr logic and it owns optimizers and lr.
But TraininerOptimizerMixin is calling trainer.call_hook(). Could we separate call_hook from trainer, then we can remove reference to Trainer from strategy and remove TraininerOptimizerMixin.

@ananthsub
Copy link
Contributor

ananthsub commented Nov 19, 2021

@four4fish TrainerOptimizerMixin is mostly separate from reworking call_hook, so let's write up a separate issue/doc to discuss it

@daniellepintz
Copy link
Contributor Author

daniellepintz commented Nov 19, 2021

@four4fish the plan for TrainerOptimizerMixin is to move all of its logic to Strategy, and remove it. I don't think we need to separate call_hook from the trainer in order to remove TrainerOptimizerMixin. re. the trainer reference on Strategy, is call_hook the only reason it is needed?

@four4fish
Copy link
Contributor

@daniellepintz after the refactor (include simplify spawning logic ), yeah i think so. Now we need pass trainer to setup, setup_optizer because call_hook in TrainerOptimizerMixin. Otherwise we only need trainer.fn.state.

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
@daniellepintz
Copy link
Contributor Author

There are 12 failing tests remaining in the CI, and they are all passing locally for me so I am quite confused:

FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[ddp_spawn-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[None-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_manual_plugin_selection[DDPShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_automatic_plugin_selection[trainer_kwargs1]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_spawn_find_unused_parameters_false-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_find_unused_parameters_false-DDPShardedPlugin]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs27-expected27]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs28-expected28]

@ananthsub
Copy link
Contributor

ananthsub commented Dec 4, 2021

There are 12 failing tests remaining in the CI, and they are all passing locally for me so I am quite confused:

FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[ddp_spawn-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_plugin_accelerator_choice[None-ddp_sharded]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[1-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded-DDPShardedPlugin]
FAILED tests/accelerators/test_accelerator_connector.py::test_accelerator_choice_multi_node_gpu[2-ddp_sharded_spawn-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_manual_plugin_selection[DDPShardedPlugin]
FAILED tests/plugins/test_cluster_integration.py::test_ranks_available_automatic_plugin_selection[trainer_kwargs1]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_spawn_find_unused_parameters_false-DDPSpawnShardedPlugin]
FAILED tests/plugins/test_plugins_registry.py::test_ddp_find_unused_parameters_training_type_plugins_registry[ddp_sharded_find_unused_parameters_false-DDPShardedPlugin]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs27-expected27]
FAILED tests/trainer/test_trainer.py::test_trainer_config_strategy[trainer_kwargs28-expected28]

The CI is failing with this

pytorch_lightning.utilities.exceptions.MisconfigurationException: `DDPShardedPlugin` requires `fairscale` to be installed. Install it by running `pip install fairscale`.

was fairscale removed as a dependency?

@daniellepintz
Copy link
Contributor Author

no, I didn't change anything with fairscale. maybe @awaelchli knows something since he has been working with DDP recently?

@awaelchli
Copy link
Contributor

awaelchli commented Dec 4, 2021

Happy to help.

Take the example of this test that fails: tests/trainer/test_trainer.py::test_trainer_config_strategy
You see the error in CI and not locally because:

  • the test initializes a Trainer with sharded plugin
  • during Trainer init, there is a call to call_callback_hooks("on_init_start")
  • This guy accesses training_type_plugin.lightning_module
  • the sharded plugin cannot work if fairscale is not installed, hence the error
  • locally you have it installed, so you don't see the error, the CI however does not install fairscale
  • even if sharded plugin behaved differently, a lightningmodule reference is not available in Trainer init anyhow.

This all occurs because this PR replaced
self.on_init_start()
with
self._call_callback_hooks("on_init_start")
And that is no longer equivalent to what it was before. The same problem would occur if we replaced self.on_init_start() with self.call_hook("on_init_start") on master. on_init_start only exists for callbacks, not for the LightningModule.

My suggestion for resolution:

@daniellepintz
Copy link
Contributor Author

@awaelchli thank you SO much for your help! That makes total sense. Fixed!!

@daniellepintz daniellepintz enabled auto-merge (squash) December 4, 2021 05:24
@carmocca carmocca added this to the 1.6 milestone Dec 4, 2021
@carmocca carmocca added hooks Related to the hooks API refactor labels Dec 4, 2021
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic!

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
fn = getattr(self.accelerator, hook_name)
if not callable(fn):
return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed not setting the current_fx_name for accelerator but since we do it manually in all places that call this method, might as well put it inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I ended up having to change it because some tests were failing. Also at least this way we don't use the LM's trainer reference in TTP.py

Copy link
Contributor

@awaelchli awaelchli Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some mind boggling problems in #10890. One question, why do these new call_xyz methods not reset the current_fx_name to its previous value? Was this forgotten or intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @carmocca I learned that only LM hooks and callback hooks need to set current_fx_name, so I only included that in the _call_LM_hook and _call_callback_hooks, and not for _call_ttp_hook and _call_accelerator_hook. However, before this PR, all the places where we called the accelerator hook had a line like this:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3d6262b7a91215e72019d720e742e6261e1636dc/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py#L220

so I included self.lightning_module._current_fx_name = hook_name in _call_accelerator_hook and got rid of the individual lines like the above

@mergify mergify bot added the ready PRs ready to be merged label Dec 4, 2021
@carmocca carmocca disabled auto-merge December 4, 2021 13:03
@codecov
Copy link

codecov bot commented Dec 4, 2021

Codecov Report

Merging #10575 (492fc62) into master (a28b4cd) will decrease coverage by 4%.
The diff coverage is 98%.

@@           Coverage Diff            @@
##           master   #10575    +/-   ##
========================================
- Coverage      92%      88%    -4%     
========================================
  Files         177      177            
  Lines       16553    16484    -69     
========================================
- Hits        15204    14522   -682     
- Misses       1349     1962   +613     

@daniellepintz daniellepintz enabled auto-merge (squash) December 4, 2021 21:39
@daniellepintz daniellepintz merged commit 6043179 into Lightning-AI:master Dec 4, 2021
@daniellepintz daniellepintz deleted the call_hook branch December 4, 2021 21:40
@daniellepintz
Copy link
Contributor Author

Merged, but @carmocca if you have any follow ups I can add them in subsequent PRs

carmocca added a commit that referenced this pull request Dec 6, 2021
@carmocca carmocca mentioned this pull request Dec 6, 2021
8 tasks
@carmocca
Copy link
Contributor

carmocca commented Dec 6, 2021

Just opened #10957 with what I think was missing,

@awaelchli, for your PR, you'll have to set and unset in _call_ttp_hook as you've moved the calls from the accelerator to the ttp.

carmocca added a commit that referenced this pull request Dec 7, 2021
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
@daniellepintz daniellepintz mentioned this pull request Dec 7, 2021
12 tasks
@AndresAlgaba AndresAlgaba mentioned this pull request Sep 23, 2022
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hooks Related to the hooks API ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Re-design call_hook interface
8 participants