Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] testing for moving precision plugin as property of TrainingTypePlugin #7805

Closed

Conversation

shuyingsunshine21
Copy link
Contributor

@shuyingsunshine21 shuyingsunshine21 commented Jun 2, 2021

What does this PR do?

Fixes #7324

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Shuying Sun and others added 30 commits March 23, 2021 12:06
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
…oint_consolidate

Update test_all_gather_grad.py
…1-checkpoint_consolidate"

This reverts commit c5053da, reversing
changes made to 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
@pep8speaks
Copy link

pep8speaks commented Jun 2, 2021

Hello @shuyingsunshine21! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-22 13:53:43 UTC

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

All in all I am not sure, if I really like this. I see the convenience having this, but i also think this leads in the wrong direction (back to monolithic class structures) and entangles a lot of the logic we want to be separated from other parts again.

This includes (but isn't limited to)

  • no more strong separation of training and precision plugin (yes I know they're still separate classes but you don't need a precision plugin anymore; you can do all in the training type plugin (and out of convenience and lazyness people will do that))
  • Users have to care about both plugins when only implementing the training type (at least need to call hooks from precision)
  • The accelerator as an orchestration class becomes basically useless.

So in total this feels to me like we are heading back to fixed combinations of training and precision plugins (partly even incorporating the accelerator) with no clear responsibility anymore.

@@ -201,7 +199,7 @@ def training_step(
"""
step_kwargs = self.to_device(step_kwargs)

with self.precision_plugin.train_step_context(), self.training_type_plugin.train_step_context():
with self.training_type_plugin.precision_plugin.train_step_context(), self.training_type_plugin.train_step_context():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We basically have two different contexts here: One for forward + backward (current train_step_context) and one for forward only.

I think we could change it down to these two as well in this PR. From my perspective there is no difference between forward in training and forward in val/test/predict so we should be able to combine them.

training_type_plugin: TrainingTypePlugin,
) -> None:
"""
Args:
precision_plugin: the plugin to handle precision-specific parts
training_type_plugin: the plugin to handle different training routines
"""
self.precision_plugin = precision_plugin
self.precision_plugin = training_type_plugin.precision_plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we even need a fixed reference here then? I think replacing it with a property might be better since that does not increase the reference counter of the object

self, optimizer: Optimizer, optimizer_idx: int, lambda_closure: Callable, **kwargs: Any
) -> None:
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
self.training_type_plugin.optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only downside here is that the training type plugin must make sure to call the precision plugin here. This is something the user has to make sure when implementing his own training type plugin

@@ -28,7 +28,7 @@ def global_rank(self) -> int:
def world_size(self) -> int:
return self.num_nodes

def setup(self, model):
def setup_model(self, model):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not rename these, since they aren't meant to setup only the model. They can actually setup arbitrary things

@@ -54,9 +54,9 @@ def backward(
self,
model: 'pl.LightningModule',
closure_loss: Tensor,
should_accumulate: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but i think the order should be preserved here to avoid BC for positional args

@@ -54,7 +67,7 @@ def _reinit_optimizers_with_oss(self):
optim_class = type(optimizer)
zero_optimizer = OSS(params=optimizer.param_groups, optim=optim_class, **optimizer.defaults)
if _FAIRSCALE_OSS_FP16_BROADCAST_AVAILABLE:
is_fp16 = self.lightning_module.trainer.precision == 16
is_fp16 = self.lightning_module.trainer.precision == "mixed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't that properly handled before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, unfortunately, will separate this small fix.

@@ -66,9 +66,8 @@ def model_to_device(self) -> None:

self._model.to(self.root_device)

def setup(self, model: torch.nn.Module) -> torch.nn.Module:
def setup_model(self, model: 'pl.LightningModule') -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -111,9 +111,8 @@ def pre_dispatch(self):
if self.debug:
os.environ["PT_XLA_DEBUG"] = str(1)

def setup(self, model: Module) -> Module:
def setup_model(self, model: 'pl.LightningModule') -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -55,7 +55,7 @@ def distributed_sampler_kwargs(self):
distributed_sampler_kwargs = dict(num_replicas=self.world_size, rank=self.global_rank)
return distributed_sampler_kwargs

def setup(self, model):
def setup_model(self, model):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

return self._precision_plugin

@precision_plugin.setter
def precision_plugin(self, args: Dict[str, Any]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I don't like this. This should live in the acceleator connector. You perform the usual checks there and check if this class implements some other logic, but ideally they'd be determined beforehand (maybe as a staticmethod since you don't seem to really need the sate for selection)

@shuyingsunshine21
Copy link
Contributor Author

Thanks @justusschock for the comments!

I am also thinking whether going back to one class like TrainingStrategyPlugin to control everything is a bad thing or not given that interleaving calls across TrainingType and Precision handled by accelerator might be problematic. This flow might cause some combination of TrainingType and Precision do not work as expected.

no more strong separation of training and precision plugin (yes I know they're still separate classes but you don't need a precision plugin anymore; you can do all in the training type plugin (and out of convenience and lazyness people will do that))

I also feel that it is a bit vague to separate the current training type plugin and precision plugin. One can actually do everything in training type plugin and has some dummy precision plugin even for the current way. If we would like to have clear separation, what would be a good interface to ensure that.

Users have to care about both plugins when only implementing the training type (at least need to call hooks from precision)

I am thinking this is actually intended, as this is to avoid implementing one of them without caring logics of the other, causing unexpected behavior. Current way, user need to care about this in accelerator level though.

@stale
Copy link

stale bot commented Jun 17, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jun 17, 2021
@stale
Copy link

stale bot commented Jun 22, 2021

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Jun 22, 2021
@kaushikb11 kaushikb11 reopened this Jun 22, 2021
@stale stale bot removed the won't fix This will not be worked on label Jun 22, 2021
@stale
Copy link

stale bot commented Jul 6, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jul 6, 2021
@edenlightning edenlightning removed the won't fix This will not be worked on label Jul 8, 2021
@stale
Copy link

stale bot commented Jul 22, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jul 22, 2021
@awaelchli awaelchli added this to the v1.5 milestone Jul 22, 2021
@stale stale bot removed the won't fix This will not be worked on label Jul 22, 2021
@awaelchli awaelchli closed this Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Precision Plugins should be part of Training Type Plugins
7 participants