Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

Closed
ananthsub opened this issue Sep 10, 2021 · 8 comments · Fixed by #9901
Closed
Labels
design Includes a design discussion feature Is an improvement or enhancement help wanted Open to be worked on let's do it! approved to implement refactor

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Sep 10, 2021

Proposed refactoring or deprecation

Directly call TrainingTypePlugin APIs instead of going through the Accelerator wherever possible

@four4fish @awaelchli @justusschock @SeanNaren

Motivation

This carries forward the discussion from #9373 (comment)

Most of the Accelerator class today is a shell class that delegates calls to its attached TrainingTypePlugin. This creates an unnecessary level of indirection in many places. It also creates doubt as to whether custom accelerators should override these functions or not.

As most of the strategy around model distribution is embedded in the training type plugin, this is the hub where the following logic lives:

  • Rank information
  • Which ranks conduct IO for checkpoint saving/loading
  • Control/Ownership of the LightningModule
  • Collective communications

However, the accelerator is positioned as the gateway component the trainer interacts with for this functionality. In turn, much of the logic of the training type plugin is currently replicated on the accelerator. This creates an undesirable coupling (we're nearly doubling the APIs exposed). We could cut out this level of indirection by having the trainer call the training type plugin directly wherever applicable. This would shrink the accelerator interface. Ultimately, this will allow it to live as a component in the training type plugin eventually too. In this case, the accelerator can manage the device logic as part of the overall parallelization strategy.

Pitch

Have the Trainer directly call the training type plugin APIs for these methods and then deprecate/remove the corresponding APIs from the accelerator

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L63-L65

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L67-L74

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L86-L93

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L122-L153

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L181-L182

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L208-L231

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L334-L339

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L341-L477

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor design Includes a design discussion labels Sep 10, 2021
@justusschock
Copy link
Member

I think we can do so, wherever no device-specific logic is involved (which should be the case for all the properties/functions you mentioned). When creating this I wanted to avoid users to reach to deeply into the framework internals before they were mature, but I think now we can enable this :)

@awaelchli
Copy link
Contributor

that sounds good to me too.

once the accelerator has moved, we will still follow this pattern and keep the calls separate, right?

@ananthsub
Copy link
Contributor Author

once the accelerator has moved, we will still follow this pattern and keep the calls separate, right?

Yup!

Though I think it could make sense to hold off on this until #7324 is completed. That way the trainer will interact entirely with the training type plugin, and we can make the full API cutover at the same time (and we don't have to balance/explain why some calls are using trainer.accelerator while others are using trainer.training_type_plugin)

@awaelchli
Copy link
Contributor

awaelchli commented Sep 10, 2021

maybe at this point we want to consider adding #7324 to the project/planning. on the other hand the effort is probably quite large and feature freeze is coming up soon, we might not be able to finish it in time for 1.5. Then on the other other hand we also probably don't want to keep accelerator/plugins experimental forever, so better start sooner than later. What do you think?

@ananthsub
Copy link
Contributor Author

maybe at this point we want to consider adding #7324 to the project/planning.

yes I agree we should include it, and that we should start as soon as possible. It's been around for a while and is blocking these other refactors. @awaelchli @justusschock @four4fish maybe we can set up some time early next week to go over how we can proceed with it?

@daniellepintz
Copy link
Contributor

@ananthsub mind adding me to the meeting as well? I am interested to learn more about this

@justusschock
Copy link
Member

justusschock commented Sep 13, 2021

@ananthsub do you think it makes sense to start introducing purely abstract classes (as discussed earlier) before we make such severe changes? I am a bit afraid that without a clearly defined interface, we may run into troubles later where we "forget" to update a portion of the plugin interfaces.

@tchaton
Copy link
Contributor

tchaton commented Sep 13, 2021

Hey @justusschock. As you originally designed the interface, mind leading the conversation there ?

Also @ananthsub, we started to refactor the Accelerator to be fully independent components and to be used as their own distributed engine, similar to HF accelerate. Should we re-ignite this instead ?

Best,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement help wanted Open to be worked on let's do it! approved to implement refactor
Projects
None yet
6 participants