-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426
Comments
I think we can do so, wherever no device-specific logic is involved (which should be the case for all the properties/functions you mentioned). When creating this I wanted to avoid users to reach to deeply into the framework internals before they were mature, but I think now we can enable this :) |
that sounds good to me too. once the accelerator has moved, we will still follow this pattern and keep the calls separate, right? |
Yup! Though I think it could make sense to hold off on this until #7324 is completed. That way the trainer will interact entirely with the training type plugin, and we can make the full API cutover at the same time (and we don't have to balance/explain why some calls are using |
maybe at this point we want to consider adding #7324 to the project/planning. on the other hand the effort is probably quite large and feature freeze is coming up soon, we might not be able to finish it in time for 1.5. Then on the other other hand we also probably don't want to keep accelerator/plugins experimental forever, so better start sooner than later. What do you think? |
yes I agree we should include it, and that we should start as soon as possible. It's been around for a while and is blocking these other refactors. @awaelchli @justusschock @four4fish maybe we can set up some time early next week to go over how we can proceed with it? |
@ananthsub mind adding me to the meeting as well? I am interested to learn more about this |
@ananthsub do you think it makes sense to start introducing purely abstract classes (as discussed earlier) before we make such severe changes? I am a bit afraid that without a clearly defined interface, we may run into troubles later where we "forget" to update a portion of the plugin interfaces. |
Hey @justusschock. As you originally designed the interface, mind leading the conversation there ? Also @ananthsub, we started to refactor the Accelerator to be fully independent components and to be used as their own distributed engine, similar to HF accelerate. Should we re-ignite this instead ? Best, |
Proposed refactoring or deprecation
Directly call TrainingTypePlugin APIs instead of going through the Accelerator wherever possible
@four4fish @awaelchli @justusschock @SeanNaren
Motivation
This carries forward the discussion from #9373 (comment)
Most of the Accelerator class today is a shell class that delegates calls to its attached TrainingTypePlugin. This creates an unnecessary level of indirection in many places. It also creates doubt as to whether custom accelerators should override these functions or not.
As most of the strategy around model distribution is embedded in the training type plugin, this is the hub where the following logic lives:
However, the accelerator is positioned as the gateway component the trainer interacts with for this functionality. In turn, much of the logic of the training type plugin is currently replicated on the accelerator. This creates an undesirable coupling (we're nearly doubling the APIs exposed). We could cut out this level of indirection by having the trainer call the training type plugin directly wherever applicable. This would shrink the accelerator interface. Ultimately, this will allow it to live as a component in the training type plugin eventually too. In this case, the accelerator can manage the device logic as part of the overall parallelization strategy.
Pitch
Have the Trainer directly call the training type plugin APIs for these methods and then deprecate/remove the corresponding APIs from the accelerator
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L63-L65
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L67-L74
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L86-L93
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L122-L153
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L181-L182
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L208-L231
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L334-L339
https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L341-L477
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: