[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

ananthsub · 2021-09-10T04:44:09Z

Proposed refactoring or deprecation

Directly call TrainingTypePlugin APIs instead of going through the Accelerator wherever possible

@four4fish @awaelchli @justusschock @SeanNaren

Motivation

This carries forward the discussion from #9373 (comment)

Most of the Accelerator class today is a shell class that delegates calls to its attached TrainingTypePlugin. This creates an unnecessary level of indirection in many places. It also creates doubt as to whether custom accelerators should override these functions or not.

As most of the strategy around model distribution is embedded in the training type plugin, this is the hub where the following logic lives:

Rank information
Which ranks conduct IO for checkpoint saving/loading
Control/Ownership of the LightningModule
Collective communications

However, the accelerator is positioned as the gateway component the trainer interacts with for this functionality. In turn, much of the logic of the training type plugin is currently replicated on the accelerator. This creates an undesirable coupling (we're nearly doubling the APIs exposed). We could cut out this level of indirection by having the trainer call the training type plugin directly wherever applicable. This would shrink the accelerator interface. Ultimately, this will allow it to live as a component in the training type plugin eventually too. In this case, the accelerator can manage the device logic as part of the overall parallelization strategy.

Pitch

Have the Trainer directly call the training type plugin APIs for these methods and then deprecate/remove the corresponding APIs from the accelerator

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L63-L65

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L67-L74

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L86-L93

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L122-L153

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L181-L182

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L208-L231

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L334-L339

https://github.com/PyTorchLightning/pytorch-lightning/blob/c963bf6568e6abd16615b7dfaa446b1f5c446793/pytorch_lightning/accelerators/accelerator.py#L341-L477

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

justusschock · 2021-09-10T08:49:37Z

I think we can do so, wherever no device-specific logic is involved (which should be the case for all the properties/functions you mentioned). When creating this I wanted to avoid users to reach to deeply into the framework internals before they were mature, but I think now we can enable this :)

awaelchli · 2021-09-10T23:39:45Z

that sounds good to me too.

once the accelerator has moved, we will still follow this pattern and keep the calls separate, right?

ananthsub · 2021-09-10T23:47:16Z

once the accelerator has moved, we will still follow this pattern and keep the calls separate, right?

Yup!

Though I think it could make sense to hold off on this until #7324 is completed. That way the trainer will interact entirely with the training type plugin, and we can make the full API cutover at the same time (and we don't have to balance/explain why some calls are using trainer.accelerator while others are using trainer.training_type_plugin)

awaelchli · 2021-09-10T23:52:20Z

maybe at this point we want to consider adding #7324 to the project/planning. on the other hand the effort is probably quite large and feature freeze is coming up soon, we might not be able to finish it in time for 1.5. Then on the other other hand we also probably don't want to keep accelerator/plugins experimental forever, so better start sooner than later. What do you think?

ananthsub · 2021-09-10T23:57:51Z

maybe at this point we want to consider adding #7324 to the project/planning.

yes I agree we should include it, and that we should start as soon as possible. It's been around for a while and is blocking these other refactors. @awaelchli @justusschock @four4fish maybe we can set up some time early next week to go over how we can proceed with it?

daniellepintz · 2021-09-12T23:31:14Z

@ananthsub mind adding me to the meeting as well? I am interested to learn more about this

justusschock · 2021-09-13T07:43:51Z

@ananthsub do you think it makes sense to start introducing purely abstract classes (as discussed earlier) before we make such severe changes? I am a bit afraid that without a clearly defined interface, we may run into troubles later where we "forget" to update a portion of the plugin interfaces.

tchaton · 2021-09-13T08:44:57Z

Hey @justusschock. As you originally designed the interface, mind leading the conversation there ?

Also @ananthsub, we started to refactor the Accelerator to be fully independent components and to be used as their own distributed engine, similar to HF accelerate. Should we re-ignite this instead ?

Best,
T.C

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor design Includes a design discussion labels Sep 10, 2021

ananthsub mentioned this issue Sep 10, 2021

Add remove_checkpoint to CheckpointIO plugin to simplify ModelCheckpo… #9373

Merged

12 tasks

four4fish mentioned this issue Sep 24, 2021

1/n Call training_type_plugin collective functions directly instead of going through the Accelerator #9677

Merged

12 tasks

carmocca added the let's do it! approved to implement label Oct 6, 2021

This was referenced Oct 7, 2021

[RFC] Introduce strategy flag to Trainer #9053

Closed

[2/n] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9901

Merged

carmocca closed this as completed in #9901 Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

ananthsub commented Sep 10, 2021 •

edited

Loading

justusschock commented Sep 10, 2021

awaelchli commented Sep 10, 2021

ananthsub commented Sep 10, 2021

awaelchli commented Sep 10, 2021 •

edited

Loading

ananthsub commented Sep 10, 2021

daniellepintz commented Sep 12, 2021

justusschock commented Sep 13, 2021 •

edited

Loading

tchaton commented Sep 13, 2021

[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

[RFC] Directly call TrainingTypePlugin APIs instead of going through the Accelerator #9426

Comments

ananthsub commented Sep 10, 2021 • edited Loading

Proposed refactoring or deprecation

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

justusschock commented Sep 10, 2021

awaelchli commented Sep 10, 2021

ananthsub commented Sep 10, 2021

awaelchli commented Sep 10, 2021 • edited Loading

ananthsub commented Sep 10, 2021

daniellepintz commented Sep 12, 2021

justusschock commented Sep 13, 2021 • edited Loading

tchaton commented Sep 13, 2021

ananthsub commented Sep 10, 2021 •

edited

Loading

awaelchli commented Sep 10, 2021 •

edited

Loading

justusschock commented Sep 13, 2021 •

edited

Loading