-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Introduce strategy
flag to Trainer
#9053
Comments
FYI @four4fish |
Hi @kaushikb11, thanks for the proposal! It makes perfect sense. Using However, I don't think I'd vote for Edit: clarification |
@kaushikb11 - I really like the proposal. Regarding
Will these be deprecated? I would strongly prefer we have 1 way to specify different devices for distributed training, and the @yifuwang - what about |
I think it's a very strong user-facing API change for the Users to accept, unfortunately. Also, it's more intuitive for the user to set But the
|
Initially, we had considered |
As a user, this is confusing to me: with the new arguments, there are now double the ways to configure training. What is the framework's recommendation for what I do? I'd have to learn the precedence of accelerator vs setting This optionality will also reflect in more framework complexity. You know best how complex the accelerator connector class has become :/ . The bloat has all sorts of side effects on dev velocity too (harder to work in the framework -> need for more tests across options to cover all the possible inputs -> dealing with slow and/or flaky tests, etc) All of this is a result of one more thing to remember: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md#main-core-value-one-less-thing-to-remember |
@ananthsub, @kaushikb11 - sorry for the confusion. I wasn't suggesting to rename
IMHO distributed training algorithms like I'd even argue that Thus I don't think |
Yes, this is the idea being formulated here: https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#heading=h.vv5zw27fkkpe (@four4fish ) Where we flip the accelerator & training type, and make training type (or whatever the new name is) own:
|
Ah @yifuwang I misunderstood, my apologies. I think including @kaushikb11 what do you think of one of |
Hey everyone, Personally, I am not a huge fan of parallel and this is a new terminology to adapt to.
Trainer(distributed_backend="deepspeed", accelerator="auto", num_devices=4, precision=16) Best, |
Imo we should not use the term distributed, since this is not all distributed training (thinking of DP and single device stuff). To me a term like @ananthsub regarding the deprecation of arguments like |
Personally, I don't like strategy as it doesn't inform the users this is meant to be extended for their own use-case. Would Best, |
I think |
@tchaton
@justusschock with the encapsulation of collectives, distributed algorithms in Lightning are close to being device agnostic. IMO the algorithms are not strategies of device usage (as @ananthsub mentioned above, there's plan to eliminate the |
@yifuwang Yes that's true. That's why it's only the strategy on how to use the given devices (the device/hardware handling is implemented by the accelerator). Just the term But please let's not use |
Makes sense ! What about |
İ think maybe we can draw inspiration from similar ideas in the Hadoop/Spark ecosystem. İt sounds like we have two different concerns here: responsibility of interfacing with hardware (CPUs, GPUs, TPUs,...) and a separate Hadoop YARN-like concern of orchestration/resource management. İ would suggest the former to be called a HardwareConnector and the latter a ResourceManager or HardwareOrchestrator? Spark 3.x introduced GPU acceleration so maybe there are good examples in their source code to check out |
We (Lightning team) had a discussion about this issue at the offsite. We have decided to go ahead with the |
Few folks mentioned about that we are planning on eliminate the Accelerator -> TrainingTypePlugin delegation. With move precision into TTP and call TTP directly instead of going through Accelerator, Accelerator and subclass could be the device_manager_plugin @justusschock mentioned above. I feel the flags will make more sense after the tasks above finishes. we can have More details here: https://docs.google.com/document/d/1E5t8auWf5DrNHzutvMmrJC_KqBqVyZuYX0thra69Ad8/edit#heading=h.nehmb825f40s Are we planning on change the flag in 1.5? |
Thanks @four4fish I think that could be a followup decision outside this PR, there were a lot of ideas introduced during this PR so it's important to focus on just one. As @kaushikb11 said we spent some time going through the potential API choices and came to the conclusion that the # current API
Trainer(accelerator='ddp', gpus=2)
Trainer(plugin=DDPPlugin(), gpus=2)
# introduce strategy API
Trainer(strategy='ddp', accelerator='cpu', gpus=2, num_processes=2)
# introduce device API for agnostic devices
Trainer(strategy='ddp', accelerator='cpu', devices=2, num_nodes=2)
Trainer(strategy='ddp', accelerator='gpu', devices=2, num_nodes=2)
# introduce error handling
Trainer(strategy='ddp', accelerator='ipu', devices=2, num_nodes=2) # crashes as ddp not supported on IPUs
Trainer(strategy='ddp_find_unused_parameters_false', accelerator='ipu', devices=2, num_nodes=2) |
accelerator_strategy
flag to Trainerstrategy
flag to Trainer
What's left to do here? |
Closing this issue, will create a new issue to keep track of this with additional pointers. |
🚀 Feature
Motivation
The motivation is to have a separate
accelerator_strategy
flag to support passing training type aliases (ddp, ddp_spawn, etc) and custom TrainingTypePlugin objects.xxxxxxxxxxxxxx
Background
At the moment, there’s a single flag
accelerator
tied for Accelerators as well as Training Type plugins. We wish to have them decoupled and would like to add a separate flagaccelerator_strategy
for Training Type plugins!Alternate flags to set Training Types
accelerator
Optional[Union[str, Accelerator]]
= Nonedistributed_backend
Optional[str]
= Noneaccelerator
insteadplugins
Optional[Union[List[Union[Plugin, ClusterEnvironment, str]], Plugin, ClusterEnvironment, str]]
= NoneWhat's the difference between passing training type to accelerator, distributed_backend, or plugins?
accelerator
anddistributed_backend
only supportDistributedType
(ddp, ddp_spawn, etc), whereasplugins
support Custom Training Types (DDPPlugin(), ddp_find_unused_parameters_false, etc).xxxxxxxxxxxxxxxxxxxxx
Proposed Solution
strategy
flag to Trainer.Exceptions:
Trainer(distributed_backend="ddp_cpu", strategy="ddp_spawn")
Trainer(accelerator="ddp", strategy="ddp_spawn")
Trainer(plugins="ddp_find_unused_parameters_false", strategy="ddp_spawn")
Deprecations: (Deprecated in v1.5 & will be removed in v1.6)
accelerator
flagplugins
flagxxxxxxxxxxxxxxxxxxxxx
Related PR: #8597
Related Issue: #6090
If you agree with this change, react with 🎉, if not then 🙅🏽 with comments.
Alternatives
plugins
argument not theaccelerator
argument.strategy
argument instead ofaccelerator_strategy
.Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: