-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 2.3.0 #19954
Merged
Merged
Release 2.3.0 #19954
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
fabric
lightning.fabric.Fabric
pl
Generic label for PyTorch Lightning package
package
data
labels
Jun 6, 2024
awaelchli
requested review from
lantiga,
Borda,
tchaton and
justusschock
as code owners
June 10, 2024 23:56
justusschock
approved these changes
Jun 11, 2024
lantiga
approved these changes
Jun 11, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, would it make sense to add a mention to ModelParallelStrategy to the README?
Thanks for the suggestion. I'll look for a good spot to mention it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Below is the draft for the release notes:
Lightning v2.3: Tensor Parallelism and 2D Parallelism
Lightning AI is excited to announce the release of Lightning 2.3 ⚡
Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.
This release introduces experimental support for Tensor Parallelism and 2D Parallelism, PyTorch 2.3 support, and several bugfixes and stability improvements.
Highlights
Tensor Parallelism (beta)
Tensor parallelism (TP) is a technique that splits up the computation of selected layers across GPUs to save memory and speed up distributed models. To enable TP as well as other forms of parallelism, we introduce a
ModelParallelStrategy
for both Lightning Trainer and Fabric. Under the hood, TP is enabled through new experimental PyTorch APIs like DTensor andtorch.distributed.tensor.parallel
.PyTorch Lightning
Enabling TP in a model with PyTorch Lightning requires you to implement the
LightningModule.configure_model()
method where you convert selected layers of a model to paralellized layers. This is an advanced feature, because it requires a deep understanding of the model architecture. Open the tutorial Studio to learn the basics of Tensor Parallelism.Full training example (requires at least 2 GPUs).
Lightning Fabric
Applying TP in a model with Fabric requires you to implement a special function where you convert selected layers of a model to paralellized layers. This is an advanced feature, because it requires a deep understanding of the model architecture. Open the tutorial Studio to learn the basics of Tensor Parallelism.
Full training example (requires at least 2 GPUs).
2D Parallelism (beta)
Tensor Parallelism by itself can be very effective for efficient inference of very large models. For training, TP is typically combined with other forms of parallelism, such as FSDP, to increase throughput and scalability on large clusters with 100s of GPUs. The new
ModelParallelStrategy
in this release supports the combination of TP + FSDP, which is referred to as 2D parallelism.For an introduction to this feature, please also refer to the tutorial Studios (PyTorch Lightning, Lightning Fabric). At the moment, the PyTorch team is reimplementing FSDP under the name FSDP2 with the aim to make it compose well with other parallelisms such as TP. Therefore, for the experimental 2D parallelism support, you'll need to switch to using FSDP2 with the new
ModelParallelStrategy
. Please refer to our docs (PyTorch Lightning, Lightning Fabric) and stay tuned for future releases as these APIs mature.Training Mode in Model Summary
The model summary table that gets displayed when you run
Trainer.fit()
now contains a new column "Mode" that shows the training mode each layer is in (#19468).A module in PyTorch is always either in
train
(default) oreval
mode.This improvement should give users more visibility into the state of their model and help debug issues, for example when you need to make sure certain layers of the model are frozen.
Special Forward Methods in Fabric
Until now, Lightning Fabric warned the user in case the forward pass of the model or a subset of its modules was conducted through methods other than the dedicated
forward
method of the PyTorch module. The reason for this is that PyTorch needs to run special hooks in case of DDP/FSDP and other strategies to function properly, and not running through the realforward
method would skip these hooks and lead to correctness issues.In Lightning Fabric 2.3, we added a feature to explicitly mark alternative forward methods so that Fabric can add the necessary rerouting behind the scenes:
Find the full example and more details in our docs.
Notable Changes
The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.
Skipping the training step in DDP
It is no longer allowed to skip
training_step()
by returningNone
in distributed training (#19918). The following usage was previously possible but would result in unpredictable hangs and timeouts in distributed training:We decided to raise an error if the user attempts to return
None
when running in a multi-GPU setting.Miscellaneous Changes
prepare_data()
hook inLightningModule
andLightningDataModule
is now subject to a barrier without timeout to avoid long-running tasks to be interrupted (#19448).CHANGELOG
PyTorch Lightning
Added
Changed
Deprecated
Removed
Fixed
Lightning Fabric
Added
Changed
Removed
Fixed
Full commit list: 2.2.0 -> 2.3.0
Contributors
We thank all our contributors who submitted pull requests for features, bug fixes and documentation updates.
New Contributors
TODO
cc @Borda @carmocca @justusschock @awaelchli