Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protect tensor parallel usage #34800

Merged
merged 1 commit into from
Nov 19, 2024
Merged

protect tensor parallel usage #34800

merged 1 commit into from
Nov 19, 2024

Conversation

ArthurZucker
Copy link
Collaborator

What does this PR do?

Fixes #34795

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@ArthurZucker ArthurZucker merged commit dadb286 into main Nov 19, 2024
23 of 25 checks passed
@ArthurZucker ArthurZucker deleted the protect-tp branch November 19, 2024 08:54
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -5005,6 +5006,8 @@ def tensor_parallel(self, device_mesh):
device_mesh (`torch.distributed.DeviceMesh`):
The device mesh to use for tensor parallelism.
"""
if not is_torch_greater_or_equal_than_2_4:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArthurZucker - this mismatch between the torch 2.4 check and 2.5 requirement means that torch 2.4 still hits this issue (where torch 2.3 is now working properly)

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Comment on lines +42 to +47
if is_torch_greater_or_equal_than_2_4:
from torch.distributed.tensor import Replicate
from torch.distributed.tensor.parallel import (
ColwiseParallel,
RowwiseParallel,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Replicate is in torch.distributed.tensor >= torch 2.5.
Between 2.0-2.4 (included), it is in torch.distributed._tensor.

Thus it seems there are two options:
Option 1:

try:
    from torch.distributed.tensor import Replicate
except ImportError:
    from torch.distributed._tensor import Replicate

Option 2:
bump the requirement to 2.5.

ColwiseParallel and RowwiseParallel well exists since 2.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kwen2501 - no preference from my side. I have a PR that does the first option here: #34816

But happy to abandon or switch in favor of any you have to add support.

loadams added a commit to microsoft/DeepSpeed that referenced this pull request Nov 22, 2024
Reverts #6759

Requires from transformers: 
huggingface/transformers#34816
huggingface/transformers#34800

Todo:
- [x] Need to merge first PR to get support for torch 2.4
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors when using transformers with torch<2.5
5 participants