-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue when using a dataloader on MULTI_GPU #1400
Comments
I can't reproduce on my side as the prompt yielded in the batches make Accelerate fail: since this is is an iterable dataset, |
Thank you for your response. However I am quite surprised. Maybe It has to do with my version of accelerate. Even when I replaced prepared dataloader
batch size = 0
step = 0
input ids = torch.Size([0, 3])
attention mask = tensor([], device='cuda:2', size=(0, 3), dtype=torch.int64)
index prompt = tensor([], device='cuda:2', dtype=torch.int8)
An error occured
prepared dataloader
step = 0
input ids = torch.Size([1, 3])
attention mask = tensor([[1, 1, 1]], device='cuda:0')
index prompt = tensor([0], device='cuda:0', dtype=torch.int8)
decoding : I am happy
prepared dataloader
step = 0
input ids = torch.Size([1, 3])
attention mask = tensor([[1, 1, 1]], device='cuda:1')
index prompt = tensor([0], device='cuda:1', dtype=torch.int8)
decoding : I am happy
prepared dataloader
batch size = 0
step = 0
input ids = torch.Size([0, 3])
attention mask = tensor([], device='cuda:3', size=(0, 3), dtype=torch.int64)
index prompt = tensor([], device='cuda:3', dtype=torch.int8)
An error occured The error also remains when I remove |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am using accelerate to speed up the inference on a set of prompts. I sequentially receive batches of prompts, that I turn into a torch IterableDataset. From that dataset I build a dataloader that I prepare as well as the model I am using. I am trying to run the inference by going through the prepared dataloader. As a matter of fact, I expect accelerate to dispatch my prompts through all my GPUs but I end up having an issue.
The problem arises in with the following code
Here my dataset has just one sentence (
I am happy
), but I have 4 GPUs. Knowing how accelerate works, I expect each GPU to have a duplicate of the sentence on it. However it is not the case. This is what I have in terms of error logsAs you would probably notice,
cuda:0
andcuda:1
have the sentence, as expected (input_ids
is a non empty tensor and its decoding leads to the original promptI am happy
. However,cuda:2
andcuda:3
have empty tensors (you can noticeAn error occured
which comes from thetry except
which catch the error that comes from trying to decode an emptyinput_ids
tensor.I think the problem may come from how the last batch is handled when we dispatch the content of the dataloader in accelerate.
Expected behavior
I expected the tensors
input_ids
andattention_mask
(corresponding to the sentence 'I am happy') to be duplicated across all my devices. Not justcuda:0
andcuda:1
, but alsocuda:2
andcuda:3
.The text was updated successfully, but these errors were encountered: