-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer get_train_dataloader creates wrong batch size when using IterableDataset and multi-gpu training on single machine #21444
Comments
Sounds like the As an aside, using DataParallel is not the recommended way to run a multiple GPUs by PyTorch, you should launch your training script with |
Thanks, Sylvain. I issue the pull request. My first time doing so, so hope I did it OK! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Reopening for FSDP use case
but the streaming dataset still fetch more item than per_device_train_batch_size. |
@edwardpwtsoi My bad - I thought the above was a merged PR. Regardless, it would be useful to have a new issue with specifics about the FSDP case |
System Info
@sgugger
I'm not sure if I'm missing something here or not. But I am doing masked language modeling with RobertaForMaskedLM and working in pytorch on an AWS machine with 8 V100s. I set args.per_device_train_batch_size=32. If I train with a regular Dataset object, the data loader will produce a big batch of 32 * 8 = 256 examples each time, and then they will be split up and sent to each GPU in batches of 32 as expected. But if I switch to an IterableDataset, I end up with the DataLoader producing batches of 32, which get split into batches of 4 being send to each GPU.
This happens because of this code in Trainer.get_train_data_loader. If we have an iterable Dataset, we end up creating a DataLoader based on per_device_train_batch_size (which is 32). But if we have any other type of dataset, we create a DataLoader with self._train_batch_size (which is 256). I confess I don't what the first if self.args.world_size > 1 block is supposed to be doing, but that doesn't get executed in my situation (running on a single machine with multiple GPUs).
Am I doing something wrong, or is this a bug?
Thanks,
Andy
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The train batch size should be the same whether using regular or IterableDataset
The text was updated successfully, but these errors were encountered: