Trainer get_train_dataloader creates wrong batch size when using IterableDataset and multi-gpu training on single machine #21444

agossard · 2023-02-03T18:28:14Z

System Info

I'm not sure if I'm missing something here or not. But I am doing masked language modeling with RobertaForMaskedLM and working in pytorch on an AWS machine with 8 V100s. I set args.per_device_train_batch_size=32. If I train with a regular Dataset object, the data loader will produce a big batch of 32 * 8 = 256 examples each time, and then they will be split up and sent to each GPU in batches of 32 as expected. But if I switch to an IterableDataset, I end up with the DataLoader producing batches of 32, which get split into batches of 4 being send to each GPU.

This happens because of this code in Trainer.get_train_data_loader. If we have an iterable Dataset, we end up creating a DataLoader based on per_device_train_batch_size (which is 32). But if we have any other type of dataset, we create a DataLoader with self._train_batch_size (which is 256). I confess I don't what the first if self.args.world_size > 1 block is supposed to be doing, but that doesn't get executed in my situation (running on a single machine with multiple GPUs).

Am I doing something wrong, or is this a bug?

Thanks,
Andy

    if isinstance(train_dataset, torch.utils.data.IterableDataset):
        if self.args.world_size > 1:
            train_dataset = IterableDatasetShard(
                train_dataset,
                batch_size=self._train_batch_size,
                drop_last=self.args.dataloader_drop_last,
                num_processes=self.args.world_size,
                process_index=self.args.process_index,
            )

        return DataLoader(
            train_dataset,
            batch_size=self.args.**per_device_train_batch_size**,
            collate_fn=data_collator,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
        )

    train_sampler = self._get_train_sampler()

    return DataLoader(
        train_dataset,
        batch_size=self.**_train_batch_size**,
        sampler=train_sampler,
        collate_fn=data_collator,
        drop_last=self.args.dataloader_drop_last,
        num_workers=self.args.dataloader_num_workers,
        pin_memory=self.args.dataloader_pin_memory,
        worker_init_fn=seed_worker,
    )

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Use a pytorch model on a single single machine with multiple GPUs
Set TrainingArguments.per_device_train_batch_size=32
Create a regular dataset in memory from a pandas data frame (or whatever)
Put a breakpoint (or debugging statement) in the forward pass of the model to print out inputs.shape -> Very that first dimension=32
Now create a IterableDataset and run again
See that inputs.shape has first dimension of 4

Expected behavior

The train batch size should be the same whether using regular or IterableDataset

The text was updated successfully, but these errors were encountered:

sgugger · 2023-02-03T18:35:40Z

Sounds like the self.args.per_device_train_batch_size should be self._train_batch_size indeed. Do you want to open a PR?

As an aside, using DataParallel is not the recommended way to run a multiple GPUs by PyTorch, you should launch your training script with torchrun

agossard · 2023-02-03T19:58:58Z

Thanks, Sylvain. I issue the pull request. My first time doing so, so hope I did it OK!

github-actions · 2023-03-06T15:02:07Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

edwardpwtsoi · 2024-07-10T09:51:34Z

Reopening for FSDP use case
IMO, the per_device_batch_size should be used for the FSDP case. As the machines should be treated as a single device. Let me know if my understanding is wrong. I tried to test whether a patch could fix the issue by

    @property
    def train_batch_size(self) -> int:
        """
        The actual batch size for training (may differ from `per_gpu_train_batch_size` in distributed training).
        """
        return self.per_device_train_batch_size

but the streaming dataset still fetch more item than per_device_train_batch_size.
Does anyone have insight on what would be a possible fix to this use case?

amyeroberts · 2024-07-10T15:32:47Z

@edwardpwtsoi ~~Could you open a new issue if it's not covered by the fix in huggingface/datasets#5506? This helps us better track what has and hasn't been resolved~~

My bad - I thought the above was a merged PR. Regardless, it would be useful to have a new issue with specifics about the FSDP case

lhoestq mentioned this issue Feb 8, 2023

IterableDataset and Dataset return different batch sizes when using Trainer with multiple GPUs huggingface/datasets#5506

Closed

github-actions bot closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer get_train_dataloader creates wrong batch size when using IterableDataset and multi-gpu training on single machine #21444

Trainer get_train_dataloader creates wrong batch size when using IterableDataset and multi-gpu training on single machine #21444

agossard commented Feb 3, 2023

sgugger commented Feb 3, 2023

agossard commented Feb 3, 2023

github-actions bot commented Mar 6, 2023

edwardpwtsoi commented Jul 10, 2024

amyeroberts commented Jul 10, 2024 •

edited

Loading

Trainer get_train_dataloader creates wrong batch size when using IterableDataset and multi-gpu training on single machine #21444

Trainer get_train_dataloader creates wrong batch size when using IterableDataset and multi-gpu training on single machine #21444

Comments

agossard commented Feb 3, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Feb 3, 2023

agossard commented Feb 3, 2023

github-actions bot commented Mar 6, 2023

edwardpwtsoi commented Jul 10, 2024

amyeroberts commented Jul 10, 2024 • edited Loading

amyeroberts commented Jul 10, 2024 •

edited

Loading