Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait… #1709

yuxinyuan · 2023-07-12T07:40:43Z

Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait when the dataset is an IterDataPipe during multi-process training.

In newer version of pytorch, the iterator of DataLoader will try to broadcast a shared seed across all distributed processes (see here). In the current implementation of DataLoaderDispatcher, the iterator is only created in the main process. This causes the training to hang when dataset is an IterDataPipe.

One can try out the script below to see the effect.

import accelerate
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.datapipes.iter import IterableWrapper


def print_rank_by_rank(*args, **kwargs):
    for rank in range(accelerator.num_processes):
        if rank == accelerator.process_index:
            print(*args, **kwargs)
        accelerator.wait_for_everyone()


if __name__ == "__main__":
    accelerator = accelerate.Accelerator()
    accelerator.print(accelerate.__version__)

    dp = IterableWrapper(range(21)).shuffle().map(lambda x: x + 1)

    loader = DataLoader(dp, batch_size=4, shuffle=True)
    loader = accelerator.prepare(loader)

    accelerator.print(type(loader))

    print_rank_by_rank(list(loader))
    print_rank_by_rank(list(loader))

… when the dataset is an IterDataPipe during multi-process training.

HuggingFaceDocBuilderDev · 2023-07-12T07:45:32Z

The documentation is not available anymore as the PR was closed or merged.

muellerzr · 2023-07-12T09:55:22Z

src/accelerate/data_loader.py

-        # We can safely pass because the default is -1
-        with suppress(Exception):
-            length = getattr(self.dataset, "total_dataset_length", len(self.dataset))
-            self.remainder = length % self.total_batch_size


Is there a reason this has been moved into __iter__ and not __init__?

I didn't go through the rest of this repo to figure out how remainder is used. However, since self.reset() will always set it to -1, it just makes more sense (looking at data_loader.py) to follow DataLoaderShard and set the remainder in __iter__. Otherwise, it should be safe to remove this code completely.

It's used for Gradient accumulation. Here it makes sense to have it in __init__ as there's no need to calculate it twice

Well, I might be missing something, but, in __iter__, self.reset() will set self.remainder to -1. So, self.remainder won't be useful once we start iterating through the dataloader/dataset. Is that correct?

Looking more, it is also updated at the end of the iter loop. @muellerzr do we actually need those lines?

But while we investigate, I agree that it's safer to just copy this after the reset.

Hmmm... will look into this today.

muellerzr

This looks great, thanks! Left one comment on moving a chunk of code.

sgugger

Thanks for your PR! Let's just keep the remainder logic in the init as mentioned by @muellerzr and we should be good to go.

sgugger · 2023-07-12T11:09:40Z

src/accelerate/data_loader.py

-        # We can safely pass because the default is -1
-        with suppress(Exception):
-            length = getattr(self.dataset, "total_dataset_length", len(self.dataset))
-            self.remainder = length % self.total_batch_size


Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait…

7d78517

… when the dataset is an IterDataPipe during multi-process training.

muellerzr reviewed Jul 12, 2023

View reviewed changes

muellerzr approved these changes Jul 12, 2023

View reviewed changes

muellerzr requested a review from sgugger July 12, 2023 09:56

sgugger reviewed Jul 12, 2023

View reviewed changes

sgugger merged commit 518c206 into huggingface:main Jul 12, 2023
24 checks passed

This was referenced Jul 12, 2023

Keep old behavior on dispatch dataloader #1716

Merged

Remove duplicate code and check the remainder only once #1717

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait… #1709

Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait… #1709

yuxinyuan commented Jul 12, 2023

HuggingFaceDocBuilderDev commented Jul 12, 2023 •

edited

Loading

muellerzr Jul 12, 2023

yuxinyuan Jul 12, 2023

muellerzr Jul 12, 2023 •

edited

Loading

sgugger Jul 12, 2023

yuxinyuan Jul 12, 2023

sgugger Jul 12, 2023

muellerzr Jul 12, 2023

muellerzr left a comment

sgugger left a comment

sgugger Jul 12, 2023

Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait… #1709

Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait… #1709

Conversation

yuxinyuan commented Jul 12, 2023

HuggingFaceDocBuilderDev commented Jul 12, 2023 • edited Loading

muellerzr Jul 12, 2023

Choose a reason for hiding this comment

yuxinyuan Jul 12, 2023

Choose a reason for hiding this comment

muellerzr Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

sgugger Jul 12, 2023

Choose a reason for hiding this comment

yuxinyuan Jul 12, 2023

Choose a reason for hiding this comment

sgugger Jul 12, 2023

Choose a reason for hiding this comment

muellerzr Jul 12, 2023

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jul 12, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 12, 2023 •

edited

Loading

muellerzr Jul 12, 2023 •

edited

Loading