Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in caching dataset with existing file error in distributed computing #421

Closed
Pale-Blue-Dot-97 opened this issue Jan 23, 2024 · 1 comment · Fixed by #535
Closed

Bug in caching dataset with existing file error in distributed computing #421

Pale-Blue-Dot-97 opened this issue Jan 23, 2024 · 1 comment · Fixed by #535
Assignees
Labels
bug Something isn't working

Comments

@Pale-Blue-Dot-97
Copy link
Owner

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Use a distributed computing setup
  2. Ensure cache==True for make_dataset
  3. See error

Expected behavior
No error. If the dataset is not already cached, it should be created then cached under the unique hash. If it exists, the hash should be recognised and the dataset loaded.

Environment (please complete the following information):

  • OS: Ubuntu
  • Version: 0.27.0
  • Python Version: 3.11
  • Packages: pytorch==2.1.2
@Pale-Blue-Dot-97 Pale-Blue-Dot-97 added the bug Something isn't working label Jan 23, 2024
@Pale-Blue-Dot-97 Pale-Blue-Dot-97 self-assigned this Jan 23, 2024
@Pale-Blue-Dot-97 Pale-Blue-Dot-97 changed the title Tries to cache dataset to existing file in distributed computing Bug in caching dataset with existing file error in distributed computing Jan 23, 2024
@Pale-Blue-Dot-97
Copy link
Owner Author

What appears to be happening here is that each process of a distributed process group sees that the dataset requested does not exist, hence they all try to independently create and they cache the dataset. As this will be in the same location, a conflict arises when the slower processes try caching to a now extant dataset

The solution is to ensure that only process 0 attempts to create the dataset. All other processes should then wait until 0 is finished, then they can load the dataset from the new cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant