Bug in caching dataset with existing file error in distributed computing #421

Pale-Blue-Dot-97 · 2024-01-23T10:38:15Z

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Use a distributed computing setup
Ensure cache==True for make_dataset
See error

Expected behavior
No error. If the dataset is not already cached, it should be created then cached under the unique hash. If it exists, the hash should be recognised and the dataset loaded.

Environment (please complete the following information):

OS: Ubuntu
Version: 0.27.0
Python Version: 3.11
Packages: pytorch==2.1.2

The text was updated successfully, but these errors were encountered:

Pale-Blue-Dot-97 · 2024-01-23T10:44:02Z

What appears to be happening here is that each process of a distributed process group sees that the dataset requested does not exist, hence they all try to independently create and they cache the dataset. As this will be in the same location, a conflict arises when the slower processes try caching to a now extant dataset

The solution is to ensure that only process 0 attempts to create the dataset. All other processes should then wait until 0 is finished, then they can load the dataset from the new cache.

Pale-Blue-Dot-97 added the bug Something isn't working label Jan 23, 2024

Pale-Blue-Dot-97 self-assigned this Jan 23, 2024

Pale-Blue-Dot-97 changed the title ~~Tries to cache dataset to existing file in distributed computing~~ Bug in caching dataset with existing file error in distributed computing Jan 23, 2024

Pale-Blue-Dot-97 mentioned this issue Jan 23, 2024

421 bug in caching dataset #422

Merged

Pale-Blue-Dot-97 added a commit that referenced this issue Feb 6, 2024

Fixed #421 using dist.barrier

87bbc4e

Pale-Blue-Dot-97 added a commit that referenced this issue Feb 6, 2024

Fixed #421 using dist.barrier

fc61c31

Pale-Blue-Dot-97 added a commit that referenced this issue Feb 6, 2024

Fixed #421 using dist.barrier

34e8f92

Pale-Blue-Dot-97 mentioned this issue Sep 15, 2024

Minerva v0.28.0 #535

Merged

Pale-Blue-Dot-97 closed this as completed in 637bbce Sep 17, 2024

Pale-Blue-Dot-97 closed this as completed in #535 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in caching dataset with existing file error in distributed computing #421

Bug in caching dataset with existing file error in distributed computing #421

Pale-Blue-Dot-97 commented Jan 23, 2024

Pale-Blue-Dot-97 commented Jan 23, 2024

Bug in caching dataset with existing file error in distributed computing #421

Bug in caching dataset with existing file error in distributed computing #421

Comments

Pale-Blue-Dot-97 commented Jan 23, 2024

Pale-Blue-Dot-97 commented Jan 23, 2024