Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout while waiting for prepare_data to finish #19266

Closed
kfoynt opened this issue Jan 10, 2024 · 2 comments · Fixed by #19448
Closed

Timeout while waiting for prepare_data to finish #19266

kfoynt opened this issue Jan 10, 2024 · 2 comments · Fixed by #19448
Labels
data handling Generic data-related topic docs Documentation related feature Is an improvement or enhancement

Comments

@kfoynt
Copy link

kfoynt commented Jan 10, 2024

📚 Documentation

I initialize two processes. The first process is the one that creates the data inside prepare_data.
The second process waits for the first process.

The problem is that when I generate a large dataset it takes more than 1800 seconds, which is the default DDP timeout.
So my script kept freezing.

I think that the timeout should be emphasized in the documentation when one uses prepare_data in the lightning data module.

Currently, it is mentioned that prepare_data is called only by one process, but it is not mentioned that the other processes are waiting for the first process to finish and that there is a timeout for this. Also, it is not mentioned when the other processes start counting the time. Is it at the beginning of prepare_data or at the end? It seems to be at the beginning, which is what caused my script to timeout.

The timeout error is also not very informative. It just says that it timed out. Given that it is not described how exactly prepare_data works it is difficult to interpret the error.

I guess that the developer assumed that prepare_data won't take that long, but when one deals with very large data, almost certainly it will need more than 1800 seconds.

cc @Borda @justusschock @awaelchli

@kfoynt kfoynt added docs Documentation related needs triage Waiting to be triaged by maintainers labels Jan 10, 2024
@awaelchli
Copy link
Contributor

@kfoynt Yes we are aware of that. Unfortunately, PyTorch doesn't allow one to configure/disable the timeout for a single collective (e.g. in this case the barrier), and so this is a pretty strict limitation. Maybe we can get away with making a new process sub group just for these barriers here and configure the timeout there to a very high number.

@awaelchli awaelchli added feature Is an improvement or enhancement data handling Generic data-related topic and removed needs triage Waiting to be triaged by maintainers labels Jan 11, 2024
@kfoynt
Copy link
Author

kfoynt commented Jan 11, 2024

I think that even a warning in the documentation will do. It would have saved me a day of debugging for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic docs Documentation related feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants