Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUCompiler.reset_runtime() race condition #2168

Open
simonbyrne opened this issue Nov 14, 2023 · 4 comments
Open

GPUCompiler.reset_runtime() race condition #2168

simonbyrne opened this issue Nov 14, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@simonbyrne
Copy link
Contributor

Describe the bug

If multiple processes attempt to precompile CUDA.jl, they might call GPUCompiler.reset_runtime(), which due to a bug in Julia 1.9.3 and earlier, can trigger a race condition in recursive rm. See log here:
https://buildkite.com/clima/climaatmos-ci/builds/14843#018bc9e6-4dcd-4a53-87e4-60467b240fda/162-168

The Julia bug is fixed in JuliaLang/julia#50842, but this has not made it to a release version yet.

(I'm trying to figure out why they were getting recompiled, but @vchuravy suggested I open an issue for this).

To reproduce

Honestly, I can't figure out a way to reproduce it, but I have seen it several times.

@simonbyrne simonbyrne added the bug Something isn't working label Nov 14, 2023
@maleadt
Copy link
Member

maleadt commented Nov 14, 2023

Ah, good catch. What do you suggest as workaround or fix? Using a pidlock seems heavyweight; I wonder if we could first do an atomic rename of the directory before wiping it.

@simonbyrne
Copy link
Contributor Author

atomic rename before and after writing seems like the easiest fix?

@simonbyrne
Copy link
Contributor Author

Any idea why we might be getting the compilecache call? Or any suggestions to try (i can't reliably reproduce it, unfortunately)

@simonbyrne
Copy link
Contributor Author

I think this might be related to segfaults I'm seeing on loading of CUDA_Runtime_jll:
https://buildkite.com/clima/climacore-ci/builds/2842#018bd01c-5d70-4661-bc81-5cf000add4a3/161-422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants