-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626
Labels
Comments
This may be the same issue as these: It should be already fixed in HEAD. |
stonea
added a commit
to chapel-lang/chapel
that referenced
this issue
Jan 24, 2023
Alt approach for GPU threadblock barrier sync This PR makes two changes: Changes how we generate threadblock barrier sync calls Starts gathering performance data for SHOC sort benchmark To give some context --- As discussed here (Cray/chapel-private#4179) with our Chapel implementation of the SHOC sort benchmark we were running into an issue where we'd succeed when not compiling with --fast and fail when using `--fast). I was able to narrow this down to a small reproducer example (seen here Cray/chapel-private#4179 (comment)), which looks incredibly similar to this example on an LLVM bug report: llvm/llvm-project#58626 In that bug report the author shows they can work around it by using inline assembly (marked volatile) to generate the sync call instead. This might be fixed in later versions of clang. I don't know. This seems like a reasonable workaround in the interim. [Reviewed by @e-kayrakli]
riftEmber
pushed a commit
to chapel-lang/chapel
that referenced
this issue
Jan 30, 2023
Alt approach for GPU threadblock barrier sync This PR makes two changes: Changes how we generate threadblock barrier sync calls Starts gathering performance data for SHOC sort benchmark To give some context --- As discussed here (Cray/chapel-private#4179) with our Chapel implementation of the SHOC sort benchmark we were running into an issue where we'd succeed when not compiling with --fast and fail when using `--fast). I was able to narrow this down to a small reproducer example (seen here Cray/chapel-private#4179 (comment)), which looks incredibly similar to this example on an LLVM bug report: llvm/llvm-project#58626 In that bug report the author shows they can work around it by using inline assembly (marked volatile) to generate the sync call instead. This might be fixed in later versions of clang. I don't know. This seems like a reasonable workaround in the interim. [Reviewed by @e-kayrakli]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I use CUDA 11.3 and an nVidia A100, hence the architecture sm_80 is specified during compilation.
__syncthreads()
is no longer working for me when compiling CUDA code ever since upgrading from Clang 12 to Clang 14.0.6. The problem persists in Clang 15.0.3I have a short reproducer. It throws runtime asserts when compiled with
-O2
and above.As a workaround, when I call the below inline assembly function
barSync()
in place of__syncthreads()
, the code starts working again.The text was updated successfully, but these errors were encountered: