Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626

Closed
cbuchner1 opened this issue Oct 26, 2022 · 2 comments
Assignees
Labels

Comments

@cbuchner1
Copy link

cbuchner1 commented Oct 26, 2022

I use CUDA 11.3 and an nVidia A100, hence the architecture sm_80 is specified during compilation.

__syncthreads() is no longer working for me when compiling CUDA code ever since upgrading from Clang 12 to Clang 14.0.6. The problem persists in Clang 15.0.3

I have a short reproducer. It throws runtime asserts when compiled with -O2 and above.

// clang++ -O3 --cuda-gpu-arch=sm_80 -x cuda test.cu -o test -L/usr/local/cuda-11.3/lib64 -lcudart
#include <cassert>
__global__ void test()
{
   __shared__ int test;
   test = 0;
   __syncthreads();
   if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0)
   {
     test = 1234;
   }
   __syncthreads();
   assert(test == 1234);
}
#include <iostream>
int main(int argc, char **argv)
{
  dim3 block(16,16,1);
  dim3 grid(1,1,1);
  test<<<grid, block>>>();
  cudaDeviceSynchronize();
  std::cerr << "CUDA error code: " << cudaGetLastError() << std::endl;
}

As a workaround, when I call the below inline assembly function barSync() in place of __syncthreads(), the code starts working again.

// inline assembly to insert a barrier synchronization equivalent to __syncthreads()
__device__ __forceinline__ void barSync() {
      asm volatile("bar.sync 0;" : : : "memory");
}
@Artem-B
Copy link
Member

Artem-B commented Oct 26, 2022

This may be the same issue as these:
https://lists.llvm.org/pipermail/llvm-dev/2021-November/154060.html
#54851

It should be already fixed in HEAD.

stonea added a commit to chapel-lang/chapel that referenced this issue Jan 24, 2023
Alt approach for GPU threadblock barrier sync

This PR makes two changes:

Changes how we generate threadblock barrier sync calls
Starts gathering performance data for SHOC sort benchmark

To give some context ---
As discussed here (Cray/chapel-private#4179) with our Chapel implementation of the SHOC sort benchmark we were running into an issue where we'd succeed when not compiling with --fast and fail when using `--fast).
I was able to narrow this down to a small reproducer example (seen here Cray/chapel-private#4179 (comment)), which looks incredibly similar to this example on an LLVM bug report: llvm/llvm-project#58626
In that bug report the author shows they can work around it by using inline assembly (marked volatile) to generate the sync call instead.
This might be fixed in later versions of clang. I don't know. This seems like a reasonable workaround in the interim.

[Reviewed by @e-kayrakli]
riftEmber pushed a commit to chapel-lang/chapel that referenced this issue Jan 30, 2023
Alt approach for GPU threadblock barrier sync

This PR makes two changes:

Changes how we generate threadblock barrier sync calls
Starts gathering performance data for SHOC sort benchmark

To give some context ---
As discussed here (Cray/chapel-private#4179) with our Chapel implementation of the SHOC sort benchmark we were running into an issue where we'd succeed when not compiling with --fast and fail when using `--fast).
I was able to narrow this down to a small reproducer example (seen here Cray/chapel-private#4179 (comment)), which looks incredibly similar to this example on an LLVM bug report: llvm/llvm-project#58626
In that bug report the author shows they can work around it by using inline assembly (marked volatile) to generate the sync call instead.
This might be fixed in later versions of clang. I don't know. This seems like a reasonable workaround in the interim.

[Reviewed by @e-kayrakli]
@Artem-B Artem-B self-assigned this May 8, 2023
@Artem-B
Copy link
Member

Artem-B commented May 8, 2023

Fixed by 9dc7da3 #54851

@Artem-B Artem-B closed this as completed May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants