CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626

cbuchner1 · 2022-10-26T13:46:46Z

I use CUDA 11.3 and an nVidia A100, hence the architecture sm_80 is specified during compilation.

__syncthreads() is no longer working for me when compiling CUDA code ever since upgrading from Clang 12 to Clang 14.0.6. The problem persists in Clang 15.0.3

I have a short reproducer. It throws runtime asserts when compiled with -O2 and above.

// clang++ -O3 --cuda-gpu-arch=sm_80 -x cuda test.cu -o test -L/usr/local/cuda-11.3/lib64 -lcudart
#include <cassert>
__global__ void test()
{
   __shared__ int test;
   test = 0;
   __syncthreads();
   if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0)
   {
     test = 1234;
   }
   __syncthreads();
   assert(test == 1234);
}
#include <iostream>
int main(int argc, char **argv)
{
  dim3 block(16,16,1);
  dim3 grid(1,1,1);
  test<<<grid, block>>>();
  cudaDeviceSynchronize();
  std::cerr << "CUDA error code: " << cudaGetLastError() << std::endl;
}

As a workaround, when I call the below inline assembly function barSync() in place of __syncthreads(), the code starts working again.

// inline assembly to insert a barrier synchronization equivalent to __syncthreads()
__device__ __forceinline__ void barSync() {
      asm volatile("bar.sync 0;" : : : "memory");
}

The text was updated successfully, but these errors were encountered:

Artem-B · 2022-10-26T16:56:17Z

This may be the same issue as these:
https://lists.llvm.org/pipermail/llvm-dev/2021-November/154060.html
#54851

It should be already fixed in HEAD.

@e-kayrakli

Alt approach for GPU threadblock barrier sync This PR makes two changes: Changes how we generate threadblock barrier sync calls Starts gathering performance data for SHOC sort benchmark To give some context --- As discussed here (Cray/chapel-private#4179) with our Chapel implementation of the SHOC sort benchmark we were running into an issue where we'd succeed when not compiling with --fast and fail when using `--fast). I was able to narrow this down to a small reproducer example (seen here Cray/chapel-private#4179 (comment)), which looks incredibly similar to this example on an LLVM bug report: llvm/llvm-project#58626 In that bug report the author shows they can work around it by using inline assembly (marked volatile) to generate the sync call instead. This might be fixed in later versions of clang. I don't know. This seems like a reasonable workaround in the interim. [Reviewed by @e-kayrakli]

@e-kayrakli

Alt approach for GPU threadblock barrier sync This PR makes two changes: Changes how we generate threadblock barrier sync calls Starts gathering performance data for SHOC sort benchmark To give some context --- As discussed here (Cray/chapel-private#4179) with our Chapel implementation of the SHOC sort benchmark we were running into an issue where we'd succeed when not compiling with --fast and fail when using `--fast). I was able to narrow this down to a small reproducer example (seen here Cray/chapel-private#4179 (comment)), which looks incredibly similar to this example on an LLVM bug report: llvm/llvm-project#58626 In that bug report the author shows they can work around it by using inline assembly (marked volatile) to generate the sync call instead. This might be fixed in later versions of clang. I don't know. This seems like a reasonable workaround in the interim. [Reviewed by @e-kayrakli]

Artem-B · 2023-05-08T21:24:04Z

Fixed by 9dc7da3 #54851

github-actions bot added the new issue label Oct 26, 2022

EugeneZelenko added cuda and removed new issue labels Oct 26, 2022

Artem-B mentioned this issue Jan 4, 2023

clang CUDA: shared variables sometimes are not updated (on Pascal architecture) #59632

Closed

stonea mentioned this issue Jan 18, 2023

Alt approach for GPU threadblock barrier sync chapel-lang/chapel#21387

Merged

Artem-B self-assigned this May 8, 2023

Artem-B closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626

CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626

cbuchner1 commented Oct 26, 2022 •

edited by VoltrexKeyva

Loading

Artem-B commented Oct 26, 2022

Artem-B commented May 8, 2023

CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626

CUDA __syncthreads() malfunctioning with -O2 optimization or higher (clang 14.0.6 and 15.0.3) #58626

Comments

cbuchner1 commented Oct 26, 2022 • edited by VoltrexKeyva Loading

Artem-B commented Oct 26, 2022

Artem-B commented May 8, 2023

cbuchner1 commented Oct 26, 2022 •

edited by VoltrexKeyva

Loading