Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clang] Fix cross-lane scan when given divergent lanes #127703

Merged
merged 1 commit into from
Feb 19, 2025

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Feb 18, 2025

Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.

@llvmbot llvmbot added clang Clang issues not falling into any other category backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics libc labels Feb 18, 2025
@llvmbot
Copy link
Member

llvmbot commented Feb 18, 2025

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.


Full diff: https://github.com/llvm/llvm-project/pull/127703.diff

4 Files Affected:

  • (modified) clang/lib/Headers/gpuintrin.h (+24-9)
  • (modified) clang/lib/Headers/nvptxintrin.h (+4-1)
  • (modified) libc/test/integration/src/__support/GPU/CMakeLists.txt (+2)
  • (modified) libc/test/integration/src/__support/GPU/scan_reduce.cpp (+50)
diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h
index 11c87e85cd497..75271edbc84a2 100644
--- a/clang/lib/Headers/gpuintrin.h
+++ b/clang/lib/Headers/gpuintrin.h
@@ -154,7 +154,7 @@ __gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t __idx, double __x,
 #define __DO_LANE_SUM(__type, __suffix)                                        \
   _DEFAULT_FN_ATTRS static __inline__ __type __gpu_lane_sum_##__suffix(        \
       uint64_t __lane_mask, __type __x) {                                      \
-    for (uint32_t __step = __gpu_num_lanes() / 2; __step > 0; __step /= 2) {   \
+    for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {       \
       uint32_t __index = __step + __gpu_lane_id();                             \
       __x += __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x,           \
                                           __gpu_num_lanes());                  \
@@ -171,14 +171,29 @@ __DO_LANE_SUM(double, f64);   // double __gpu_lane_sum_f64(m, x)
 #define __DO_LANE_SCAN(__type, __bitmask_type, __suffix)                       \
   _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_scan_##__suffix(     \
       uint64_t __lane_mask, uint32_t __x) {                                    \
-    for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {       \
-      uint32_t __index = __gpu_lane_id() - __step;                             \
-      __bitmask_type bitmask = __gpu_lane_id() >= __step;                      \
-      __x += __builtin_bit_cast(                                               \
-          __type, -bitmask & __builtin_bit_cast(__bitmask_type,                \
-                                                __gpu_shuffle_idx_##__suffix(  \
-                                                    __lane_mask, __index, __x, \
-                                                    __gpu_num_lanes())));      \
+    uint64_t __first = __lane_mask >> __builtin_ctzll(__lane_mask);            \
+    bool __divergent = __gpu_read_first_lane_##__suffix(                       \
+        __lane_mask, __first & (__first + 1));                                 \
+    if (__divergent) {                                                         \
+      __type __accum = 0;                                                      \
+      for (uint64_t __mask = __lane_mask; __mask; __mask &= __mask - 1) {      \
+        __type __index = __builtin_ctzll(__mask);                              \
+        __type __tmp = __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x, \
+                                                    __gpu_num_lanes());        \
+        __x = __gpu_lane_id() == __index ? __accum + __tmp : __x;              \
+        __accum += __tmp;                                                      \
+      }                                                                        \
+    } else {                                                                   \
+      for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {     \
+        uint32_t __index = __gpu_lane_id() - __step;                           \
+        __bitmask_type bitmask = __gpu_lane_id() >= __step;                    \
+        __x += __builtin_bit_cast(                                             \
+            __type,                                                            \
+            -bitmask & __builtin_bit_cast(__bitmask_type,                      \
+                                          __gpu_shuffle_idx_##__suffix(        \
+                                              __lane_mask, __index, __x,       \
+                                              __gpu_num_lanes())));            \
+      }                                                                        \
     }                                                                          \
     return __x;                                                                \
   }
diff --git a/clang/lib/Headers/nvptxintrin.h b/clang/lib/Headers/nvptxintrin.h
index f857a87b5f4c7..29d0adcabc82f 100644
--- a/clang/lib/Headers/nvptxintrin.h
+++ b/clang/lib/Headers/nvptxintrin.h
@@ -155,8 +155,11 @@ _DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_lane(uint64_t __lane_mask) {
 _DEFAULT_FN_ATTRS static __inline__ uint32_t
 __gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x,
                       uint32_t __width) {
+  // Mask out inactive lanes to match AMDGPU behavior.
   uint32_t __mask = (uint32_t)__lane_mask;
-  return __nvvm_shfl_sync_idx_i32(__mask, __x, __idx,
+  bool __bitmask = (1ull << __idx) & __lane_mask;
+  return -__bitmask &
+         __nvvm_shfl_sync_idx_i32(__mask, __x, __idx,
                                   ((__gpu_num_lanes() - __width) << 8u) | 0x1f);
 }
 
diff --git a/libc/test/integration/src/__support/GPU/CMakeLists.txt b/libc/test/integration/src/__support/GPU/CMakeLists.txt
index e066830f6cc0d..a750f2302b511 100644
--- a/libc/test/integration/src/__support/GPU/CMakeLists.txt
+++ b/libc/test/integration/src/__support/GPU/CMakeLists.txt
@@ -6,6 +6,8 @@ add_integration_test(
   SUITE libc-support-gpu-tests
   SRCS
     scan_reduce.cpp
+  DEPENDS
+    libc.src.stdio.printf
   LOADER_ARGS
     --threads 64
 )
diff --git a/libc/test/integration/src/__support/GPU/scan_reduce.cpp b/libc/test/integration/src/__support/GPU/scan_reduce.cpp
index bc621c3300cbe..7739137f5b200 100644
--- a/libc/test/integration/src/__support/GPU/scan_reduce.cpp
+++ b/libc/test/integration/src/__support/GPU/scan_reduce.cpp
@@ -8,6 +8,7 @@
 
 #include "src/__support/CPP/bit.h"
 #include "src/__support/GPU/utils.h"
+#include "src/stdio/printf.h"
 #include "test/IntegrationTest/test.h"
 
 using namespace LIBC_NAMESPACE;
@@ -53,10 +54,59 @@ static void test_scan() {
   EXPECT_EQ(z, gpu::get_lane_id() % 2 ? gpu::get_lane_id() / 2 + 1 : 0);
 }
 
+static uint32_t random(uint64_t *rand_next) {
+  uint64_t x = *rand_next;
+  x ^= x >> 12;
+  x ^= x << 25;
+  x ^= x >> 27;
+  *rand_next = x;
+  return static_cast<uint32_t>((x * 0x2545F4914F6CDD1Dul) >> 32);
+}
+
+// Scan operations can break down under thread divergence, make sure that the
+// function works under some random divergence. We do this by trivially
+// implementing a scan with shared scratch memory and then comparing the
+// results.
+static void test_scan_divergent() {
+  static uint32_t input[64] = {0};
+  static uint32_t result[64] = {0};
+  uint64_t state = gpu::processor_clock() * __gpu_lane_id();
+
+  for (int i = 0; i < 64; ++i) {
+    uint64_t lanemask = gpu::get_lane_mask();
+    if (random(&state) & (1ull << gpu::get_lane_id())) {
+      uint64_t divergent = gpu::get_lane_mask();
+      uint32_t value = random(&state) % 256;
+      input[gpu::get_lane_id()] = value;
+
+      if (gpu::is_first_lane(divergent)) {
+        uint32_t accumulator = 0;
+        for (uint32_t lane = 0; lane < gpu::get_lane_size(); ++lane) {
+          uint32_t tmp = input[lane];
+          result[lane] = tmp + accumulator;
+          accumulator += tmp;
+        }
+      }
+      gpu::sync_lane(divergent);
+
+      uint32_t scan = gpu::scan(divergent, value);
+      EXPECT_EQ(scan, result[gpu::get_lane_id()]);
+    }
+    if (gpu::is_first_lane(lanemask))
+      __builtin_memset(input, 0, sizeof(input));
+    gpu::sync_lane(lanemask);
+  }
+}
+
 TEST_MAIN(int argc, char **argv, char **envp) {
+  if (gpu::get_thread_id() >= gpu::get_lane_size())
+    return 0;
+
   test_reduce();
 
   test_scan();
 
+  test_scan_divergent();
+
   return 0;
 }

@llvmbot
Copy link
Member

llvmbot commented Feb 18, 2025

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.


Full diff: https://github.com/llvm/llvm-project/pull/127703.diff

4 Files Affected:

  • (modified) clang/lib/Headers/gpuintrin.h (+24-9)
  • (modified) clang/lib/Headers/nvptxintrin.h (+4-1)
  • (modified) libc/test/integration/src/__support/GPU/CMakeLists.txt (+2)
  • (modified) libc/test/integration/src/__support/GPU/scan_reduce.cpp (+50)
diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h
index 11c87e85cd497..75271edbc84a2 100644
--- a/clang/lib/Headers/gpuintrin.h
+++ b/clang/lib/Headers/gpuintrin.h
@@ -154,7 +154,7 @@ __gpu_shuffle_idx_f64(uint64_t __lane_mask, uint32_t __idx, double __x,
 #define __DO_LANE_SUM(__type, __suffix)                                        \
   _DEFAULT_FN_ATTRS static __inline__ __type __gpu_lane_sum_##__suffix(        \
       uint64_t __lane_mask, __type __x) {                                      \
-    for (uint32_t __step = __gpu_num_lanes() / 2; __step > 0; __step /= 2) {   \
+    for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {       \
       uint32_t __index = __step + __gpu_lane_id();                             \
       __x += __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x,           \
                                           __gpu_num_lanes());                  \
@@ -171,14 +171,29 @@ __DO_LANE_SUM(double, f64);   // double __gpu_lane_sum_f64(m, x)
 #define __DO_LANE_SCAN(__type, __bitmask_type, __suffix)                       \
   _DEFAULT_FN_ATTRS static __inline__ uint32_t __gpu_lane_scan_##__suffix(     \
       uint64_t __lane_mask, uint32_t __x) {                                    \
-    for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {       \
-      uint32_t __index = __gpu_lane_id() - __step;                             \
-      __bitmask_type bitmask = __gpu_lane_id() >= __step;                      \
-      __x += __builtin_bit_cast(                                               \
-          __type, -bitmask & __builtin_bit_cast(__bitmask_type,                \
-                                                __gpu_shuffle_idx_##__suffix(  \
-                                                    __lane_mask, __index, __x, \
-                                                    __gpu_num_lanes())));      \
+    uint64_t __first = __lane_mask >> __builtin_ctzll(__lane_mask);            \
+    bool __divergent = __gpu_read_first_lane_##__suffix(                       \
+        __lane_mask, __first & (__first + 1));                                 \
+    if (__divergent) {                                                         \
+      __type __accum = 0;                                                      \
+      for (uint64_t __mask = __lane_mask; __mask; __mask &= __mask - 1) {      \
+        __type __index = __builtin_ctzll(__mask);                              \
+        __type __tmp = __gpu_shuffle_idx_##__suffix(__lane_mask, __index, __x, \
+                                                    __gpu_num_lanes());        \
+        __x = __gpu_lane_id() == __index ? __accum + __tmp : __x;              \
+        __accum += __tmp;                                                      \
+      }                                                                        \
+    } else {                                                                   \
+      for (uint32_t __step = 1; __step < __gpu_num_lanes(); __step *= 2) {     \
+        uint32_t __index = __gpu_lane_id() - __step;                           \
+        __bitmask_type bitmask = __gpu_lane_id() >= __step;                    \
+        __x += __builtin_bit_cast(                                             \
+            __type,                                                            \
+            -bitmask & __builtin_bit_cast(__bitmask_type,                      \
+                                          __gpu_shuffle_idx_##__suffix(        \
+                                              __lane_mask, __index, __x,       \
+                                              __gpu_num_lanes())));            \
+      }                                                                        \
     }                                                                          \
     return __x;                                                                \
   }
diff --git a/clang/lib/Headers/nvptxintrin.h b/clang/lib/Headers/nvptxintrin.h
index f857a87b5f4c7..29d0adcabc82f 100644
--- a/clang/lib/Headers/nvptxintrin.h
+++ b/clang/lib/Headers/nvptxintrin.h
@@ -155,8 +155,11 @@ _DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_lane(uint64_t __lane_mask) {
 _DEFAULT_FN_ATTRS static __inline__ uint32_t
 __gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x,
                       uint32_t __width) {
+  // Mask out inactive lanes to match AMDGPU behavior.
   uint32_t __mask = (uint32_t)__lane_mask;
-  return __nvvm_shfl_sync_idx_i32(__mask, __x, __idx,
+  bool __bitmask = (1ull << __idx) & __lane_mask;
+  return -__bitmask &
+         __nvvm_shfl_sync_idx_i32(__mask, __x, __idx,
                                   ((__gpu_num_lanes() - __width) << 8u) | 0x1f);
 }
 
diff --git a/libc/test/integration/src/__support/GPU/CMakeLists.txt b/libc/test/integration/src/__support/GPU/CMakeLists.txt
index e066830f6cc0d..a750f2302b511 100644
--- a/libc/test/integration/src/__support/GPU/CMakeLists.txt
+++ b/libc/test/integration/src/__support/GPU/CMakeLists.txt
@@ -6,6 +6,8 @@ add_integration_test(
   SUITE libc-support-gpu-tests
   SRCS
     scan_reduce.cpp
+  DEPENDS
+    libc.src.stdio.printf
   LOADER_ARGS
     --threads 64
 )
diff --git a/libc/test/integration/src/__support/GPU/scan_reduce.cpp b/libc/test/integration/src/__support/GPU/scan_reduce.cpp
index bc621c3300cbe..7739137f5b200 100644
--- a/libc/test/integration/src/__support/GPU/scan_reduce.cpp
+++ b/libc/test/integration/src/__support/GPU/scan_reduce.cpp
@@ -8,6 +8,7 @@
 
 #include "src/__support/CPP/bit.h"
 #include "src/__support/GPU/utils.h"
+#include "src/stdio/printf.h"
 #include "test/IntegrationTest/test.h"
 
 using namespace LIBC_NAMESPACE;
@@ -53,10 +54,59 @@ static void test_scan() {
   EXPECT_EQ(z, gpu::get_lane_id() % 2 ? gpu::get_lane_id() / 2 + 1 : 0);
 }
 
+static uint32_t random(uint64_t *rand_next) {
+  uint64_t x = *rand_next;
+  x ^= x >> 12;
+  x ^= x << 25;
+  x ^= x >> 27;
+  *rand_next = x;
+  return static_cast<uint32_t>((x * 0x2545F4914F6CDD1Dul) >> 32);
+}
+
+// Scan operations can break down under thread divergence, make sure that the
+// function works under some random divergence. We do this by trivially
+// implementing a scan with shared scratch memory and then comparing the
+// results.
+static void test_scan_divergent() {
+  static uint32_t input[64] = {0};
+  static uint32_t result[64] = {0};
+  uint64_t state = gpu::processor_clock() * __gpu_lane_id();
+
+  for (int i = 0; i < 64; ++i) {
+    uint64_t lanemask = gpu::get_lane_mask();
+    if (random(&state) & (1ull << gpu::get_lane_id())) {
+      uint64_t divergent = gpu::get_lane_mask();
+      uint32_t value = random(&state) % 256;
+      input[gpu::get_lane_id()] = value;
+
+      if (gpu::is_first_lane(divergent)) {
+        uint32_t accumulator = 0;
+        for (uint32_t lane = 0; lane < gpu::get_lane_size(); ++lane) {
+          uint32_t tmp = input[lane];
+          result[lane] = tmp + accumulator;
+          accumulator += tmp;
+        }
+      }
+      gpu::sync_lane(divergent);
+
+      uint32_t scan = gpu::scan(divergent, value);
+      EXPECT_EQ(scan, result[gpu::get_lane_id()]);
+    }
+    if (gpu::is_first_lane(lanemask))
+      __builtin_memset(input, 0, sizeof(input));
+    gpu::sync_lane(lanemask);
+  }
+}
+
 TEST_MAIN(int argc, char **argv, char **envp) {
+  if (gpu::get_thread_id() >= gpu::get_lane_size())
+    return 0;
+
   test_reduce();
 
   test_scan();
 
+  test_scan_divergent();
+
   return 0;
 }

@jhuber6 jhuber6 added this to the LLVM 20.X Release milestone Feb 18, 2025
@jhuber6 jhuber6 force-pushed the ScanFix branch 2 times, most recently from bf81a92 to ba202ea Compare February 18, 2025 22:02
Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.
@@ -188,6 +186,32 @@ __DO_LANE_SCAN(float, uint32_t, f32); // float __gpu_lane_scan_f32(m, x)
__DO_LANE_SCAN(double, uint64_t, f64); // double __gpu_lane_scan_f64(m, x)
#undef __DO_LANE_SCAN

// Gets the sum of all lanes inside the warp or wavefront.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do you need to sink this macro?

@jhuber6 jhuber6 merged commit 6cc7ca0 into llvm:main Feb 19, 2025
12 of 14 checks passed
@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 20, 2025

/cherry-pick 6cc7ca0

@llvmbot
Copy link
Member

llvmbot commented Feb 20, 2025

/pull-request #128085

swift-ci pushed a commit to swiftlang/llvm-project that referenced this pull request Feb 21, 2025
Summary:
The scan operation implemented here only works if there are contiguous
ones in the executation mask that can be used to propagate the result.
There are two solutions to this, one is to enter 'whole-wave-mode' and
forcibly turn them back on, or to do this serially. This implementation
does the latter because it's more portable, but checks to see if the
parallel fast-path is applicable.

Needs to be backported for correct behavior and because it fixes a
failing libc test.

(cherry picked from commit 6cc7ca0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics clang Clang issues not falling into any other category libc
Projects
Development

Successfully merging this pull request may close these issues.

3 participants