[MATRIX][DOC][E2E] Add note on sm version nvidia device issue. (#14178)

For some sm_7x devices it is important to compile using the sm version (compute capability) matching the device that will run the joint_matrix code. This adds a note in docs explaining this. Also specify `sycl::fabs` in joint_matrix_gemm_cuda.hpp to avoid ambiguity between `sycl::native::fabs`. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
intel · Jun 24, 2024 · 9800153 · 9800153
1 parent dc93a36
commit 9800153
Show file tree

Hide file tree

Showing 3 changed files with 16 additions and 1 deletion.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -1085,6 +1085,13 @@ multiple of 4 when `T` is `float`; where `T` is the type of the
 `joint_matrix` elements. When `T` is not `half` or `float` there are
 no restrictions to `stride`.
 
+IMPORTANT: For some devices it is important to use the sm version
+(Compute Capability) corresponding to the device that will run the
+program when specifying e.g. `-fsycl-targets=nvidia_gpu_sm_xx` during
+compilation. This particularly affects matrix operations using `half`.
+For more information on this issue consult
+https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-restrictions
+
 ==== AMD Matrix Cores Supported Combinations
 The complete set of matrix data types and dimensions that are supported by
 the `ext_oneapi_hip` backend are represented in the following
@@ -1139,4 +1146,5 @@ supported combinations
 load/store overloads
 |11  |2024-04-29 |Yury Plyakhin | Add 1x64x16 supported combination for
 Intel XMX (intel_gpu_pvc)
+|12  |2024-06-14 |Jack Kirk | Add note on sm version device matching issue.
 |======================
diff --git a/sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp b/sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp
@@ -229,7 +229,8 @@ void test(queue &q) {
         auto res_device =
             matrix_ref_mn<Big_N, Big_K, Big_M, layout_A, layout_B>(m, n, A, B,
                                                                    C);
-        assert(fabs(2 * (D[index_D] - res_device)) / (D[index_D] + res_device) <
+        assert(sycl::fabs(2 * (D[index_D] - res_device)) /
+                   (D[index_D] + res_device) <
                bf16_eps * 2);
       } else {
         assert((D[index_D] ==

diff --git a/sycl/test-e2e/Matrix/joint_matrix_tensorcores_sm70.cpp b/sycl/test-e2e/Matrix/joint_matrix_tensorcores_sm70.cpp
@@ -13,6 +13,12 @@
 // This tests the unified matrix extension interfaces for the cuda backend.
 // This test must be compiled with -Xsycl-target-backend --cuda-gpu-arch=sm_xx,
 // where sm_xx >= sm_70.
+// For some devices it is important to use the sm version (Compute Capability)
+// corresponding to the device that will run the program when specifying e.g.
+// `-fsycl-targets=nvidia_gpu_sm_xx` during compilation. This particularly
+// affects matrix operations using `half` such as those in this test. For more
+// information on this issue consult
+// https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-restrictions
 
 #include "joint_matrix_apply_cuda.hpp"
 #include "joint_matrix_gemm_cuda.hpp"