Skip to content

Commit

Permalink
[MATRIX][DOC][E2E] Add note on sm version nvidia device issue. (#14178)
Browse files Browse the repository at this point in the history
For some sm_7x devices it is important to compile using the sm version
(compute capability) matching the device that will run the joint_matrix
code. This adds a note in docs explaining this.

Also specify `sycl::fabs` in joint_matrix_gemm_cuda.hpp to avoid
ambiguity between `sycl::native::fabs`.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
  • Loading branch information
JackAKirk authored Jun 24, 2024
1 parent dc93a36 commit 9800153
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -1085,6 +1085,13 @@ multiple of 4 when `T` is `float`; where `T` is the type of the
`joint_matrix` elements. When `T` is not `half` or `float` there are
no restrictions to `stride`.

IMPORTANT: For some devices it is important to use the sm version
(Compute Capability) corresponding to the device that will run the
program when specifying e.g. `-fsycl-targets=nvidia_gpu_sm_xx` during
compilation. This particularly affects matrix operations using `half`.
For more information on this issue consult
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-restrictions

==== AMD Matrix Cores Supported Combinations
The complete set of matrix data types and dimensions that are supported by
the `ext_oneapi_hip` backend are represented in the following
Expand Down Expand Up @@ -1139,4 +1146,5 @@ supported combinations
load/store overloads
|11 |2024-04-29 |Yury Plyakhin | Add 1x64x16 supported combination for
Intel XMX (intel_gpu_pvc)
|12 |2024-06-14 |Jack Kirk | Add note on sm version device matching issue.
|======================
3 changes: 2 additions & 1 deletion sycl/test-e2e/Matrix/joint_matrix_gemm_cuda.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,8 @@ void test(queue &q) {
auto res_device =
matrix_ref_mn<Big_N, Big_K, Big_M, layout_A, layout_B>(m, n, A, B,
C);
assert(fabs(2 * (D[index_D] - res_device)) / (D[index_D] + res_device) <
assert(sycl::fabs(2 * (D[index_D] - res_device)) /
(D[index_D] + res_device) <
bf16_eps * 2);
} else {
assert((D[index_D] ==
Expand Down
6 changes: 6 additions & 0 deletions sycl/test-e2e/Matrix/joint_matrix_tensorcores_sm70.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@
// This tests the unified matrix extension interfaces for the cuda backend.
// This test must be compiled with -Xsycl-target-backend --cuda-gpu-arch=sm_xx,
// where sm_xx >= sm_70.
// For some devices it is important to use the sm version (Compute Capability)
// corresponding to the device that will run the program when specifying e.g.
// `-fsycl-targets=nvidia_gpu_sm_xx` during compilation. This particularly
// affects matrix operations using `half` such as those in this test. For more
// information on this issue consult
// https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-restrictions

#include "joint_matrix_apply_cuda.hpp"
#include "joint_matrix_gemm_cuda.hpp"
Expand Down

0 comments on commit 9800153

Please sign in to comment.