Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly Sycl unit test failures with intel/2023.1.0, intel/2024.1.0 on Intel Ponte Vecchio #1961

Closed
ndellingwood opened this issue Aug 31, 2023 · 7 comments

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Aug 31, 2023

Testing with the Sycl backend on Intel Ponte Vecchio on the new Blake showed a couple failing sub-tests (failure output listed below the failing executable), depending on which environment variables set:

Default (ZES_ENABLE_SYSMAN unset)

The following tests FAILED:
   13 - sparse_sycl (Failed)
[  FAILED  ] sycl_test.sparse_coo2crs
[  FAILED  ] sycl_test.sparse_spgemm_jacobi_double_int_size_t_TestExecSpace
[  FAILED  ] sycl_test.sparse_spgemm_double_int_size_t_TestExecSpace
[  FAILED  ] sycl_test.sparse_par_ilut_double_int_size_t_TestExecSpace
[  FAILED  ] sycl_test.sparse_par_ilut_precond_double_int_size_t_TestExecSpace

   14 - blocksparse_sycl (Failed)
[  FAILED  ] sycl_test.sparse_bsr_gauss_seidel_rank1_double_int_size_t_TestExecSpace
[  FAILED  ] sycl_test.sparse_bsr_gauss_seidel_rank2_double_int_size_t_TestExecSpace
[  FAILED  ] sycl_test.sparse_block_spgemm_double_int_size_t_TestExecSpace

   22 - wiki_spgemm (Subprocess aborted)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error: No memory modules for the SYCL backend found. Make sure that ZES_ENABLE_SYSMAN=1 is set at run time!

ZES_ENABLE_SYSMAN=1

The following tests FAILED:
13/27 Test #13: sparse_sycl ......................Subprocess aborted***Exception:  45.76 sec
[==========] Running 48 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 48 tests from sycl_test
[ RUN      ] sycl_test.sparse_coo2crs
/home/ndellin/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4.34205 vs 3.75255e-10
row: 17, crs_col_ids_ref(504) = 20 mismatched values!
Begin arguments for above failure...
RandCooMat<N6Kokkos7complexIdEE, N6Kokkos10LayoutLeftE, N6Kokkos12Experimental4SYCLE130...): rand seed: 3072659895
scalar: N6Kokkos7complexIdEE
layout: N6Kokkos10LayoutLeftE
m: 130, n: 130
...end arguments for above failure.
...
[  FAILED  ] sycl_test.sparse_coo2crs (20842 ms)
[ RUN      ] sycl_test.sparse_spgemm_jacobi_double_int_size_t_TestExecSpace
terminate called after throwing an instance of 'std::runtime_error'
  what():  There was a synchronous SYCL error:
Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

14/27 Test #14: blocksparse_sycl .................***Failed   33.50 sec
...
[ RUN      ] sycl_test.sparse_block_spgemm_double_int_size_t_TestExecSpace
nentries_actual:1564 nentries_reference:2423
/home/ndellin/kokkos-kernels/sparse/unit_test/Test_Sparse_bspgemm.hpp:235: Failure
Value of: is_identical
  Actual: false
Expected: true
SPGEMM_KK
...

Reproducer (Blake PV queue):
SHAs:
kokkos/kokkos@7e299b4
acdd896

module load cmake intel-oneapi-compilers/2023.1.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-sycl --arch=INTEL_PVC --compiler=/projects/x86-64-icelake-rocky8/compilers/intel-oneapi-compilers/2023.1.0/gcc/8.5.0/base/6g2jkiv/compiler/2023.1.0/linux/bin-llvm/clang++ --cxxflags="-fp-model=precise" --shared --kokkos-cmake-flags=-DKokkos_ENABLE_ONEDPL=OFF

Edit: Added shas used in the testing

@ndellingwood ndellingwood changed the title Sycl unit test failures with intel/2023.1.0 on Intel Ponte Vecchio Nightly Sycl unit test failures with intel/2023.1.0 on Intel Ponte Vecchio Nov 14, 2023
@ndellingwood
Copy link
Contributor Author

Updating the issue with failures as of SHA 32aa75a

Configuration 1 (no TPLs):

salloc -N 1 -p PV

source /projects/x86-64-icelake-rocky8/spack-config/blake-setup-user-module-env.sh
module purge
module load cmake intel-oneapi-compilers/2023.1.0 intel-oneapi-dpl/2022.1.0 git

# Required for the hashmap accumulator
export ZES_ENABLE_SYSMAN=1

# Configuration
$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-sycl --arch=INTEL_PVC --compiler=/projects/x86-64-icelake-rocky8/compilers/intel-oneapi-compilers/2023.1.0/gcc/8.5.0/base/6g2jkiv/compiler/2023.1.0/linux/bin-llvm/clang++ --cxxflags="-fp-model=precise" --shared --kokkos-cmake-flags=-DKokkos_ENABLE_ONEDPL=OFF -kokkos-path=$KOKKOS_PATH

Test failures on PVC:

23:43:24 The following tests FAILED:
23:43:24 	 15 - sparse_sycl (SEGFAULT)
23:43:24 	 16 - blocksparse_sycl (Failed)

Configuration 2 (oneMKL):

salloc -N 1 -p PV

source /projects/x86-64-icelake-rocky8/spack-config/blake-setup-user-module-env.sh
module purge
module load git cmake intel-oneapi-compilers/2023.1.0 intel-oneapi-dpl/2022.1.0 intel-oneapi-mkl/2023.1.0 intel-oneapi-tbb/2021.9.0

# Required for the hashmap accumulator
export ZES_ENABLE_SYSMAN=1

# Configuration
$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-sycl --arch=INTEL_PVC --compiler=icpx --cxxflags="-fp-model=precise" --shared --with-tpls=mkl --kokkos-cmake-flags=-DKokkos_ENABLE_ONEDPL=OFF -kokkos-path=$KOKKOS_PATH

Test failures on PVC:

05:49:17 The following tests FAILED:
05:49:17 	  9 - blas_sycl (Failed)
05:49:17 	 15 - sparse_sycl (Subprocess aborted)
05:49:17 	 16 - blocksparse_sycl (Failed)
05:49:17 	 26 - wiki_spadd (Subprocess aborted)

@ndellingwood
Copy link
Contributor Author

Joe installed intel oneapi 2024.1.0 on Blake, I tested the MKL configuration above:

Test failures:

15/32 Test #15: sparse_sycl ......................***Failed  194.78 sec
...
[  PASSED  ] 47 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] sycl_test.sparse_spgemm_jacobi_double_int_int_TestDevice
[  FAILED  ] sycl_test.sparse_spgemm_double_int_int_TestDevice
[  FAILED  ] sycl_test.sparse_spmv_double_int_int_TestDevice
[  FAILED  ] sycl_test.sparse_par_ilut_double_int_int_TestDevice

16/32 Test #16: blocksparse_sycl .................***Failed   29.87 sec
...
[==========] 7 tests from 1 test case ran. (29406 ms total)
[  PASSED  ] 6 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] sycl_test.sparse_block_spgemm_double_int_int_TestDevice

Configuration (Sycl backend, intel/2024.1.0 with mkl/2024.0.0):

source /projects/x86-64-icelake-rocky8/spack-config/blake-setup-user-module-env.sh
module purge
module load cmake intel-oneapi-compilers/2024.1.0 intel-oneapi-dpl/2022.5.0 intel-oneapi-tbb/2021.12.0 intel-oneapi-mkl/2024.0.0
module list

# Required for the hashmap accumulator
export ZES_ENABLE_SYSMAN=1

# Configuration
$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-sycl --arch=INTEL_PVC --compiler=icpx --cxxflags="-fp-model=precise -Wno-pass-failed" --shared --with-tpls=mkl --kokkos-path=$KOKKOS_PATH

make -j16

# Unit tests
export ONEAPI_DEVICE_SELECTOR=ext_oneapi_level_zero:gpu
ctest --output-on-failure

@ndellingwood ndellingwood changed the title Nightly Sycl unit test failures with intel/2023.1.0 on Intel Ponte Vecchio Nightly Sycl unit test failures with intel/2023.1.0, intel/2024.1.0 on Intel Ponte Vecchio Jun 5, 2024
@cwpearson
Copy link
Contributor

I've been poking around with this:

In the SpGEMM, it seems that Kokkos::atomic_add(addr, val); always results in *addr = 0, (or possibly *addr unchanged, e.g. Kokkos::atomic_add is a no-op).

however, *addr += val; causes some math to happen (though produces the incorrect values in a context where atomics are needed

I've tried replacing Kokkos::atomic_add(addr, val) with various flavors of

auto v = sycl::atomic_ref<std::remove_reference_t<decltype(*addr)>,
                          sycl::memory_order::relaxed,
                          sycl::memory_scope::device, 
                          sycl::access::address_space::global_space>(*addr);
v += val;

but no luck so far.

I've also tried running the Kokkos Core atomics unit tests built with the same Core that I use for the Kernels unit tests, and the Core atomic unit tests all pass.

@cwpearson
Copy link
Contributor

cwpearson commented Jun 28, 2024

Reimplementing alignPtr as

template <typename InPtr, typename T>
KOKKOS_INLINE_FUNCTION T *alignPtr(InPtr p) {
  std::uintptr_t ptrVal = reinterpret_cast<std::uintptr_t>(p);
  while (ptrVal % alignof(T)) {
    ++ptrVal;
  }
  return reinterpret_cast<T *>(ptrVal);
}

seems to make the SpGEMM unit tests pass. However, using the equivalent

template <typename InPtr, typename T>
KOKKOS_INLINE_FUNCTION T *alignPtr(InPtr p) {
  std::uintptr_t ptrVal = reinterpret_cast<std::uintptr_t>(p);
  return reinterpret_cast<T *>((ptrVal + alignof(T) - 1) / alignof(T) * alignof(T));
}

does not. May be a SYCL compiler issue (unless (ptrVal + alignof(T) - 1) overflows)

@cwpearson
Copy link
Contributor

cwpearson commented Jun 28, 2024

unsigned int f1(unsigned int i, unsigned int align) // today
{
    return ((i + align - 1) & (~(align - 1)));
}

unsigned int f2(unsigned int i, unsigned int align)
{
    return ((i + align - 1) / align * align);
}

unsigned int f3(unsigned int i, unsigned int align) // gcc
{
    return (i + align - 1) & (-align);
}

unsigned int f4(unsigned int i, unsigned int align)
{
    while (i % align) {
        ++i;
    }
    return i;
}

only f4 works for SYCL SpGEMM

in clang-trunk x86 in godbolt, f1 and f3 compile to the same instructions. f2 and f4 are each different again.

@ndellingwood
Copy link
Contributor Author

Status update as-of 7/9/2024 following merge of some recent fixes:

Sycl + PV, no MKL

Failing tests

23:41:05 The following tests FAILED:
23:41:05 	 15 - sparse_sycl (Failed)

Failure output snips:
sparse_sycl

23:40:11 [ RUN      ] sycl_test.sparse_coo2crs
23:40:11 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
23:40:11 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.37386 vs 3.71438e-09
23:40:11 row: 31, crs_col_ids_ref(2871) = 25 mismatched values!
23:40:11 Begin arguments for above failure...
23:40:11 RandCooMat<N6Kokkos7complexIdEE, N6Kokkos10LayoutLeftE, N6Kokkos12Experimental4SYCLE409...): rand seed: 990578628
23:40:11 scalar: N6Kokkos7complexIdEE
23:40:11 layout: N6Kokkos10LayoutLeftE
23:40:11 m: 409, n: 409
23:40:11 ...end arguments for above failure.
23:40:11 
23:40:11 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
23:40:11 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3.55771 vs 3.71438e-09
23:40:11 row: 37, crs_col_ids_ref(3420) = 213 mismatched values!
23:40:11 Begin arguments for above failure...
23:40:11 RandCooMat<N6Kokkos7complexIdEE, N6Kokkos10LayoutLeftE, N6Kokkos12Experimental4SYCLE409...): rand seed: 990578628
23:40:11 scalar: N6Kokkos7complexIdEE
23:40:11 layout: N6Kokkos10LayoutLeftE
23:40:11 m: 409, n: 409
23:40:11 ...end arguments for above failure.

Sycl + PV, with MKL

Failing tests:

05:51:47 The following tests FAILED:
05:51:47 	  9 - blas_sycl (Failed)
05:51:47 	 15 - sparse_sycl (Failed)
05:51:47 	 16 - blocksparse_sycl (Failed)

Failure output snips:
blas_sycl

05:47:45 [ RUN      ] sycl_test.gemv_double
05:47:45 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV-oneMKL/kokkos-kernels/blas/unit_test/Test_Blas2_gemv.hpp:141: expected(0)=-1.50128, h_y(0)=nan, eps=2.22045e-16, 1024*2*eps=4.54747e-13
...
05:47:45 beta = 0, input contains NaN, A is 2131x2131, mode T: gemv incorrect
05:47:45 [  FAILED  ] sycl_test.gemv_double (643 ms)
05:47:45 [ RUN      ] sycl_test.blas_gemv_streams_double_int_int_TestDevice
05:47:45 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV-oneMKL/kokkos-kernels/blas/unit_test/Test_Blas2_gemv.hpp:141: expected(0)=19.9203, h_y(0)=nan, eps=2.22045e-16, 1024*2*eps=4.54747e-13
...
05:47:45 Value of: 0
05:47:45 Expected: numErrors
05:47:45 Which is: 40
05:47:45 beta = 0, input contains NaN, A is 50x40, mode T: gemv incorrect
05:47:45 [  FAILED  ] sycl_test.blas_gemv_streams_double_int_int_TestDevice (79 ms)

sparse_sycl:

05:50:55 [ RUN      ] sycl_test.sparse_coo2crs
05:50:55 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV-oneMKL/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
05:50:55 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5.46403 vs 2.86173e-09
05:50:55 row: 19, crs_col_ids_ref(1595) = 288 mismatched values!
05:50:55 Begin arguments for above failure...
05:50:55 RandCooMat<N6Kokkos7complexIdEE, N6Kokkos10LayoutLeftE, N6Kokkos12Experimental4SYCLE359...): rand seed: 3195414009
05:50:55 scalar: N6Kokkos7complexIdEE
05:50:55 layout: N6Kokkos10LayoutLeftE
05:50:55 m: 359, n: 359
05:50:55 ...end arguments for above failure.
...
05:50:55 [ RUN      ] sycl_test.sparse_spmv_double_int_int_TestDevice
05:50:55 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV-oneMKL/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv.hpp:216: Failure
05:50:55 Value of: threw
05:50:55   Actual: true
05:50:55 Expected: false
05:50:55 KokkosSparse::Test::spmv 1D, mode T: threw exception:
05:50:55 oneapi::mkl::sparse::gemv: unimplemented functionality: currently only supports the oneapi::mkl::transpose::nontrans op
...
05:50:55 [ RUN      ] sycl_test.sparse_spmv_mv_double_int_int_LayoutLeft_TestDevice
05:50:55 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV-oneMKL/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv.hpp:268: Failure
05:50:55 Value of: threw
05:50:55   Actual: true
05:50:55 Expected: false
05:50:55 KokkosSparse::Test::spmv 2D, mode T: threw exception:
05:50:55 oneapi::mkl::sparse::gemv: unimplemented functionality: currently only supports the oneapi::mkl::transpose::nontrans op
...
05:50:55 [ RUN      ] sycl_test.sparse_sptrsv_double_int_int_TestDevice
05:50:55 unknown file: Failure
05:50:55 C++ exception with description "oneapi::mkl::sparse::optimize_gemv: unimplemented functionality: currently only supports the oneapi::mkl::transpose::nontrans op" thrown in the test body.
...
05:50:55 [  PASSED  ] 47 tests.
05:50:55 [  FAILED  ] 4 tests, listed below:
05:50:55 [  FAILED  ] sycl_test.sparse_coo2crs
05:50:55 [  FAILED  ] sycl_test.sparse_spmv_double_int_int_TestDevice
05:50:55 [  FAILED  ] sycl_test.sparse_spmv_mv_double_int_int_LayoutLeft_TestDevice
05:50:55 [  FAILED  ] sycl_test.sparse_sptrsv_double_int_int_TestDevice
05:50:55 
05:50:55  4 FAILED TESTS

blocksparse_sycl

05:51:18 [ RUN      ] sycl_test.sparse_bsr_spmv_double_int_int_TestDevice
05:51:18 unknown file: Failure
05:51:18 C++ exception with description "oneapi::mkl::sparse::optimize_gemv: unimplemented functionality: currently only supports the oneapi::mkl::transpose::nontrans op" thrown in the test body.
05:51:18 [  FAILED  ] sycl_test.sparse_bsr_spmv_double_int_int_TestDevice (1627 ms)
05:51:18 [ RUN      ] sycl_test.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestDevice
05:51:18 unknown file: Failure
05:51:18 C++ exception with description "oneapi::mkl::sparse::optimize_gemv: unimplemented functionality: currently only supports the oneapi::mkl::transpose::nontrans op" thrown in the test body.
05:51:18 [  FAILED  ] sycl_test.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestDevice (3108 ms)
05:51:18 [----------] 7 tests from sycl_test (22517 ms total)
05:51:18 
05:51:18 [----------] Global test environment tear-down
05:51:18 [==========] 7 tests from 1 test case ran. (22517 ms total)
05:51:18 [  PASSED  ] 5 tests.
05:51:18 [  FAILED  ] 2 tests, listed below:
05:51:18 [  FAILED  ] sycl_test.sparse_bsr_spmv_double_int_int_TestDevice
05:51:18 [  FAILED  ] sycl_test.sparse_bsr_spmmv_double_int_int_LayoutLeft_TestDevice

@ndellingwood
Copy link
Contributor Author

Status update 7/12/2024:

After the recent gemv fallback updates, the Sycl builds are in better shape with only the sparse_coo2crs test failure remaining:

sparse_sycl

23:40:11 [ RUN      ] sycl_test.sparse_coo2crs
23:40:11 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
23:40:11 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.37386 vs 3.71438e-09
23:40:11 row: 31, crs_col_ids_ref(2871) = 25 mismatched values!
23:40:11 Begin arguments for above failure...
23:40:11 RandCooMat<N6Kokkos7complexIdEE, N6Kokkos10LayoutLeftE, N6Kokkos12Experimental4SYCLE409...): rand seed: 990578628
23:40:11 scalar: N6Kokkos7complexIdEE
23:40:11 layout: N6Kokkos10LayoutLeftE
23:40:11 m: 409, n: 409
23:40:11 ...end arguments for above failure.
23:40:11 
23:40:11 /home/jenkins/blake-new/workspace/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_Sycl_PV/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
23:40:11 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3.55771 vs 3.71438e-09
23:40:11 row: 37, crs_col_ids_ref(3420) = 213 mismatched values!
23:40:11 Begin arguments for above failure...
23:40:11 RandCooMat<N6Kokkos7complexIdEE, N6Kokkos10LayoutLeftE, N6Kokkos12Experimental4SYCLE409...): rand seed: 990578628
23:40:11 scalar: N6Kokkos7complexIdEE
23:40:11 layout: N6Kokkos10LayoutLeftE
23:40:11 m: 409, n: 409
23:40:11 ...end arguments for above failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants