Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly test failures - sparse_{cuda,hip} (child aborted) #1373

Closed
ndellingwood opened this issue Mar 27, 2022 · 6 comments
Closed

Nightly test failures - sparse_{cuda,hip} (child aborted) #1373

ndellingwood opened this issue Mar 27, 2022 · 6 comments

Comments

@ndellingwood
Copy link
Contributor

Following merge of #1342 the sparse set of unit tests began failing in sparse_csc2csr in some of our cuda and hip builds:

02:59:33 4: [ RUN      ] cuda.sparse_csc2csr
02:59:33 4: cudaStreamSynchronize(stream) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA10_1/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:162
02:59:33 4: Backtrace:
02:59:33 4:                        [0x12368a3]
02:59:33 4:                        [0x122ee08]
02:59:33 4:                        [0x122ee8b]
02:59:33 4:                        [0x123b75e]
02:59:33 4:                        [0x123b946]
02:59:33 4:                        [0x123c93e]
02:59:33 4:                         [0x82b96c]
02:59:33 4:                         [0x82e01e]
02:59:33 4:                         [0x82e640]
02:59:33 4:                         [0x82eab8]
02:59:33 4:                         [0x47c826]
02:59:33 4:                        [0x12160ea]
02:59:33 4:                        [0x120a736]
02:59:33 4:                        [0x120aced]
02:59:33 4:                        [0x120af25]
02:59:33 4:                        [0x120c8df]
02:59:33 4:                        [0x120cba1]
02:59:33 4:                         [0x408440]
02:59:33 4: __libc_start_main [0x7f46e68ee555]
02:59:33 4:                         [0x44c209]
02:59:33  4/23 Test  #4: sparse_cuda ......................Child aborted***Exception: 102.67 sec

@e10harvey are you able to investigate?

Reproducer (kokkos-dev-2):

source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh ; module use /home/projects/x86-64/modulefiles/local
module load sems-archive-env sems-archive-cmake/3.17.1 sems-archive-gcc/7.3.0 sems-archive-cuda/10.1

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,OpenMP --arch=SNB,Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="14" --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-options=disable_deprecated_code --no-examples
@e10harvey e10harvey self-assigned this Mar 28, 2022
@e10harvey
Copy link
Contributor

Thanks, @ndellingwood ! Yes, I will investigate later today.

@lucbv lucbv mentioned this issue Mar 28, 2022
@e10harvey
Copy link
Contributor

@e10harvey
Copy link
Contributor

@lucbv, @ndellingwood: Do you know why this runtime error is not showing up in the CUDA CI checks?

The cuda 10 ci check runs:

../kokkos-kernels/scripts/cm_test_all_sandia --spot-check-tpls cuda/10.1.243 --kokkos-path=../kokkos --kokkoskernels-path=../kokkos-kernels --arch=Power9,Volta70
../kokkos-kernels/scripts/cm_test_all_sandia --spot-check-tpls cuda/10.1.243 --kokkos-path=../kokkos --kokkoskernels-path=../kokkos-kernels --arch=Power9,Volta70 --no-default-eti --with-layouts=LayoutRight --with-spaces=hostspace,cudaspace,cudauvmspace

@e10harvey
Copy link
Contributor

It looks like this has to do with copying views from host to device by reference. This is fixed by #1375:

 1028  ../../cm_generate_makefile.bash --with-devices=Cuda,OpenMP --arch=SNB,Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="14" --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-options=disable_deprecated_code --no-examples ../../
 1029  make -j16 KokkosKernels_sparse_cuda
 1030  unit_test/KokkosKernels_sparse_cuda --gtest_filter='*csc*'
[issue1373]$ unit_test/KokkosKernels_sparse_cuda --gtest_filter='*csc*'
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Note: Google Test filter = *csc*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from cuda
[ RUN      ] cuda.sparse_randcscmat
[       OK ] cuda.sparse_randcscmat (104 ms)
[ RUN      ] cuda.sparse_csc2csr
[       OK ] cuda.sparse_csc2csr (562 ms)
[----------] 2 tests from cuda (666 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (666 ms total)
[  PASSED  ] 2 tests.

@ndellingwood
Copy link
Contributor Author

Thanks for quickly addressing @e10harvey , nightlies are passing :)

@ndellingwood
Copy link
Contributor Author

Do you know why this runtime error is not showing up in the CUDA CI checks

@e10harvey I'm not certain, the nightlies typically test cuda/10.1.105 and the failures were on x86+Volta90; cuda/10.1.105 builds on weaver (Power9+Volta70) passed, though I'm not certain why differing host architecture would be the underlying cause here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants