Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KokkosKernels_sparse_* tests are failing on ATDM cuda builds #3438

Closed
fryeguy52 opened this issue Sep 13, 2018 · 39 comments
Closed

KokkosKernels_sparse_* tests are failing on ATDM cuda builds #3438

fryeguy52 opened this issue Sep 13, 2018 · 39 comments
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: KokkosKernels type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented Sep 13, 2018

CC: @trilinos/kokkos-kernels , @kddevin (Trilinos Data Services Product Lead)

Next Action Status

The test KokkosKernels_sparse_serial_MPI_1 on 'waterman' has been passing without timing out in each 'debug' build on 'waterman' since 10/9/2018 as shown here.

Description

As shown in this query the tests:

  • KokkosKernels_sparse_serial_MPI_1
  • KokkosKernels_sparse_cuda_MPI_1

are failing in all the cuda builds on white, ride, hansen, and waterman:

  • Trilinos-atdm-waterman-gnu-debug-openmp
  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-white-ride-cuda-9.2-opt
  • Trilinos-atdm-white-ride-cuda-9.2-debug-pt
  • Trilinos-atdm-white-ride-cuda-9.2-debug
  • Trilinos-atdm-hansen-shiller-cuda-8.0-opt
  • Trilinos-atdm-hansen-shiller-cuda-9.0-debug
  • Trilinos-atdm-hansen-shiller-cuda-9.0-opt

Here you can see that these tests started failing on 9/7/2018

at the bottom of this page is a list of commits that were new on that day.

Steps to Reproduce

One should be able to reproduce this failure as described in:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_KokkosKernels=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: KokkosKernels client: ATDM Any issue primarily impacting the ATDM project labels Sep 13, 2018
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Sep 13, 2018

@srajama1 and @kyungjoo-kim, looking at the commits pulled that day shown here, the likely commits that broke these tests were:

cbae754:  KokkosKernels - TPL interface should match not to confuse linker.
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Tue Aug 28 10:29:32 2018 -0600

M	packages/kokkos-kernels/src/impl/tpls/KokkosBlas1_scal_tpl_spec_decl.hpp

...

c4fee29:  Merge remote-tracking branch 'upstream/develop' into tpetra-develop
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Mon Aug 27 13:20:00 2018 -0600

217c7d1:  KokkosKernels - direct push commits from the KokkosKernels repo.
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Mon Aug 27 13:05:10 2018 -0600

M	packages/kokkos-kernels/CMakeLists.txt
M	packages/kokkos-kernels/Makefile.kokkos-kernels
M	packages/kokkos-kernels/cmake/Dependencies.cmake
M	packages/kokkos-kernels/cmake/KokkosKernels_config.h.in
M	packages/kokkos-kernels/src/blas/KokkosBlas.hpp
M	packages/kokkos-kernels/unit_test/CMakeLists.txt

These commits were integrated into 'develop' in PR #3223 merged on 9/6/2018.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Sep 13, 2018

There is a cuda error

[ RUN      ] cuda.sparse_spgemm_double_int_int_TestExecSpace
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

There is an error related to MKL macro issues, same error message as kokkos/kokkos-kernels#289

[ RUN      ] serial.sparse_spgemm_double_int_int_TestExecSpace
/home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos-kernels/unit_test/sparse/Test_Sparse_spgemm.hpp:366: Failure
Value of: (failed == is_expected_to_fail)
  Actual: false

My modification is to change cmake configuration to use tpls correctly. @srajama1 Does spgemm test use any cublas internally ?

@srajama1
Copy link
Contributor

spgemm tests uses cusparse, yes. How did this pass spot checks ?

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Sep 13, 2018

@ndellingwood

I think that 1) waterman and white use gcc 7.2 + cuda 9.2 2) hansen cuda version also uses gcc.

The macro fix described in kokkos/kokkos-kernels#290 handles a case when intel mkl is used.
This is the case when intel mkl is not used. Do we have a test for this when we merge the PR ?

@srajama1
Copy link
Contributor

I don't see why we are trying to combine this to MKL changes. This is CUDA bug that got introduced in Trilinos. Let us resolve it separately.

@kyungjoo-kim
Copy link
Contributor

Look at the error message with serial build with gcc compilers. It has the same error described in kokkos/kokkos-kernels#289. The corresponding PR patch should be applied. If it is already applied, then we have another edge case.

@ndellingwood
Copy link
Contributor

@srajama1 @kyungjoo-kim the fix in kokkos/kokkos-kernels#289 is merged into kokkos-kernels develop branch, does a patch need to be applied to Trilinos?

If the problem is similar to that issue then there may need to be an additional update to the offending test to guard here: test SPGEMM_CUSPARSE

Replace with something like
#if !defined(KERNELS_HAVE_CUSPARSE) && !defined(KOKKOSKERNELS_ENABLE_TPL_CUSPARSE)

where the second macro is defined in the config file

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Sep 13, 2018

@ndellingwood Yes please apply the patch to Trilinos and would you test this on waterman or white (no cuda test but serial or openmp test).

@ndellingwood
Copy link
Contributor

@kyungjoo-kim will do, should have PR up shortly. Should I include the CUSPARSE modification I suggested above as well?

ndellingwood added a commit to ndellingwood/Trilinos that referenced this issue Sep 13, 2018
Address Trilinos issue trilinos#3438, incorporates part of fix from
PR kokkos/kokkos-kernels#290
@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Sep 13, 2018

@ndellingwood

Yes please do so. I don't think that putting ifdef guard of TPLs is harmful. Thanks a lot.

@ndellingwood
Copy link
Contributor

PR #3442 issued with (hopeful) fix.

@ndellingwood
Copy link
Contributor

Following reproducer instructions to test now on white - is there any reason I shouldn't build on an interactive node rather than the head node?

@ndellingwood
Copy link
Contributor

ndellingwood commented Sep 13, 2018

Head node is pretty busy, I started an interactive session and tried to compile but was unable to:
bsub -Is -n 1 -q rhel7F bash

bash-4.2$ make NP=16
ninja -C .  -j 16 ./all
ninja: Entering directory `.'
ninja: no work to do.

@bartlettroscoe is there a simple modification to the reproducer instructions so that I can build on an interactive node?

Edit: I didn't state very clear the issue I'm having - I haven't built anything yet (clean start to the build) and received the message ninja: no work to do., so I need to modify instructions in order to build on an interactive node (assuming this should work).

@ndellingwood
Copy link
Contributor

I get the same output regardless of head node or interactive node...
On Head node

bash-4.2$ make NP=16
ninja -C .  -j 16 ./all
ninja: Entering directory `.'
ninja: no work to do.

@bartlettroscoe
Copy link
Member

@ndellingwood,

What are the exact commands you are using?

Also, can you please do:

$ cmake <what-ever-options-your-used> &> configure.txt

and then attach the configure.txt file? That should explain what is happening why it is not building anything.

@ndellingwood
Copy link
Contributor

ndellingwood commented Sep 13, 2018

Thanks @bartlettroscoe
Here was my set up:
TRILINOS_DIR

-bash-4.2$ echo $TRILINOS_DIR 
/ascldap/users/ndellin/trilinos/Trilinos

Build directory:

-bash-4.2$ pwd
/ascldap/users/ndellin/trilinos/Trilinos/Build/ATDM-cudadbg-issue3438

Then I followed the reproducer instructions:

$ cd /ascldap/users/ndellin/trilinos/Trilinos/Build/ATDM-cudadbg-issue3438

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_KokkosKernels=ON \
  $TRILINOS_DIR

$ make NP=16

I'll blow away the directory and relogin to start clean and post the configure output in next message

@ndellingwood
Copy link
Contributor

Fresh start with the same procedure above led to same issue.

@bartlettroscoe attached is the configure.txt file, thanks for the help
configure.txt

@bartlettroscoe
Copy link
Member

@ndellingwood, if you look a the bottom of the file configure.txt, you see the problem:



***
*** WARNING:  There were no packages configured so no libraries or tests/examples will be built!
***


Generating dummy makefiles in each directory to call Ninja ...


Set up for creating a distribution ...


Finished configuring Trilinos!

-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    Trilinos_ENABLE_KokkosKernals

Is it obvious how to fix this?

@bartlettroscoe
Copy link
Member

FYI: I fixed the typo in the "Steps to Reproduce" above. Run those instructions and it should work not. Sorry about that.

@ndellingwood
Copy link
Contributor

@bartlettroscoe working fine now, thanks for the help! Will post results when it completes.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim @srajama1
Sparse tests are still failing, fixes from kokkos-kernels develop branch were no help :(


-bash-4.2$ bsub -x -Is -q rhel7F -n 16 ctest -j16
***Forced exclusive execution
Job <36964> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on white22>>
Test project /ascldap/users/ndellin/trilinos/Trilinos/Build/ATDM-cudadbg-issue3438
    Start 1: KokkosKernels_blas_cuda_MPI_1
    Start 2: KokkosKernels_sparse_cuda_MPI_1
    Start 3: KokkosKernels_graph_cuda_MPI_1
    Start 4: KokkosKernels_common_cuda_MPI_1
    Start 5: KokkosKernels_blas_serial_MPI_1
    Start 6: KokkosKernels_sparse_serial_MPI_1
    Start 7: KokkosKernels_graph_serial_MPI_1
    Start 8: KokkosKernels_common_serial_MPI_1
1/8 Test #8: KokkosKernels_common_serial_MPI_1 ...   Passed    1.20 sec
2/8 Test #4: KokkosKernels_common_cuda_MPI_1 .....   Passed    1.21 sec
3/8 Test #3: KokkosKernels_graph_cuda_MPI_1 ......   Passed   38.30 sec
4/8 Test #1: KokkosKernels_blas_cuda_MPI_1 .......   Passed   52.65 sec
5/8 Test #2: KokkosKernels_sparse_cuda_MPI_1 .....***Failed   95.85 sec
6/8 Test #5: KokkosKernels_blas_serial_MPI_1 .....   Passed  153.74 sec
7/8 Test #7: KokkosKernels_graph_serial_MPI_1 ....   Passed  194.39 sec
8/8 Test #6: KokkosKernels_sparse_serial_MPI_1 ...***Failed  256.09 sec

75% tests passed, 2 tests failed out of 8

Subproject Time Summary:
KokkosKernels    = 793.44 sec*proc (8 tests)

Total Test time (real) = 256.14 sec

The following tests FAILED:
	  2 - KokkosKernels_sparse_cuda_MPI_1 (Failed)
	  6 - KokkosKernels_sparse_serial_MPI_1 (Failed)

@ndellingwood
Copy link
Contributor

Running the individual tests on an interactive node here is some additional output:


bash-4.2$ ./KokkosKernels_sparse_cuda.exe --gtest_filter=cuda.sparse_spgemm_double_int_int_TestExecSpace
Note: Google Test filter = cuda.sparse_spgemm_double_int_int_TestExecSpace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN      ] cuda.sparse_spgemm_double_int_int_TestExecSpace
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

And running through cuda-gdb (output abbreviated):

bash-4.2$ cuda-gdb --args ./KokkosKernels_sparse_cuda.exe --gtest_filter=cuda.sparse_spgemm_double_int_int_TestExecSpace
...
[ RUN      ] cuda.sparse_spgemm_double_int_int_TestExecSpace

CUDA Exception: Warp Out-of-range Address

Thread 1 "KokkosKernels_s" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 0, grid 97, block (90,0,0), thread (32,0,0), device 0, sm 0, warp 24, lane 0]
0x000000001d4ec860 in void csrgemmNnz_kernel2<128, 32, 2, 4>(csrgemmNnz_params)
   <<<(2500,1,1),(128,1,1)>>> ()

@ndellingwood
Copy link
Contributor

The MKL change in the PR #3442 fixed the MKL issue, but there are still problems with CUSPARSE.

I added some print statements to the test, this is a summary of what I'm seeing:

cuda.sparse_spgemm_double_int_int_TestExecSpace

  • KERNELS_HAVE_CUSPARSE is not defined
  • Since above is undefined, the test is expected to fail
  • Test throws a runtime error causing termination, but is not caught by the try... catch... catch... blocks which would have set the failed flag to true (as expected)
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119

@kyungjoo-kim @srajama1 based on the testing setup, should this error have been caught or instead should it not even run unless cusparse kernels are enabled? I'm guessing the latter...

serial.sparse_spgemm_double_int_int_TestExecSpace
This test's issue is weirder than the previous one:

  • KERNELS_HAVE_CUSPARSE is not defined
  • Since above is undefined, the test is expected to fail
  • The test does not fail and does not throw an error! So the failed flag is never set to true, leading to the discrepancy and runtime error that it should have failed but did not, even though this is the serial test.

@ndellingwood
Copy link
Contributor

As an extra data point, the kokkos-kernels sparse tests pass with VOTD kokkos-kernels + kokkos on the same White queue when using devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88 and also with the same atdm environment reported in this issue (I used the generate_makefile script for the testing in this case); need to determine if this points to CMake-related issues or possible code changes.

@bartlettroscoe
Copy link
Member

CC: @bathmatt, @jmgate

FYI: The merge of PR #3442 on 9/14/20-18 does not appear to have fixed any of the individual failing unit tests on any of the builds on any of the platforms show yesterday, for example, here.

What is the next course of action? Should the original merge commit be reverted while this gets fixed offline? NOTE: The EMPIRE team has flagged the original PR #3223 as breaking the usage of Trilinos for EMPIRE. Therefore, there is a strong case to back out that PR merge commit and fix this offline so that EMPIRE can get an updated version of Trilinos.

@jmgate
Copy link
Contributor

jmgate commented Sep 17, 2018

@bartlettroscoe, I don't think this was holding EMPIRE up. We should see a successful update to our fork happen soon. No need to revert unless I get back in touch.

@bartlettroscoe
Copy link
Member

@bartlettroscoe, I don't think this was holding EMPIRE up. We should see a successful update to our fork happen soon. No need to revert unless I get back in touch.

@jmgate, I thought that your bisection study showed the some commit in PR #3442 broke the EMPIRE test suite?

@ndellingwood
Copy link
Contributor

Rechecked the kokkos-kernels VOTD tests, in my previous check I didn't enable the cusparse tpl properly, and the same tests are failing on the kokkos-kernels develop branch.
@kyungjoo-kim can you look into this? The MKL fix in #3442 was necessary but only a partial fix with that issue likely masked by these CUSPARSE errors (which occur before the MKL path gets hit).

@kyungjoo-kim
Copy link
Contributor

@ndellingwood I agree. I did not have time to fully check the test yet but I also think that cusparse version is not tested in spotcheck. The spgemm test passes 3 tests in kokkoskernels but cusparse version. Let's remove cusparse version from the test list for now and think this in kokkoskernels.

@jmgate
Copy link
Contributor

jmgate commented Sep 18, 2018

@jmgate, I thought that your bisection study showed the some commit in PR #3442 broke the EMPIRE test suite?

@bartlettroscoe, yes, that is what two independent bisects showed, but you know how bisecting Trilinos goes. @ccober6 did a manual bisect and found our actual problem elsewhere.

@bartlettroscoe
Copy link
Member

@bartlettroscoe, yes, that is what two independent bisects showed, but you know how bisecting Trilinos goes. @ccober6 did a manual bisect and found our actual problem elsewhere.

@jmgate, did you only bisect on the first-parent merge commits directly on the 'develop' branch or did you try to bisect all commits? I think that the PR process only provides confidence that you can bisect on commits that pass PR testing. (NOTE: The Trilinos GitHub project currently allows people to rebase and push which destroys the ability to bisect robustly on the first-parent commits on 'develop'. Therefore, if some Trilino developers are doing this, this might explain why your bisect study did not do what is should. See #2726.)

@jmgate
Copy link
Contributor

jmgate commented Sep 18, 2018

@jmgate, did you only bisect on the first-parent merge commits directly on the 'develop' branch or did you try to bisect all commits?

Both, just to be sure.

@bartlettroscoe
Copy link
Member

Both, just to be sure.

@jmgate, If you try to bisect on all of the commits I think you are almost guaranteed to get false failures since some of these commits may not even configure. And even if it configures and builds, there is a good chance that the tests will not pass. I guess if you just skipped over commits that did not configure and build Trilinos or EMPIRE and just looked at the your particular EMPIRE test, then you might be able to more robustly bisect all of the commits. Is this what you did?

@jmgate
Copy link
Contributor

jmgate commented Sep 18, 2018

@bartlettroscoe, we don't need to have this conversation, particularly in this issue.

@kyungjoo-kim
Copy link
Contributor

here

We move this problem into kokkoskernels and fix the problem there. As this problem is gone from Trilinos testing, we close this issue.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Sep 20, 2018

I just found that time out failure from gauss seidel on waterman.

@srajama1 Who is working on GS now ? A simple solution is to disable the test for the debug build. Another solution is to simplify the testing.

Anyway I cannot help this issue anymore as the time out failure pops up from waterman that I do not have an access.

@bartlettroscoe
Copy link
Member

FYI: PR #3559 will disable the timing-out KokkosKernels_sparse_serial_MPI_1 test for now in the debug builds on 'waterman'.

However, I will work to get an opt-debug build running shortly that uses CMAKE_BUILD_TYPE=RELEASE but Trilinos_ENABLE_DEBUG=ON and can then hopefully re-enabled a lot of these timing-out tests on many of these builds.

@bartlettroscoe
Copy link
Member

Commit 837804d merged in PR #3559 on 10/3/2018 disabled a some of the slow-er-running unit tests in the test executable KokkosKernels_sparse_serial_MPI_1 on 'waterman'. That has resulted in this test passing without timing out in each 'debug' build on 'waterman' since 10/9/2018 as shown here.

With the new release-debug builds running on 'waterman', all of the unit tests in this test executable are run with runtime debug-mode checking. Therefore, I will close this as complete and not leave open with "Disabled Tests".

@mhoemmen mhoemmen changed the title KokkosKernals_sparse_* tests are failing on ATDM cuda builds KokkosKernels_sparse_* tests are failing on ATDM cuda builds Oct 21, 2018
@bartlettroscoe bartlettroscoe added the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
Address Trilinos issue trilinos#3438, incorporates part of fix from
PR kokkos/kokkos-kernels#290
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: KokkosKernels type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

6 participants