Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda.graph_graph_color* COLORING_VBD test failures with cuda/9.2 + gcc/7.2 on White #317

Closed
ndellingwood opened this issue Oct 8, 2018 · 27 comments

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Oct 8, 2018

The Jenkins job KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720 shows at least the following test failure:

00:03:47 [ FAILED ] cuda.graph_graph_color_double_int_int_TestExecSpace

Subsequent tests were also failing in the Jenkins job, but this may be due to the first test failure. After reproducing and filtering out the first failing test I'll post other valid failures.

This began failing after updating the test_all_sandia script to match the versions of cuda and gcc tested by Kokkos and the Trilinos ATDM scripts.

Edit: Added the name of the Jenkins job.

@ndellingwood
Copy link
Contributor Author

Here is the output of that failing test:

00:03:47 [ RUN      ] cuda.graph_graph_color_double_int_int_TestExecSpace
00:03:47 unknown file: Failure
00:03:47 C++ exception with description "cudaMemcpy( dst , src , n , cudaMemcpyDefault ) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:92
00:03:47 Traceback functionality not available
00:03:47 " thrown in the test body.
00:03:47 [  FAILED  ] cuda.graph_graph_color_double_int_int_TestExecSpace (268 ms)
00:03:47 [ RUN      ] cuda.graph_graph_color_double_int64_t_int_TestExecSpace

@ndellingwood
Copy link
Contributor Author

@srajama1
Copy link
Contributor

srajama1 commented Oct 8, 2018

Is this distance-1 or distance-2 coloring ? Might be related to what @lucbv is also working on (on KNL) ?

@william76
Copy link
Contributor

This will be the D1 graph coloring I believe... If I read that correctly, it's failing in unit_test/graph/Test_Graph_graph_color.hpp which is D1.

@william76
Copy link
Contributor

The Jenkins job is KokkosKernels_White_CudaOpenMP_cuda_92_gc_720 ... with build-time trend.

It looks like from 8/24 until 10/2 the execution time had been around 10 seconds -- which indicates the tests weren't actually running (even though the Jenkins job returned SUCCESS).

Starting on 10/3 the execution times have been in the 24 minute range, but from 7/12 to 8/24 the runs were typically taking 7h 0m and timing out in the queue so they never really ran.

My read on this is that this test hadn't really been run between 7/15 and 10/2. :(

Looking at the console output from last night's run, there is a lot more than just the 1 failing test shown above:

00:03:47 
00:03:47 113 FAILED TESTS
00:03:47 terminate called after throwing an instance of 'std::runtime_error'
00:03:47   what():  cudaMemcpy( dst , src , n , cudaMemcpyDefault ) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:89
00:03:47 Traceback functionality not available
00:03:47 
00:03:47 make[2]: *** [test-cuda] Aborted (core dumped)
00:03:47 make[2]: Leaving directory `/home/jenkins/white/workspace/KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720/TestAll_2018-10-07_23.40.01/cuda/9.2.88/OpenMP_Cuda-release/unit_test'
00:03:47 make[1]: *** [test] Error 2
00:03:47 make[1]: Leaving directory `/home/jenkins/white/workspace/KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720/TestAll_2018-10-07_23.40.01/cuda/9.2.88/OpenMP_Cuda-release/unit_test'
00:03:47 make: *** [test] Error 2
00:03:48 #######################################################
00:03:48 PASSED TESTS
00:03:48 #######################################################
00:03:48 #######################################################
00:03:48 FAILED TESTS
00:03:48 #######################################################
00:03:48 cuda-9.2.88-OpenMP_Cuda-release (test failed)

but they are all on that cudaMalloc instruction as far as I can tell:

00:03:47 [ RUN      ] cuda.batched_scalar_team_gemm_t_t_dcomplex_dcomplex
00:03:47 unknown file: Failure
00:03:47 C++ exception with description "cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/white/workspace/KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
00:03:47 Traceback functionality not available
00:03:47 " thrown in the test body.
00:03:47 [  FAILED  ] cuda.batched_scalar_team_gemm_t_t_dcomplex_dcomplex (0 ms)

@ndellingwood Tribal knowledge question: How are the tests driven by KokkosKernels, does it generate 1 mongo test-everything binary with all the tests dumped in so a failure on one can affect the others, or does it generate separate test binaries by package or by test so a failure in one won't disrupt others? Normally I'd think that if I see 113 test fails then it'd imply something in the core library is suspect -- but it sounds like in KokkosKernels the testing is designed to make the single mongo-binary that includes everything so one test failing can affect others and make them fail even if they're not dependent on the thing that's failing?

@ndellingwood
Copy link
Contributor Author

I'm getting a build going on White right now to see if I can triage a bit better. When I ran a spot-check for this PR 12 days ago this test was not failing, #283 (comment)
That was before the test_all_sandia update but I had manually updated my test_all_sandia script to test cuda/9.2

@ndellingwood
Copy link
Contributor Author

@william76

does it generate 1 mongo test-everything binary with all the tests dumped in so a failure on one can affect the others

It's closer to the '1 mongo test' which is likely the trigger for at least some of the other failing tests

@ndellingwood
Copy link
Contributor Author

@william76 For reference, to test this particular unit test you can use gfilter:

./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.graph_graph_color_double_int_int_TestExecSpace

And to test all tests excluding this test, run with a minus sign to remove, i.e.

./KokkosKernels_UnitTest_Cuda --gtest_filter=-cuda.graph_graph_color_double_int_int_TestExecSpace

@ndellingwood
Copy link
Contributor Author

All other tests pass if I filter out all the graph_graph_color tests:
./KokkosKernels_UnitTest_Cuda --gtest_filter=-cuda.graph_graph_color_*

@ndellingwood
Copy link
Contributor Author

More detailed output for the double type failing test:

[ RUN      ] cuda.graph_graph_color_double_int_int_TestExecSpace
/ascldap/users/ndellin/kokkos-kernels-testing/unit_test/graph/Test_Graph_graph_color.hpp:171: Failure
Value of: (num_conflict == conf)
  Actual: false
Expected: true
/ascldap/users/ndellin/kokkos-kernels-testing/unit_test/graph/Test_Graph_graph_color.hpp:173: Failure
Value of: (num_conflict == 0)
  Actual: false
Expected: true
/ascldap/users/ndellin/kokkos-kernels-testing/unit_test/graph/Test_Graph_graph_color.hpp:171: Failure
Value of: (num_conflict == conf)
  Actual: false
Expected: true
/ascldap/users/ndellin/kokkos-kernels-testing/unit_test/graph/Test_Graph_graph_color.hpp:173: Failure
Value of: (num_conflict == 0)
  Actual: false
Expected: true
[  FAILED  ] cuda.graph_graph_color_double_int_int_TestExecSpace (2818 ms)

@william76
Copy link
Contributor

@ndellingwood ooh... cool! Thanks for posting about that option.

That output looks different from the output in the console on Jenkins, which showed the illegal memory access exception. Does the gtest_filter option cause the test to do something different?

@ndellingwood
Copy link
Contributor Author

ndellingwood commented Oct 8, 2018

@william76 I built with debug enabled and ran it through cuda-gdb, not sure if that is the reason. It's probably because I ran the test in isolation.

Running the code with cuda-gdb
cuda-gdb --args ./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.graph_graph_color_double_int_int_TestExecSpace

When setting breakpoints may need this:
set breakpoint pending on

@ndellingwood
Copy link
Contributor Author

ndellingwood commented Oct 8, 2018

./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.graph_graph_color_double_int_int_TestExecSpace runs two tests.

Failing algorithm: COLORING_VBD

First run:
bandwidth = 200

conf num_conflicts
2762594 22100752

Second test:
bandwidth = 200

conf num_conflicts
2530004 20240032

@ndellingwood
Copy link
Contributor Author

Updating the issue to mark more clearly the failing test and algorithm.

@ndellingwood ndellingwood changed the title cuda/9.2 + gcc/7.2 test failure(s) cuda.graph_graph_color* COLORING_VBD test failures with cuda/9.2 + gcc/7.2 on White Oct 8, 2018
@ndellingwood
Copy link
Contributor Author

@lucbv it looks like the COLORING_VBD test is the deterministic vertex-based coloring, did you work on this and do you have time to look into this?

@ndellingwood
Copy link
Contributor Author

One quick note, line 103 of kokkos-kernels/src/graph/KokkosGraph_graph_color.hpp should be updated to remove the deprecated dimension_0() method; replace with extent(0)

@ndellingwood
Copy link
Contributor Author

ndellingwood commented Oct 8, 2018

Reproducer instructions:

Log onto White:
ssh white

Modules:
module load cuda/9.2.88 gcc/7.2.0

In testing directory generate makefile:
cd MY_TESTING_DIR
../../scripts/generate_makefile.bash --with-cuda --with-openmp --arch="Power8,Kepler37" --debug --compiler=${HOME}/kokkos/bin/nvcc_wrapper

Allocate node (interactive session) for the Kepler queue rhel7F

Install the lib (within testing directory)
make install-lib -j16

Build the unit tests
cd unit_test
make -j32

Run the test:
./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.graph_graph_color_double_int_int_TestExecSpace

@srajama1
Copy link
Contributor

srajama1 commented Oct 9, 2018

Coloring VBD should not be run on the GPUs. @lucbv

@william76
Copy link
Contributor

@ndellingwood
Are you doing anything special to get the conf and num_conflicts values? I just stuck in a printout in Test_Graph_graph_color.hpp at line 171 just before the EXPECT_TRUE(...) lines. Just asking in case there's some special option that makes the system print out the values from EXPECT_TRUE macros.

Thus far, I cannot replicate what you're getting, I'm getting this on rhel7F nodes on White:

$ ./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.graph_graph_color_double_int_int_TestExecSpace
Note: Google Test filter = cuda.graph_graph_color_double_int_int_TestExecSpace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN      ] cuda.graph_graph_color_double_int_int_TestExecSpace
num_conflict = 0
conf         = 0
num_conflict = 0
conf         = 0
num_conflict = 0
conf         = 0
num_conflict = 0
conf         = 0
num_conflict = 0
conf         = 0
num_conflict = 0
conf         = 0
unknown file: Failure
C++ exception with description "cudaGetLastError() error( cudaErrorIllegalAddress): an illegal memory access was encountered /ascldap/users/wcmclen/dev/kk-dev/kk-test/white-ride-Serial-Cuda/kokkos/install/include/Cuda/Kokkos_CudaExec.hpp:401
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.graph_graph_color_double_int_int_TestExecSpace (433 ms)
[----------] 1 test from cuda (434 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (435 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] cuda.graph_graph_color_double_int_int_TestExecSpace

 1 FAILED TEST
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMemcpy( dst , src , n , cudaMemcpyDefault ) error( cudaErrorIllegalAddress): an illegal memory access was encountered /ascldap/users/wcmclen/dev/kk-dev/source/Kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:89
Traceback functionality not available

Aborted

It still fails with that memory error but I'm not getting nonzeros in num_conflict & conf...

@srajama1
Copy link
Contributor

srajama1 commented Oct 9, 2018

@william76 : You probably missed my comment above. VBD option should not be used in GPUs. I don't think this was what @lucbv planned.

@ndellingwood
Copy link
Contributor Author

@william76 I added print statements and manual intervention/recompilation ;)

@srajama1 @lucbv I can add macros to guard so the test only runs for host arch's, does this sound like the right thing to do?

@lucbv
Copy link
Contributor

lucbv commented Oct 9, 2018

@srajama1 @ndellingwood indeed this test is not meant to run on GPUs, VBD is likely to use more memory than available on a GPU, VBDBIT on the other hand should do just fine on GPUs so.

@ndellingwood
Copy link
Contributor Author

PR #318 submitted with fix by @william76

@srajama1
Copy link
Contributor

srajama1 commented Oct 9, 2018

@ndellingwood @william76 Thanks for resolving this !

@william76
Copy link
Contributor

@ndellingwood
with PR #318 merged, should we close this issue?

Or is the process to wait an extra day to let the nightlies run?

@lucbv
Copy link
Contributor

lucbv commented Oct 10, 2018

@william76 we need to keep this open as the PR has disabled VBDBIT which should compile and run fine on GPUs.

@ndellingwood
Copy link
Contributor Author

@william76 also we like to keep issues open, even if the fix is in develop, until the kokkos+kokkos-kernels promotion. It helps with keeping track of fixes in the changelog that we generate with each promotion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants