-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda.graph_graph_color* COLORING_VBD test failures with cuda/9.2 + gcc/7.2 on White #317
Comments
Here is the output of that failing test:
|
Is this distance-1 or distance-2 coloring ? Might be related to what @lucbv is also working on (on KNL) ? |
This will be the D1 graph coloring I believe... If I read that correctly, it's failing in |
The Jenkins job is KokkosKernels_White_CudaOpenMP_cuda_92_gc_720 ... with build-time trend. It looks like from 8/24 until 10/2 the execution time had been around 10 seconds -- which indicates the tests weren't actually running (even though the Jenkins job returned Starting on 10/3 the execution times have been in the 24 minute range, but from 7/12 to 8/24 the runs were typically taking 7h 0m and timing out in the queue so they never really ran. My read on this is that this test hadn't really been run between 7/15 and 10/2. :( Looking at the console output from last night's run, there is a lot more than just the 1 failing test shown above:
but they are all on that cudaMalloc instruction as far as I can tell:
@ndellingwood Tribal knowledge question: How are the tests driven by KokkosKernels, does it generate 1 mongo test-everything binary with all the tests dumped in so a failure on one can affect the others, or does it generate separate test binaries by package or by test so a failure in one won't disrupt others? Normally I'd think that if I see 113 test fails then it'd imply something in the core library is suspect -- but it sounds like in KokkosKernels the testing is designed to make the single mongo-binary that includes everything so one test failing can affect others and make them fail even if they're not dependent on the thing that's failing? |
I'm getting a build going on White right now to see if I can triage a bit better. When I ran a spot-check for this PR 12 days ago this test was not failing, #283 (comment) |
It's closer to the '1 mongo test' which is likely the trigger for at least some of the other failing tests |
@william76 For reference, to test this particular unit test you can use gfilter:
And to test all tests excluding this test, run with a minus sign to remove, i.e.
|
All other tests pass if I filter out all the graph_graph_color tests: |
More detailed output for the double type failing test:
|
@ndellingwood ooh... cool! Thanks for posting about that option. That output looks different from the output in the console on Jenkins, which showed the illegal memory access exception. Does the |
@william76 I built with debug enabled and ran it through cuda-gdb, not sure if that is the reason. It's probably because I ran the test in isolation. Running the code with cuda-gdb When setting breakpoints may need this: |
Failing algorithm: First run:
Second test:
|
Updating the issue to mark more clearly the failing test and algorithm. |
@lucbv it looks like the |
One quick note, line 103 of |
Reproducer instructions: Log onto White: Modules: In testing directory generate makefile: Allocate node (interactive session) for the Kepler queue rhel7F Install the lib (within testing directory) Build the unit tests Run the test: |
Coloring VBD should not be run on the GPUs. @lucbv |
@ndellingwood Thus far, I cannot replicate what you're getting, I'm getting this on rhel7F nodes on White:
It still fails with that memory error but I'm not getting nonzeros in |
@william76 : You probably missed my comment above. VBD option should not be used in GPUs. I don't think this was what @lucbv planned. |
@william76 I added print statements and manual intervention/recompilation ;) @srajama1 @lucbv I can add macros to guard so the test only runs for host arch's, does this sound like the right thing to do? |
@srajama1 @ndellingwood indeed this test is not meant to run on GPUs, |
PR #318 submitted with fix by @william76 |
@ndellingwood @william76 Thanks for resolving this ! |
@ndellingwood Or is the process to wait an extra day to let the nightlies run? |
@william76 we need to keep this open as the PR has disabled |
@william76 also we like to keep issues open, even if the fix is in develop, until the kokkos+kokkos-kernels promotion. It helps with keeping track of fixes in the changelog that we generate with each promotion. |
The Jenkins job
KokkosKernels_White_CudaOpenMP_cuda_92_gcc_720
shows at least the following test failure:00:03:47 [ FAILED ] cuda.graph_graph_color_double_int_int_TestExecSpace
Subsequent tests were also failing in the Jenkins job, but this may be due to the first test failure. After reproducing and filtering out the first failing test I'll post other valid failures.
This began failing after updating the
test_all_sandia
script to match the versions of cuda and gcc tested by Kokkos and the Trilinos ATDM scripts.Edit: Added the name of the Jenkins job.
The text was updated successfully, but these errors were encountered: