-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KokkosKernels_sparse_* tests are failing on ATDM cuda builds #3438
Comments
@srajama1 and @kyungjoo-kim, looking at the commits pulled that day shown here, the likely commits that broke these tests were:
These commits were integrated into 'develop' in PR #3223 merged on 9/6/2018. |
There is a cuda error
There is an error related to MKL macro issues, same error message as kokkos/kokkos-kernels#289
My modification is to change cmake configuration to use tpls correctly. @srajama1 Does spgemm test use any cublas internally ? |
spgemm tests uses cusparse, yes. How did this pass spot checks ? |
I think that 1) waterman and white use gcc 7.2 + cuda 9.2 2) hansen cuda version also uses gcc. The macro fix described in kokkos/kokkos-kernels#290 handles a case when intel mkl is used. |
I don't see why we are trying to combine this to MKL changes. This is CUDA bug that got introduced in Trilinos. Let us resolve it separately. |
Look at the error message with serial build with gcc compilers. It has the same error described in kokkos/kokkos-kernels#289. The corresponding PR patch should be applied. If it is already applied, then we have another edge case. |
@srajama1 @kyungjoo-kim the fix in kokkos/kokkos-kernels#289 is merged into kokkos-kernels develop branch, does a patch need to be applied to Trilinos? If the problem is similar to that issue then there may need to be an additional update to the offending test to guard here: test SPGEMM_CUSPARSE Replace with something like where the second macro is defined in the config file |
@ndellingwood Yes please apply the patch to Trilinos and would you test this on waterman or white (no cuda test but serial or openmp test). |
@kyungjoo-kim will do, should have PR up shortly. Should I include the CUSPARSE modification I suggested above as well? |
Address Trilinos issue trilinos#3438, incorporates part of fix from PR kokkos/kokkos-kernels#290
Yes please do so. I don't think that putting ifdef guard of TPLs is harmful. Thanks a lot. |
PR #3442 issued with (hopeful) fix. |
Following reproducer instructions to test now on white - is there any reason I shouldn't build on an interactive node rather than the head node? |
Head node is pretty busy, I started an interactive session and tried to compile but was unable to:
@bartlettroscoe is there a simple modification to the reproducer instructions so that I can build on an interactive node? Edit: I didn't state very clear the issue I'm having - I haven't built anything yet (clean start to the build) and received the message |
I get the same output regardless of head node or interactive node...
|
What are the exact commands you are using? Also, can you please do:
and then attach the |
Thanks @bartlettroscoe
Build directory:
Then I followed the reproducer instructions:
I'll blow away the directory and relogin to start clean and post the configure output in next message |
Fresh start with the same procedure above led to same issue. @bartlettroscoe attached is the configure.txt file, thanks for the help |
@ndellingwood, if you look a the bottom of the file configure.txt, you see the problem:
Is it obvious how to fix this? |
FYI: I fixed the typo in the "Steps to Reproduce" above. Run those instructions and it should work not. Sorry about that. |
@bartlettroscoe working fine now, thanks for the help! Will post results when it completes. |
@kyungjoo-kim @srajama1
|
Running the individual tests on an interactive node here is some additional output:
And running through cuda-gdb (output abbreviated):
|
The MKL change in the PR #3442 fixed the MKL issue, but there are still problems with CUSPARSE. I added some print statements to the test, this is a summary of what I'm seeing: cuda.sparse_spgemm_double_int_int_TestExecSpace
@kyungjoo-kim @srajama1 based on the testing setup, should this error have been caught or instead should it not even run unless cusparse kernels are enabled? I'm guessing the latter... serial.sparse_spgemm_double_int_int_TestExecSpace
|
As an extra data point, the kokkos-kernels sparse tests pass with VOTD kokkos-kernels + kokkos on the same White queue when using |
FYI: The merge of PR #3442 on 9/14/20-18 does not appear to have fixed any of the individual failing unit tests on any of the builds on any of the platforms show yesterday, for example, here. What is the next course of action? Should the original merge commit be reverted while this gets fixed offline? NOTE: The EMPIRE team has flagged the original PR #3223 as breaking the usage of Trilinos for EMPIRE. Therefore, there is a strong case to back out that PR merge commit and fix this offline so that EMPIRE can get an updated version of Trilinos. |
@bartlettroscoe, I don't think this was holding EMPIRE up. We should see a successful update to our fork happen soon. No need to revert unless I get back in touch. |
@jmgate, I thought that your bisection study showed the some commit in PR #3442 broke the EMPIRE test suite? |
Rechecked the kokkos-kernels VOTD tests, in my previous check I didn't enable the cusparse tpl properly, and the same tests are failing on the kokkos-kernels develop branch. |
@ndellingwood I agree. I did not have time to fully check the test yet but I also think that cusparse version is not tested in spotcheck. The spgemm test passes 3 tests in kokkoskernels but cusparse version. Let's remove cusparse version from the test list for now and think this in kokkoskernels. |
@bartlettroscoe, yes, that is what two independent bisects showed, but you know how bisecting Trilinos goes. @ccober6 did a manual bisect and found our actual problem elsewhere. |
@jmgate, did you only bisect on the first-parent merge commits directly on the 'develop' branch or did you try to bisect all commits? I think that the PR process only provides confidence that you can bisect on commits that pass PR testing. (NOTE: The Trilinos GitHub project currently allows people to rebase and push which destroys the ability to bisect robustly on the first-parent commits on 'develop'. Therefore, if some Trilino developers are doing this, this might explain why your bisect study did not do what is should. See #2726.) |
Both, just to be sure. |
@jmgate, If you try to bisect on all of the commits I think you are almost guaranteed to get false failures since some of these commits may not even configure. And even if it configures and builds, there is a good chance that the tests will not pass. I guess if you just skipped over commits that did not configure and build Trilinos or EMPIRE and just looked at the your particular EMPIRE test, then you might be able to more robustly bisect all of the commits. Is this what you did? |
@bartlettroscoe, we don't need to have this conversation, particularly in this issue. |
We move this problem into kokkoskernels and fix the problem there. As this problem is gone from Trilinos testing, we close this issue. |
I just found that time out failure from gauss seidel on waterman. @srajama1 Who is working on GS now ? A simple solution is to disable the test for the debug build. Another solution is to simplify the testing. Anyway I cannot help this issue anymore as the time out failure pops up from waterman that I do not have an access. |
FYI: PR #3559 will disable the timing-out However, I will work to get an |
Commit 837804d merged in PR #3559 on 10/3/2018 disabled a some of the slow-er-running unit tests in the test executable With the new |
Address Trilinos issue trilinos#3438, incorporates part of fix from PR kokkos/kokkos-kernels#290
CC: @trilinos/kokkos-kernels , @kddevin (Trilinos Data Services Product Lead)
Next Action Status
The test
KokkosKernels_sparse_serial_MPI_1
on 'waterman' has been passing without timing out in each 'debug' build on 'waterman' since 10/9/2018 as shown here.Description
As shown in this query the tests:
are failing in all the cuda builds on white, ride, hansen, and waterman:
Here you can see that these tests started failing on 9/7/2018
at the bottom of this page is a list of commits that were new on that day.
Steps to Reproduce
One should be able to reproduce this failure as described in:
More specifically, the commands given for the system white are provided at:
The exact commands to reproduce this issue should be:
The text was updated successfully, but these errors were encountered: