-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 in CUDA-10 builds on 'ats2' and 'cee-rhel6' starting 2020-07-17 #7690
Comments
@bartlettroscoe I was able to replicate this and yes it was caused by #7677 introducing an old KokkosKernels bug. The fix is ready in #7694. |
@brian-kelley, why do some CUDA builds show this and not others? Is it because this error was only triggered with cuda-10.1 instead of cuda-9.2? The CUDA builds on 'vortex' and the 'ascicgpu' machines are using cuda-10.1 and all of the other CUDA builds on other platforms (including 'ride') are using cuda-9.2. |
@bartlettroscoe This should have been triggered if and only if |
@brian-kelley, the CMakeCache file for the build
|
@bartlettroscoe Figured out this one: |
@bartlettroscoe #7694 includes proper support for cuSPARSE 9.x. Here's what I found:
|
@brian-kelley, thanks for digging into this. Now a another case like this will be caught by the Trilinos PR CUDA build. That is huge! |
Automatically Merged using Trilinos Pull Request AutoTester PR Title: KokkosKernels: fix conjugate-transpose cusparse spmv (#7690) PR Author: brian-kelley
@bartlettroscoe I'm closing this since it looks like the CUDA tests that were failing on 7-20 passed today. |
CC: @trilinos/thyra, @trilinos/tpetra, @trilinos/kokkos-kernels @kddevin (Trilinos Data Services Product Lead), @brian-kelley
Next Action Status
Description
As shown in this query the test:
ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1
in the builds:
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt_cuda-aware-mpi
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi
Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg
Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt
started failing on testing day 2020-07-18.
The failure is in the unit test
TpetraThyraWrappers_double_TpetraLinearOp_UnitTest
and shows the error:It looks like the adjoint operator does not match up with the forward operator.
The new commits that were pulled on testing day 2020-07-17 as shown, for example, here and included commits from the merged PRs #7681, #7671, #7677, and #7676. Looking over the commits in these PRs, my guess is that this was triggered by the commits by @brian-kelley in PR #7677?
Could the updated implementation of the adjoint sparse mat-vec by KokkosKernels be having a problem on these systems? That is a tiny little 4x4 matrix so it should be easy to to manually debug this. (You can tell the unit test to be very very verbose and print out everything to allow for manual checking.)
Current Status on CDash
ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 results in Trilinos builds over the last 5 days
Steps to Reproduce
One should be able to reproduce this failure on the machine as described in:
The failures for the 'ats2' builds can be reproduced on the machine 'vortex' as described at:
The failures for the 'cee-rhel6' CUDA build can be repoduced in any CEE 'ascicgpu' machine as described at:
On 'vortex', one should be able to reproduce the test failure with:
The text was updated successfully, but these errors were encountered: