Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 in CUDA-10 builds on 'ats2' and 'cee-rhel6' starting 2020-07-17 #7690

Closed
bartlettroscoe opened this issue Jul 19, 2020 · 9 comments
Assignees
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) pkg: KokkosKernels pkg: Thyra Issues primarily dealing with the Thyra Package pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/thyra, @trilinos/tpetra, @trilinos/kokkos-kernels @kddevin (Trilinos Data Services Product Lead), @brian-kelley

Next Action Status

Description

As shown in this query the test:

  • ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1

in the builds:

  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt_cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi
  • Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg
  • Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt

started failing on testing day 2020-07-18.

The failure is in the unit test TpetraThyraWrappers_double_TpetraLinearOp_UnitTest and shows the error:

12. TpetraThyraWrappers_double_TpetraLinearOp_UnitTest ... 
 tpetraOp = Tpetra::CrsMatrix (Kokkos refactor):
   Template parameters:
    Scalar: double
    LocalOrdinal: int
    GlobalOrdinal: long long
    Node: Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace>
   isFillComplete: true
   Global dimensions: [4, 4]
   Global number of entries: 10
   
   Global max number of entries in a row: 3
   
   Row Map:
   
    "Tpetra::Map":
     Template parameters:
      LocalOrdinal: int
      GlobalOrdinal: long long
      Node: Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace>
     Global number of entries: 4
     Minimum global index: 0
     Maximum global index: 3
     Index base: 0
     Number of processes: 1
     Uniform: false
     Contiguous: true
     Distributed: false
     Process 0 of 1:
      My number of entries: 4
      My minimum global index: 0
      My maximum global index: 3
   Column Map: same as row Map
   Domain Map: same as row Map
   Range Map: same as domain Map
   Process rank: 0
    Number of allocated entries: 10
    Number of entries: 10
    Max number of entries per row: 3
       Proc Rank   Global Row  Num Entries
               0            0            2
               0            1            3
               0            2            3
               0            3            2
 
 nonnull(tpetraOp) = 1 == true : passed
 nonnull(thyraLinearOp) = 1 == true : passed
 
 Check that operator returns the right thing ...
 
 Check: rel_err(sum_y, as<Scalar>(3+1+2*(y->space()->dim()-2)))
        = rel_err(8, 8) = 0
          <= tol = 2.22045e-14 : passed
 
 Check the general LinearOp interface ...
  
  *** Entering LinearOpTester<double,double>::check(op,...) ...
  
  describe op: Thyra::TpetraLinearOp<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >{rangeDim=4,domainDim=4}
  
  Checking the domain and range spaces ... passed!
  
  this->check_linear_properties()==true:Checking the linear properties of the forward linear operator ... passed!
  
  (this->check_linear_properties()&&this->check_adjoint())==true: Checking the linear properties of the adjoint operator ... passed!
  
  this->check_adjoint()==true: Checking the agreement of the adjoint and forward operators ... 
  op.opSupported(CONJTRANS) = true == true : passed
  
  Checking that the adjoint agrees with the non-adjoint operator as:
  
    <0.5*op'*v2,v1> == <v2,0.5*op*v1>
     \________/            \_______/
         v4                   v3
  
           <v4,v1>  == <v2,v3>
  
  Random vector tests = 1
   
   v1 = randomize(-1,+1); ...
   
   v2 = randomize(-1,+1); ...
   
   v3 = 0.5*op*v1 ...
   
   v4 = 0.5*op'*v2 ...
   
   Check: rel_err(<v4,v1>, <v2,v3>)
          = rel_err(0.0508384, -0.117265) = 1.43353
            <= adjoint_error_tol() = 2.22045e-12 : FAILED
  
  this->check_for_symmetry()==false: Skipping check of symmetry ...
  
  Oh no, at least one of the tests performed with this LinearOpBase object failed (see above failures)!
  
  *** Leaving LinearOpTester<double,double>::check(...)
 linearOpTester.check(*thyraLinearOp, Teuchos::inOutArg(out)) = 0 == true : FAILED ==> /scratch/atdm-devops-admin/atdm-trilinos-nightly-builds/Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg/SRC_AND_BUILD/Trilinos/packages/thyra/adapters/tpetra/test/TpetraThyraWrappers_UnitTests.cpp:600
 [FAILED]  (0.43 sec) TpetraThyraWrappers_double_TpetraLinearOp_UnitTest
 Location: /scratch/atdm-devops-admin/atdm-trilinos-nightly-builds/Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg/SRC_AND_BUILD/Trilinos/packages/thyra/adapters/tpetra/test/TpetraThyraWrappers_UnitTests.cpp:567

It looks like the adjoint operator does not match up with the forward operator.

The new commits that were pulled on testing day 2020-07-17 as shown, for example, here and included commits from the merged PRs #7681, #7671, #7677, and #7676. Looking over the commits in these PRs, my guess is that this was triggered by the commits by @brian-kelley in PR #7677?

Could the updated implementation of the adjoint sparse mat-vec by KokkosKernels be having a problem on these systems? That is a tiny little 4x4 matrix so it should be easy to to manually debug this. (You can tell the unit test to be very very verbose and print out everything to allow for manual checking.)

Current Status on CDash

ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 results in Trilinos builds over the last 5 days

Steps to Reproduce

One should be able to reproduce this failure on the machine as described in:

The failures for the 'ats2' builds can be reproduced on the machine 'vortex' as described at:

The failures for the 'cee-rhel6' CUDA build can be repoduced in any CEE 'ascicgpu' machine as described at:

On 'vortex', one should be able to reproduce the test failure with:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \
    Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Thyra=ON \
 $TRILINOS_DIR

$ bsub -J <job-name> -W 6:00 -Is bash

$ lrun -n 1 make NP=32

$ ctest -R ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Tpetra pkg: Thyra Issues primarily dealing with the Thyra Package impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) pkg: KokkosKernels client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs labels Jul 19, 2020
@brian-kelley brian-kelley self-assigned this Jul 20, 2020
@brian-kelley
Copy link
Contributor

@bartlettroscoe I was able to replicate this and yes it was caused by #7677 introducing an old KokkosKernels bug. The fix is ready in #7694.

@bartlettroscoe
Copy link
Member Author

@brian-kelley, why do some CUDA builds show this and not others? Is it because this error was only triggered with cuda-10.1 instead of cuda-9.2? The CUDA builds on 'vortex' and the 'ascicgpu' machines are using cuda-10.1 and all of the other CUDA builds on other platforms (including 'ride') are using cuda-9.2.

@bartlettroscoe bartlettroscoe changed the title Test ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 in CUDA builds on 'ats2' and 'cee-rhel6' CUDA starting 2020-07-17 Test ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 in CUDA-10 builds on 'ats2' and 'cee-rhel6' starting 2020-07-17 Jul 21, 2020
@brian-kelley
Copy link
Contributor

@bartlettroscoe This should have been triggered if and only if KokkosKernels_ENABLE_CUSPARSE, but both the 10.1 and 9.2 builds (I looked at Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release) enable it. But the test passed on that build. So I need to figure out why the test wasn't failing before. Something in KokkosKernels might still be predicated on a CUSPARSE version >= 10 that shouldn't be.

@bartlettroscoe
Copy link
Member Author

@brian-kelley, the CMakeCache file for the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug today shown here shows:

KokkosKernels_ENABLE_TPL_CUSPARSE:BOOL=ON

@brian-kelley
Copy link
Contributor

@bartlettroscoe Figured out this one: CUSPARSE_VERSION was not defined in cusparse.h prior to CUDA 10, so cusparse was not actually getting called. Will fix by using CUDA_VERSION instead.

@brian-kelley
Copy link
Contributor

@bartlettroscoe #7694 includes proper support for cuSPARSE 9.x. Here's what I found:

  • Build on RIDE w/ CUDA 9.2, develop branch: Thyra passes, because cuSPARSE wasn't getting called
  • Fix the macros by using CUDA_VERSION: Thyra now fails because cuSPARSE is getting called, but conjugate transpose broken
  • Combine the macro fix into KokkosKernels: fix conjugate-transpose cusparse spmv (#7690) #7694 : Thyra passes again. Still passes on CUDA 10.1 as well.

@bartlettroscoe
Copy link
Member Author

@brian-kelley, thanks for digging into this. Now a another case like this will be caught by the Trilinos PR CUDA build. That is huge!

trilinos-autotester added a commit that referenced this issue Jul 23, 2020
Automatically Merged using Trilinos Pull Request AutoTester
PR Title: KokkosKernels: fix conjugate-transpose cusparse spmv (#7690)
PR Author: brian-kelley
@brian-kelley
Copy link
Contributor

@bartlettroscoe I'm closing this since it looks like the CUDA tests that were failing on 7-20 passed today.

@bartlettroscoe
Copy link
Member Author

As shown in this query, these tests that started failing on testing day 2020-07-17 started passing on testing day 2020-07-24. We can see from the PRs merged that day here that the PRs #7634, #7694, #7714, were merged. I think PR #7694 fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) pkg: KokkosKernels pkg: Thyra Issues primarily dealing with the Thyra Package pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants