Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promotion 2.7.24 broke MV unit tests in Tpetra with complex types #360

Closed
ndellingwood opened this issue Dec 13, 2018 · 16 comments
Closed

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Dec 13, 2018

This issue opened to address @lucbv's issue trilinos/Trilinos#3823.

Previously, @mhoemmen committed PR trilinos/Trilinos#3538 to resolve this issue in Tpetra.

The recent promotion clobbered that commit and replaced the fix with those in PR #314. However, problems with complex types in Tpetra's MV are occurring again, as reported by @lucbv.

@ndellingwood
Copy link
Contributor Author

This can be reproduced on White (Kepler nodes) in Trilinos using the following:

export TRILINOS_DIR=<PATH_TO_TRILINOS>/Trilinos

Load modules

module purge
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh cuda-9.2-opt-Kepler37
module swap cmake/3.9.6 cmake/3.12.3

Configure

cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS:BOOL=ON \
  -DTrilinos_ENABLE_Tpetra=ON \
  -DTrilinos_ENABLE_COMPLEX_DOUBLE:BOOL=ON \
  -DKOKKOS_ENABLE_DEPRECATED_CODE:BOOL=ON \
$TRILINOS_DIR

@ndellingwood
Copy link
Contributor Author

I suggest making the changes in Trilinos' kokkos-kernels and testing there first; trying to link to VOTD kokkos and kokkos-kernels will break unless using the kokkos-promotion branch of Trilinos.

@ndellingwood
Copy link
Contributor Author

@kyungjoo-kim do you have time to look at this? It seems related to the difference between PR #314 you worked on and the trilinos/Trilinos#3538 PR we clobbered with the promotion.

@ndellingwood
Copy link
Contributor Author

@kyungjoo-kim it could also be related to some of the dual_view changes as well, I can check that first before we consider difference between the commits mentioned above.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Dec 13, 2018

@lucbv If you don't mind, could you check if this test still fails after randomizing only real part ?

You can do so by changing the randomize() method in Tpetra_MultiVector_def.hpp.

const IST max = ATS::one();
const IST min = ATS::zero();

In this way, the complex values have zero imagineries. If the double test passes, the complex test with zero imagineris also should passes. In my case, it fails.

I found this problem in KokkosKernels and I am wondering if this is also the case. If the test still fails while double test passes, then we need to fix Kokkos complex.

@ndellingwood ndellingwood changed the title Promotion 2.7.24 broke MV in Tpetra with complex types Promotion 2.7.24 broke MV unit tests in Tpetra with complex types Dec 20, 2018
@mhoemmen
Copy link
Contributor

@kddevin @trilinos/tpetra @trilinos/framework @jwillenbring @rrdrake @theguruat12

My work-around for this got clobbered. BLAS functions that return complex have been known to be broken for a long time, due to ABI issues. See e.g., CheckCXXComplexBlasProblem.cmake and CheckCXXComplexBlasProblemCanBeFixed.cmake in Trilinos/packages/teuchos/cmake/. Teuchos knew how to work around this issue quite a while ago, as the latter filename indicates.

@ndellingwood
Copy link
Contributor Author

@mhoemmen I'll get a build going on White and also test with @kyungjoo-kim 's suggestion in the comment above - if we test complex<double> and set the imaginary part to 0 and get incorrect results (in some sense making the problem seem like we're using double types, known to work fine, but running through the complex<double> code paths) this will better help define the issue.

@mhoemmen
Copy link
Contributor

Thanks @ndellingwood !

@ndellingwood
Copy link
Contributor Author

Build in progress, may take a little bit...

@ndellingwood
Copy link
Contributor Author

Extra failure info from Tpetra MV unit test for reference- the issue seems to occur in kokkos-kernels so it is pasted here rather than the Trilinos issue.

Same setup as reproducer steps above on White.

Allocate node:
bsub -Is -n 16 -q rhel7F bash

Run MV tests:
ctest -R TpetraCore_MultiVector_UnitTests_MPI_4 -VV

Failures:

1: The following tests FAILED:
1:     2. MultiVector_double_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode_ComplexDotOneColumn_UnitTest ...
1:     123. MultiVector_int_int_std_complex0double0_Kokkos_Compat_KokkosSerialWrapperNode_NonContigView_UnitTest ...
1:     255. MultiVector_int_longlong_std_complex0double0_Kokkos_Compat_KokkosSerialWrapperNode_NonContigView_UnitTest ...

Specific output grouped by test above:

1: 2. MultiVector_double_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode_ComplexDotOneColumn_UnitTest ...

1: 2. MultiVector_double_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode_ComplexDotOneColumn_UnitTest ...
1:  results[0] = (0,0) == STS::one() = (1,0) : FAILED ==> ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:3192
1:  results[0] = (0,0) == -STS::one() = (-1,-0) : FAILED ==> ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:3207
1:  gblSuccess == 1 = 0 == true : FAILED ==> ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:3213
1:  [FAILED]  (0.00056 sec) MultiVector_double_default_local_ordinal_type_default_global_ordinal_type_Kokkos_Compat_KokkosSerialWrapperNode_ComplexDotOneColumn_UnitTest
1:  Location: ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:3136
1: 123. MultiVector_int_int_std_complex0double0_Kokkos_Compat_KokkosSerialWrapperNode_NonContigView_UnitTest ...

1:   Check: rel_err(dotsOrig[j], dotsView[j])
1:          = rel_err((0,0), (-2.29429,-7.26526)) = 1
1:            <= tol = 2.22045e-12 : FAILED
...
1:   gblSuccess == 1 = 0 == true : FAILED ==> ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:569
1:  [FAILED]  (0.00182 sec) MultiVector_int_int_std_complex0double0_Kokkos_Compat_KokkosSerialWrapperNode_NonContigView_UnitTest
1:  Location: ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:363
1: 255. MultiVector_int_longlong_std_complex0double0_Kokkos_Compat_KokkosSerialWrapperNode_NonContigView_UnitTest ...

1:   Check: rel_err(dotsOrig[j], dotsView[j])
1:          = rel_err((0,0), (-2.29429,-7.26526)) = 1
1:            <= tol = 2.22045e-12 : FAILED
...
1:   gblSuccess == 1 = 0 == true : FAILED ==> ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:569
1:  [FAILED]  (0.00182 sec) MultiVector_int_longlong_std_complex0double0_Kokkos_Compat_KokkosSerialWrapperNode_NonContigView_UnitTest
1:  Location: ../../packages/tpetra/core/test/MultiVector/MultiVector_UnitTests.cpp:363

@ndellingwood
Copy link
Contributor Author

The problem occurs with the Serial tests (not the CudaUVM tests).

In the case when Cuda is the execution space, the kokkos-kernels specialization is called and everything works fine.

In the case when Serial is the execution space, the BLAS tpl specialization is hit, ultimately making a call to zdotu_ like so:

       zdotu_(reinterpret_cast<std::complex<double>* >(R.data()),        \
             &N,                                                        \
             reinterpret_cast<const std::complex<double>* >(X.data()),&one, \
             reinterpret_cast<const std::complex<double>* >(Y.data()),&one);

I checked the values of the X and Y data, before and after and they look correct; I also tried creating new std::complex<double>* allocations and copied the data for X, Y, and R over and passing those pointers to zdotu_ (I checked the data were as expected) in case something weird was happening with the reinterpret_cast etc, but the result stored was always incorrectly returned as 0.

@mhoemmen 's previous PR trilinos/Trilinos#3538 makes sure we use the kokkos-kernels implementation (rather than the BLAS tpl specialization) and would serve as a solution.

@kyungjoo-kim @mhoemmen @crtrott @srajama1 any suggestions where to go from here? The fastest path is to reimplement Mark's fix clobbered by the promotion, which would disable TPL options when complex types are used. If that isn't the desired solution I'll have to sit down with someone to get a clear understanding how to handle this.

@ndellingwood
Copy link
Contributor Author

Also pinging @lucbv so he is in the loop, he has been waiting on a fix for this.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jan 19, 2019

@ndellingwood Just to be clear, it sounds like zdotu_ is getting correct input, but is returning the wrong result. Is that correct?

If that's correct, then my guess is that this is caused by a different Fortran ABI for returning a complex number, than what kokkos-kernels is expecting. kokkos-kernels expects zdotu_ and its kin to return the complex number as the first output argument. However, this is not standard. It looks like the MKL "returns" a complex number by hidden first output argument, while GCC's gfortran actually returns the complex number. The latter might happen if using a system BLAS library that was built using gfortran.

Trilinos has seen issues like this before. Teuchos' CMake logic includes tests for this case:

and Teuchos actually uses this to decide which ABI variant to use.

To fix this, we would just copy those tests over to kokkos-kernels (since kokkos-kernels is not supposed to depend on Teuchos). We can use the latter CHECK_CXX_COMPLEX_BLAS_PROBLEM_CAN_BE_FIXED CMake function to define a macro, and use the macro to pick from one of two different extern declarations of the relevant functions. The only BLAS 1 functions that return complex are {c,z}dotc and {c,z}dotu, so this shouldn't be too much trouble.

@mhoemmen
Copy link
Contributor

@theguruat12 @rrdrake If I'm right, this should be pretty straightforward to fix. We can apply the fix to the kokkos-kernels and Trilinos repos simultaneously, so that we can go ahead with the Trilinos promotion.

@ndellingwood
Copy link
Contributor Author

Cross-referencing PR #374 and #373 with fixes for the issue.

@ndellingwood
Copy link
Contributor Author

PR's #373, #374 and #380 should complete the fix for this, all merged into develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants