Adding team-based BLAS functions #215

vqd8a · 2018-04-23T17:49:14Z

Addressing issues #209, #214

This provides the basic template of how to do the team based things.

…erial, OpenMP, Pthread, and Cuda

…enMP, Pthread, and Cuda

mhoemmen

Do we want the team versions of functions like axpby, gemv, and scal to respect BLAS semantics regarding coefficients alpha and beta? (See comments for details.) It's up to y'all, but the node-global versions of these functions currently do respect BLAS semantics.

mhoemmen · 2018-04-23T20:27:14Z

src/blas/impl/KokkosBlas1_team_axpby_spec.hpp

+      const typename YVector::non_const_value_type& b, const YVector& y) {
+    const int N = x.extent(0);
+    Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
+      y(i) = b*y(i) + a*x(i);


This breaks BLAS semantics:

If b is zero, the code should overwrite y(i) without multiplying. This matters if y(i) is Inf or NaN.

If a is zero, the code should ignore x(i) without multiplying. This matters if x(i) is Inf or NaN.

If we intend to do this, then we should fix the behavior of the node-global version to be consistent with this.

mhoemmen · 2018-04-23T20:27:54Z

src/blas/impl/KokkosBlas1_team_dot_spec.hpp

+    dot_type result;
+    int N = X.extent(0);
+    Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,N), [&] (const int& i, dot_type& val) {
+      val += X(i)*Y(i);


If you're using InnerProductSpaceTraits (which you should), then you should use InnerProductSpaceTraits::dot to compute this product, not operator*.

When is that not the same for dot? I understand why it wouldn't be the same for NRM2 but for dot?

@crtrott It's for some Stokhos scalar types, where Eric wants to do Krylov methods with plain old double as the dot product / norm results, even though the vectors have Stokhos types in them.

This is also a reason why one might want axpby with coefficients of a different type than the vectors' entries.

mhoemmen · 2018-04-23T20:28:10Z

src/blas/impl/KokkosBlas1_team_mult_spec.hpp

+      const typename AVector::non_const_value_type& alpha, const AVector& a, const XVector& x) {
+    const int N = x.extent(0);
+    Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
+      y(i) = gamma*y(i) + alpha*a(i)*x(i);


See above note on BLAS semantics.

mhoemmen · 2018-04-23T20:30:04Z

src/blas/impl/KokkosBlas1_team_scal_spec.hpp

+  static KOKKOS_INLINE_FUNCTION void team_scal (const TeamType& team, const RV& R, const typename XV::non_const_value_type& a, const XV& X) {
+    const int N = X.extent(0);
+    Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
+      R(i) = a*X(i);


See above note on BLAS semantics. (If a is zero, this should fill R with zeros, regardless of the contents of X.)

mhoemmen · 2018-04-23T20:30:25Z

src/blas/impl/KokkosBlas1_team_update_spec.hpp

+      const typename ZVector::non_const_value_type& gamma, const ZVector& z) {
+    const int N = x.extent(0);
+    Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
+      z(i) = gamma*z(i) + alpha*x(i) + beta*y(i);


See above note on BLAS semantics.

mhoemmen · 2018-04-23T20:30:44Z

src/blas/impl/KokkosBlas2_team_gemv_spec.hpp

+      Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,M), [&] (const int& j, dot_type& val ) {
+        val += A(j,i) * x(j);
+      },Ax_i);
+      y(i) = beta * y(i) + alpha * Ax_i;


See above note on BLAS semantics.

mhoemmen · 2018-04-23T20:30:51Z

src/blas/impl/KokkosBlas2_team_gemv_spec.hpp

+      Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,M), [&] (const int& j, dot_type& val ) {
+        val += Kokkos::ArithTraits<value_type>::conj(A(j,i)) * x(j);
+      },Ax_i);
+      y(i) = beta * y(i) + alpha * Ax_i;


See above note on BLAS semantics.

mhoemmen · 2018-04-23T20:33:06Z

unit_test/blas/Test_Blas1_team_abs.hpp

+
+    ScalarA expected_result = 0;
+    for(int i=0;i<N;i++)
+      expected_result += AT::abs(h_x(i)) * AT::abs(h_x(i));


This would be a good place to compute a Wilkinson-style running error bound, rather than using an ad-hoc bound (see below).

mhoemmen · 2018-04-23T20:33:16Z

unit_test/blas/Test_Blas1_team_abs.hpp

+    } );
+
+    ScalarB nonconst_nonconst_result = KokkosBlas::dot(y,y);
+    EXPECT_NEAR_KK( nonconst_nonconst_result, expected_result, eps*expected_result);


See above note.

mhoemmen · 2018-04-23T20:33:29Z

unit_test/blas/Test_Blas1_team_abs.hpp

+    } );
+
+    ScalarB const_nonconst_result = KokkosBlas::dot(y,y);
+    EXPECT_NEAR_KK( const_nonconst_result, expected_result, eps*expected_result);


See above note.

crtrott · 2018-04-23T21:12:07Z

Hm interesting question regarding the blast semantics. At the top level we handle that logic pretty far outward and just do ETI for everything. Here we would need to inject all the different code variants into your kernel, which could have some bad performance effects. I definitely want to keep BLAS semantics on the global functions but for the nested ones I just don't know whether the possible performance issues are worth it. Any thoughts?

mhoemmen · 2018-04-23T23:17:36Z

@crtrott wrote:

I definitely want to keep BLAS semantics on the global functions but for the nested ones I just don't know whether the possible performance issues are worth it. Any thoughts?

Tpetra can always implement BLAS semantics on its own, so I'm not personally worried so much about the global functions. However, you do have to worry about users following current developments in the batched linear algebra standardization effort. In addition, it depends on how we intend to use these new functions. I'm not sure what multifrontal / supernodal sparse factorizations expect, for example.

vqd8a · 2018-04-24T15:52:40Z

@crtrott Any further guidance?

srajama1 · 2018-04-24T21:02:55Z

@vqd8a : The BLAS semantics are important for node level calls. For the team level ones you are working on, the semantics are not clear, especially with the performance hit Christian is talking about. In fact, this is a point of discussion for the batched BLAS committee. Let me check with them next time we talk. We don't expect this semantics for our sparse solvers, so what you have done is good for us.

The use case of coefficients different than vector entries is also a node-level use case. I have never heard this for a team level use case though it could extend. I am willing to go with what you have and then iterate for the other use case. Can you create a wiki page describing these comments so users know what they are getting ?

mhoemmen · 2018-04-25T03:17:40Z

I've approved changes so as not to block progress, but you really need to fix the InnerProductSpaceTraits issue or Stokhos might not build correctly. @etphipp may which to comment.

vqd8a · 2018-04-25T13:51:05Z

Thanks, @srajama1 @mhoemmen.
@srajama1 I am going to create a wiki page for these. BTW, I am not quite clear about "The use case of coefficients different than vector entries is also a node-level use case". Could you please explain more?

etphipp · 2018-04-25T14:40:56Z

For some Stokhos scalar types, the results of dot-products are different from the scalar type, hence the use of InnerProductSpaceTraits as Mark described. However Stokhos has historically overloaded these functions for its scalar types anyways, so it probably won't really matter (unless this breaks those overloads where they are no longer overriding the right function signature).

etphipp · 2018-04-25T14:45:31Z

And after looking at the code more closely, you really should follow Mark's suggestion. It's a 1-line change and makes the code consistent.

vqd8a · 2018-04-25T14:46:07Z

Thanks. @etphipp

srajama1 · 2018-04-25T16:34:01Z

@vqd8a : I was wondering if the InnerProductSpaceTraits are really needed because of the overloading @etphipp mentioned. He says it is needed, so let us do that.

vqd8a · 2018-04-25T17:31:15Z

Changed to using InnerProductSpaceTraits::dot to compute the product in KokkosBlas1_team_dot_spec.hpp

srajama1 · 2018-04-25T17:38:43Z

As @mhoemmen and @etphipp wanted this change I will wait for them to review. I can click the green button once either of them approve.

etphipp

Looks fine to me, although you could shorten the dot_type typedef by using IPT.

etphipp · 2018-04-25T20:11:21Z

Inner products and norms are treated differently than general multiplies, which are what those functions do.

vqd8a · 2018-04-25T20:14:17Z

but in the KokkosBlas1_team_mult_spec.hpp and KokkosBlas2_team_gemv_spec.hpp, we still need to compute products, right? Correct me if I am wrong. Thanks.

etphipp · 2018-04-25T21:37:32Z

Yes, but these functions mathematically represent different operations, so you don't use the IPT there.

vqd8a · 2018-04-25T21:38:37Z

Oh, I see. Thanks. @etphipp

crtrott · 2018-05-02T23:25:20Z

Running tests now.

mhoemmen

Kokkos::ArithTraits already has real and imag functions. For example:

kokkos-kernels/src/Kokkos_ArithTraits.hpp

Line 442 in 6e8e97a

static KOKKOS_FORCEINLINE_FUNCTION mag_type real (const T& x);

Thus, you don't need these extra RealPart and ImagPart functions. If you can instantiate Kokkos::ArithTraits for the SIMD<Kokkos::complex<T>> type, then you should be able to call the function in ArithTraits, and it should return the right thing. If it doesn't return the right thing, then we should fix ArithTraits to do the right thing.

mhoemmen · 2018-05-03T02:21:03Z

src/batched/KokkosBatched_Vector.hpp

-    template<typename T, int l> typename Kokkos::Details::ArithTraits<Vector<SIMD<Kokkos::complex<T> >,l> >::mag_type
+    template<typename T>
+    KOKKOS_INLINE_FUNCTION
+    typename Kokkos::Details::ArithTraits<T>::mag_type RealPart(const T &val) { return val; }


Kokkos::ArithTraits already has the real function:

kokkos-kernels/src/Kokkos_ArithTraits.hpp

Line 442 in 6e8e97a

static KOKKOS_FORCEINLINE_FUNCTION mag_type real (const T& x);

You don't need this extra function. If you can instantiate Kokkos::ArithTraits for the SIMD type, then you should be able to call the function in ArithTraits, and it should return the right thing. If it doesn't return the right thing, then please fix ArithTraits instead of reimplementing real here.

mhoemmen · 2018-05-03T02:21:53Z

src/batched/KokkosBatched_Vector.hpp

-    template<typename T, int l> typename Kokkos::Details::ArithTraits<Vector<SIMD<Kokkos::complex<T> >,l> >::mag_type
+    template<typename T>
+    KOKKOS_INLINE_FUNCTION
+    typename Kokkos::Details::ArithTraits<T>::mag_type ImagPart(const T &val) { return 0; }


See above comment (ArithTraits already has an imag function).

We are not using experimental flags a s a rule.

crtrott · 2018-05-03T05:16:45Z

Mark you can open a separate issue for that. I only touched that file to fix some compile issues I had.

mhoemmen · 2018-05-03T15:36:59Z

@crtrott wrote:

Mark you can open a separate issue for that. I only touched that file to fix some compile issues I had.

Done: #225

mhoemmen

See #225.

@crtrott

See @crtrott's comments above.

crtrott · 2018-05-05T04:10:05Z

Running on machine: apollo
Going to test compilers:  gcc/4.8.4 gcc/5.3.0 intel/16.0.1 clang/3.9.0 cuda/9.1
Testing compiler gcc/4.8.4
...
#######################################################
PASSED TESTS
#######################################################
clang-3.9.0-Pthread_Serial-release build_time=211 run_time=1508
gcc-4.8.4-Pthread-release build_time=123 run_time=911
#######################################################
FAILED TESTS
#######################################################
cuda-9.1-Cuda_OpenMP-release (test failed)
gcc-4.8.4-OpenMP-release (test failed)
gcc-5.3.0-Serial-release (test failed)
intel-16.0.1-OpenMP-release (test failed)

crtrott · 2018-05-05T04:10:57Z

[crtrott@apollo KokkosKernelsBuild]$ grep "error:" nohup.PR215e
[crtrott@apollo KokkosKernelsBuild]$ grep "FAIL" nohup.PR215e
  FAILED gcc-4.8.4-OpenMP-release
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2 (137 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2
 1 FAILED TEST
  FAILED gcc-5.3.0-Serial-release
[  FAILED  ] serial.batched_vector_arithmatic_simd_dcomplex2 (117 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] serial.batched_vector_arithmatic_simd_dcomplex2
 1 FAILED TEST
  FAILED intel-16.0.1-OpenMP-release
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_dcomplex (10 ms)
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_double (10 ms)
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_double
 2 FAILED TESTS
  FAILED cuda-9.1-Cuda_OpenMP-release
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2 (126 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2
 1 FAILED TEST
FAILED TESTS

crtrott · 2018-05-05T04:11:36Z

Looks like this is all batched stuff. So I am going to approve this pull request and then we fix the rest.

ndellingwood · 2018-05-05T05:12:05Z

Issue #231 was opened regarding the batched stuff, @kyungjoo-kim is working on it.

srajama1 · 2018-05-07T15:21:03Z

@crtrott Did the SpGEMM issue get resolved because of the Kokkos fix ?

crtrott and others added 7 commits March 21, 2018 17:01

Adding team based dot,axpby and gemv

1d870ea

This provides the basic template of how to do the team based things.

Add team based abs

5497592

Add team based abs

9975024

Add team abs unit test for Cuda

a925dfc

Add team abs (+ unit tests) for Serial, OpenMP, Cuda, Pthread

f19ec6b

Add team based abs and unit tests (abs, dot, gemv, axpby, axpy) for S…

f785183

…erial, OpenMP, Pthread, and Cuda

Add team-based scal, mult, update, nrm2 (+ unit tests) for Serial, Op…

be11b8d

…enMP, Pthread, and Cuda

vqd8a mentioned this pull request Apr 23, 2018

Adding team based BLAS functions #210

Closed

vqd8a requested review from mhoemmen, crtrott and srajama1 April 23, 2018 17:51

mhoemmen suggested changes Apr 23, 2018

View reviewed changes

srajama1 previously approved these changes Apr 24, 2018

View reviewed changes

mhoemmen previously approved these changes Apr 25, 2018

View reviewed changes

InnerProductSpaceTraits::dot to compute the product

836726a

vqd8a dismissed stale reviews from mhoemmen and srajama1 via 836726a April 25, 2018 17:26

etphipp approved these changes Apr 25, 2018

View reviewed changes

mhoemmen previously approved these changes Apr 25, 2018

View reviewed changes

Fix some batched stuff missing KOKKOS_INLINE_FUNCTION

8bf7a62

crtrott dismissed mhoemmen’s stale review via 8bf7a62 May 2, 2018 23:15

Update Apollo Test Script

bb0deb5

crtrott added 2 commits May 2, 2018 19:01

Fix test_script issue and guard tests which require lambda

a726322

Merge remote-tracking branch 'origin/develop' into issue-209-214

fedf928

mhoemmen suggested changes May 3, 2018

View reviewed changes

crtrott added 2 commits May 2, 2018 21:10

Fix shadow warning

429e105

Remove relaxed const expr flag from Makefile

4793a9f

We are not using experimental flags a s a rule.

mhoemmen mentioned this pull request May 3, 2018

Team-based BLAS unnecessarily redefines real & imag from ArithTraits #225

Closed

mhoemmen reviewed May 3, 2018

View reviewed changes

mhoemmen previously approved these changes May 3, 2018

View reviewed changes

crtrott added 2 commits May 4, 2018 20:11

Merge branch 'develop' into issue-209-214

eecb14b

TestScript: one more update for apollo

97e22ab

crtrott approved these changes May 5, 2018

View reviewed changes

crtrott merged commit dac474e into develop May 5, 2018

crtrott deleted the issue-209-214 branch May 5, 2018 04:12

Adding team-based BLAS functions #215

Adding team-based BLAS functions #215

Conversation

vqd8a commented Apr 23, 2018

mhoemmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott commented Apr 23, 2018

mhoemmen commented Apr 23, 2018

vqd8a commented Apr 24, 2018

srajama1 commented Apr 24, 2018

mhoemmen commented Apr 25, 2018 • edited Loading

vqd8a commented Apr 25, 2018

etphipp commented Apr 25, 2018

etphipp commented Apr 25, 2018

vqd8a commented Apr 25, 2018

srajama1 commented Apr 25, 2018

vqd8a commented Apr 25, 2018

srajama1 commented Apr 25, 2018

etphipp left a comment

Choose a reason for hiding this comment

etphipp commented Apr 25, 2018

vqd8a commented Apr 25, 2018

etphipp commented Apr 25, 2018

vqd8a commented Apr 25, 2018

crtrott commented May 2, 2018

mhoemmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott commented May 3, 2018

mhoemmen commented May 3, 2018

mhoemmen left a comment

Choose a reason for hiding this comment

crtrott commented May 5, 2018

crtrott commented May 5, 2018

crtrott commented May 5, 2018

ndellingwood commented May 5, 2018

srajama1 commented May 7, 2018

mhoemmen commented Apr 25, 2018 •

edited

Loading