Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding team-based BLAS functions #215

Merged
merged 17 commits into from
May 5, 2018
Merged

Adding team-based BLAS functions #215

merged 17 commits into from
May 5, 2018

Conversation

vqd8a
Copy link
Contributor

@vqd8a vqd8a commented Apr 23, 2018

Addressing issues #209, #214

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the team versions of functions like axpby, gemv, and scal to respect BLAS semantics regarding coefficients alpha and beta? (See comments for details.) It's up to y'all, but the node-global versions of these functions currently do respect BLAS semantics.

const typename YVector::non_const_value_type& b, const YVector& y) {
const int N = x.extent(0);
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
y(i) = b*y(i) + a*x(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks BLAS semantics:

  1. If b is zero, the code should overwrite y(i) without multiplying. This matters if y(i) is Inf or NaN.
  2. If a is zero, the code should ignore x(i) without multiplying. This matters if x(i) is Inf or NaN.

If we intend to do this, then we should fix the behavior of the node-global version to be consistent with this.

dot_type result;
int N = X.extent(0);
Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,N), [&] (const int& i, dot_type& val) {
val += X(i)*Y(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using InnerProductSpaceTraits (which you should), then you should use InnerProductSpaceTraits::dot to compute this product, not operator*.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is that not the same for dot? I understand why it wouldn't be the same for NRM2 but for dot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crtrott It's for some Stokhos scalar types, where Eric wants to do Krylov methods with plain old double as the dot product / norm results, even though the vectors have Stokhos types in them.

This is also a reason why one might want axpby with coefficients of a different type than the vectors' entries.

const typename AVector::non_const_value_type& alpha, const AVector& a, const XVector& x) {
const int N = x.extent(0);
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
y(i) = gamma*y(i) + alpha*a(i)*x(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note on BLAS semantics.

static KOKKOS_INLINE_FUNCTION void team_scal (const TeamType& team, const RV& R, const typename XV::non_const_value_type& a, const XV& X) {
const int N = X.extent(0);
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
R(i) = a*X(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note on BLAS semantics. (If a is zero, this should fill R with zeros, regardless of the contents of X.)

const typename ZVector::non_const_value_type& gamma, const ZVector& z) {
const int N = x.extent(0);
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) {
z(i) = gamma*z(i) + alpha*x(i) + beta*y(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note on BLAS semantics.

Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,M), [&] (const int& j, dot_type& val ) {
val += A(j,i) * x(j);
},Ax_i);
y(i) = beta * y(i) + alpha * Ax_i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note on BLAS semantics.

Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,M), [&] (const int& j, dot_type& val ) {
val += Kokkos::ArithTraits<value_type>::conj(A(j,i)) * x(j);
},Ax_i);
y(i) = beta * y(i) + alpha * Ax_i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note on BLAS semantics.


ScalarA expected_result = 0;
for(int i=0;i<N;i++)
expected_result += AT::abs(h_x(i)) * AT::abs(h_x(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good place to compute a Wilkinson-style running error bound, rather than using an ad-hoc bound (see below).

} );

ScalarB nonconst_nonconst_result = KokkosBlas::dot(y,y);
EXPECT_NEAR_KK( nonconst_nonconst_result, expected_result, eps*expected_result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note.

} );

ScalarB const_nonconst_result = KokkosBlas::dot(y,y);
EXPECT_NEAR_KK( const_nonconst_result, expected_result, eps*expected_result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above note.

@crtrott
Copy link
Member

crtrott commented Apr 23, 2018

Hm interesting question regarding the blast semantics. At the top level we handle that logic pretty far outward and just do ETI for everything. Here we would need to inject all the different code variants into your kernel, which could have some bad performance effects. I definitely want to keep BLAS semantics on the global functions but for the nested ones I just don't know whether the possible performance issues are worth it. Any thoughts?

@mhoemmen
Copy link
Contributor

@crtrott wrote:

I definitely want to keep BLAS semantics on the global functions but for the nested ones I just don't know whether the possible performance issues are worth it. Any thoughts?

Tpetra can always implement BLAS semantics on its own, so I'm not personally worried so much about the global functions. However, you do have to worry about users following current developments in the batched linear algebra standardization effort. In addition, it depends on how we intend to use these new functions. I'm not sure what multifrontal / supernodal sparse factorizations expect, for example.

@vqd8a
Copy link
Contributor Author

vqd8a commented Apr 24, 2018

@crtrott Any further guidance?

@srajama1
Copy link
Contributor

@vqd8a : The BLAS semantics are important for node level calls. For the team level ones you are working on, the semantics are not clear, especially with the performance hit Christian is talking about. In fact, this is a point of discussion for the batched BLAS committee. Let me check with them next time we talk. We don't expect this semantics for our sparse solvers, so what you have done is good for us.

The use case of coefficients different than vector entries is also a node-level use case. I have never heard this for a team level use case though it could extend. I am willing to go with what you have and then iterate for the other use case. Can you create a wiki page describing these comments so users know what they are getting ?

srajama1
srajama1 previously approved these changes Apr 24, 2018
mhoemmen
mhoemmen previously approved these changes Apr 25, 2018
@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 25, 2018

I've approved changes so as not to block progress, but you really need to fix the InnerProductSpaceTraits issue or Stokhos might not build correctly. @etphipp may which to comment.

@vqd8a
Copy link
Contributor Author

vqd8a commented Apr 25, 2018

Thanks, @srajama1 @mhoemmen.
@srajama1 I am going to create a wiki page for these. BTW, I am not quite clear about "The use case of coefficients different than vector entries is also a node-level use case". Could you please explain more?

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

For some Stokhos scalar types, the results of dot-products are different from the scalar type, hence the use of InnerProductSpaceTraits as Mark described. However Stokhos has historically overloaded these functions for its scalar types anyways, so it probably won't really matter (unless this breaks those overloads where they are no longer overriding the right function signature).

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

And after looking at the code more closely, you really should follow Mark's suggestion. It's a 1-line change and makes the code consistent.

@vqd8a
Copy link
Contributor Author

vqd8a commented Apr 25, 2018

Thanks. @etphipp

@srajama1
Copy link
Contributor

@vqd8a : I was wondering if the InnerProductSpaceTraits are really needed because of the overloading @etphipp mentioned. He says it is needed, so let us do that.

@vqd8a vqd8a dismissed stale reviews from mhoemmen and srajama1 via 836726a April 25, 2018 17:26
@vqd8a
Copy link
Contributor Author

vqd8a commented Apr 25, 2018

Changed to using InnerProductSpaceTraits::dot to compute the product in KokkosBlas1_team_dot_spec.hpp

@srajama1
Copy link
Contributor

As @mhoemmen and @etphipp wanted this change I will wait for them to review. I can click the green button once either of them approve.

Copy link
Contributor

@etphipp etphipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, although you could shorten the dot_type typedef by using IPT.

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

Inner products and norms are treated differently than general multiplies, which are what those functions do.

@vqd8a
Copy link
Contributor Author

vqd8a commented Apr 25, 2018

but in the KokkosBlas1_team_mult_spec.hpp and KokkosBlas2_team_gemv_spec.hpp, we still need to compute products, right? Correct me if I am wrong. Thanks.

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

Yes, but these functions mathematically represent different operations, so you don't use the IPT there.

@vqd8a
Copy link
Contributor Author

vqd8a commented Apr 25, 2018

Oh, I see. Thanks. @etphipp

mhoemmen
mhoemmen previously approved these changes Apr 25, 2018
@crtrott
Copy link
Member

crtrott commented May 2, 2018

Running tests now.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kokkos::ArithTraits already has real and imag functions. For example:

static KOKKOS_FORCEINLINE_FUNCTION mag_type real (const T& x);

Thus, you don't need these extra RealPart and ImagPart functions. If you can instantiate Kokkos::ArithTraits for the SIMD<Kokkos::complex<T>> type, then you should be able to call the function in ArithTraits, and it should return the right thing. If it doesn't return the right thing, then we should fix ArithTraits to do the right thing.

template<typename T, int l> typename Kokkos::Details::ArithTraits<Vector<SIMD<Kokkos::complex<T> >,l> >::mag_type
template<typename T>
KOKKOS_INLINE_FUNCTION
typename Kokkos::Details::ArithTraits<T>::mag_type RealPart(const T &val) { return val; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kokkos::ArithTraits already has the real function:

static KOKKOS_FORCEINLINE_FUNCTION mag_type real (const T& x);

You don't need this extra function. If you can instantiate Kokkos::ArithTraits for the SIMD type, then you should be able to call the function in ArithTraits, and it should return the right thing. If it doesn't return the right thing, then please fix ArithTraits instead of reimplementing real here.

template<typename T, int l> typename Kokkos::Details::ArithTraits<Vector<SIMD<Kokkos::complex<T> >,l> >::mag_type
template<typename T>
KOKKOS_INLINE_FUNCTION
typename Kokkos::Details::ArithTraits<T>::mag_type ImagPart(const T &val) { return 0; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment (ArithTraits already has an imag function).

crtrott added 2 commits May 2, 2018 21:10
We are not using experimental flags a s a rule.
@crtrott
Copy link
Member

crtrott commented May 3, 2018

Mark you can open a separate issue for that. I only touched that file to fix some compile issues I had.

@mhoemmen
Copy link
Contributor

mhoemmen commented May 3, 2018

@crtrott wrote:

Mark you can open a separate issue for that. I only touched that file to fix some compile issues I had.

Done: #225

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #225.

mhoemmen
mhoemmen previously approved these changes May 3, 2018
@mhoemmen mhoemmen dismissed their stale review May 3, 2018 15:38

See @crtrott's comments above.

@crtrott
Copy link
Member

crtrott commented May 5, 2018

Running on machine: apollo
Going to test compilers:  gcc/4.8.4 gcc/5.3.0 intel/16.0.1 clang/3.9.0 cuda/9.1
Testing compiler gcc/4.8.4
...
#######################################################
PASSED TESTS
#######################################################
clang-3.9.0-Pthread_Serial-release build_time=211 run_time=1508
gcc-4.8.4-Pthread-release build_time=123 run_time=911
#######################################################
FAILED TESTS
#######################################################
cuda-9.1-Cuda_OpenMP-release (test failed)
gcc-4.8.4-OpenMP-release (test failed)
gcc-5.3.0-Serial-release (test failed)
intel-16.0.1-OpenMP-release (test failed)

@crtrott
Copy link
Member

crtrott commented May 5, 2018

[crtrott@apollo KokkosKernelsBuild]$ grep "error:" nohup.PR215e
[crtrott@apollo KokkosKernelsBuild]$ grep "FAIL" nohup.PR215e
  FAILED gcc-4.8.4-OpenMP-release
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2 (137 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2
 1 FAILED TEST
  FAILED gcc-5.3.0-Serial-release
[  FAILED  ] serial.batched_vector_arithmatic_simd_dcomplex2 (117 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] serial.batched_vector_arithmatic_simd_dcomplex2
 1 FAILED TEST
  FAILED intel-16.0.1-OpenMP-release
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_dcomplex (10 ms)
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_double (10 ms)
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_trsv_l_nt_n_dcomplex_double
 2 FAILED TESTS
  FAILED cuda-9.1-Cuda_OpenMP-release
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2 (126 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] openmp.batched_vector_arithmatic_simd_dcomplex2
 1 FAILED TEST
FAILED TESTS

@crtrott
Copy link
Member

crtrott commented May 5, 2018

Looks like this is all batched stuff. So I am going to approve this pull request and then we fix the rest.

@crtrott crtrott merged commit dac474e into develop May 5, 2018
@crtrott crtrott deleted the issue-209-214 branch May 5, 2018 04:12
@ndellingwood
Copy link
Contributor

Issue #231 was opened regarding the batched stuff, @kyungjoo-kim is working on it.

@srajama1
Copy link
Contributor

srajama1 commented May 7, 2018

@crtrott Did the SpGEMM issue get resolved because of the Kokkos fix ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants