-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding team-based BLAS functions #215
Conversation
This provides the basic template of how to do the team based things.
…erial, OpenMP, Pthread, and Cuda
…enMP, Pthread, and Cuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the team versions of functions like axpby
, gemv
, and scal
to respect BLAS semantics regarding coefficients alpha
and beta
? (See comments for details.) It's up to y'all, but the node-global versions of these functions currently do respect BLAS semantics.
const typename YVector::non_const_value_type& b, const YVector& y) { | ||
const int N = x.extent(0); | ||
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) { | ||
y(i) = b*y(i) + a*x(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks BLAS semantics:
- If
b
is zero, the code should overwritey(i)
without multiplying. This matters ify(i)
isInf
orNaN
. - If
a
is zero, the code should ignorex(i)
without multiplying. This matters ifx(i)
isInf
orNaN
.
If we intend to do this, then we should fix the behavior of the node-global version to be consistent with this.
dot_type result; | ||
int N = X.extent(0); | ||
Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team,N), [&] (const int& i, dot_type& val) { | ||
val += X(i)*Y(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using InnerProductSpaceTraits (which you should), then you should use InnerProductSpaceTraits::dot
to compute this product, not operator*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is that not the same for dot? I understand why it wouldn't be the same for NRM2 but for dot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crtrott It's for some Stokhos scalar types, where Eric wants to do Krylov methods with plain old double
as the dot product / norm results, even though the vectors have Stokhos types in them.
This is also a reason why one might want axpby
with coefficients of a different type than the vectors' entries.
const typename AVector::non_const_value_type& alpha, const AVector& a, const XVector& x) { | ||
const int N = x.extent(0); | ||
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) { | ||
y(i) = gamma*y(i) + alpha*a(i)*x(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note on BLAS semantics.
static KOKKOS_INLINE_FUNCTION void team_scal (const TeamType& team, const RV& R, const typename XV::non_const_value_type& a, const XV& X) { | ||
const int N = X.extent(0); | ||
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) { | ||
R(i) = a*X(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note on BLAS semantics. (If a
is zero, this should fill R
with zeros, regardless of the contents of X
.)
const typename ZVector::non_const_value_type& gamma, const ZVector& z) { | ||
const int N = x.extent(0); | ||
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,N), [&] (const int& i) { | ||
z(i) = gamma*z(i) + alpha*x(i) + beta*y(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note on BLAS semantics.
Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,M), [&] (const int& j, dot_type& val ) { | ||
val += A(j,i) * x(j); | ||
},Ax_i); | ||
y(i) = beta * y(i) + alpha * Ax_i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note on BLAS semantics.
Kokkos::parallel_reduce(Kokkos::ThreadVectorRange(team,M), [&] (const int& j, dot_type& val ) { | ||
val += Kokkos::ArithTraits<value_type>::conj(A(j,i)) * x(j); | ||
},Ax_i); | ||
y(i) = beta * y(i) + alpha * Ax_i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note on BLAS semantics.
|
||
ScalarA expected_result = 0; | ||
for(int i=0;i<N;i++) | ||
expected_result += AT::abs(h_x(i)) * AT::abs(h_x(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a good place to compute a Wilkinson-style running error bound, rather than using an ad-hoc bound (see below).
} ); | ||
|
||
ScalarB nonconst_nonconst_result = KokkosBlas::dot(y,y); | ||
EXPECT_NEAR_KK( nonconst_nonconst_result, expected_result, eps*expected_result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note.
} ); | ||
|
||
ScalarB const_nonconst_result = KokkosBlas::dot(y,y); | ||
EXPECT_NEAR_KK( const_nonconst_result, expected_result, eps*expected_result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above note.
Hm interesting question regarding the blast semantics. At the top level we handle that logic pretty far outward and just do ETI for everything. Here we would need to inject all the different code variants into your kernel, which could have some bad performance effects. I definitely want to keep BLAS semantics on the global functions but for the nested ones I just don't know whether the possible performance issues are worth it. Any thoughts? |
@crtrott wrote:
Tpetra can always implement BLAS semantics on its own, so I'm not personally worried so much about the global functions. However, you do have to worry about users following current developments in the batched linear algebra standardization effort. In addition, it depends on how we intend to use these new functions. I'm not sure what multifrontal / supernodal sparse factorizations expect, for example. |
@crtrott Any further guidance? |
@vqd8a : The BLAS semantics are important for node level calls. For the team level ones you are working on, the semantics are not clear, especially with the performance hit Christian is talking about. In fact, this is a point of discussion for the batched BLAS committee. Let me check with them next time we talk. We don't expect this semantics for our sparse solvers, so what you have done is good for us. The use case of coefficients different than vector entries is also a node-level use case. I have never heard this for a team level use case though it could extend. I am willing to go with what you have and then iterate for the other use case. Can you create a wiki page describing these comments so users know what they are getting ? |
I've approved changes so as not to block progress, but you really need to fix the InnerProductSpaceTraits issue or Stokhos might not build correctly. @etphipp may which to comment. |
For some Stokhos scalar types, the results of dot-products are different from the scalar type, hence the use of InnerProductSpaceTraits as Mark described. However Stokhos has historically overloaded these functions for its scalar types anyways, so it probably won't really matter (unless this breaks those overloads where they are no longer overriding the right function signature). |
And after looking at the code more closely, you really should follow Mark's suggestion. It's a 1-line change and makes the code consistent. |
Thanks. @etphipp |
Changed to using InnerProductSpaceTraits::dot to compute the product in KokkosBlas1_team_dot_spec.hpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, although you could shorten the dot_type typedef by using IPT.
Inner products and norms are treated differently than general multiplies, which are what those functions do. |
but in the KokkosBlas1_team_mult_spec.hpp and KokkosBlas2_team_gemv_spec.hpp, we still need to compute products, right? Correct me if I am wrong. Thanks. |
Yes, but these functions mathematically represent different operations, so you don't use the IPT there. |
Oh, I see. Thanks. @etphipp |
Running tests now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kokkos::ArithTraits
already has real
and imag
functions. For example:
kokkos-kernels/src/Kokkos_ArithTraits.hpp
Line 442 in 6e8e97a
static KOKKOS_FORCEINLINE_FUNCTION mag_type real (const T& x); |
Thus, you don't need these extra RealPart
and ImagPart
functions. If you can instantiate Kokkos::ArithTraits
for the SIMD<Kokkos::complex<T>>
type, then you should be able to call the function in ArithTraits, and it should return the right thing. If it doesn't return the right thing, then we should fix ArithTraits to do the right thing.
src/batched/KokkosBatched_Vector.hpp
Outdated
template<typename T, int l> typename Kokkos::Details::ArithTraits<Vector<SIMD<Kokkos::complex<T> >,l> >::mag_type | ||
template<typename T> | ||
KOKKOS_INLINE_FUNCTION | ||
typename Kokkos::Details::ArithTraits<T>::mag_type RealPart(const T &val) { return val; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kokkos::ArithTraits
already has the real
function:
kokkos-kernels/src/Kokkos_ArithTraits.hpp
Line 442 in 6e8e97a
static KOKKOS_FORCEINLINE_FUNCTION mag_type real (const T& x); |
You don't need this extra function. If you can instantiate Kokkos::ArithTraits
for the SIMD type, then you should be able to call the function in ArithTraits, and it should return the right thing. If it doesn't return the right thing, then please fix ArithTraits instead of reimplementing real
here.
src/batched/KokkosBatched_Vector.hpp
Outdated
template<typename T, int l> typename Kokkos::Details::ArithTraits<Vector<SIMD<Kokkos::complex<T> >,l> >::mag_type | ||
template<typename T> | ||
KOKKOS_INLINE_FUNCTION | ||
typename Kokkos::Details::ArithTraits<T>::mag_type ImagPart(const T &val) { return 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment (ArithTraits already has an imag
function).
We are not using experimental flags a s a rule.
Mark you can open a separate issue for that. I only touched that file to fix some compile issues I had. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #225.
|
|
Looks like this is all batched stuff. So I am going to approve this pull request and then we fix the rest. |
Issue #231 was opened regarding the batched stuff, @kyungjoo-kim is working on it. |
@crtrott Did the SpGEMM issue get resolved because of the Kokkos fix ? |
Addressing issues #209, #214