-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serial code path for spmv #893
Conversation
I heard that we now have auto tester. The following spotcheck result is just for reference.
|
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: KokkosKernels_PullRequest_GCC720
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_GCC720
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_INTEL18
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_GCC720_Light
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_CUDA10
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_CUDA9
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_GCC720_GCC740
Jenkins Parameters
Using Repos:
Pull Request Author: kyungjoo-kim |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: KokkosKernels_PullRequest_GCC720
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_GCC720
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_INTEL18
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_GCC720_Light
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_CUDA10
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_CUDA9
Jenkins Parameters
Build InformationTest Name: KokkosKernels_PullRequest_Tpls_GCC720_GCC740
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, I just have a few minor comments about the style I guess and performance.
Any reason for using 4 tmp rather than something else, I am guessing this gave the best performance?
const ordinal_type *__restrict__ col_idx_ptr = A.graph.entries.data(); | ||
const value_type *__restrict__ values_ptr = A.values.data(); | ||
|
||
typename YVector::non_const_value_type *__restrict__ y_ptr = y.data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since I already spotted it a few times, maybe add something like:
using output_value_type = typename YVector::non_const_value_type
typename YVector::non_const_value_type *__restrict__ y_ptr = y.data(); | ||
typename XVector::const_value_type *__restrict__ x_ptr = x.data(); | ||
|
||
const typename YVector::non_const_value_type zero(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think there is anything wrong with this especially since it is on the host side, but in general using:
const output_value_type zero = Kokkos::ArithTraits<output_value_type>::zero()
can be more robust but that's no big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that this is just a style. ; )
if (alpha == zero) { | ||
if (dobeta == 0) { | ||
memset(y_ptr, 0, sizeof(typename YVector::value_type)*nrow); | ||
} else if (dobeta == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check for dobeta == -1
, it would probably be a bit faster to do it with -= operator than with *= operator?
Again not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a case with alpha = zero. When beta is any number, it is scaling. -= cannot be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess I should have said, y = -y
, I am actually hoping that the compiler can detect that alpha is equal to -1
and can optimize this but again I do not think it is critical here. I think alpha=-1
is a more common case because of residual calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment in the below.
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
@lucbv The performance is in paricular slower on power architecture comparing with intel. Intel already perform reasonably well without this trick. The power architecture vectorization has 128 bit length and I just give possibility of vectorization with latency hiding by feeding 2x food. This kind of optimization should be done by a compiler. With advance in compiler optimization, the performance can be improved without hand optimization like this (that will be the best at the end). The performance depends on the sparsity of the matrix. I do not really expect performance with this trick but some problems on a coarse domain is quite dense in the application. It just performs better. If there is no critical concern, would you approve and include this in the code ? Earlier integration with Trilinos would also help for the application level testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ lucbv ]! |
Status Flag 'Pull Request AutoTester' - Pull Request MUST BE MERGED MANUALLY BY Project Team - This Repo does not support Automerge |
if (dobeta == 0) { | ||
y_ptr[i] = alpha*(tmp1 + tmp2 + tmp3 + tmp4); | ||
} else if (dobeta == -1) { | ||
y_ptr[i] -= alpha*(tmp1 + tmp2 + tmp3 + tmp4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucbv This is when alpha = -1.
Status Flag 'Pull Request AutoTester' - Pull Request MUST BE MERGED MANUALLY BY Project Team - This Repo does not support Automerge |
1 similar comment
Status Flag 'Pull Request AutoTester' - Pull Request MUST BE MERGED MANUALLY BY Project Team - This Repo does not support Automerge |
could we merge this ? |
Sure |
The current spmv with transpose uses atomic updates and follows the generic interface with parallel for. This significantly lowers the performance as seen in trilinos/Trilinos#8658.
This PR address the performance issue of serial implementation of spmv.