-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transpose ND HIP Kernel Optimizations #216
Transpose ND HIP Kernel Optimizations #216
Conversation
made changes in test suite to read read fixed input and comparision with golden output
made changes to update destination strides based on permute order
added golden input and outputs for 3D [2, 0, 1] transpose version removed golden inputs and outputs for 6D data
src/modules/hip/kernel/transpose.hpp
Outdated
|
||
int maxLength = dstStrides[0]; | ||
int xAlignedLength = maxLength & ~7; // alignedLength for vectorized global loads | ||
int xDiff = maxLength - xAlignedLength; // difference between roiWidth and alignedLength |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the xAlignedLength variable and just substitute with
int xDiff = maxLength - (maxLength & ~7); // difference between roiWidth and (alignedLength = maxLength & ~7)
src/modules/cpu/kernel/transpose.hpp
Outdated
*(srcPtr3 + 2 * srcGenericDescPtr->strides[1]), | ||
*(srcPtr3 + 3 * srcGenericDescPtr->strides[1]), | ||
*(srcPtr3 + 4 * srcGenericDescPtr->strides[1]), | ||
*(srcPtr3 + 5 * srcGenericDescPtr->strides[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole 1/2/3/4/5/6/7 * srcGenericDescPtr->strides[1] is constant.
Can be taken out of 4 loops!
Addressed all comments wrt to transpose HIP |
added 2d golden output for transpose
modified test suite code to display the transpose variant being tested
@r-abishek Please take another look if possible and let me know if further changes are needed |
for(; vectorLoopCount < alignedCols; vectorLoopCount += vectorIncrement) | ||
{ | ||
__m256 pSrc[4]; | ||
rpp_simd_load(rpp_load8_f32_to_f32_avx, srcPtrRow[0], &pSrc[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampath1117 Pls add the AVX flags
src/modules/cpu/kernel/transpose.hpp
Outdated
dstPtr[i] += increment; | ||
} | ||
|
||
void rpp_store16_f32_f32_channelwise(Rpp32f **dstPtr, __m128 *p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these helpers inline?
No description provided.