Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpose ND HIP Kernel Optimizations #216

Merged
merged 122 commits into from
May 21, 2024

Conversation

snehaa8
Copy link

@snehaa8 snehaa8 commented Jan 4, 2024

No description provided.

r-abishek and others added 30 commits November 27, 2023 22:10
made changes in test suite to read read fixed input and comparision with golden output
made changes to update destination strides based on permute order
added golden input and outputs for 3D [2, 0, 1] transpose version

removed golden inputs and outputs for 6D data

int maxLength = dstStrides[0];
int xAlignedLength = maxLength & ~7; // alignedLength for vectorized global loads
int xDiff = maxLength - xAlignedLength; // difference between roiWidth and alignedLength
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the xAlignedLength variable and just substitute with

int xDiff = maxLength - (maxLength & ~7);    // difference between roiWidth and (alignedLength = maxLength & ~7)

*(srcPtr3 + 2 * srcGenericDescPtr->strides[1]),
*(srcPtr3 + 3 * srcGenericDescPtr->strides[1]),
*(srcPtr3 + 4 * srcGenericDescPtr->strides[1]),
*(srcPtr3 + 5 * srcGenericDescPtr->strides[1]),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole 1/2/3/4/5/6/7 * srcGenericDescPtr->strides[1] is constant.
Can be taken out of 4 loops!

@snehaa8
Copy link
Author

snehaa8 commented Apr 30, 2024

Addressed all comments wrt to transpose HIP

@sampath1117
Copy link
Collaborator

sampath1117 commented May 10, 2024

@r-abishek
I have made the changes for test suite for both host and hip and addressed the remain comments in the PR

Please take another look if possible and let me know if further changes are needed

for(; vectorLoopCount < alignedCols; vectorLoopCount += vectorIncrement)
{
__m256 pSrc[4];
rpp_simd_load(rpp_load8_f32_to_f32_avx, srcPtrRow[0], &pSrc[0]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampath1117 Pls add the AVX flags

dstPtr[i] += increment;
}

void rpp_store16_f32_f32_channelwise(Rpp32f **dstPtr, __m128 *p)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these helpers inline?

@r-abishek r-abishek changed the base branch from develop to ar/transpose_tensor May 21, 2024 19:04
@r-abishek r-abishek added this to the sow10ms1 milestone May 21, 2024
@r-abishek r-abishek merged commit 8879af0 into r-abishek:ar/transpose_tensor May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants