Transpose ND HIP Kernel Optimizations #216

snehaa8 · 2024-01-04T04:15:08Z

No description provided.

made changes in test suite to read read fixed input and comparision with golden output

…ined constant

…ins constant

made changes to update destination strides based on permute order

added golden input and outputs for 3D [2, 0, 1] transpose version removed golden inputs and outputs for 6D data

…QA passing

r-abishek · 2024-04-30T01:56:23Z

src/modules/hip/kernel/transpose.hpp

+
+    int maxLength = dstStrides[0];
+    int xAlignedLength =  maxLength & ~7;      // alignedLength for vectorized global loads
+    int xDiff = maxLength - xAlignedLength;    // difference between roiWidth and alignedLength


Remove the xAlignedLength variable and just substitute with

int xDiff = maxLength - (maxLength & ~7); // difference between roiWidth and (alignedLength = maxLength & ~7)

r-abishek · 2024-04-30T02:01:36Z

src/modules/cpu/kernel/transpose.hpp

+                                                                 *(srcPtr3 + 2 * srcGenericDescPtr->strides[1]),
+                                                                 *(srcPtr3 + 3 * srcGenericDescPtr->strides[1]),
+                                                                 *(srcPtr3 + 4 * srcGenericDescPtr->strides[1]),
+                                                                 *(srcPtr3 + 5 * srcGenericDescPtr->strides[1]),


This whole 1/2/3/4/5/6/7 * srcGenericDescPtr->strides[1] is constant.
Can be taken out of 4 loops!

snehaa8 · 2024-04-30T07:49:36Z

Addressed all comments wrt to transpose HIP

… ND kernels

code cleanup

added 2d golden output for transpose

modified test suite code to display the transpose variant being tested

sampath1117 · 2024-05-10T14:20:20Z

@r-abishek
I have made the changes for test suite for both host and hip and addressed the remain comments in the PR

Please take another look if possible and let me know if further changes are needed

r-abishek · 2024-05-21T00:30:39Z

src/modules/cpu/kernel/transpose.hpp

+        for(; vectorLoopCount < alignedCols; vectorLoopCount += vectorIncrement)
+        {
+            __m256 pSrc[4];
+            rpp_simd_load(rpp_load8_f32_to_f32_avx, srcPtrRow[0], &pSrc[0]);


@sampath1117 Pls add the AVX flags

r-abishek · 2024-05-21T00:31:30Z

src/modules/cpu/kernel/transpose.hpp

+        dstPtr[i] += increment;
+}
+
+void rpp_store16_f32_f32_channelwise(Rpp32f **dstPtr, __m128 *p)


Can we make these helpers inline?

r-abishek and others added 30 commits November 27, 2023 22:10

Add transpose ref input/output

230a29d

added initial support for generic ND transpose in HOST

edc1b9c

added golden outputs for transpose

69722bf

made changes in test suite to read read fixed input and comparision with golden output

optimized 2D transpose with SSE instructions

63c1929

added initial SSE version for 3D inputs with last dimension fixed to 16

4ff7b87

minor changes

29be31a

added support for transposing 3D inputs when innermost dimension rema…

2b39906

…ined constant

added support for transposing 3D inputs when innermost dimension rema…

598b010

…ins constant

minor cleanup

7151677

optimized 2D transpose with AVX2 instructions

ddc54a1

added generic case to do memcpy if permute order is same as input layout

d8883bf

made changes to process w.r.t strides instead of ROI

b7b66e4

made changes in test suite to run performance tests

bbcb707

made changes to update destination strides based on permute order

added initial transpose SSE version for 4D inputs

3ca31a4

optimized 4d transpose with avx2

eccaa3c

updated golden inputs and output with actual float data

b961d34

added golden input and outputs for 3D [2, 0, 1] transpose version removed golden inputs and outputs for 6D data

changed the name of transpose test suite file

79682e5

ported transpose 3D 16 channel variant from SSE to AVX2 instructions

117d4bc

added a templated generic transpose kernel for other bitdepths

6f4c273

changed ROI buffer to also have begin values

b0b85b1

made transpose test suite generic for supporting any ND kernel

8637c82

added support to run transpose test suite from python

c909277

changed file name of transpose test suite for better readability

7aef9ca

minor changes

8e2cc6e

address review comments

b143008

removed usage of malloc/calloc in transpose function

4a60069

fixed build error in test suite

ad29567

Add initial generic templated u8/f16/f32/i8 unvectorized transpose - …

1a41104

…QA passing

Add initial hip misc tests for transpose

1817f63

Increase max dims to 8

b6d6fad

r-abishek reviewed Apr 30, 2024

View reviewed changes

snehaa8 and others added 2 commits April 30, 2024 03:38

Cleanup and optimize

84ebccd

Merge branch 'develop' into sn/transpose_ND

178fc89

minor change in comment

485f181

r-abishek force-pushed the develop branch from d24f1ea to 9dcae9d Compare May 1, 2024 18:31

sampath1117 added 12 commits May 10, 2024 06:17

Merge branch 'develop' into sn/transpose_ND

5a699f7

Merge branch 'develop' into sn/transpose_ND

9a6cfc4

revert unnecessary changes happened with merge

840b6e6

added transpose test case

bfc5598

removed .txt input files

bf20223

moved normalize inputs and outputs to another folder

9fea8f4

made changes to make the golden input and output path generic for all…

52df0c3

… ND kernels

removed .txt output files for transpose

b67fed3

use hipMemcpyAsync instead of instead of hipMempcy in hip kernel

a99727e

code cleanup

modified compare output function to do comparision for transpose case

976f5f7

added 2d golden output for transpose

added golden output for 3d inputs

35f04c8

modified test suite code to display the transpose variant being tested

moved constant compute outside the loop

e12d5c7

sampath1117 added 2 commits May 10, 2024 14:24

minor change in description

59355c0

updated print statement for usage in test suites

9ec2f6b

r-abishek reviewed May 21, 2024

View reviewed changes

sampath1117 added 2 commits May 21, 2024 13:40

Merge branch 'develop' into sn/transpose_ND

538b53d

added AVX2 flags and made helper functions inline in HOST kernel

9d199da

r-abishek changed the base branch from develop to ar/transpose_tensor May 21, 2024 19:04

r-abishek added this to the sow10ms1 milestone May 21, 2024

r-abishek approved these changes May 21, 2024

View reviewed changes

r-abishek merged commit 8879af0 into r-abishek:ar/transpose_tensor May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpose ND HIP Kernel Optimizations #216

Transpose ND HIP Kernel Optimizations #216

snehaa8 commented Jan 4, 2024

r-abishek Apr 30, 2024

r-abishek Apr 30, 2024

snehaa8 commented Apr 30, 2024

sampath1117 commented May 10, 2024 •

edited

Loading

r-abishek May 21, 2024

r-abishek May 21, 2024

Transpose ND HIP Kernel Optimizations #216

Transpose ND HIP Kernel Optimizations #216

Conversation

snehaa8 commented Jan 4, 2024

r-abishek Apr 30, 2024

Choose a reason for hiding this comment

r-abishek Apr 30, 2024

Choose a reason for hiding this comment

snehaa8 commented Apr 30, 2024

sampath1117 commented May 10, 2024 • edited Loading

r-abishek May 21, 2024

Choose a reason for hiding this comment

r-abishek May 21, 2024

Choose a reason for hiding this comment

sampath1117 commented May 10, 2024 •

edited

Loading