Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARSE] Improve sparse performance on ROCM #7935

Merged
merged 4 commits into from
May 4, 2021

Conversation

tkonolige
Copy link
Contributor

The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster.

@tmoreau89 @jwfromm

The current sparse dense gpu kernel uses warp level storage to handling
caching of data. Warp level storage uses shuffle intrinsics, which are
slow on rocm (because they actually read and write to shared memory).
Rocm does provide intrinsics to do the correct memory management, but
they are not available through tvm. Instead this PR switches to using
shared memory on rocm devices. Performance is about 2x faster.
@masahi
Copy link
Member

masahi commented Apr 28, 2021

This post says: "They (ds_permute and ds_bpermute instructions) use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location". I don't know what they mean by "route without actually writing".
https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

I wonder if both approaches use shared memory, why the explicit way as in this PR is faster.

@tkonolige
Copy link
Contributor Author

@masahi With ds_permute, we do a write and read from LDS for each of the 64 element accesses, vs doing a single write to LDS and 64 reads with the approach in this PR.

Lower down it says "All active lanes write data to a temporary buffer. All active lanes read data from the temporary buffer...".

@masahi
Copy link
Member

masahi commented Apr 28, 2021

I'm planning to work on improving our GPU scan kernel using warp shuffle instructions, so I want to investigate this issue when I get there. Warp shuffle on AMD being slower than shared memory sounds surprising and counter intuitive. In the PR that introduced warp shuffle support to TVM rocm, #5727, @t-vi mentioned that he got a good speed up on softmax reduction #5727 (comment). So I was under impression that warp shuffle is generally a good thing on AMD too.

@t-vi
Copy link
Contributor

t-vi commented Apr 28, 2021

I don't think the descriptions are entirely accurate, but the Vega ISA manual says

This does not access LDS memory and may be called even if no LDS memory is allocated to the wave. It uses LDS hardware to implement an arbitrary swizzle across threads in a wavefront.

so I would expect that the performance lies somewhere between using LDS and registers. I can imagine that doing a lot less writing might save time in this specific case, but it probably is best to check with AMD before drawing global conclusions.

@masahi masahi self-assigned this Apr 28, 2021
@masahi masahi merged commit 9070c65 into apache:main May 4, 2021
umangyadav pushed a commit to umangyadav/tvm that referenced this pull request May 5, 2021
* [SPARSE] Improve sparse performance on ROCM

The current sparse dense gpu kernel uses warp level storage to handling
caching of data. Warp level storage uses shuffle intrinsics, which are
slow on rocm (because they actually read and write to shared memory).
Rocm does provide intrinsics to do the correct memory management, but
they are not available through tvm. Instead this PR switches to using
shared memory on rocm devices. Performance is about 2x faster.

* default to shared mem

* formatting

* formatting
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* [SPARSE] Improve sparse performance on ROCM

The current sparse dense gpu kernel uses warp level storage to handling
caching of data. Warp level storage uses shuffle intrinsics, which are
slow on rocm (because they actually read and write to shared memory).
Rocm does provide intrinsics to do the correct memory management, but
they are not available through tvm. Instead this PR switches to using
shared memory on rocm devices. Performance is about 2x faster.

* default to shared mem

* formatting

* formatting
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* [SPARSE] Improve sparse performance on ROCM

The current sparse dense gpu kernel uses warp level storage to handling
caching of data. Warp level storage uses shuffle intrinsics, which are
slow on rocm (because they actually read and write to shared memory).
Rocm does provide intrinsics to do the correct memory management, but
they are not available through tvm. Instead this PR switches to using
shared memory on rocm devices. Performance is about 2x faster.

* default to shared mem

* formatting

* formatting
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* [SPARSE] Improve sparse performance on ROCM

The current sparse dense gpu kernel uses warp level storage to handling
caching of data. Warp level storage uses shuffle intrinsics, which are
slow on rocm (because they actually read and write to shared memory).
Rocm does provide intrinsics to do the correct memory management, but
they are not available through tvm. Instead this PR switches to using
shared memory on rocm devices. Performance is about 2x faster.

* default to shared mem

* formatting

* formatting
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
* [SPARSE] Improve sparse performance on ROCM

The current sparse dense gpu kernel uses warp level storage to handling
caching of data. Warp level storage uses shuffle intrinsics, which are
slow on rocm (because they actually read and write to shared memory).
Rocm does provide intrinsics to do the correct memory management, but
they are not available through tvm. Instead this PR switches to using
shared memory on rocm devices. Performance is about 2x faster.

* default to shared mem

* formatting

* formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants