[SPARSE] Improve sparse performance on ROCM #7935

tkonolige · 2021-04-27T23:29:59Z

The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster.

@tmoreau89 @jwfromm

The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster.

masahi · 2021-04-28T00:08:42Z

This post says: "They (ds_permute and ds_bpermute instructions) use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location". I don't know what they mean by "route without actually writing".
https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

I wonder if both approaches use shared memory, why the explicit way as in this PR is faster.

tkonolige · 2021-04-28T15:33:30Z

@masahi With ds_permute, we do a write and read from LDS for each of the 64 element accesses, vs doing a single write to LDS and 64 reads with the approach in this PR.

Lower down it says "All active lanes write data to a temporary buffer. All active lanes read data from the temporary buffer...".

masahi · 2021-04-28T19:24:34Z

I'm planning to work on improving our GPU scan kernel using warp shuffle instructions, so I want to investigate this issue when I get there. Warp shuffle on AMD being slower than shared memory sounds surprising and counter intuitive. In the PR that introduced warp shuffle support to TVM rocm, #5727, @t-vi mentioned that he got a good speed up on softmax reduction #5727 (comment). So I was under impression that warp shuffle is generally a good thing on AMD too.

python/tvm/topi/cuda/sparse.py

t-vi · 2021-04-28T21:02:54Z

I don't think the descriptions are entirely accurate, but the Vega ISA manual says

This does not access LDS memory and may be called even if no LDS memory is allocated to the wave. It uses LDS hardware to implement an arbitrary swizzle across threads in a wavefront.

so I would expect that the performance lies somewhere between using LDS and registers. I can imagine that doing a lot less writing might save time in this specific case, but it probably is best to check with AMD before drawing global conclusions.

* [SPARSE] Improve sparse performance on ROCM The current sparse dense gpu kernel uses warp level storage to handling caching of data. Warp level storage uses shuffle intrinsics, which are slow on rocm (because they actually read and write to shared memory). Rocm does provide intrinsics to do the correct memory management, but they are not available through tvm. Instead this PR switches to using shared memory on rocm devices. Performance is about 2x faster. * default to shared mem * formatting * formatting

masahi reviewed Apr 28, 2021

View reviewed changes

python/tvm/topi/cuda/sparse.py Outdated Show resolved Hide resolved

tkonolige added 2 commits April 28, 2021 14:35

default to shared mem

bd246e9

formatting

2a2f115

masahi self-assigned this Apr 28, 2021

formatting

0136308

masahi approved these changes May 4, 2021

View reviewed changes

masahi merged commit 9070c65 into apache:main May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARSE] Improve sparse performance on ROCM #7935

[SPARSE] Improve sparse performance on ROCM #7935

tkonolige commented Apr 27, 2021

masahi commented Apr 28, 2021 •

edited

Loading

tkonolige commented Apr 28, 2021

masahi commented Apr 28, 2021

t-vi commented Apr 28, 2021

[SPARSE] Improve sparse performance on ROCM #7935

[SPARSE] Improve sparse performance on ROCM #7935

Conversation

tkonolige commented Apr 27, 2021

masahi commented Apr 28, 2021 • edited Loading

tkonolige commented Apr 28, 2021

masahi commented Apr 28, 2021

t-vi commented Apr 28, 2021

masahi commented Apr 28, 2021 •

edited

Loading