[TOPI][CUDA] Improve the performance of scatter_nd #8479

zhuwenxi · 2021-07-15T09:28:27Z

Split into 2 kernels, one does the "Init" and another does the "Update".
Thus they can have different Grid/Block configurations to better utilize
SMs.
Use atomic_add instead of direct assignment, which could avoid the race
condtion when multiple indices point to the same location of the output
tensor. With this moidification, it's safe now to use more CUDA threads
to gain more parallelism.

Detail discussion: https://discuss.tvm.apache.org/t/topi-cuda-scatter-nd-has-a-very-poor-performance-on-cuda-backend-1000x-slower-than-hand-written-cuda-code/10426

zhuwenxi · 2021-07-15T09:29:26Z

@tkonolige Could you help review this PR? Thank you.

1. Split into 2 kernels, one does the "Init" and another does the "Update". Thus they can have different Grid/Block configurations to better utilize SMs. 2. Use atomic_add instead of direct assignment, which could avoid the race condtion when multiple indices point to the same location of the output tensor. With this moidification, it's safe now to use more CUDA threads to gain more parallelism.

tkonolige

Thanks zhuwenxi! Do you have performance numbers for the PR? I'd be interested in seeing them.

tkonolige · 2021-07-15T16:38:47Z

python/tvm/topi/cuda/scatter.py

+            blockDim = data_ptr.shape[-1]
+
+            ib.scope_attr(bidx, "thread_extent", gridDim)
+            ib.scope_attr(tidx, "thread_extent", blockDim)


In some cases this dimension will be very small. Can you instead split the full shape by max_num_threads?

tkonolige · 2021-07-15T16:39:13Z

python/tvm/topi/cuda/scatter.py

+        with ib.new_scope():
+            bidx = te.thread_axis("blockIdx.x")
+            tidx = te.thread_axis("threadIdx.x")
+            gridDim = fused_indices_dimension  # 32 * 600 = 19200


remove this comment

tkonolige · 2021-07-15T16:41:45Z

python/tvm/topi/cuda/scatter.py

-                # Build up the indices[0, y_0, .. y_{K-1}], .. indices[M-1, y_0, .. y_{K-1}] part
-                # of the index into out.
-                for l in reversed(range(indices_ptr.shape[0].value)):
+                findex = j


You've set j = tidx and then only use it in one spot. Why not just use tidx everywhere?

tkonolige · 2021-07-15T16:52:18Z

python/tvm/topi/cuda/scatter.py

-        # For now we avoid parallizing over dimensions indexed by `indices` as
-        # there may be repeated indices and hadling parallel accumulation can
-        # be hard. So we parallelize over X_M .. X_{N-1} instead. This will
-        # work well when these dimensions are large enough to saturate memory
-        # bandwidth, but performance will be bad when these dimensions are
-        # small.


Can you add a comment about how we are doing parallelism (we are thread-parallel over all the update dimension and each block handles one set of indices?)

We follow the original parallelism scheme, but replace ib.for_range() with blockIdx.y.
Atomic_add guarantees correctness when mode=="add"

Can you update the comment in the code to reflect this?

- Split ScatterND kernel into 2 sub-kernels using ib.new_scope() - Replace ib.for_range() with blockIdx.y - Using atomic_add when mode == "add" - Keep threadIdx.x less than max_threads of GPU

zhuwenxi · 2021-07-20T12:02:37Z

@tkonolige about the performance comparison, it's 23 ms vs. 4.9 ms on my NV T4 card, for the case I provided in https://discuss.tvm.apache.org/t/topi-cuda-scatter-nd-has-a-very-poor-performance-on-cuda-backend-1000x-slower-than-hand-written-cuda-code/10426.

zhuwenxi · 2021-07-20T12:03:39Z

@tkonolige We just upstream a commit to fix a UT and the comment issue. The remaining fixes are on the way.

tkonolige

Could you provide timing information for a variety of shapes and ranks. I just want to make sure this is faster on all inputs.

tkonolige · 2021-07-20T17:17:04Z

python/tvm/topi/cuda/scatter.py

                    offset *= data_ptr.shape[l]
                if mode == "update":
-                    out[index] = updates[i * fused_updates_dimension + j]
+                    out[index] = updates[by * fused_updates_dimension + j]


Can you move updates[by * fused_updates_dimension + j] outside of the if statements?

tkonolige · 2021-07-20T17:17:09Z

python/tvm/topi/cuda/scatter.py

-        # For now we avoid parallizing over dimensions indexed by `indices` as
-        # there may be repeated indices and hadling parallel accumulation can
-        # be hard. So we parallelize over X_M .. X_{N-1} instead. This will
-        # work well when these dimensions are large enough to saturate memory
-        # bandwidth, but performance will be bad when these dimensions are
-        # small.


Can you update the comment in the code to reflect this?

CaptainDuke · 2021-07-21T09:35:52Z

Could you provide timing information for a variety of shapes and ranks. I just want to make sure this is faster on all inputs.

@tkonolige
We evalutate the performance with 3 types of ranks and shapes. Time (nanoseconds) is collected using Nsight System.

So long as the original with ib.for_range() as i is large enough, the separated two kernels would enlarge dimGrid and achieve better parallelism significantly.

tkonolige

Performance results look great! Could you also test 1. where indices is small (~10) and updates is large and 2. where indices is large and updates is size 1.

tkonolige · 2021-07-21T16:36:17Z

python/tvm/topi/cuda/scatter.py

        # work well when these dimensions are large enough to saturate memory
        # bandwidth, but performance will be bad when these dimensions are
        # small.


This comment is no longer valid right?

tkonolige · 2021-07-21T16:37:57Z

python/tvm/topi/cuda/scatter.py

+
+        # For better performance, we introduce blockIdx.y to implement for-loops
+        # within one thread.
+        # Atomic_add guarantees correctness when mode=="add"


Suggested change

# Atomic_add guarantees correctness when mode=="add"

# The code is parallel over the scattered indices, so we use atomic_add to guarantee correctness when mode=="add".

tkonolige · 2021-07-21T16:41:19Z

python/tvm/topi/cuda/scatter.py

-                index = j  # This is x_M, .. x_{N-1} part of the index into out.
-                # Build up the indices[0, y_0, .. y_{K-1}], .. indices[M-1, y_0, .. y_{K-1}] part
-                # of the index into out.


Can you keep this comment. I believe it still holds

CaptainDuke · 2021-07-22T08:51:59Z

Performance results look great! Could you also test 1. where indices is small (~10) and updates is large and 2. where indices is large and updates is size 1.

@tkonolige
2 cases tested.

CaptainDuke · 2021-07-22T09:01:01Z

@tkonolige
We found that some test cases failed since automic_add from CUDA doesn't support int64 data type, so we add fallback implementation to pass these test cases.

Do you have any suggestions on this fallback?

- Atomic_add from CUDA doesn't support int64 data type - Change "ind{i}" to "ind%d"%i, where names of relay.var could correctly display

tkonolige · 2021-07-26T16:29:58Z

python/tvm/topi/cuda/scatter.py

-                else:
-                    raise NotImplementedError("scatter_nd mode not in [update, add]:", mode)
+        with ib.new_scope():
+            if updates.dtype == "int64" and mode == "add":


I don't think this is the correct way to check for atomic add support. @masahi What is the correct way?

Unfortunately there is not a good way. I think we should encode atomic support information to a target description (similarity to @Lunderberg's vulkan work)

For now, atomic is not supported by vulkan and metal.

I'd agree, the target would be the best location for checking atomic support. I have it on my to-do list to document/RFC which parameters should be standardized across target kinds, so that they'll be available for use in strategies/optimizations.

@CaptainDuke you need to check for Vulkan or metal here. Can you also add a comment as to why we have this if statement.

CaptainDuke · 2021-07-27T03:28:42Z

@tkonolige
I have committed several times but CI failed at different stages. Any suggestions in view of this situation?

More over, one test case error I can not reproduce, jenkins log: test_gru(), which seems non-related to this PR

Above results are computed on CPU, since target and device were hardcode #8565

tkonolige · 2021-07-27T16:29:43Z

@CaptainDuke CI has been having issues. Just push a new commit and it will re-run.

tkonolige · 2021-07-27T16:43:25Z

python/tvm/topi/cuda/scatter.py

+                bdim_x = ceil_div(fused_updates_dimension, tdim)
+                bdim_y = fused_indices_dimension


For large input sizes, this creates too many blocks. Try with input sizes data=21, 3, 2600, 212, indices=4, 1, 1, 2600, 212, updates=1, 1, 2600, 212.

840 bx = te.thread_axis("blockIdx.y") 841 by = te.thread_axis("blockIdx.x")

According to Maximum x-dimension of a grid of thread blocks = 2^31 -1, I exchange blockIdx.x and blockIdx.y to avoid out of bounds.

For the given input sizes, performance on mode="add" was

time: 784638840
grid=(1,1,1), block=(1,1,1)

v.s.

time: 105102068 + 2141897 = 107243965.
grid=(34725600,1,1), block=(1,1,1)
grid=(551200,1,1), block=(1,1,1)

7.3x faster

masahi · 2021-07-28T10:40:14Z

python/tvm/topi/cuda/scatter.py

+                        index += offset * indices[by + l * fused_indices_dimension]
+                        offset *= data_ptr.shape[l]
+                    if mode == "update":
+                        out[index] = updates[up_index]


For update mode, does this give deterministic output? To me it seems it doesn't.

The output is calculated via the following equation:

output = np.copy(data) update_indices = indices.shape[:-1] for idx in np.ndindex(update_indices): output[indices[idx]] = updates[idx]

The order of iteration in the above loop is not specified. In particular, indices should not have duplicate entries: that is, if idx1 != idx2, then indices[idx1] != indices[idx2]. This ensures that the output value does not depend on the iteration order.

@masahi
According to the defination of ScatterND in ONNX, output does not depend on the iteration order.

Based on the above assumption, we replace the original with ib.for_range(0, fused_indices_dimension) as i with blockIdx.y, where bim_y = fused_indices_dimension

Is this what you concerned ?

It doesn't matter what ONNX says. If the previous implementation gives a deterministic output, performance improvement shouldn't break that. If you use atomic for add mode, then I assume that multiple threads compete for the same write index. This leads to non-determinstic output for update mode.

@masahi
I see. So, should I fallback to previous algorithm when mode="update"? Or any suggestions. Thanks.

Yes, for update the prev algo should be used.

If you do care about the performance improvement for update mode, we can add a new attribute allow_non_deterministic to scatter_nd op, which is False by default. And change ONNX frontend to emit scatter_op with allow_non_deterministic = True, which will allow the new code path for update mode as well. I think we can also choose this option if @tkonolige thinks this is reasonable.

OK, I'll fallback to previous algo for update.
For allow_non_deterministic feature, maybe we could fire a new PR?

Yes that sounds good, we can discuss with more people then. This has been on my mind for a while, since both our scatter and scatter_nd op sacrifice performance for deterministic output, while all other frameworks make the opposite choice (they say output is undefined when indices are not unique).

Great! Looking forward to further improvement

tkonolige · 2021-07-28T19:01:33Z

python/tvm/topi/cuda/scatter.py

        # For now we avoid parallizing over dimensions indexed by `indices` as
        # there may be repeated indices and hadling parallel accumulation can


This comment is not valid anymore right?

tkonolige · 2021-07-28T19:02:57Z

python/tvm/topi/cuda/scatter.py

-                else:
-                    raise NotImplementedError("scatter_nd mode not in [update, add]:", mode)
+        with ib.new_scope():
+            if updates.dtype == "int64" and mode == "add":


@CaptainDuke you need to check for Vulkan or metal here. Can you also add a comment as to why we have this if statement.

tkonolige · 2021-07-28T19:10:55Z

python/tvm/topi/cuda/scatter.py

+                bx = te.thread_axis("blockIdx.y")
+                by = te.thread_axis("blockIdx.x")


Can you update the names so they match the dimensions? Alternatively rename them to reflect what they are indexing over.

@tkonolige @masahi Check for vulkan & metal added.
Comment added.
Names updated.

tkonolige · 2021-07-29T17:12:08Z

python/tvm/topi/cuda/scatter.py

+        # For now, atomic is not supported by target "vulkan", "metal", or "cuda" with "int64"
+        # So we fallback to normal algorithm, using "+=" rather than atomic_add
+
+        # TODO:


please put a username on the TODO (your username assuming you will do this).

python/tvm/topi/cuda/scatter.py

Co-authored-by: Tristan Konolige <tristan.konolige@gmail.com>

CaptainDuke · 2021-07-31T04:51:51Z

@tkonolige @masahi
All checks have been passed. Ready to merge

masahi · 2021-07-31T07:22:24Z

python/tvm/topi/cuda/scatter.py

+                mode == "update"
+                or cur_target_kind("vulkan")
+                or cur_target_kind("metal")
+                or (updates.dtype == "int64" and mode == "add")


I think atomic is only supported for 32 bit. So float64 or int16 should also be catched here. Also since now you have mode == "update check above, there is no need to check mode == "add".

I suggest swapping then and else block and make the condition be:

if mode == "add" and target not in ["vulkan", metal"] and updates.dtype not in ["int32", "float32"]: use atomic code path else ...

Done.
Tiny fix: updates.dtype in ["int32", "float32]: rather than not in

masahi · 2021-07-31T07:24:47Z

python/tvm/topi/cuda/scatter.py

+        # Since multiple threads compete for the same write index, which leads to
+        # non-determinstic output for update mode. We could add a new attribute
+        # "allow_non_deterministic" to scatter_nd op, which is False by default.
+        # And change ONNX frontend to emit scatter_op with allow_non_deterministic = True,


Remove the reference to "ONNX".

We could add a new attribute, "allow_non_deterministic", which can be conditionally set to True by each frontend when non-determinsm is allowed.

masahi · 2021-07-31T07:25:34Z

python/tvm/topi/cuda/scatter.py

@@ -764,6 +764,9 @@ def scatter_nd(data, indices, updates, mode):
    """
    _verify_scatter_nd_inputs(data, indices, updates)

+    def cur_target_kind(kind="cuda"):
+        return tvm.target.Target.current(allow_none=False).kind == tvm.target.Target(kind).kind


tvm.target.Target.current(allow_none=False).kind == kind

Fixed

I use str(tvm.target.Target.current(allow_none=False).kind)
convert <class 'tvm.target.target.TargetKind'> to string

* [TOPI][CUDA] Improve the performance of scatter_nd by: 1. Split into 2 kernels, one does the "Init" and another does the "Update". Thus they can have different Grid/Block configurations to better utilize SMs. 2. Use atomic_add instead of direct assignment, which could avoid the race condtion when multiple indices point to the same location of the output tensor. With this moidification, it's safe now to use more CUDA threads to gain more parallelism. * Fix python code format. * FIX: [TOPI][CUDA] Improve the performance of scatter_nd apache#8479 - Split ScatterND kernel into 2 sub-kernels using ib.new_scope() - Replace ib.for_range() with blockIdx.y - Using atomic_add when mode == "add" - Keep threadIdx.x less than max_threads of GPU * Comment added * Add fallback implementation when "mode=add" meets int64 - Atomic_add from CUDA doesn't support int64 data type - Change "ind{i}" to "ind%d"%i, where names of relay.var could correctly display * Python format * Fix line too long * CI pass * Empty, for CI pass * Empty, for CI pass * Empty, for CI pass * Empty, for CI pass * Empty, for CI pass * Exchange blockIdx.x and blockIdx.y * check for Vulkan or metal * Fallback to previous algorithm when mode==update * Update python/tvm/topi/cuda/scatter.py Co-authored-by: Tristan Konolige <tristan.konolige@gmail.com> * Assign TODO * Swapping then and else block Co-authored-by: wenxizhu <wenxizhu@tencent.com> Co-authored-by: CaptainDuke <captainduke328@gmail.com> Co-authored-by: Tristan Konolige <tristan.konolige@gmail.com>

zhuwenxi force-pushed the feature/wenxizhu/improve-scatter-performance branch from 4107191 to 930043e Compare July 15, 2021 09:32

Fix python code format.

833561b

zhuwenxi changed the title ~~[TOPI][CUDA] Improve the performance of scatter_nd by:~~ [TOPI][CUDA] Improve the performance of scatter_nd Jul 15, 2021

tkonolige requested changes Jul 15, 2021

View reviewed changes

FIX: [TOPI][CUDA] Improve the performance of scatter_nd apache#8479

a6effec

- Split ScatterND kernel into 2 sub-kernels using ib.new_scope() - Replace ib.for_range() with blockIdx.y - Using atomic_add when mode == "add" - Keep threadIdx.x less than max_threads of GPU

tkonolige requested changes Jul 20, 2021

View reviewed changes

Comment added

675947e

zhuwenxi requested review from comaniac, icemelon, jroesch, junrushao, tqchen, yzhliu and zhiics as code owners July 21, 2021 09:19

tkonolige requested changes Jul 21, 2021

View reviewed changes

Add fallback implementation when "mode=add" meets int64

1e1a617

- Atomic_add from CUDA doesn't support int64 data type - Change "ind{i}" to "ind%d"%i, where names of relay.var could correctly display

zhuwenxi requested review from areusch, Huyuwei, jcf94, jwfromm, kevinthesun, Laurawly and masahi as code owners July 22, 2021 09:06

CaptainDuke added 4 commits July 24, 2021 10:03

CI pass

fd573c5

Empty, for CI pass

a4373d0

Empty, for CI pass

d3fb5a2

Empty, for CI pass

1faa97a

CaptainDuke force-pushed the feature/wenxizhu/improve-scatter-performance branch from 83e8aa6 to 1faa97a Compare July 26, 2021 08:06

Empty, for CI pass

92af183

tkonolige reviewed Jul 26, 2021

View reviewed changes

Empty, for CI pass

7d940b0

tkonolige requested changes Jul 27, 2021

View reviewed changes

tqchen assigned masahi Jul 28, 2021

Exchange blockIdx.x and blockIdx.y

c264949

masahi reviewed Jul 28, 2021

View reviewed changes

tkonolige requested changes Jul 28, 2021

View reviewed changes

CaptainDuke added 2 commits July 29, 2021 11:43

check for Vulkan or metal

c319e39

Fallback to previous algorithm when mode==update

bac7b65

tkonolige approved these changes Jul 29, 2021

View reviewed changes

CaptainDuke and others added 2 commits July 30, 2021 09:50

Update python/tvm/topi/cuda/scatter.py

3cf534c

Co-authored-by: Tristan Konolige <tristan.konolige@gmail.com>

Assign TODO

7c361c9

masahi reviewed Jul 31, 2021

View reviewed changes

Swapping then and else block

31fbde5

masahi approved these changes Aug 1, 2021

View reviewed changes

masahi merged commit 887324f into apache:main Aug 1, 2021

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

	# Atomic_add guarantees correctness when mode=="add"
	# The code is parallel over the scattered indices, so we use atomic_add to guarantee correctness when mode=="add".

		bdim_x = ceil_div(fused_updates_dimension, tdim)
		bdim_y = fused_indices_dimension

		# For now we avoid parallizing over dimensions indexed by `indices` as
		# there may be repeated indices and hadling parallel accumulation can

		bx = te.thread_axis("blockIdx.y")
		by = te.thread_axis("blockIdx.x")

[TOPI][CUDA] Improve the performance of scatter_nd #8479

[TOPI][CUDA] Improve the performance of scatter_nd #8479

Conversation

zhuwenxi commented Jul 15, 2021

zhuwenxi commented Jul 15, 2021

tkonolige left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuwenxi commented Jul 20, 2021 • edited Loading

zhuwenxi commented Jul 20, 2021

tkonolige left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke commented Jul 21, 2021 • edited Loading

tkonolige left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke commented Jul 22, 2021

CaptainDuke commented Jul 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke commented Jul 27, 2021 • edited Loading

tkonolige commented Jul 27, 2021

Choose a reason for hiding this comment

CaptainDuke Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

masahi Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

CaptainDuke Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke commented Jul 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaptainDuke Jul 20, 2021 •

edited

Loading

zhuwenxi commented Jul 20, 2021 •

edited

Loading

CaptainDuke commented Jul 21, 2021 •

edited

Loading

CaptainDuke commented Jul 22, 2021 •

edited

Loading

CaptainDuke commented Jul 27, 2021 •

edited

Loading

CaptainDuke Jul 28, 2021 •

edited

Loading

masahi Jul 28, 2021 •

edited

Loading

CaptainDuke Jul 28, 2021 •

edited

Loading

masahi Jul 29, 2021 •

edited

Loading

CaptainDuke Jul 29, 2021 •

edited

Loading