ND-reverse #416

maleadt · 2019-09-16T16:05:58Z

Continuation of #389

src/array.jl

kraftpunk97-zz · 2019-09-16T20:44:36Z

@maleadt I would like to work on the in-place version of ND-reverse. Do I add to this branch/PR?

maleadt · 2019-09-17T06:05:25Z

@maleadt I would like to work on the in-place version of ND-reverse. Do I add to this branch/PR?

Sure, but you probably won't be able to push to this branch. Feel free to push to a new one and have me merge, or just open a new PR.

I realized that the ND-kernels should be usable for the 1D case, so I've unified them. The TODO left is to adjust the indexing for the in-place kernel to support multiple dimensions. Note that it works a little differently, it's not one thread that's responsible for reversing a single element, instead independently writing to shared memory and reading from it. It becomes clear with some print statements:

+++ b/src/array.jl
@@ -307,6 +307,7 @@ function reverse_by_moving(data::CuArray{T, N}, dims::Integer=1) where {T, N}
         if index_in <= length(data)
             index_shared = blockDim().x - threadIdx().x + 1
             @inbounds shared[index_shared] = data[index_in]
+            @cuprintln("thread ", threadIdx().x, ", block ", blockIdx().x, ": shared[", index_shared, "] = data[", index_in, "]")
         end
 
         sync_threads()
@@ -319,6 +320,7 @@ function reverse_by_moving(data::CuArray{T, N}, dims::Integer=1) where {T, N}
         if 1 <= index_out <= length(data)
             index_shared = threadIdx().x
             @inbounds data[index_out] = shared[index_shared]
+            @cuprintln("thread ", threadIdx().x, ", block ", blockIdx().x, ": data[", index_out, "] = shared[", index_shared, "]")
         end
 
         return

julia> a = cu([1,2,3,4,5,6]); reverse!(a, 2, 5)
6-element CuArray{Float32,1}:
thread 1, block 1: shared[256] = data[1]
thread 2, block 1: shared[255] = data[2]
thread 3, block 1: shared[254] = data[3]
thread 4, block 1: shared[253] = data[4]
 1.0
 5.0
 4.0
 3.0
 2.0
 6.0

Using shared memory for out-of-place is better since that enables memory coalescing on the output write.

maleadt · 2019-09-23T13:46:12Z

~~Ugh, this failure looks to be another case of JuliaGPU/CUDAnative.jl#4: only occurs under --check-bounds=yes 😭~~

EDIT: nope, this is just LLVM reordering code differently based on the presence of bounds checking. In-place reversing needs to swap, and can't use any smarter techniques IIUC, since we would otherwise be writing over values that still have to be read by threads in another block.

Wrt. shared memory: newer hardware can coalesce both forward and backward access patterns, which we would get without shared memory in the 1D case, while for the ND case we don't get either of those if just writing backwards in the shared memory buffer. Wrt. separate kernels: the in-place version would trample over values to be read by other blocks. Grid synchronization is much to costly, so just swap numbers on half the number of threads.

maleadt · 2019-09-24T10:11:48Z

After having confused myself for half a day, this is ready to go. Thanks again, @kraftpunk97!

maleadt commented Sep 16, 2019

View reviewed changes

src/array.jl Outdated Show resolved Hide resolved

maleadt added the enhancement label Sep 16, 2019

maleadt force-pushed the tb/reverse branch from b38f5cf to 1687f4c Compare September 16, 2019 20:12

maleadt force-pushed the tb/reverse branch from 1057844 to a913c31 Compare September 23, 2019 07:10

maleadt mentioned this pull request Sep 23, 2019

Added in-place ND reverse #424

Closed

kraftpunk97 and others added 11 commits September 23, 2019 14:49

added reverse for arbitrary dimension CuArrays

da8960b

removed a redundant variable

6149161

Simplified kernel

21d44f8

Whitespace fixes.

27bbc33

More compact testing.

0fcbf0c

Clean-up reverse kernels.

e9d4abb

Drop specific typing.

008760f

Unify kernels.

0893ba3

Working, but non-coalesced in-place ND.

b7a7556

Make the ND-case coalesced for 1D inputs.

0ad5491

Unify in-place and out-of-place kernels.

ba586b9

Using shared memory for out-of-place is better since that enables memory coalescing on the output write.

maleadt force-pushed the tb/reverse branch from a913c31 to 1966169 Compare September 24, 2019 09:27

maleadt merged commit 536b4e4 into master Sep 24, 2019

bors bot deleted the tb/reverse branch September 24, 2019 10:11

maleadt referenced this pull request Nov 4, 2019

Test released Flux.

7642038

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ND-reverse #416

ND-reverse #416

maleadt commented Sep 16, 2019

kraftpunk97-zz commented Sep 16, 2019

maleadt commented Sep 17, 2019

maleadt commented Sep 23, 2019 •

edited

Loading

maleadt commented Sep 24, 2019

ND-reverse #416

ND-reverse #416

Conversation

maleadt commented Sep 16, 2019

kraftpunk97-zz commented Sep 16, 2019

maleadt commented Sep 17, 2019

maleadt commented Sep 23, 2019 • edited Loading

maleadt commented Sep 24, 2019

maleadt commented Sep 23, 2019 •

edited

Loading