Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

ND-reverse #416

Merged
merged 12 commits into from
Sep 24, 2019
Merged

ND-reverse #416

merged 12 commits into from
Sep 24, 2019

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Sep 16, 2019

Continuation of #389

src/array.jl Outdated Show resolved Hide resolved
@kraftpunk97-zz
Copy link

@maleadt I would like to work on the in-place version of ND-reverse. Do I add to this branch/PR?

@maleadt
Copy link
Member Author

maleadt commented Sep 17, 2019

@maleadt I would like to work on the in-place version of ND-reverse. Do I add to this branch/PR?

Sure, but you probably won't be able to push to this branch. Feel free to push to a new one and have me merge, or just open a new PR.

I realized that the ND-kernels should be usable for the 1D case, so I've unified them. The TODO left is to adjust the indexing for the in-place kernel to support multiple dimensions. Note that it works a little differently, it's not one thread that's responsible for reversing a single element, instead independently writing to shared memory and reading from it. It becomes clear with some print statements:

+++ b/src/array.jl
@@ -307,6 +307,7 @@ function reverse_by_moving(data::CuArray{T, N}, dims::Integer=1) where {T, N}
         if index_in <= length(data)
             index_shared = blockDim().x - threadIdx().x + 1
             @inbounds shared[index_shared] = data[index_in]
+            @cuprintln("thread ", threadIdx().x, ", block ", blockIdx().x, ": shared[", index_shared, "] = data[", index_in, "]")
         end
 
         sync_threads()
@@ -319,6 +320,7 @@ function reverse_by_moving(data::CuArray{T, N}, dims::Integer=1) where {T, N}
         if 1 <= index_out <= length(data)
             index_shared = threadIdx().x
             @inbounds data[index_out] = shared[index_shared]
+            @cuprintln("thread ", threadIdx().x, ", block ", blockIdx().x, ": data[", index_out, "] = shared[", index_shared, "]")
         end
 
         return
julia> a = cu([1,2,3,4,5,6]); reverse!(a, 2, 5)
6-element CuArray{Float32,1}:
thread 1, block 1: shared[256] = data[1]
thread 2, block 1: shared[255] = data[2]
thread 3, block 1: shared[254] = data[3]
thread 4, block 1: shared[253] = data[4]
 1.0
 5.0
 4.0
 3.0
 2.0
 6.0

@maleadt
Copy link
Member Author

maleadt commented Sep 23, 2019

Ugh, this failure looks to be another case of JuliaGPU/CUDAnative.jl#4: only occurs under --check-bounds=yes 😭

EDIT: nope, this is just LLVM reordering code differently based on the presence of bounds checking. In-place reversing needs to swap, and can't use any smarter techniques IIUC, since we would otherwise be writing over values that still have to be read by threads in another block.

Wrt. shared memory: newer hardware can coalesce both forward and backward access patterns,
which we would get without shared memory in the 1D case, while for the ND case we don't
get either of those if just writing backwards in the shared memory buffer.

Wrt. separate kernels: the in-place version would trample over values to be read by other blocks.
Grid synchronization is much to costly, so just swap numbers on half the number of threads.
@maleadt
Copy link
Member Author

maleadt commented Sep 24, 2019

After having confused myself for half a day, this is ready to go. Thanks again, @kraftpunk97!

@maleadt maleadt merged commit 536b4e4 into master Sep 24, 2019
@bors bors bot deleted the tb/reverse branch September 24, 2019 10:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants