-
-
Notifications
You must be signed in to change notification settings - Fork 83
Conversation
@maleadt I would like to work on the in-place version of ND-reverse. Do I add to this branch/PR? |
Sure, but you probably won't be able to push to this branch. Feel free to push to a new one and have me merge, or just open a new PR. I realized that the ND-kernels should be usable for the 1D case, so I've unified them. The TODO left is to adjust the indexing for the in-place kernel to support multiple dimensions. Note that it works a little differently, it's not one thread that's responsible for reversing a single element, instead independently writing to shared memory and reading from it. It becomes clear with some print statements: +++ b/src/array.jl
@@ -307,6 +307,7 @@ function reverse_by_moving(data::CuArray{T, N}, dims::Integer=1) where {T, N}
if index_in <= length(data)
index_shared = blockDim().x - threadIdx().x + 1
@inbounds shared[index_shared] = data[index_in]
+ @cuprintln("thread ", threadIdx().x, ", block ", blockIdx().x, ": shared[", index_shared, "] = data[", index_in, "]")
end
sync_threads()
@@ -319,6 +320,7 @@ function reverse_by_moving(data::CuArray{T, N}, dims::Integer=1) where {T, N}
if 1 <= index_out <= length(data)
index_shared = threadIdx().x
@inbounds data[index_out] = shared[index_shared]
+ @cuprintln("thread ", threadIdx().x, ", block ", blockIdx().x, ": data[", index_out, "] = shared[", index_shared, "]")
end
return
|
Using shared memory for out-of-place is better since that enables memory coalescing on the output write.
EDIT: nope, this is just LLVM reordering code differently based on the presence of bounds checking. In-place reversing needs to swap, and can't use any smarter techniques IIUC, since we would otherwise be writing over values that still have to be read by threads in another block. |
Wrt. shared memory: newer hardware can coalesce both forward and backward access patterns, which we would get without shared memory in the 1D case, while for the ND case we don't get either of those if just writing backwards in the shared memory buffer. Wrt. separate kernels: the in-place version would trample over values to be read by other blocks. Grid synchronization is much to costly, so just swap numbers on half the number of threads.
After having confused myself for half a day, this is ready to go. Thanks again, @kraftpunk97! |
Continuation of #389