-
-
Notifications
You must be signed in to change notification settings - Fork 83
Slow mapreduce compared to KnetArray #141
Comments
p.b. this is on latest master:
|
Can you try out the latest master of CUDAnative? |
You should use BenchmarkTools for these kind of measurements. I can't reproduce the difference:
But as @vchuravy points out, this might have been due to a recent fix on CUDAnative (JuliaGPU/CUDAnative.jl#254). |
I updated and fixed the benchmark, and upgraded |
I'm not seeing these huge differences, "only" 5x. Looking into it. However, it looks like Knet doesn't properly free memory, so the CuArrays allocator might be running into OOM all the time. Try running both of your tests, individually, in a clean Julia session, and comparing those timings. |
Hmm, the import CuArrays
import Knet
using BenchmarkTools
using Statistics
const EPSILON = 1.0f-5
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true
function normit(x)
d = ndims(x) == 4 ? [1,2,3] : [1,]
s = prod(size(x)[d])
mu = sum(x, dims=d) ./ s
x = x .- mu
sigma = sqrt.(EPSILON .+ (sum(x .* x, dims=d)) ./ s)
return x ./ sigma
end
function benchmark(r)
for _ in 1:1000
normit(r)
end
return
end
data = rand(Float32, 84,84,1,16)
cuarray = @benchmark benchmark($(CuArrays.CuArray(data)))
knet = @benchmark benchmark($(Knet.KnetArray(data)))
@show judge(median(cuarray), median(knet)) |
I noticed this myself and updated the issue (it was a typo). Even when running separately there is a 5~6x difference. I also updated the benchmark so CuArrays run 1st and Knet 2nd, so CuArrays doesn't have have to deal with any memory problem. Still, the slowness remains... |
Looks like a problem with using BenchmarkTools, Statistics
import CuArrays, Knet
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true
function benchmark(r)
for _ in 1:1000
sum(r, dims=[1,2,3])
end
return
end
data = rand(Float32, 84, 84, 1, 16)
cuarray = @benchmark benchmark($(CuArrays.CuArray(data)))
knet = @benchmark benchmark($(Knet.KnetArray(data)))
@show judge(median(cuarray), median(knet)) |
Knet:
CuArrays:
It really is the kernel that's slow. |
I'm not really surprised that there's another order of magnitude of performance left for certain sizes of the mapreducedim kernel. |
@jekbradbury what do you mean by "certain sizes"? I just tested on many different sizes (128,128,1,1), (256,256,1,16) etc., all of them are slower... |
a 6x slowdown in the following benchmark:
Results:
The text was updated successfully, but these errors were encountered: