Slow mapreduce compared to KnetArray #141

skariel · 2018-09-20T09:19:13Z

a 6x slowdown in the following benchmark:

import CuArrays
import Knet
using BenchmarkTools

const EPSILON = 1.0f-5

function normit(x)
    d = ndims(x) == 4 ? [1,2,3] : [1,]
    s = prod(size(x)[d])
    mu = sum(x, dims=d) ./ s
    x = x .- mu
    sigma = sqrt.(EPSILON .+ (sum(x .* x, dims=d)) ./ s)
    return x ./ sigma
end

function bench_knet()
    r = Knet.KnetArray(rand(Float32, 84,84,1,16));
    for _ in 1:1000 normit(r); end
    nothing
end

function bench_cuarrays()
    r = CuArrays.CuArray(rand(Float32, 84,84,1,16));
    for _ in 1:1000 normit(r); end
    nothing
end

@btime bench_cuarrays()
@btime bench_knet()

Results:

 238.931 ms (416907 allocations: 18.68 MiB)
  43.466 ms (90007 allocations: 4.32 MiB)

The text was updated successfully, but these errors were encountered:

skariel · 2018-09-20T09:20:53Z

p.b. this is on latest master:

  Status `~/.julia/environments/v1.0/Project.toml`
  [c5f51814] CUDAdrv v0.8.5
  [be33ccc6] CUDAnative v0.8.10
  [3a865a2d] CuArrays v0.8.0+ #master (https://github.com/JuliaGPU/CuArrays.jl.git)
  [717857b8] DSP v0.5.1
  [864edb3b] DataStructures v0.12.0
  [39dd38d3] Dierckx v0.4.0
  [7a1cc6ca] FFTW v0.2.4
  [587475ba] Flux v0.6.7+ #master (https://github.com/FluxML/Flux.jl.git)
  [f6369f11] ForwardDiff v0.9.0
  [0c68f7d7] GPUArrays v0.4.0+ #master (https://github.com/JuliaGPU/GPUArrays.jl.git)
  [7073ff75] IJulia v1.11.1+ #master (https://github.com/JuliaLang/IJulia.jl.git)
  [916415d5] Images v0.16.0
  [033835bb] JLD2 v0.1.2
  [1902f260] Knet v1.1.0+ #master (https://github.com/denizyuret/Knet.jl.git)
  [b964fa9f] LaTeXStrings v1.0.2
  [4d1e1d77] Nullables v0.0.8
  [f0f68f2c] PlotlyJS v0.11.1
  [91a5bcdd] Plots v0.20.2+ #master (https://github.com/JuliaPlots/Plots.jl.git)
  [438e738f] PyCall v1.18.4
  [d330b81b] PyPlot v2.6.3
  [295af30f] Revise v0.7.10
  [276daf66] SpecialFunctions v0.7.0
  [37b6cedf] Traceur v0.2.0

vchuravy · 2018-09-20T10:28:22Z

Can you try out the latest master of CUDAnative?

maleadt · 2018-09-20T10:58:00Z

You should use BenchmarkTools for these kind of measurements.
You also have an undefined EPSILON there.

I can't reproduce the difference:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.0 (2018-08-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using BenchmarkTools

julia> import CuArrays

julia> import Knet

julia> function normit(x)
           d = ndims(x) == 4 ? [1,2,3] : [1,]
           s = prod(size(x)[d])
           mu = sum(x, dims=d) ./ s
           x = x .- mu
           sigma = sqrt.(1 .+ (sum(x .* x, dims=d)) ./ s)
           return x ./ sigma
       end
normit (generic function with 1 method)

julia> a = Knet.KnetArray(rand(Float32, 84,84,1,1));

julia> @btime normit($a);
  90.968 μs (82 allocations: 3.58 KiB)

julia> b = CuArrays.CuArray(rand(Float32, 84,84,1,1));

julia> @btime normit($b);
  89.114 μs (405 allocations: 17.88 KiB)

But as @vchuravy points out, this might have been due to a recent fix on CUDAnative (JuliaGPU/CUDAnative.jl#254).

skariel · 2018-09-20T12:18:42Z

I updated and fixed the benchmark, and upgraded CUDAnative and CUDAdrv. The slowdown is now much larger... at 6x. Please try the new benchmark

maleadt · 2018-09-20T12:47:21Z

@btime already runs the code in a loop, so you don't need the for loop within.

I'm not seeing these huge differences, "only" 5x. Looking into it. However, it looks like Knet doesn't properly free memory, so the CuArrays allocator might be running into OOM all the time. Try running both of your tests, individually, in a clean Julia session, and comparing those timings.

maleadt · 2018-09-20T12:58:14Z

Hmm, the for loop seems required to hit the 5x slowdown:

import CuArrays
import Knet
using BenchmarkTools
using Statistics

const EPSILON = 1.0f-5
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true

function normit(x)
    d = ndims(x) == 4 ? [1,2,3] : [1,]
    s = prod(size(x)[d])
    mu = sum(x, dims=d) ./ s
    x = x .- mu
    sigma = sqrt.(EPSILON .+ (sum(x .* x, dims=d)) ./ s)
    return x ./ sigma
end

function benchmark(r)
    for _ in 1:1000
        normit(r)
    end
    return
end

data = rand(Float32, 84,84,1,16)
cuarray = @benchmark benchmark($(CuArrays.CuArray(data)))
knet = @benchmark benchmark($(Knet.KnetArray(data)))
@show judge(median(cuarray), median(knet))

skariel · 2018-09-20T12:59:23Z

I noticed this myself and updated the issue (it was a typo). Even when running separately there is a 5~6x difference. I also updated the benchmark so CuArrays run 1st and Knet 2nd, so CuArrays doesn't have have to deal with any memory problem. Still, the slowness remains...

maleadt · 2018-09-20T13:03:20Z

Looks like a problem with mapreducedim

using BenchmarkTools, Statistics
import CuArrays, Knet

BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true

function benchmark(r)
    for _ in 1:1000
        sum(r, dims=[1,2,3])
    end
    return
end

data = rand(Float32, 84, 84, 1, 16)
cuarray = @benchmark benchmark($(CuArrays.CuArray(data)))
knet = @benchmark benchmark($(Knet.KnetArray(data)))
@show judge(median(cuarray), median(knet))

maleadt · 2018-09-20T13:11:55Z

Knet:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.03033s    110000  9.3660us  8.6720us  99.489us  _sum_32_21(int, float*, int, int, float*)
                    0.00%  37.665us         1  37.665us  37.665us  37.665us  [CUDA memcpy HtoD]
      API calls:   86.77%  922.18ms    110000  8.3830us  6.9930us  34.077ms  cudaLaunchKernel
                   13.18%  140.06ms     13001  10.773us  9.2290us  407.26us  cudaMalloc
                    0.03%  295.02us        93  3.1720us     300ns  108.81us  cuDeviceGetAttribute
                    0.01%  117.07us         1  117.07us  117.07us  117.07us  cudaMemcpy
                    0.01%  80.632us         1  80.632us  80.632us  80.632us  cuDeviceTotalMem
                    0.01%  62.943us         1  62.943us  62.943us  62.943us  cuDeviceGetName
                    0.00%  2.4570us         1  2.4570us  2.4570us  2.4570us  cuDriverGetVersion
                    0.00%  2.1410us         1  2.1410us  2.1410us  2.1410us  cuInit
                    0.00%     618ns         1     618ns     618ns     618ns  cuDeviceGetCount
                    0.00%     514ns         1     514ns     514ns     514ns  cuDeviceGet

CuArrays:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.30%  7.35275s     44000  167.11us  164.23us  322.76us  ptxcall_mapreducedim_kernel_parallel_2
                    1.70%  127.46ms     44000  2.8960us  2.8160us  4.3840us  ptxcall_anonymous_1
                    0.00%  40.608us         1  40.608us  40.608us  40.608us  [CUDA memcpy HtoD]
      API calls:   94.85%  4.56706s     88000  51.898us  6.3000us  29.877ms  cuLaunchKernel
                    1.66%  79.730ms         2  39.865ms  14.968ms  64.763ms  cuModuleLoadDataEx
                    1.30%  62.838ms      6575  9.5570us  8.3350us  403.99us  cuMemAlloc
                    0.88%  42.424ms     94579     448ns     305ns  241.41us  cuCtxGetCurrent
                    0.60%  29.046ms     88004     330ns     290ns  256.82us  cuDeviceGetAttribute
                    0.39%  18.559ms     44002     421ns     363ns  256.31us  cuCtxGetDevice
                    0.32%  15.482ms     44002     351ns     289ns  290.98us  cuDeviceGet
                    0.00%  101.34us         1  101.34us  101.34us  101.34us  cuMemcpyHtoD
                    0.00%  2.7020us         2  1.3510us  1.1270us  1.5750us  cuModuleGetFunction

It really is the kernel that's slow.

jekbradbury · 2018-09-21T20:20:49Z

I'm not really surprised that there's another order of magnitude of performance left for certain sizes of the mapreducedim kernel.

skariel · 2019-02-14T10:59:25Z

@jekbradbury what do you mean by "certain sizes"? I just tested on many different sizes (128,128,1,1), (256,256,1,16) etc., all of them are slower...

maleadt added the performance label Sep 20, 2018

maleadt changed the title ~~slow performance compared to KnetArray~~ Slow mapreduce compared to KnetArray Sep 21, 2018

maleadt mentioned this issue Feb 24, 2020

Reimplement mapreduce. #602

Merged

maleadt closed this as completed Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow mapreduce compared to KnetArray #141

Slow mapreduce compared to KnetArray #141

skariel commented Sep 20, 2018 •

edited

Loading

skariel commented Sep 20, 2018

vchuravy commented Sep 20, 2018

maleadt commented Sep 20, 2018 •

edited

Loading

skariel commented Sep 20, 2018 •

edited

Loading

maleadt commented Sep 20, 2018

maleadt commented Sep 20, 2018

skariel commented Sep 20, 2018

maleadt commented Sep 20, 2018

maleadt commented Sep 20, 2018

jekbradbury commented Sep 21, 2018

skariel commented Feb 14, 2019

Slow mapreduce compared to KnetArray #141

Slow mapreduce compared to KnetArray #141

Comments

skariel commented Sep 20, 2018 • edited Loading

skariel commented Sep 20, 2018

vchuravy commented Sep 20, 2018

maleadt commented Sep 20, 2018 • edited Loading

skariel commented Sep 20, 2018 • edited Loading

maleadt commented Sep 20, 2018

maleadt commented Sep 20, 2018

skariel commented Sep 20, 2018

maleadt commented Sep 20, 2018

maleadt commented Sep 20, 2018

jekbradbury commented Sep 21, 2018

skariel commented Feb 14, 2019

skariel commented Sep 20, 2018 •

edited

Loading

maleadt commented Sep 20, 2018 •

edited

Loading

skariel commented Sep 20, 2018 •

edited

Loading