Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Slow mapreduce compared to KnetArray #141

Closed
skariel opened this issue Sep 20, 2018 · 11 comments
Closed

Slow mapreduce compared to KnetArray #141

skariel opened this issue Sep 20, 2018 · 11 comments

Comments

@skariel
Copy link

skariel commented Sep 20, 2018

a 6x slowdown in the following benchmark:

import CuArrays
import Knet
using BenchmarkTools

const EPSILON = 1.0f-5

function normit(x)
    d = ndims(x) == 4 ? [1,2,3] : [1,]
    s = prod(size(x)[d])
    mu = sum(x, dims=d) ./ s
    x = x .- mu
    sigma = sqrt.(EPSILON .+ (sum(x .* x, dims=d)) ./ s)
    return x ./ sigma
end

function bench_knet()
    r = Knet.KnetArray(rand(Float32, 84,84,1,16));
    for _ in 1:1000 normit(r); end
    nothing
end

function bench_cuarrays()
    r = CuArrays.CuArray(rand(Float32, 84,84,1,16));
    for _ in 1:1000 normit(r); end
    nothing
end

@btime bench_cuarrays()
@btime bench_knet()

Results:

 238.931 ms (416907 allocations: 18.68 MiB)
  43.466 ms (90007 allocations: 4.32 MiB)
@skariel
Copy link
Author

skariel commented Sep 20, 2018

p.b. this is on latest master:

  Status `~/.julia/environments/v1.0/Project.toml`
  [c5f51814] CUDAdrv v0.8.5
  [be33ccc6] CUDAnative v0.8.10
  [3a865a2d] CuArrays v0.8.0+ #master (https://github.com/JuliaGPU/CuArrays.jl.git)
  [717857b8] DSP v0.5.1
  [864edb3b] DataStructures v0.12.0
  [39dd38d3] Dierckx v0.4.0
  [7a1cc6ca] FFTW v0.2.4
  [587475ba] Flux v0.6.7+ #master (https://github.com/FluxML/Flux.jl.git)
  [f6369f11] ForwardDiff v0.9.0
  [0c68f7d7] GPUArrays v0.4.0+ #master (https://github.com/JuliaGPU/GPUArrays.jl.git)
  [7073ff75] IJulia v1.11.1+ #master (https://github.com/JuliaLang/IJulia.jl.git)
  [916415d5] Images v0.16.0
  [033835bb] JLD2 v0.1.2
  [1902f260] Knet v1.1.0+ #master (https://github.com/denizyuret/Knet.jl.git)
  [b964fa9f] LaTeXStrings v1.0.2
  [4d1e1d77] Nullables v0.0.8
  [f0f68f2c] PlotlyJS v0.11.1
  [91a5bcdd] Plots v0.20.2+ #master (https://github.com/JuliaPlots/Plots.jl.git)
  [438e738f] PyCall v1.18.4
  [d330b81b] PyPlot v2.6.3
  [295af30f] Revise v0.7.10
  [276daf66] SpecialFunctions v0.7.0
  [37b6cedf] Traceur v0.2.0

@vchuravy
Copy link
Member

Can you try out the latest master of CUDAnative?

@maleadt
Copy link
Member

maleadt commented Sep 20, 2018

You should use BenchmarkTools for these kind of measurements.
You also have an undefined EPSILON there.

I can't reproduce the difference:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.0 (2018-08-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using BenchmarkTools

julia> import CuArrays

julia> import Knet

julia> function normit(x)
           d = ndims(x) == 4 ? [1,2,3] : [1,]
           s = prod(size(x)[d])
           mu = sum(x, dims=d) ./ s
           x = x .- mu
           sigma = sqrt.(1 .+ (sum(x .* x, dims=d)) ./ s)
           return x ./ sigma
       end
normit (generic function with 1 method)

julia> a = Knet.KnetArray(rand(Float32, 84,84,1,1));

julia> @btime normit($a);
  90.968 μs (82 allocations: 3.58 KiB)

julia> b = CuArrays.CuArray(rand(Float32, 84,84,1,1));

julia> @btime normit($b);
  89.114 μs (405 allocations: 17.88 KiB)

But as @vchuravy points out, this might have been due to a recent fix on CUDAnative (JuliaGPU/CUDAnative.jl#254).

@skariel
Copy link
Author

skariel commented Sep 20, 2018

I updated and fixed the benchmark, and upgraded CUDAnative and CUDAdrv. The slowdown is now much larger... at 6x. Please try the new benchmark

@maleadt
Copy link
Member

maleadt commented Sep 20, 2018

@btime already runs the code in a loop, so you don't need the for loop within.

I'm not seeing these huge differences, "only" 5x. Looking into it. However, it looks like Knet doesn't properly free memory, so the CuArrays allocator might be running into OOM all the time. Try running both of your tests, individually, in a clean Julia session, and comparing those timings.

@maleadt
Copy link
Member

maleadt commented Sep 20, 2018

Hmm, the for loop seems required to hit the 5x slowdown:

import CuArrays
import Knet
using BenchmarkTools
using Statistics

const EPSILON = 1.0f-5
BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true

function normit(x)
    d = ndims(x) == 4 ? [1,2,3] : [1,]
    s = prod(size(x)[d])
    mu = sum(x, dims=d) ./ s
    x = x .- mu
    sigma = sqrt.(EPSILON .+ (sum(x .* x, dims=d)) ./ s)
    return x ./ sigma
end

function benchmark(r)
    for _ in 1:1000
        normit(r)
    end
    return
end

data = rand(Float32, 84,84,1,16)
cuarray = @benchmark benchmark($(CuArrays.CuArray(data)))
knet = @benchmark benchmark($(Knet.KnetArray(data)))
@show judge(median(cuarray), median(knet))

@skariel
Copy link
Author

skariel commented Sep 20, 2018

I noticed this myself and updated the issue (it was a typo). Even when running separately there is a 5~6x difference. I also updated the benchmark so CuArrays run 1st and Knet 2nd, so CuArrays doesn't have have to deal with any memory problem. Still, the slowness remains...

@maleadt
Copy link
Member

maleadt commented Sep 20, 2018

Looks like a problem with mapreducedim

using BenchmarkTools, Statistics
import CuArrays, Knet

BenchmarkTools.DEFAULT_PARAMETERS.gcsample = true

function benchmark(r)
    for _ in 1:1000
        sum(r, dims=[1,2,3])
    end
    return
end

data = rand(Float32, 84, 84, 1, 16)
cuarray = @benchmark benchmark($(CuArrays.CuArray(data)))
knet = @benchmark benchmark($(Knet.KnetArray(data)))
@show judge(median(cuarray), median(knet))

@maleadt
Copy link
Member

maleadt commented Sep 20, 2018

Knet:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.03033s    110000  9.3660us  8.6720us  99.489us  _sum_32_21(int, float*, int, int, float*)
                    0.00%  37.665us         1  37.665us  37.665us  37.665us  [CUDA memcpy HtoD]
      API calls:   86.77%  922.18ms    110000  8.3830us  6.9930us  34.077ms  cudaLaunchKernel
                   13.18%  140.06ms     13001  10.773us  9.2290us  407.26us  cudaMalloc
                    0.03%  295.02us        93  3.1720us     300ns  108.81us  cuDeviceGetAttribute
                    0.01%  117.07us         1  117.07us  117.07us  117.07us  cudaMemcpy
                    0.01%  80.632us         1  80.632us  80.632us  80.632us  cuDeviceTotalMem
                    0.01%  62.943us         1  62.943us  62.943us  62.943us  cuDeviceGetName
                    0.00%  2.4570us         1  2.4570us  2.4570us  2.4570us  cuDriverGetVersion
                    0.00%  2.1410us         1  2.1410us  2.1410us  2.1410us  cuInit
                    0.00%     618ns         1     618ns     618ns     618ns  cuDeviceGetCount
                    0.00%     514ns         1     514ns     514ns     514ns  cuDeviceGet

CuArrays:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.30%  7.35275s     44000  167.11us  164.23us  322.76us  ptxcall_mapreducedim_kernel_parallel_2
                    1.70%  127.46ms     44000  2.8960us  2.8160us  4.3840us  ptxcall_anonymous_1
                    0.00%  40.608us         1  40.608us  40.608us  40.608us  [CUDA memcpy HtoD]
      API calls:   94.85%  4.56706s     88000  51.898us  6.3000us  29.877ms  cuLaunchKernel
                    1.66%  79.730ms         2  39.865ms  14.968ms  64.763ms  cuModuleLoadDataEx
                    1.30%  62.838ms      6575  9.5570us  8.3350us  403.99us  cuMemAlloc
                    0.88%  42.424ms     94579     448ns     305ns  241.41us  cuCtxGetCurrent
                    0.60%  29.046ms     88004     330ns     290ns  256.82us  cuDeviceGetAttribute
                    0.39%  18.559ms     44002     421ns     363ns  256.31us  cuCtxGetDevice
                    0.32%  15.482ms     44002     351ns     289ns  290.98us  cuDeviceGet
                    0.00%  101.34us         1  101.34us  101.34us  101.34us  cuMemcpyHtoD
                    0.00%  2.7020us         2  1.3510us  1.1270us  1.5750us  cuModuleGetFunction

It really is the kernel that's slow.

@maleadt maleadt changed the title slow performance compared to KnetArray Slow mapreduce compared to KnetArray Sep 21, 2018
@jekbradbury
Copy link
Contributor

I'm not really surprised that there's another order of magnitude of performance left for certain sizes of the mapreducedim kernel.

@skariel
Copy link
Author

skariel commented Feb 14, 2019

@jekbradbury what do you mean by "certain sizes"? I just tested on many different sizes (128,128,1,1), (256,256,1,16) etc., all of them are slower...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants