Performance regression with mapreduce #611

MasonProtter · 2020-02-27T16:44:02Z

Here's an example for me on the master branch:

julia> using BenchmarkTools, CuArrays

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  16.63 KiB
  allocs estimate:  473
  --------------
  minimum time:     1.620 ms (0.00% GC)
  median time:      1.666 ms (0.00% GC)
  mean time:        1.709 ms (1.60% GC)
  maximum time:     9.460 ms (7.77% GC)
  --------------
  samples:          2921
  evals/sample:     1

(@v1.4) pkg> st CuArrays 
Status `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] CuArrays v1.7.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)

(@v1.4) pkg> st CUDAnative
Status `~/.julia/environments/v1.4/Project.toml`
  [be33ccc6] CUDAnative v2.10.2 #master (https://github.com/JuliaGPU/CUDAnative.jl.git)

and here's that same example on the latest tagged version:

julia> using BenchmarkTools, CuArrays

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  4.61 KiB
  allocs estimate:  126
  --------------
  minimum time:     594.302 μs (0.00% GC)
  median time:      659.321 μs (0.00% GC)
  mean time:        667.914 μs (1.58% GC)
  maximum time:     2.338 ms (39.61% GC)
  --------------
  samples:          7463
  evals/sample:     1

(@v1.4) pkg> st CuArrays
Status `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] CuArrays v1.7.2

(@v1.4) pkg> st CUDAnative
Status `~/.julia/environments/v1.4/Project.toml`
  [be33ccc6] CUDAnative v2.10.2

As you can see, I lost around a factor of 3 performance on the new master. I tested the master version with and without JULIA_CUDA_USE_BINARYBUILDER=false , so binary builder is not the problem. Likely due to #602

julia> versioninfo()
Julia Version 1.4.0-rc1.0
Commit b0c33b0cf5* (2020-01-23 17:23 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

[mason@mason-pc ~]$ sudo pacman -Q --info cuda
Name            : cuda
Version         : 10.2.89-3
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : custom:NVIDIA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk
Depends On      : gcc8-libs  gcc8  opencl-nvidia  nvidia-utils
Optional Deps   : gdb: for cuda-gdb
                  java-runtime=8: for nsight and nvvp
Required By     : cudnn
Optional For    : None
Conflicts With  : None
Replaces        : cuda-toolkit  cuda-sdk
Installed Size  : 4.04 GiB
Packager        : Sven-Hendrik Haase <svenstaro@gmail.com>
Build Date      : Tue 31 Dec 2019 01:07:53 AM MST
Install Date    : Wed 26 Feb 2020 03:04:42 PM MST
Install Reason  : Explicitly installed
Install Script  : Yes
Validated By    : Signature

[mason@mason-pc ~]$ lspci  -v -s  $(lspci | grep ' VGA ' | cut -d" " -f 1)
1f:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ZOTAC International (MCO) Ltd. TU106 [GeForce RTX 2060 Rev. A]
        Flags: bus master, fast devsel, latency 0, IRQ 71
        Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

The text was updated successfully, but these errors were encountered:

MasonProtter · 2020-02-27T16:45:06Z

Not really a bug, but the only two options when creating the issue were 'bug report' and 'feature request'.

MasonProtter · 2020-03-06T18:06:50Z

Just an update, trying this again on the current master I get a further factor of 10 performance regression:

julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial:
  memory estimate:  17.08 KiB
  allocs estimate:  494
  --------------
  minimum time:     11.079 ms (0.00% GC)
  median time:      11.140 ms (0.00% GC)
  mean time:        11.188 ms (0.30% GC)
  maximum time:     13.158 ms (10.40% GC)
  --------------
  samples:          447
  evals/sample:     1

maleadt · 2020-03-20T10:12:20Z

I had hoped #642 would fix this, but it doesn't do much. Maybe the serial fallback for small arrays, as used to exist with the old GPUArrays and CuArrays mapreduce implementations, is crucial in this situation. Although the input isn't particularly tiny here, so I'd need to properly profile first.

maleadt · 2020-03-20T11:26:26Z

OK, one problem is the missing specialization for mapreduce with multiple containers, falling back to a separate call to map and reduce.

MasonProtter · 2020-03-21T03:39:38Z

Ahh, that makes sense.

@benchmark

646: Improve mapreduce performance r=maleadt a=wongalvis14 ~More than 3-fold improvement over the latest implementation~ Benchmarking function from #611 First stage: Using the number of "max parallel threads a single block can hold" as the number of blocks, perform reduction with serial iteration if needed Second stage: Reduction in a single block, no serial iteration This approach aims to strike an optimal balance between workload of each thread, kernel launch overhead and parallel resource exhaustion. ``` New impl: julia> @benchmark pi_mc_cu(10000000) BenchmarkTools.Trial: memory estimate: 16.98 KiB allocs estimate: 468 -------------- minimum time: 2.520 ms (0.00% GC) median time: 2.536 ms (0.00% GC) mean time: 2.584 ms (0.64% GC) maximum time: 15.600 ms (50.62% GC) -------------- samples: 1930 evals/sample: 1 Old recursion impl: julia> @benchmark pi_mc_cu(10000000) BenchmarkTools.Trial: memory estimate: 17.05 KiB allocs estimate: 472 -------------- minimum time: 4.059 ms (0.00% GC) median time: 4.076 ms (0.00% GC) mean time: 4.130 ms (0.64% GC) maximum time: 23.199 ms (63.12% GC) -------------- samples: 1209 evals/sample: 1 Latest serial impl: BenchmarkTools.Trial: memory estimate: 7.81 KiB allocs estimate: 242 -------------- minimum time: 8.544 ms (0.00% GC) median time: 8.579 ms (0.00% GC) mean time: 8.622 ms (0.27% GC) maximum time: 26.172 ms (41.80% GC) -------------- samples: 580 evals/sample: 1 ``` Co-authored-by: wongalvis14 <wongalvis14@gmail.com> Co-authored-by: Tim Besard <tim.besard@gmail.com>

maleadt · 2020-03-31T14:33:29Z

1.7.3:

BenchmarkTools.Trial: 
  memory estimate:  4.61 KiB
  allocs estimate:  126
  --------------
  minimum time:     428.960 μs (0.00% GC)
  median time:      466.305 μs (0.00% GC)
  mean time:        495.852 μs (1.42% GC)
  maximum time:     2.296 ms (33.22% GC)
  --------------
  samples:          10000
  evals/sample:     1

current master

BenchmarkTools.Trial: 
  memory estimate:  10.45 KiB
  allocs estimate:  323
  --------------
  minimum time:     452.205 μs (0.00% GC)
  median time:      497.083 μs (0.00% GC)
  mean time:        528.749 μs (2.02% GC)
  maximum time:     3.221 ms (35.36% GC)
  --------------
  samples:          9406
  evals/sample:     1

Minor regression, but the reduction now always happens on the GPU, while old GPUArrays performed the second phase on the CPU (which is invalid when using GPU-specific functions).

MasonProtter added the bug label Feb 27, 2020

wongalvis14 mentioned this issue Mar 25, 2020

Improve mapreduce performance #646

Merged

maleadt closed this as completed Mar 31, 2020

maleadt mentioned this issue May 7, 2020

Support and use broadcast with mapreduce. #709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with mapreduce #611

Performance regression with mapreduce #611

MasonProtter commented Feb 27, 2020 •

edited

Loading

MasonProtter commented Feb 27, 2020

MasonProtter commented Mar 6, 2020

maleadt commented Mar 20, 2020 •

edited

Loading

maleadt commented Mar 20, 2020

MasonProtter commented Mar 21, 2020

maleadt commented Mar 31, 2020

Performance regression with mapreduce #611

Performance regression with mapreduce #611

Comments

MasonProtter commented Feb 27, 2020 • edited Loading

MasonProtter commented Feb 27, 2020

MasonProtter commented Mar 6, 2020

maleadt commented Mar 20, 2020 • edited Loading

maleadt commented Mar 20, 2020

MasonProtter commented Mar 21, 2020

maleadt commented Mar 31, 2020

MasonProtter commented Feb 27, 2020 •

edited

Loading

maleadt commented Mar 20, 2020 •

edited

Loading