Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Performance regression with mapreduce #611

Closed
MasonProtter opened this issue Feb 27, 2020 · 6 comments
Closed

Performance regression with mapreduce #611

MasonProtter opened this issue Feb 27, 2020 · 6 comments
Labels

Comments

@MasonProtter
Copy link

MasonProtter commented Feb 27, 2020

Here's an example for me on the master branch:

julia> using BenchmarkTools, CuArrays

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  16.63 KiB
  allocs estimate:  473
  --------------
  minimum time:     1.620 ms (0.00% GC)
  median time:      1.666 ms (0.00% GC)
  mean time:        1.709 ms (1.60% GC)
  maximum time:     9.460 ms (7.77% GC)
  --------------
  samples:          2921
  evals/sample:     1

(@v1.4) pkg> st CuArrays 
Status `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] CuArrays v1.7.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)

(@v1.4) pkg> st CUDAnative
Status `~/.julia/environments/v1.4/Project.toml`
  [be33ccc6] CUDAnative v2.10.2 #master (https://github.com/JuliaGPU/CUDAnative.jl.git)

and here's that same example on the latest tagged version:

julia> using BenchmarkTools, CuArrays

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  4.61 KiB
  allocs estimate:  126
  --------------
  minimum time:     594.302 μs (0.00% GC)
  median time:      659.321 μs (0.00% GC)
  mean time:        667.914 μs (1.58% GC)
  maximum time:     2.338 ms (39.61% GC)
  --------------
  samples:          7463
  evals/sample:     1

(@v1.4) pkg> st CuArrays
Status `~/.julia/environments/v1.4/Project.toml`
  [3a865a2d] CuArrays v1.7.2

(@v1.4) pkg> st CUDAnative
Status `~/.julia/environments/v1.4/Project.toml`
  [be33ccc6] CUDAnative v2.10.2

As you can see, I lost around a factor of 3 performance on the new master. I tested the master version with and without JULIA_CUDA_USE_BINARYBUILDER=false , so binary builder is not the problem. Likely due to #602


julia> versioninfo()
Julia Version 1.4.0-rc1.0
Commit b0c33b0cf5* (2020-01-23 17:23 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

[mason@mason-pc ~]$ sudo pacman -Q --info cuda
Name            : cuda
Version         : 10.2.89-3
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : custom:NVIDIA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk
Depends On      : gcc8-libs  gcc8  opencl-nvidia  nvidia-utils
Optional Deps   : gdb: for cuda-gdb
                  java-runtime=8: for nsight and nvvp
Required By     : cudnn
Optional For    : None
Conflicts With  : None
Replaces        : cuda-toolkit  cuda-sdk
Installed Size  : 4.04 GiB
Packager        : Sven-Hendrik Haase <svenstaro@gmail.com>
Build Date      : Tue 31 Dec 2019 01:07:53 AM MST
Install Date    : Wed 26 Feb 2020 03:04:42 PM MST
Install Reason  : Explicitly installed
Install Script  : Yes
Validated By    : Signature

[mason@mason-pc ~]$ lspci  -v -s  $(lspci | grep ' VGA ' | cut -d" " -f 1)
1f:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ZOTAC International (MCO) Ltd. TU106 [GeForce RTX 2060 Rev. A]
        Flags: bus master, fast devsel, latency 0, IRQ 71
        Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
@MasonProtter
Copy link
Author

Not really a bug, but the only two options when creating the issue were 'bug report' and 'feature request'.

@MasonProtter
Copy link
Author

Just an update, trying this again on the current master I get a further factor of 10 performance regression:

julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.

julia> function pi_mc_cu(nsamples)
           xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
           mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
       end
pi_mc_cu (generic function with 1 method)

julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial:
  memory estimate:  17.08 KiB
  allocs estimate:  494
  --------------
  minimum time:     11.079 ms (0.00% GC)
  median time:      11.140 ms (0.00% GC)
  mean time:        11.188 ms (0.30% GC)
  maximum time:     13.158 ms (10.40% GC)
  --------------
  samples:          447
  evals/sample:     1

@maleadt
Copy link
Member

maleadt commented Mar 20, 2020

I had hoped #642 would fix this, but it doesn't do much. Maybe the serial fallback for small arrays, as used to exist with the old GPUArrays and CuArrays mapreduce implementations, is crucial in this situation. Although the input isn't particularly tiny here, so I'd need to properly profile first.

@maleadt
Copy link
Member

maleadt commented Mar 20, 2020

OK, one problem is the missing specialization for mapreduce with multiple containers, falling back to a separate call to map and reduce.

@MasonProtter
Copy link
Author

Ahh, that makes sense.

bors bot added a commit that referenced this issue Mar 31, 2020
646: Improve mapreduce performance r=maleadt a=wongalvis14

~More than 3-fold improvement over the latest implementation~

Benchmarking function from #611

First stage: Using the number of "max parallel threads a single block can hold" as the number of blocks, perform reduction with serial iteration if needed

Second stage: Reduction in a single block, no serial iteration

This approach aims to strike an optimal balance between workload of each thread, kernel launch overhead and parallel resource exhaustion.

```
New impl:
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  16.98 KiB
  allocs estimate:  468
  --------------
  minimum time:     2.520 ms (0.00% GC)
  median time:      2.536 ms (0.00% GC)
  mean time:        2.584 ms (0.64% GC)
  maximum time:     15.600 ms (50.62% GC)
  --------------
  samples:          1930
  evals/sample:     1

Old recursion impl:
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial: 
  memory estimate:  17.05 KiB
  allocs estimate:  472
  --------------
  minimum time:     4.059 ms (0.00% GC)
  median time:      4.076 ms (0.00% GC)
  mean time:        4.130 ms (0.64% GC)
  maximum time:     23.199 ms (63.12% GC)
  --------------
  samples:          1209
  evals/sample:     1

Latest serial impl:
BenchmarkTools.Trial: 
  memory estimate:  7.81 KiB
  allocs estimate:  242
  --------------
  minimum time:     8.544 ms (0.00% GC)
  median time:      8.579 ms (0.00% GC)
  mean time:        8.622 ms (0.27% GC)
  maximum time:     26.172 ms (41.80% GC)
  --------------
  samples:          580
  evals/sample:     1
```

Co-authored-by: wongalvis14 <wongalvis14@gmail.com>
Co-authored-by: Tim Besard <tim.besard@gmail.com>
@maleadt
Copy link
Member

maleadt commented Mar 31, 2020

1.7.3:

BenchmarkTools.Trial: 
  memory estimate:  4.61 KiB
  allocs estimate:  126
  --------------
  minimum time:     428.960 μs (0.00% GC)
  median time:      466.305 μs (0.00% GC)
  mean time:        495.852 μs (1.42% GC)
  maximum time:     2.296 ms (33.22% GC)
  --------------
  samples:          10000
  evals/sample:     1

current master

BenchmarkTools.Trial: 
  memory estimate:  10.45 KiB
  allocs estimate:  323
  --------------
  minimum time:     452.205 μs (0.00% GC)
  median time:      497.083 μs (0.00% GC)
  mean time:        528.749 μs (2.02% GC)
  maximum time:     3.221 ms (35.36% GC)
  --------------
  samples:          9406
  evals/sample:     1

Minor regression, but the reduction now always happens on the GPU, while old GPUArrays performed the second phase on the CPU (which is invalid when using GPU-specific functions).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants