Large performance regression in trunc with vector argument #19849

giordano · 2017-01-03T22:31:59Z

After #19791, the manually vectorized method of trunc is deprecated and the dot syntax should be used. However, the 2-arg method is now much slower than before.

I've a function where this regression is sensible, even if trunc(Int, array) is a relatively small part of computation and profiling confirms that broadcasting trunc takes a while.

Before the PR (manually vectorized method):

julia> versioninfo()
Julia Version 0.6.0-dev.1807
Commit 26c8d856a (2016-12-31 04:12 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> using BenchmarkTools

julia> const array = collect(0.0:0.1:100.0);

julia> @benchmark trunc(Int, array)
BenchmarkTools.Trial: 
  memory estimate:  8.02 kb
  allocs estimate:  2
  --------------
  minimum time:     2.611 μs (0.00% GC)
  median time:      2.722 μs (0.00% GC)
  mean time:        3.242 μs (12.88% GC)
  maximum time:     165.228 μs (96.58% GC)
  --------------
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%

after the PR (dot syntax):

julia> versioninfo()
Julia Version 0.6.0-dev.1847
Commit 8f9036a7d (2017-01-02 23:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> const array = collect(0.0:0.1:100.0);

julia> using BenchmarkTools

julia> @benchmark trunc.(Int, array)
BenchmarkTools.Trial: 
  memory estimate:  23.75 kb
  allocs estimate:  1008
  --------------
  minimum time:     270.303 μs (0.00% GC)
  median time:      274.493 μs (0.00% GC)
  mean time:        281.129 μs (0.67% GC)
  maximum time:     2.250 ms (86.97% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark trunc.(array)
BenchmarkTools.Trial: 
  memory estimate:  8.00 kb
  allocs estimate:  1
  --------------
  minimum time:     952.900 ns (0.00% GC)
  median time:      1.140 μs (0.00% GC)
  mean time:        1.409 μs (15.34% GC)
  maximum time:     148.246 μs (97.14% GC)
  --------------
  samples:          10000
  evals/sample:     10
  time tolerance:   5.00%
  memory tolerance: 1.00%

Now trunc.(Int, array) is 100× slower than the old trunc(Int, array) and more memory-eager. @code_warntype trunc.(Int, array) indicates that the returned value is type-stable, but there are type-unstable variables inside the function.

Note that the current trunc.(array) is comparable to the old trunc(Int, array).

Edit: for comparison, these are the results of benchmark in Julia 0.5:

julia> @benchmark trunc(Int, array)
BenchmarkTools.Trial: 
  memory estimate:  8.03 kb
  allocs estimate:  3
  --------------
  minimum time:     2.918 μs (0.00% GC)
  median time:      3.013 μs (0.00% GC)
  mean time:        3.355 μs (7.09% GC)
  maximum time:     107.170 μs (90.67% GC)
  --------------
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark trunc(array)
BenchmarkTools.Trial: 
  memory estimate:  8.03 kb
  allocs estimate:  3
  --------------
  minimum time:     3.505 μs (0.00% GC)
  median time:      3.585 μs (0.00% GC)
  mean time:        3.938 μs (6.13% GC)
  maximum time:     112.471 μs (89.47% GC)
  --------------
  samples:          10000
  evals/sample:     8
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark trunc.(array)
BenchmarkTools.Trial: 
  memory estimate:  8.02 kb
  allocs estimate:  2
  --------------
  minimum time:     2.882 μs (0.00% GC)
  median time:      3.000 μs (0.00% GC)
  mean time:        3.317 μs (6.78% GC)
  maximum time:     99.173 μs (83.76% GC)
  --------------
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%

trunc.(array) was already faster than trunc(array) and now it's even better, the problem is the method with a type argument.

The text was updated successfully, but these errors were encountered:

KristofferC · 2017-01-03T22:34:39Z

There have been previous performance regressions when using types as arguments to functions. They got fixed by explicitly specializing on the type, i.e f{T}(::Type{T}). Maybe it is the same symptom here?

martinholters · 2017-01-04T10:57:31Z

There is an ugly workaround:

julia> @benchmark trunc.(Int, $array)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  23.75 kb
  allocs estimate:  1008
  minimum time:     380.08 μs (0.00% GC)
  median time:      428.49 μs (0.00% GC)
  mean time:        452.06 μs (0.44% GC)
  maximum time:     16.49 ms (0.00% GC)

julia> @benchmark (x->trunc(Int, x)).($array)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     8
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  8.00 kb
  allocs estimate:  1
  minimum time:     3.28 μs (0.00% GC)
  median time:      4.37 μs (0.00% GC)
  mean time:        5.13 μs (3.22% GC)
  maximum time:     2.01 ms (0.00% GC)

martinholters · 2017-01-04T10:59:23Z

I have the following theory for the cause of the problem: broadcast does not specialize for the ::Type{Int} argument, hence every call to trunc needs to go through dynamic dispatch.

tkelman · 2017-01-04T11:22:32Z

cc @pabloferz as well, if you have any ideas

stevengj · 2017-01-04T13:00:38Z

Maybe we need something like _broadcast_getindex{T}(::ScalarType, ::Type{T}, I) = (Base.@_pure_meta; T) in broadcast.jl?

pabloferz · 2017-01-04T17:43:41Z

This performance issue and this #19421 (comment) have the same root cause. I already had an idea on how to fix it, will submit a PR.

KristofferC added performance Must go faster regression Regression in behavior compared to a previous version broadcast Applying a function over a collection labels Jan 3, 2017

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 4, 2017

stevengj added the types and dispatch Types, subtyping and method dispatch label Jan 4, 2017

stevengj added this to the 0.6.0 milestone Jan 4, 2017

pabloferz mentioned this issue Jan 5, 2017

Some broadcast performance tweaks #19879

Merged

JeffBezanson closed this as completed in #19879 Jan 6, 2017

pabloferz mentioned this issue Jan 9, 2017

broadcast[!] over combinations of scalars and sparse vectors/matrices #19724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large performance regression in trunc with vector argument #19849

Large performance regression in trunc with vector argument #19849

giordano commented Jan 3, 2017 •

edited

Loading

KristofferC commented Jan 3, 2017 •

edited

Loading

martinholters commented Jan 4, 2017

martinholters commented Jan 4, 2017

tkelman commented Jan 4, 2017

stevengj commented Jan 4, 2017 •

edited

Loading

pabloferz commented Jan 4, 2017 •

edited

Loading

Large performance regression in trunc with vector argument #19849

Large performance regression in trunc with vector argument #19849

Comments

giordano commented Jan 3, 2017 • edited Loading

KristofferC commented Jan 3, 2017 • edited Loading

martinholters commented Jan 4, 2017

martinholters commented Jan 4, 2017

tkelman commented Jan 4, 2017

stevengj commented Jan 4, 2017 • edited Loading

pabloferz commented Jan 4, 2017 • edited Loading

giordano commented Jan 3, 2017 •

edited

Loading

KristofferC commented Jan 3, 2017 •

edited

Loading

stevengj commented Jan 4, 2017 •

edited

Loading

pabloferz commented Jan 4, 2017 •

edited

Loading