Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large performance regression in trunc with vector argument #19849

Closed
giordano opened this issue Jan 3, 2017 · 6 comments · Fixed by #19879
Closed

Large performance regression in trunc with vector argument #19849

giordano opened this issue Jan 3, 2017 · 6 comments · Fixed by #19879
Labels
broadcast Applying a function over a collection performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version types and dispatch Types, subtyping and method dispatch
Milestone

Comments

@giordano
Copy link
Contributor

giordano commented Jan 3, 2017

After #19791, the manually vectorized method of trunc is deprecated and the dot syntax should be used. However, the 2-arg method is now much slower than before.

I've a function where this regression is sensible, even if trunc(Int, array) is a relatively small part of computation and profiling confirms that broadcasting trunc takes a while.

Before the PR (manually vectorized method):

julia> versioninfo()
Julia Version 0.6.0-dev.1807
Commit 26c8d856a (2016-12-31 04:12 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> using BenchmarkTools

julia> const array = collect(0.0:0.1:100.0);

julia> @benchmark trunc(Int, array)
BenchmarkTools.Trial: 
  memory estimate:  8.02 kb
  allocs estimate:  2
  --------------
  minimum time:     2.611 μs (0.00% GC)
  median time:      2.722 μs (0.00% GC)
  mean time:        3.242 μs (12.88% GC)
  maximum time:     165.228 μs (96.58% GC)
  --------------
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%

after the PR (dot syntax):

julia> versioninfo()
Julia Version 0.6.0-dev.1847
Commit 8f9036a7d (2017-01-02 23:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> const array = collect(0.0:0.1:100.0);

julia> using BenchmarkTools

julia> @benchmark trunc.(Int, array)
BenchmarkTools.Trial: 
  memory estimate:  23.75 kb
  allocs estimate:  1008
  --------------
  minimum time:     270.303 μs (0.00% GC)
  median time:      274.493 μs (0.00% GC)
  mean time:        281.129 μs (0.67% GC)
  maximum time:     2.250 ms (86.97% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark trunc.(array)
BenchmarkTools.Trial: 
  memory estimate:  8.00 kb
  allocs estimate:  1
  --------------
  minimum time:     952.900 ns (0.00% GC)
  median time:      1.140 μs (0.00% GC)
  mean time:        1.409 μs (15.34% GC)
  maximum time:     148.246 μs (97.14% GC)
  --------------
  samples:          10000
  evals/sample:     10
  time tolerance:   5.00%
  memory tolerance: 1.00%

Now trunc.(Int, array) is 100× slower than the old trunc(Int, array) and more memory-eager. @code_warntype trunc.(Int, array) indicates that the returned value is type-stable, but there are type-unstable variables inside the function.

Note that the current trunc.(array) is comparable to the old trunc(Int, array).

Edit: for comparison, these are the results of benchmark in Julia 0.5:

julia> @benchmark trunc(Int, array)
BenchmarkTools.Trial: 
  memory estimate:  8.03 kb
  allocs estimate:  3
  --------------
  minimum time:     2.918 μs (0.00% GC)
  median time:      3.013 μs (0.00% GC)
  mean time:        3.355 μs (7.09% GC)
  maximum time:     107.170 μs (90.67% GC)
  --------------
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark trunc(array)
BenchmarkTools.Trial: 
  memory estimate:  8.03 kb
  allocs estimate:  3
  --------------
  minimum time:     3.505 μs (0.00% GC)
  median time:      3.585 μs (0.00% GC)
  mean time:        3.938 μs (6.13% GC)
  maximum time:     112.471 μs (89.47% GC)
  --------------
  samples:          10000
  evals/sample:     8
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark trunc.(array)
BenchmarkTools.Trial: 
  memory estimate:  8.02 kb
  allocs estimate:  2
  --------------
  minimum time:     2.882 μs (0.00% GC)
  median time:      3.000 μs (0.00% GC)
  mean time:        3.317 μs (6.78% GC)
  maximum time:     99.173 μs (83.76% GC)
  --------------
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%

trunc.(array) was already faster than trunc(array) and now it's even better, the problem is the method with a type argument.

@KristofferC KristofferC added performance Must go faster regression Regression in behavior compared to a previous version broadcast Applying a function over a collection labels Jan 3, 2017
@KristofferC
Copy link
Member

KristofferC commented Jan 3, 2017

There have been previous performance regressions when using types as arguments to functions. They got fixed by explicitly specializing on the type, i.e f{T}(::Type{T}). Maybe it is the same symptom here?

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 4, 2017
@martinholters
Copy link
Member

There is an ugly workaround:

julia> @benchmark trunc.(Int, $array)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  23.75 kb
  allocs estimate:  1008
  minimum time:     380.08 μs (0.00% GC)
  median time:      428.49 μs (0.00% GC)
  mean time:        452.06 μs (0.44% GC)
  maximum time:     16.49 ms (0.00% GC)

julia> @benchmark (x->trunc(Int, x)).($array)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     8
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  8.00 kb
  allocs estimate:  1
  minimum time:     3.28 μs (0.00% GC)
  median time:      4.37 μs (0.00% GC)
  mean time:        5.13 μs (3.22% GC)
  maximum time:     2.01 ms (0.00% GC)

@martinholters
Copy link
Member

I have the following theory for the cause of the problem: broadcast does not specialize for the ::Type{Int} argument, hence every call to trunc needs to go through dynamic dispatch.

@tkelman
Copy link
Contributor

tkelman commented Jan 4, 2017

cc @pabloferz as well, if you have any ideas

@stevengj
Copy link
Member

stevengj commented Jan 4, 2017

Maybe we need something like _broadcast_getindex{T}(::ScalarType, ::Type{T}, I) = (Base.@_pure_meta; T) in broadcast.jl?

@stevengj stevengj added the types and dispatch Types, subtyping and method dispatch label Jan 4, 2017
@stevengj stevengj added this to the 0.6.0 milestone Jan 4, 2017
@pabloferz
Copy link
Contributor

pabloferz commented Jan 4, 2017

This performance issue and this #19421 (comment) have the same root cause. I already had an idea on how to fix it, will submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version types and dispatch Types, subtyping and method dispatch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants