-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add evalpoly function #32753
Add evalpoly function #32753
Conversation
What's the point of having both? |
I'd say there's no reason to have the macro if the function can do the same job just as fast, but the macro can't be removed until 2.0. |
Hoping that the compiler will end up going through 12 calls of inlining (which might be true for floats but not for other types etc) seems worse. Although I can see the argument that it might facilitate writing more generic code. If the compiler fails to inline through the whole thing it is likely slower than the loop based version though. |
Good points! I guess my thinking was just that macros tend to be a bit more brittle than functions. One thing I like about this is that I can do things like
I know that's not going to produce optimal code unless However, this should probably be clarified in the docstring and the note I wrote that it should be just as fast as |
Unless I've done this wrong (which is always possible) it looks like the ability to splat into myhorner is a mis-feature?
|
Splatting into horner(x) = zero(x)
horner(x, p1) = p1
horner(x, p1, ps...) = muladd(x, horner(x, ps...), p1)
f(x, v) = horner(x, v...)
g(x, p1, p2, p3) = horner(x, p1, p2, p3)
v = (1.0, 2.0, 3.0)
@btime f($(Ref(x))[], $v) # 1.696 ns (0 allocations: 0 bytes)
@btime g($(Ref(x))[], 1.0, 2.0, 3.0) # 1.696 ns (0 allocations: 0 bytes) |
Another advantage of doing this as a function is that we could have something like Plus, people can add methods to |
Now that makes good sense. |
If we add a function here, I would call it See also #18410. |
I'm not sure I see the advantage to having functions for |
One case I showed above where the functions actually do something that the macros can’t do is stuff like f(x, v::SVector) = evalpoly(x, v...) Which is just as efficient as the macro. The crucial point here is that for If we’re worried about having to rely on inlining to generate efficient code, we could also just do this as a generated function. Then we get to guarantee that the loop is unrolled but have (most of) the advantages of functions like using type level information to dispatch and generate custom implementations. |
Hi all, sorry for the hiatus on this, I was away on vacation. I have made some overhauls:
I also thought a bit more about why this makes sense as a function more than a macro. To clarify what I said before, the loop unrolling being done in In the Looking forward, my Perhaps I should instead get rid of the option to not use the |
What remains to be done determined decided for merging a consensus into v1.4? |
@MasonProtter your benchmark timings are meaningless since their sub @btime evalpoly1($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, 50)))
48.683 ns (0 allocations: 0 bytes)
@btime evalpoly2($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, 50)))
36.253 ns (0 allocations: 0 bytes) In addition, code generation shows that |
@yuyichao I'm not arguing for or against evalpoly1 or evalpoly2, but my tests show that identical code is generated in both cases using your test example in #18410 (comment) |
@MasonProtter upon closer thinking I'm in support of the PR as it currently stands. From my perspective it's better to guarantee unrolling in the static tuple case. |
No, I don't believe they are meaningless. You're measuring the case where the compiler doesn't know However, further testing I've done has shown that this process is fairly finicky and the difference in behaviours between the two methods is not always reliable. For instance, julia> f() = evalpoly1(1.0, ntuple(n -> (-1)^n / n, 50))
f (generic function with 1 method)
julia> g() = evalpoly2(1.0, ntuple(n -> (-1)^n / n, 50))
g (generic function with 1 method)
julia> @btime f()
2.305 μs (54 allocations: 1.77 KiB)
-0.6832471605759182
julia> @btime g()
2.291 μs (55 allocations: 1.78 KiB)
-0.6832471605759182 shows basically no difference, though I'd have naïvely hoped that julia should be able to just lower both of these functions to just |
I guess the only question remaining here is if we should employ the trick to drop to the looping definition after a threshold. |
@musm I don't see a point where the looping method ever outperforms the generated method. Keep in mind that julia> function evalpoly1(x, p::Tuple)
N = length(p)
ex = p[end]
for i in N-1:-1:1
ex = muladd(x, ex, p[i])
end
ex
end
evalpoly1 (generic function with 1 method)
julia> @generated function evalpoly2(x, p::Tuple)
N = length(p.parameters)
ex = :(p[end])
for i in N-1:-1:1
ex = :(muladd(x, $ex, p[$i]))
end
ex
end
evalpoly2 (generic function with 1 method)
julia> for N in [1, 10, 30, 100, 200, 500, 1000]
@show N
print(" Non-generated: "); @btime evalpoly1($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, N)))
print(" Generated: "); @btime evalpoly2($(Ref(1.0))[], $(ntuple(n -> (-1)^n / n, N)))
end
N = 1
Non-generated: 0.029 ns (0 allocations: 0 bytes)
Generated: 0.029 ns (0 allocations: 0 bytes)
N = 10
Non-generated: 2.310 ns (0 allocations: 0 bytes)
Generated: 2.305 ns (0 allocations: 0 bytes)
N = 30
Non-generated: 18.259 ns (0 allocations: 0 bytes)
Generated: 12.537 ns (0 allocations: 0 bytes)
N = 100
Non-generated: 85.047 ns (0 allocations: 0 bytes)
Generated: 61.601 ns (0 allocations: 0 bytes)
N = 200
Non-generated: 196.571 ns (0 allocations: 0 bytes)
Generated: 162.356 ns (0 allocations: 0 bytes)
N = 500
Non-generated: 530.747 ns (0 allocations: 0 bytes)
Generated: 495.928 ns (0 allocations: 0 bytes)
N = 1000
Non-generated: 1.092 μs (0 allocations: 0 bytes)
Generated: 1.061 μs (0 allocations: 0 bytes) |
Agreed. Ok I think we should go ahead and merge this PR as is! |
Looks good to me. I don't know if it matters, but |
@JeffBezanson Good catch. One way I could fix that is by making it macroexpand to something like |
|
wing 64 test failure unrelated, otherwise LGTM |
@MasonProtter could you confirm that some of the elementary math function, e.g. |
I see only performance gains, at least relative to v.13.0-rc3. julia> using Printf; BenchmarkTools
julia> begin
@show VERSION
for x in 10 .* randn(10)
@show x
print(" exp("*@sprintf("%.3f",x)*"): "); @btime exp( $(Ref(x))[])
print(" exp10("*@sprintf("%.3f",x)*"): "); @btime exp10($(Ref(x))[])
end
end
VERSION = v"1.3.0-rc3.0"
x = -7.15243458415763
exp(-7.152): 7.744 ns (0 allocations: 0 bytes)
exp10(-7.152): 9.172 ns (0 allocations: 0 bytes)
x = -8.719494400795519
exp(-8.719): 7.741 ns (0 allocations: 0 bytes)
exp10(-8.719): 9.201 ns (0 allocations: 0 bytes)
x = -29.801956722722643
exp(-29.802): 7.741 ns (0 allocations: 0 bytes)
exp10(-29.802): 9.201 ns (0 allocations: 0 bytes)
x = -8.877111035670996
exp(-8.877): 7.741 ns (0 allocations: 0 bytes)
exp10(-8.877): 8.392 ns (0 allocations: 0 bytes)
x = -0.6928593337570201
exp(-0.693): 6.449 ns (0 allocations: 0 bytes)
exp10(-0.693): 9.200 ns (0 allocations: 0 bytes)
x = 2.866124556333614
exp(2.866): 7.737 ns (0 allocations: 0 bytes)
exp10(2.866): 8.117 ns (0 allocations: 0 bytes)
x = 23.658277524062772
exp(23.658): 7.743 ns (0 allocations: 0 bytes)
exp10(23.658): 9.196 ns (0 allocations: 0 bytes)
x = -8.531107335629256
exp(-8.531): 7.748 ns (0 allocations: 0 bytes)
exp10(-8.531): 9.203 ns (0 allocations: 0 bytes)
x = -1.907378473171837
exp(-1.907): 7.748 ns (0 allocations: 0 bytes)
exp10(-1.907): 9.199 ns (0 allocations: 0 bytes)
x = 12.353234804733408
exp(12.353): 7.740 ns (0 allocations: 0 bytes)
exp10(12.353): 9.198 ns (0 allocations: 0 bytes) julia> using Printf; BenchmarkTools
julia> begin
@show VERSION
for x in 10 .* randn(10)
@show x
print(" exp("*@sprintf("%.3f",x)*"): "); @btime exp( $(Ref(x))[])
print(" exp10("*@sprintf("%.3f",x)*"): "); @btime exp10($(Ref(x))[])
end
end
VERSION = v"1.4.0-DEV.539"
x = 3.0797159237964333
exp(3.080): 7.720 ns (0 allocations: 0 bytes)
exp10(3.080): 8.107 ns (0 allocations: 0 bytes)
x = 11.524550980616235
exp(11.525): 7.727 ns (0 allocations: 0 bytes)
exp10(11.525): 8.647 ns (0 allocations: 0 bytes)
x = -13.828224826793532
exp(-13.828): 7.726 ns (0 allocations: 0 bytes)
exp10(-13.828): 8.652 ns (0 allocations: 0 bytes)
x = 10.171399187874462
exp(10.171): 7.727 ns (0 allocations: 0 bytes)
exp10(10.171): 8.648 ns (0 allocations: 0 bytes)
x = -14.692467227301847
exp(-14.692): 7.719 ns (0 allocations: 0 bytes)
exp10(-14.692): 8.650 ns (0 allocations: 0 bytes)
x = 12.429415466997678
exp(12.429): 7.725 ns (0 allocations: 0 bytes)
exp10(12.429): 8.646 ns (0 allocations: 0 bytes)
x = 3.5732507205285162
exp(3.573): 7.727 ns (0 allocations: 0 bytes)
exp10(3.573): 8.649 ns (0 allocations: 0 bytes)
x = -5.010189256550127
exp(-5.010): 7.727 ns (0 allocations: 0 bytes)
exp10(-5.010): 8.648 ns (0 allocations: 0 bytes)
x = -4.809001460910974
exp(-4.809): 7.724 ns (0 allocations: 0 bytes)
exp10(-4.809): 8.653 ns (0 allocations: 0 bytes)
x = 19.73389331115626
exp(19.734): 7.724 ns (0 allocations: 0 bytes)
exp10(19.734): 8.651 ns (0 allocations: 0 bytes) |
Great! For 2.0 we should deprecate at-evalpoly. In the meantime, we can replace all instances of |
This caused a regression for some functions using SIMD intrinsics with llvmcall, because the julia> @inline function expb_kernel(::Val{ℯ}, x::Union{Float32, SVec{W, Float32} where W})
return SLEEFPirates.@horner(x, 1.0f0, 1.0f0, 0.5f0,
0.16666415f0, 0.041666083f0,
0.008375129f0, 0.0013956056f0)
end
expb_kernel (generic function with 1 method)
julia> @btime myexp($(Ref(sx32))[]) # vector
3.512 ns (0 allocations: 0 bytes)
SVec{16,Float32}<16639.281f0, 1.8628123f0, 5.235295f0, 7.906962f-5, 1822.3082f0, 2.4061686f-9, 0.0030119985f0, 0.0002081529f0, 7.7609496f0, 23.372593f0, 8.717485f11, 3.6029082f6, 13742.452f0, 3.1532597f-9, 0.007988205f0, 0.00020794437f0>
julia> @inline function expb_kernel(::Val{ℯ}, x::Union{Float32, SVec{W, Float32} where W})
return evalpoly(x, (1.0f0, 1.0f0, 0.5f0,
0.16666415f0, 0.041666083f0,
0.008375129f0, 0.0013956056f0))
end
expb_kernel (generic function with 1 method)
julia> @btime myexp($(Ref(sx32))[]) # vector
4.975 ns (0 allocations: 0 bytes)
SVec{16,Float32}<16639.281f0, 1.8628123f0, 5.235295f0, 7.906962f-5, 1822.3082f0, 2.4061686f-9, 0.0030119985f0, 0.0002081529f0, 7.7609496f0, 23.372593f0, 8.717485f11, 3.6029082f6, 13742.452f0, 3.1532597f-9, 0.007988205f0, 0.00020794437f0> Costs: julia> function cost(f, tt) # Thanks to Tim Holy; note this function needs updating for Julia 1.6
params = Core.Compiler.Params(typemax(UInt))
mi = Base.method_instances(f, tt)[1]
ci = code_typed(f, tt)[1][1]
opt = Core.Compiler.OptimizationState(mi, params)
cost(stmt::Expr) = Core.Compiler.statement_cost(stmt, -1, ci, opt.sptypes, opt.slottypes, opt.params)
cost(stmt) = 0
sum(cost, ci.code)
end
cost (generic function with 1 method)
julia> cost(+, Tuple{SVec{16,Float32},SVec{16,Float32}})
10
julia> cost(+, Tuple{Float32,Float32})
1
julia> cost(+, Tuple{SVec{8,Float32},SVec{8,Float32}})
10
julia> cost(muladd, Tuple{SVec{8,Float32},SVec{8,Float32},SVec{8,Float32}})
10
julia> cost(muladd, Tuple{Float32,Float32,Float32})
5 On my computer (with AVX512), all these costs should be |
Ref #32753 (comment) |
Note that comment was in regards to the original implementation of this PR that relied on recursion, rather than a generated function. There is no reliance on ‘12 steps of inlining’ here unless the compiler decides to not use the |
Do we still need a generated function for this? I noticed recently that the compiler seems able to unroll/inline |
For code generation, no, but for effect inference, unfortunately, yes julia> Base.infer_effects(evalpoly, (Float64, Tuple{Float64, Float64, Float64}))
(+c,+e,+n,+t,+s)
julia> Base.infer_effects(Base.Math._evalpoly, (Float64, Tuple{Float64, Float64, Float64}))
(!c,+e,!n,!t,+s) |
Code taken from Steven G. Johnson's Juliacon 2019 talk. Gives same performance as
@horner
.A lot has changed since I wrote this PR originally and the above code is not the current implementation.
See #32753 (comment) for the current state of this PR.