Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same function, same arguments, different generated code #20913

Closed
KristofferC opened this issue Mar 6, 2017 · 14 comments
Closed

Same function, same arguments, different generated code #20913

KristofferC opened this issue Mar 6, 2017 · 14 comments
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks

Comments

@KristofferC
Copy link
Member

KristofferC commented Mar 6, 2017

Consider the following small fixedsize matrix implementation together with a reduce function and two implementations of sum:

struct FixedMatrix{R,C,T,RC} <: AbstractMatrix{T}
    data::NTuple{RC, T}
end

Base.size(::FixedMatrix{R,C}) where {R, C}  = (R,C)
Base.length(::FixedMatrix{R,C,T,RC}) where {R,C,T,RC} = RC
Base.IndexStyle(::Type{<: FixedMatrix}) = IndexLinear()
Base.getindex(fm::FixedMatrix, i::Int) = fm.data[i]

@inline function myreduce(op, v0, a::FixedMatrix)
    if length(a) == 0
        return v0
    else
        s = v0
        @inbounds @simd for j = 1:length(a)
            s = op(s, a[j])
        end
        return s
    end
end

sum2(a::FixedMatrix) = myreduce(+, zero(eltype(a)), a)
sum3(a::FixedMatrix) = myreduce(+, 0.0, a)

Now, sum2 and sum3 will (should?) clearly call the same method with the same arguments when the matrix is of type Float64 (which I confirmed by putting a @which in there).

However:

julia> using BenchmarkTools

julia> m = FixedMatrix{8, 8, Float64, 64}((rand(64)...,))

julia> @btime sum2($m)
  67.113 ns (0 allocations: 0 bytes)
36.441929014270464

julia> @btime sum3($m)
  8.379 ns (0 allocations: 0 bytes)
36.441929014270464

In fact, the generated code is vastly different (one uses SIMD, the other one does not).

julia> @code_llvm sum2(m)

julia> @code_llvm sum3(m)

How can so different code be generated when the exact same function is called with the same arguments?

@KristofferC
Copy link
Member Author

KristofferC commented Mar 6, 2017

Also note that the built in sum time is slower than the one I wrote up there (although probably more robust against accumulating floating point errors):

julia> @btime sum($m)
  17.394 ns (0 allocations: 0 bytes)
36.441929014270464

@yuyichao
Copy link
Contributor

yuyichao commented Mar 6, 2017

What's m?

@KristofferC
Copy link
Member Author

KristofferC commented Mar 6, 2017

Sorry! m = FixedMatrix{8, 8, Float64, 64}((rand(64)...))

@fredrikekre
Copy link
Member

fredrikekre commented Mar 6, 2017

julia> @btime sum2($m);
  9.603 ns (0 allocations: 0 bytes)

julia> @btime sum3($m);
  9.341 ns (0 allocations: 0 bytes)

EDIT: Above timings were for 5x5 matrix. For 8x8:

julia> @btime sum2($m);
  111.510 ns (0 allocations: 0 bytes)

julia> @btime sum3($m);
  9.231 ns (0 allocations: 0 bytes)

@KristofferC
Copy link
Member Author

What version? Is the generated code the same for you?

@KristofferC
Copy link
Member Author

My master was 11 days old... maybe it has been fixed already. Will look.

@fredrikekre
Copy link
Member

I got the same generated code but for a 5x5 case. See my edit in the comment above.

@KristofferC
Copy link
Member Author

Yes for 5x5 I think everything gets unrolled.

@KristofferC
Copy link
Member Author

Same result on 0.5.

@ararslan ararslan added the compiler:codegen Generation of LLVM IR and native code label Mar 6, 2017
@JeffBezanson
Copy link
Member

The only difference in IR seems to be a constant 0.0 vs. sitofp(0). Maybe the sitofp is enough to throw off the optimizer? I guess we should add some early constant folding to sitofp.

@Keno
Copy link
Member

Keno commented Mar 6, 2017

That would be either a serious bug in LLVM or a pass order problem. Either way we should find out.

@JeffBezanson JeffBezanson added the performance Must go faster label Mar 6, 2017
@KristofferC
Copy link
Member Author

KristofferC commented Apr 18, 2017

Any ideas here? AFAIU this inihibits simd for the reductions in StaticArrays.

@KristofferC
Copy link
Member Author

Still happens

@KristofferC KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 2, 2017
@KristofferC
Copy link
Member Author

This seems to be fixed on 0.7 🎉.

julia> @btime sum2($m)
  7.670 ns (0 allocations: 0 bytes)
35.36299421828035

julia> @btime sum3($m)
  7.603 ns (0 allocations: 0 bytes)
35.36299421828035

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

No branches or pull requests

6 participants