Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of mul! with scalar #33105

Closed
dkarrasch opened this issue Aug 28, 2019 · 5 comments · Fixed by #33108 or #33187
Closed

Poor performance of mul! with scalar #33105

dkarrasch opened this issue Aug 28, 2019 · 5 comments · Fixed by #33108 or #33187
Labels
linear algebra Linear algebra performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version
Milestone

Comments

@dkarrasch
Copy link
Member

On

julia> versioninfo()
Julia Version 1.3.0-rc1.0
Commit 768b25f6a8 (2019-08-18 00:04 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

I see

using BenchmarkTools, LinearAlgebra
x = rand(10); y = similar(x); λ = rand();
@btime mul!($y, $λ, $x) # 1.117 μs (30 allocations: 640 bytes)

In contrast, the hand-written

@btime $y .= $λ .* $x # 9.764 ns (0 allocations: 0 bytes)

as expected. Interestingly, on Julia v1.2 I get

@btime $y .= $λ .* $x # 20.617 ns (0 allocations: 0 bytes)
@btime mul!($y, $λ, $x) # 11.733 ns (0 allocations: 0 bytes)

so it seems like broadcasting has significantly improved, and something is simply going wrong with mul!.

@KristofferC KristofferC added linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version labels Aug 28, 2019
@KristofferC KristofferC added this to the 1.3 milestone Aug 28, 2019
@KristofferC
Copy link
Member

KristofferC commented Aug 28, 2019

Looks like the splatting in _modify! might be a problem? cc @tkf

Might be good old #29114 again.

; ┌ @ /Users/kristoffer/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:65 within `_modify!'
   %58 = call noalias nonnull %jl_value_t addrspace(10)* @jl_gc_pool_alloc(i8* %52, i32 1424, i32 32) #1
   %59 = bitcast %jl_value_t addrspace(10)* %58 to %jl_value_t addrspace(10)* addrspace(10)*
   %60 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(10)* %59, i64 -1
   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4604265024 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %60
   %61 = addrspacecast %jl_value_t addrspace(10)* %58 to %jl_value_t addrspace(11)*
   %62 = bitcast %jl_value_t addrspace(10)* %58 to %jl_value_t addrspace(10)* addrspace(10)*
   store %jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)* addrspace(10)* %62, align 8
   %63 = bitcast %jl_value_t addrspace(11)* %61 to i8 addrspace(11)*
   %64 = getelementptr inbounds i8, i8 addrspace(11)* %63, i64 8
   %65 = bitcast i8 addrspace(11)* %64 to double addrspace(11)*
   store double %57, double addrspace(11)* %65, align 8
   %66 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 3
   store %jl_value_t addrspace(10)* %58, %jl_value_t addrspace(10)** %66
   %67 = call %jl_value_t addrspace(10)* @jl_box_int64(i64 signext %value_phi13)
   %68 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
   store %jl_value_t addrspace(10)* %67, %jl_value_t addrspace(10)** %68
   %69 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, i32 0
   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4444035360 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %69
   %70 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, i32 1
   store %jl_value_t addrspace(10)* %58, %jl_value_t addrspace(10)** %70
   %71 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, i32 2
   store %jl_value_t addrspace(10)* %67, %jl_value_t addrspace(10)** %71
   %72 = call nonnull %jl_value_t addrspace(10)* @jl_f__apply(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* null to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %4, i32 3)
; └

@KristofferC KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 28, 2019
@tkf
Copy link
Member

tkf commented Aug 28, 2019

Looks like the splatting in _modify! might be a problem?

Looks like so. Using (IC,) instead of IC in _modify! fixes the performance.

julia> using LinearAlgebra: MulAddMul, _modify!

julia> function mymul!(C::AbstractArray, X::AbstractArray, s::Number, _add::MulAddMul)
           if length(C) != length(X)
               throw(DimensionMismatch("first array has length $(length(C)) which does not match the length of the second, $(length(X))."))
           end
           for (IC, IX) in zip(eachindex(C), eachindex(X))
               @inbounds _modify!(_add, X[IX] * s, C, (IC,))
           end
           C
       end
mymul! (generic function with 1 method)

julia> @btime mymul!($y, $x, $λ, $(LinearAlgebra.MulAddMul(true, false)));
  9.512 ns (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.generic_mul!($y, $λ, $x, $(LinearAlgebra.MulAddMul(true, false)));
  938.111 ns (30 allocations: 640 bytes)

julia> @btime $y .= $λ .* $x;
  7.661 ns (0 allocations: 0 bytes)

@tkf
Copy link
Member

tkf commented Aug 28, 2019

Can eltype of eachindex(::AbstractArray) ever be a tuple of ints? If not, I think we can safely use (IC,). Otherwise the simplest fix is probably to add idx::Integer specialization for _modify!?

@Jutho
Copy link
Contributor

Jutho commented Sep 6, 2019

Why is the splatting in _modify! necessary? Isn't idx either an integer or an CartesianIndex. With CartesianIndex, this fails and I get the following bug on Julia master:

To reproduce:

using LinearAlgebra
A = randn(ComplexF64, (10,10))
B = SymTridiagonal(randn(10), randn(9))
mul!(A, B, 0.5) # works
mul!(view(A,1:10,1:10), B, 0.5)

yields

julia> mul!(view(A,1:10,1:10), B, 0.5)
ERROR: iteration is deliberately unsupported for CartesianIndex. Use `I` rather than `I...`, or use `Tuple(I)...`
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] iterate(::CartesianIndex{2}) at ./multidimensional.jl:161
 [3] _modify! at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:67 [inlined]
 [4] generic_mul!(::SubArray{Complex{Float64},2,Array{Complex{Float64},2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}, ::SymTridiagonal{Float64,Array{Float64,1}}, ::Float64, ::LinearAlgebra.MulAddMul{true,true,Bool,Bool}) at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:89
 [5] mul! at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:107 [inlined]
 [6] mul!(::SubArray{Complex{Float64},2,Array{Complex{Float64},2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}, ::SymTridiagonal{Float64,Array{Float64,1}}, ::Float64) at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/matmul.jl:195
 [7] top-level scope at REPL[10]:1

@tkf
Copy link
Member

tkf commented Sep 7, 2019

@Jutho See #33187 for a fix I suggest.

Why is the splatting in _modify! necessary?

There are many functions passing a tuple of ints to _modify!; e.g.,

@inline function _mul!(A::UpperTriangular, B::UnitUpperTriangular, c::Number, _add::MulAddMul)
n = checksquare(B)
iszero(_add.alpha) && return _rmul_or_fill!(C, _add.beta)
for j = 1:n
@inbounds _modify!(_add, c, A, (j,j))
for i = 1:(j - 1)
@inbounds _modify!(_add, B[i,j] * c, A, (i,j))
end
end
return A
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version
Projects
None yet
4 participants