Poor performance of mul! with scalar #33105

dkarrasch · 2019-08-28T19:29:24Z

On

julia> versioninfo()
Julia Version 1.3.0-rc1.0
Commit 768b25f6a8 (2019-08-18 00:04 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

I see

using BenchmarkTools, LinearAlgebra
x = rand(10); y = similar(x); λ = rand();
@btime mul!($y, $λ, $x) # 1.117 μs (30 allocations: 640 bytes)

In contrast, the hand-written

@btime $y .= $λ .* $x # 9.764 ns (0 allocations: 0 bytes)

as expected. Interestingly, on Julia v1.2 I get

@btime $y .= $λ .* $x # 20.617 ns (0 allocations: 0 bytes)
@btime mul!($y, $λ, $x) # 11.733 ns (0 allocations: 0 bytes)

so it seems like broadcasting has significantly improved, and something is simply going wrong with mul!.

The text was updated successfully, but these errors were encountered:

KristofferC · 2019-08-28T20:59:28Z

Looks like the splatting in _modify! might be a problem? cc @tkf

Might be good old #29114 again.

; ┌ @ /Users/kristoffer/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:65 within `_modify!'
   %58 = call noalias nonnull %jl_value_t addrspace(10)* @jl_gc_pool_alloc(i8* %52, i32 1424, i32 32) #1
   %59 = bitcast %jl_value_t addrspace(10)* %58 to %jl_value_t addrspace(10)* addrspace(10)*
   %60 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(10)* %59, i64 -1
   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4604265024 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %60
   %61 = addrspacecast %jl_value_t addrspace(10)* %58 to %jl_value_t addrspace(11)*
   %62 = bitcast %jl_value_t addrspace(10)* %58 to %jl_value_t addrspace(10)* addrspace(10)*
   store %jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)* addrspace(10)* %62, align 8
   %63 = bitcast %jl_value_t addrspace(11)* %61 to i8 addrspace(11)*
   %64 = getelementptr inbounds i8, i8 addrspace(11)* %63, i64 8
   %65 = bitcast i8 addrspace(11)* %64 to double addrspace(11)*
   store double %57, double addrspace(11)* %65, align 8
   %66 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 3
   store %jl_value_t addrspace(10)* %58, %jl_value_t addrspace(10)** %66
   %67 = call %jl_value_t addrspace(10)* @jl_box_int64(i64 signext %value_phi13)
   %68 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
   store %jl_value_t addrspace(10)* %67, %jl_value_t addrspace(10)** %68
   %69 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, i32 0
   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4444035360 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %69
   %70 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, i32 1
   store %jl_value_t addrspace(10)* %58, %jl_value_t addrspace(10)** %70
   %71 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %4, i32 2
   store %jl_value_t addrspace(10)* %67, %jl_value_t addrspace(10)** %71
   %72 = call nonnull %jl_value_t addrspace(10)* @jl_f__apply(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* null to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %4, i32 3)
; └

tkf · 2019-08-28T22:48:49Z

Looks like the splatting in _modify! might be a problem?

Looks like so. Using (IC,) instead of IC in _modify! fixes the performance.

julia> using LinearAlgebra: MulAddMul, _modify!

julia> function mymul!(C::AbstractArray, X::AbstractArray, s::Number, _add::MulAddMul)
           if length(C) != length(X)
               throw(DimensionMismatch("first array has length $(length(C)) which does not match the length of the second, $(length(X))."))
           end
           for (IC, IX) in zip(eachindex(C), eachindex(X))
               @inbounds _modify!(_add, X[IX] * s, C, (IC,))
           end
           C
       end
mymul! (generic function with 1 method)

julia> @btime mymul!($y, $x, $λ, $(LinearAlgebra.MulAddMul(true, false)));
  9.512 ns (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.generic_mul!($y, $λ, $x, $(LinearAlgebra.MulAddMul(true, false)));
  938.111 ns (30 allocations: 640 bytes)

julia> @btime $y .= $λ .* $x;
  7.661 ns (0 allocations: 0 bytes)

tkf · 2019-08-28T22:56:15Z

Can eltype of eachindex(::AbstractArray) ever be a tuple of ints? If not, I think we can safely use (IC,). Otherwise the simplest fix is probably to add idx::Integer specialization for _modify!?

Jutho · 2019-09-06T23:29:47Z

Why is the splatting in _modify! necessary? Isn't idx either an integer or an CartesianIndex. With CartesianIndex, this fails and I get the following bug on Julia master:

To reproduce:

using LinearAlgebra
A = randn(ComplexF64, (10,10))
B = SymTridiagonal(randn(10), randn(9))
mul!(A, B, 0.5) # works
mul!(view(A,1:10,1:10), B, 0.5)

yields

julia> mul!(view(A,1:10,1:10), B, 0.5)
ERROR: iteration is deliberately unsupported for CartesianIndex. Use `I` rather than `I...`, or use `Tuple(I)...`
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] iterate(::CartesianIndex{2}) at ./multidimensional.jl:161
 [3] _modify! at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:67 [inlined]
 [4] generic_mul!(::SubArray{Complex{Float64},2,Array{Complex{Float64},2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}, ::SymTridiagonal{Float64,Array{Float64,1}}, ::Float64, ::LinearAlgebra.MulAddMul{true,true,Bool,Bool}) at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:89
 [5] mul! at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/generic.jl:107 [inlined]
 [6] mul!(::SubArray{Complex{Float64},2,Array{Complex{Float64},2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}, ::SymTridiagonal{Float64,Array{Float64,1}}, ::Float64) at /usr/local/julia/usr/share/julia/stdlib/v1.4/LinearAlgebra/src/matmul.jl:195
 [7] top-level scope at REPL[10]:1

tkf · 2019-09-07T05:15:44Z

@Jutho See #33187 for a fix I suggest.

Why is the splatting in _modify! necessary?

There are many functions passing a tuple of ints to _modify!; e.g.,

julia/stdlib/LinearAlgebra/src/triangular.jl

Lines 480 to 490 in cb76f2a

    
           @inline function _mul!(A::UpperTriangular, B::UnitUpperTriangular, c::Number, _add::MulAddMul) 
        
               n = checksquare(B) 
        
               iszero(_add.alpha) && return _rmul_or_fill!(C, _add.beta) 
        
               for j = 1:n 
        
                   @inbounds _modify!(_add, c, A, (j,j)) 
        
                   for i = 1:(j - 1) 
        
                       @inbounds _modify!(_add, B[i,j] * c, A, (i,j)) 
        
                   end 
        
               end 
        
               return A 
        
           end

KristofferC added linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version labels Aug 28, 2019

KristofferC added this to the 1.3 milestone Aug 28, 2019

KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 28, 2019

This was referenced Aug 29, 2019

Workaround for performance penalty of splatting a number #33108

Merged

use LinearAlgebra's new generic in-place 5-arg mul! (requires Julia 1.3)? JuliaLinearAlgebra/LinearMaps.jl#56

Closed

andreasnoack closed this as completed in #33108 Aug 29, 2019

Jutho mentioned this issue Sep 6, 2019

Heads up about test error on 1.3 Jutho/KrylovKit.jl#21

Closed

tkf mentioned this issue Sep 7, 2019

Normalize index to CartesianIndex in _modify! #33187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of mul! with scalar #33105

Poor performance of mul! with scalar #33105

dkarrasch commented Aug 28, 2019

KristofferC commented Aug 28, 2019 •

edited

Loading

tkf commented Aug 28, 2019 •

edited

Loading

tkf commented Aug 28, 2019

Jutho commented Sep 6, 2019 •

edited

Loading

tkf commented Sep 7, 2019

Poor performance of mul! with scalar #33105

Poor performance of mul! with scalar #33105

Comments

dkarrasch commented Aug 28, 2019

KristofferC commented Aug 28, 2019 • edited Loading

tkf commented Aug 28, 2019 • edited Loading

tkf commented Aug 28, 2019

Jutho commented Sep 6, 2019 • edited Loading

tkf commented Sep 7, 2019

KristofferC commented Aug 28, 2019 •

edited

Loading

tkf commented Aug 28, 2019 •

edited

Loading

Jutho commented Sep 6, 2019 •

edited

Loading