`mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat` does not use BLAS #42550

sostock · 2021-10-08T06:51:18Z

See https://discourse.julialang.org/t/mul-matrix-matrix-uppertriangular-is-really-slow/69368. CC @Luapulu

mul! doesn’t use BLAS when the second matrix is an AbstractTriangular:

julia> A = rand(1000, 1000); B = rand(1000, 1000); dest = similar(A);

julia> @btime mul!($dest, $A, $B); # BLAS (fast)
  50.609 ms (0 allocations: 0 bytes)

julia> @btime mul!($dest, UpperTriangular($A), $B); # BLAS (fast)
  32.929 ms (0 allocations: 0 bytes)

julia> @btime mul!($dest, $A, UpperTriangular($B)); # generic (slow)
  914.817 ms (6 allocations: 336 bytes)

However, the * methods all use BLAS:

julia> @btime $A * $B; # BLAS (fast)
  50.560 ms (2 allocations: 7.63 MiB)

julia> @btime UpperTriangular($A) * $B; # BLAS (fast)
  33.183 ms (2 allocations: 7.63 MiB)

julia> @btime $A * UpperTriangular($B); # BLAS (fast)
  27.190 ms (2 allocations: 7.63 MiB)

The text was updated successfully, but these errors were encountered:

andreasnoack · 2021-10-08T09:22:11Z

Notice that three argument triangular matrix multiplication is not a BLAS operation. In BLAS this is an in-place operation see https://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaf07edfbb2d2077687522652c9e283e1e.html#gaf07edfbb2d2077687522652c9e283e1e. I'm not in favor of adding extra dispatch versions of the ! methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the ! methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like * should be as generic as possible.

I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.

Luapulu · 2021-10-08T09:35:33Z

So, to do mul!(dest, A, UpperTriangular(B)) efficiently, I have to do something like

copyto!(dest, A)
rmul!(dest, UpperTriangular(B))

am I getting that right?

And, why not create a mul! method to do this?

sostock · 2021-10-08T09:40:40Z

I'm not in favor of adding extra dispatch versions of the ! methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the ! methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like * should be as generic as possible.

I don’t really understand this argument. The case mul!(full1, triangular, full2) it already calls out to BLAS, why not do it for the case mul!(full1, full2, triangular) as well? They would both call the same BLAS method (the one you linked), since it can handle either the first or the second matrix as triangular (it updates the other one). If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?

I think it is really unintuitive that A * B uses a fast method, but when one wants to preallocate the output, mul!(dest, A, B) is 30 times slower.

andreasnoack · 2021-10-08T10:16:02Z

And, why not create a mul! method to do this?

Because you are using the ! methods to optimize your code by avoiding unnecessary allocations and copies so it's not a good idea to introduce a hidden copy. You'd think that the operation is out-of-place when you see the signature.

If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?

Yes but it's probably not worth the effort.

My main point is that you shouldn't use the three argument triangular matrix multiplication if you are trying to write optimized code.

Luapulu · 2021-10-08T10:53:22Z

I still think mul! should have an optimised method here because it should at least not be slower than using mul! on two full matrices. And yes, in some cases users may accidentally copy a result when they could have used an inplace rmul! or lmul! but that extra copy is certainly better than the 100x slow down users currently get and it's still faster than using mul! on two full matrices. If the use case permits it users can choose to switch to lmul! or rmul!. My case doesn't permit it and I don't see why we should force users to essentially write their own version of the mul! method when their use case doesn't allow for using the in place rmul! or lmul!.

@andreasnoack I think your argument is mainly an argument for adding a line to the docs that says users may want to consider using rmul! or lmul! instead of mul! if they want to avoid an extra copy in some cases.

sostock · 2021-10-08T11:42:02Z

I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.

It uses a tiled algorithm. The allocations reported above are on Julia 1.6, where it uses global arrays and unsafe_wrap, but still allocates. On master it allocates differently, due to #42309, which explicitly creates new Arrays for the tiles.

julia> @btime mul!($dest, $A, UpperTriangular($B));
  894.954 ms (3 allocations: 30.75 KiB)

dkarrasch · 2021-10-08T16:42:12Z

I may be missing something, but given that we have

julia/stdlib/LinearAlgebra/src/triangular.jl

Lines 677 to 680 in 146de38

    
           # The three methods are neceesary to avoid ambiguities with definitions in matmul.jl 
        
           mul!(C::AbstractVector  , A::AbstractTriangular, B::AbstractVector)   = lmul!(A, copyto!(C, B)) 
        
           mul!(C::AbstractMatrix  , A::AbstractTriangular, B::AbstractVecOrMat) = lmul!(A, copyto!(C, B)) 
        
           mul!(C::AbstractVecOrMat, A::AbstractTriangular, B::AbstractVecOrMat) = lmul!(A, copyto!(C, B))

and

julia/stdlib/LinearAlgebra/src/triangular.jl

Lines 668 to 675 in 146de38

    
           mul!(C::AbstractVector, A::AbstractTriangular, transB::Transpose{<:Any,<:AbstractVecOrMat}) = 
        
               (B = transB.parent; lmul!(A, transpose!(C, B))) 
        
           mul!(C::AbstractMatrix, A::AbstractTriangular, transB::Transpose{<:Any,<:AbstractVecOrMat}) = 
        
               (B = transB.parent; lmul!(A, transpose!(C, B))) 
        
           mul!(C::AbstractMatrix, A::AbstractTriangular, adjB::Adjoint{<:Any,<:AbstractVecOrMat}) = 
        
               (B = adjB.parent; lmul!(A, adjoint!(C, B))) 
        
           mul!(C::AbstractVecOrMat, A::AbstractTriangular, adjB::Adjoint{<:Any,<:AbstractVecOrMat}) = 
        
               (B = adjB.parent; lmul!(A, adjoint!(C, B)))

and that we even fully materialize tranposes/adjoints here

julia/stdlib/LinearAlgebra/src/triangular.jl

Lines 682 to 685 in 146de38

    
           @inline mul!(C::AbstractMatrix, A::AbstractTriangular, B::Adjoint{<:Any,<:AbstractVecOrMat}, alpha::Number, beta::Number) = 
        
               mul!(C, A, copy(B), alpha, beta) 
        
           @inline mul!(C::AbstractMatrix, A::AbstractTriangular, B::Transpose{<:Any,<:AbstractVecOrMat}, alpha::Number, beta::Number) = 
        
               mul!(C, A, copy(B), alpha, beta)

I feel like the missing methods are more like oversight rather than deliberately missing? Of course, fixing the issue requires adding what feels like a dozen new mul! methods, but OTOH the performance penalty is rather hefty.

andreasnoack · 2021-10-09T17:43:21Z

I just tried to refresh my memory on this topic and it looks like I agree with 2015 Andreas, see #10019. That issue was closed without introducing new methods. Then later that year #12812 was opened but it wasn't merged. Two years later I merged it without leaving a comment and I don't remember why I did so. Afterwards, we then had to add all the other methods to avoid ambiguities.

So it looks like it might have been a bit random that these methods were introduced in the first. I don't like them and believe that people would be better off without them but given the current state, if somebody makes a PR with these methods and associated tests then I won't try to filibuster it (even though it adds extra compilation to an already very compilation heavy and therefore very slow test file.)

leostenzel · 2023-04-04T09:35:42Z

Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym), so decided to test this more generally:

using LinearAlgebra, BenchmarkTools
using Base.Iterators: product
A = rand(200, 200)
B = rand(200, 200) 
dest = similar(A)

types = [identity, Symmetric, Hermitian, UpperTriangular, UnitUpperTriangular,
  LowerTriangular, UnitLowerTriangular, UpperHessenberg, 
  Tridiagonal, SymTridiagonal ∘ Symmetric, 
  a->Bidiagonal(diag(a), diag(a, 1), :U), Diagonal]

for (Ta, Tb) in product(types, types)
  _A = Ta(A)
  _B = Tb(B)
  t1 = time(@benchmark $_A * $_B seconds=0.1)
  tmul = time(@benchmark mul!(dest, $_A, $_B) seconds=0.1)
  if !isapprox(t1, tmul, rtol=0.5)
    arg_types = nameof.(typeof.([_A, _B]))
    @show arg_types, ratio(t1, tmul)
  end
end

On my M1 Mac with Julia 1.9rc2 this yields a bunch of type combinations where mul! is significantly slower than *:

([:Symmetric, :Symmetric], 0.15212131077987395)
([:Hermitian, :Symmetric], 0.14867050011769367)
([:Symmetric, :Hermitian], 0.1501201330590954)
([:Hermitian, :Hermitian], 0.15710876467798882)
([:Array, :UpperTriangular], 0.0783870960007864)
([:Symmetric, :UpperTriangular], 0.08869831751722534)
([:Hermitian, :UpperTriangular], 0.08843682327204656)
([:UpperHessenberg, :UpperTriangular], 0.08703282707024972)
([:Diagonal, :UpperTriangular], 0.3072659809309098)
([:Array, :UnitUpperTriangular], 0.0801960959640439)
([:Symmetric, :UnitUpperTriangular], 0.08932123686354122)
([:Hermitian, :UnitUpperTriangular], 0.08990729018466966)
([:UpperHessenberg, :UnitUpperTriangular], 0.08850920305059527)
([:Diagonal, :UnitUpperTriangular], 0.35110059171597635)
([:Array, :LowerTriangular], 0.07920584733372955)
([:Symmetric, :LowerTriangular], 0.08806820049841861)
([:Hermitian, :LowerTriangular], 0.08689892949913336)
([:UpperHessenberg, :LowerTriangular], 0.08785087978687132)
([:Diagonal, :LowerTriangular], 0.2959498808788494)
([:Array, :UnitLowerTriangular], 0.07715129397475415)
([:Symmetric, :UnitLowerTriangular], 0.09019522967933709)
([:Hermitian, :UnitLowerTriangular], 0.08938477232681072)
([:UpperHessenberg, :UnitLowerTriangular], 0.08934690586575288)
([:Diagonal, :UnitLowerTriangular], 0.34585527043905834)
([:Diagonal, :UpperHessenberg], 0.3220858895705521)
([:Diagonal, :Tridiagonal], 0.1578947764617085)
([:Diagonal, :SymTridiagonal], 0.20002717522004712)
([:UpperTriangular, :Bidiagonal], 0.30357989568515886)
([:UnitUpperTriangular, :Bidiagonal], 0.2987819866247383)
([:LowerTriangular, :Bidiagonal], 0.2744375206094478)
([:UnitLowerTriangular, :Bidiagonal], 0.27931542328982784)
([:Diagonal, :Bidiagonal], 0.1732365783162167)
([:UpperTriangular, :Diagonal], 0.024214568979777973)
([:UnitUpperTriangular, :Diagonal], 0.03201487204047439)
([:LowerTriangular, :Diagonal], 0.026398215281650865)
([:UnitLowerTriangular, :Diagonal], 0.029608938547486034)
([:UpperHessenberg, :Diagonal], 0.32266666666666666)
([:Tridiagonal, :Diagonal], 0.19409417260880255)
([:SymTridiagonal, :Diagonal], 0.19127713498622592)
([:Bidiagonal, :Diagonal], 0.1897607790860655)
([:Diagonal, :Diagonal], 0.028365610288337475)

But there are also some cases where mul! is faster—this could just be caused by improved allocations, of course…

([:Tridiagonal, :UpperTriangular], 6.484450754024589)
([:SymTridiagonal, :UpperTriangular], 6.755345806882726)
([:Tridiagonal, :UnitUpperTriangular], 5.848837490624924)
([:SymTridiagonal, :UnitUpperTriangular], 5.85030633412128)
([:Tridiagonal, :UnitLowerTriangular], 5.854310386415167)
([:SymTridiagonal, :UnitLowerTriangular], 5.8304607385296805)
([:Tridiagonal, :LowerTriangular], 5.472315492957747)
([:SymTridiagonal, :LowerTriangular], 6.163230320159385)
([:SymTridiagonal, :Tridiagonal], 2.412737178282923)
([:Tridiagonal, :Tridiagonal], 2.2284411276948592)
([:Tridiagonal, :Tridiagonal], 2.482101850601108)
([:SymTridiagonal, :Bidiagonal], 2.6301575735495035)
([:Bidiagonal, :Bidiagonal], 2.3222368960335174)
([:Tridiagonal, :Tridiagonal], 2.4564705882352937)
([:Tridiagonal, :SymTridiagonal], 2.210109392682007)
([:SymTridiagonal, :SymTridiagonal], 2.437875751503006)
([:Bidiagonal, :SymTridiagonal], 2.491171597633136)

dkarrasch · 2023-04-09T13:04:32Z

Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym)

I think the main reason for this is that * is allowed to simply allocate new plain arrays with easy memory access, as in

*(A::HermOrSym, B::HermOrSym) = A * copyto!(similar(parent(B)), B)

and subsequently use specialized BLAS routines. In contrast, mul!(C, ::Hermitian, ::Hermitian) falls back to generic_matmatmul!! The point is that the promise of mul! is that it operates in a memory-aware fashion, and doesn't create copies of the factors. So what we do is that we add mul! methods whenever we realize a more efficient way of doing the multiplication without creating copies.

Another aspect is that you measure computation times with output dest::Matrix. In some structured cases, this means that mul! needs to overwrite many entries by zeros, whereas * may be able to create a specific output type that has all the zeros "implicit", like *Diagonal etc. Think of *(::Diagonal, ::Diagonal), compared to mul!(::Matrix, ::Diagonal, ::Diagonal) (compared to mul!(::Diagonal, ::Diagonal, ::Diagonal)!

leostenzel · 2023-04-11T06:15:21Z

Thanks for your reply! I agree with your arguments, and many cases already seem to be addressed by your PR.
It seem logical that a user should, e.g., allocate a Triagular matrix if they multiply mul!(::Triangular, ::Diagonal, ::[Unit]Triangular). Though, it's surprising that dest::StridedMatrix is 40 times slower…

I'm just not happy with mul!(::StridedMatrix, ::HermOrSym, ::HermOrSym) (and its 5-arg counterpart).
Wanted to propose something like:

mul!(C::StridedMatrix{T}, A::HermOrSym{T,<:StridedMatrix},
    B::HermOrSym{T,<:StridedMatrix}) where {T<:BlasFloat} = mul!(C, copyto!(A.data, A), B)

I guess it's not ideal to mutate A… But as a user, I'd be less surprised to get symmetrized A.data, compared to a 7-fold increased runtime?
Would do a corresponding PR, if you think that makes sense.

sostock added performance Must go faster domain:linear algebra Linear algebra labels Oct 8, 2021

sostock changed the title ~~mul!(::StridedMatrix{T}, ::StridedMatrix{T}, ::AbstractTriangular{T, <:StridedMatrix{T}}) where T<:BlasFloat does not use BLAS~~ mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat does not use BLAS Oct 8, 2021

dkarrasch mentioned this issue Feb 14, 2022

Make *Triangular handle units #43972

Merged

leostenzel mentioned this issue Apr 5, 2023

Using BLAS for more mul! methods #49259

Closed

dkarrasch closed this as completed in #43972 May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat` does not use BLAS #42550

`mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat` does not use BLAS #42550

sostock commented Oct 8, 2021

andreasnoack commented Oct 8, 2021

Luapulu commented Oct 8, 2021

sostock commented Oct 8, 2021

andreasnoack commented Oct 8, 2021

Luapulu commented Oct 8, 2021

sostock commented Oct 8, 2021

dkarrasch commented Oct 8, 2021

andreasnoack commented Oct 9, 2021

leostenzel commented Apr 4, 2023

dkarrasch commented Apr 9, 2023

leostenzel commented Apr 11, 2023 •

edited

Loading

mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat does not use BLAS #42550

mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat does not use BLAS #42550

Comments

sostock commented Oct 8, 2021

andreasnoack commented Oct 8, 2021

Luapulu commented Oct 8, 2021

sostock commented Oct 8, 2021

andreasnoack commented Oct 8, 2021

Luapulu commented Oct 8, 2021

sostock commented Oct 8, 2021

dkarrasch commented Oct 8, 2021

andreasnoack commented Oct 9, 2021

leostenzel commented Apr 4, 2023

dkarrasch commented Apr 9, 2023

leostenzel commented Apr 11, 2023 • edited Loading

`mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat` does not use BLAS #42550

`mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat` does not use BLAS #42550

leostenzel commented Apr 11, 2023 •

edited

Loading