Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat does not use BLAS #42550

Closed
sostock opened this issue Oct 8, 2021 · 11 comments · Fixed by #43972
Labels
domain:linear algebra Linear algebra performance Must go faster

Comments

@sostock
Copy link
Contributor

sostock commented Oct 8, 2021

See https://discourse.julialang.org/t/mul-matrix-matrix-uppertriangular-is-really-slow/69368. CC @Luapulu

mul! doesn’t use BLAS when the second matrix is an AbstractTriangular:

julia> A = rand(1000, 1000); B = rand(1000, 1000); dest = similar(A);

julia> @btime mul!($dest, $A, $B); # BLAS (fast)
  50.609 ms (0 allocations: 0 bytes)

julia> @btime mul!($dest, UpperTriangular($A), $B); # BLAS (fast)
  32.929 ms (0 allocations: 0 bytes)

julia> @btime mul!($dest, $A, UpperTriangular($B)); # generic (slow)
  914.817 ms (6 allocations: 336 bytes)

However, the * methods all use BLAS:

julia> @btime $A * $B; # BLAS (fast)
  50.560 ms (2 allocations: 7.63 MiB)

julia> @btime UpperTriangular($A) * $B; # BLAS (fast)
  33.183 ms (2 allocations: 7.63 MiB)

julia> @btime $A * UpperTriangular($B); # BLAS (fast)
  27.190 ms (2 allocations: 7.63 MiB)
@sostock sostock added performance Must go faster domain:linear algebra Linear algebra labels Oct 8, 2021
@sostock sostock changed the title mul!(::StridedMatrix{T}, ::StridedMatrix{T}, ::AbstractTriangular{T, <:StridedMatrix{T}}) where T<:BlasFloat does not use BLAS mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat does not use BLAS Oct 8, 2021
@andreasnoack
Copy link
Member

Notice that three argument triangular matrix multiplication is not a BLAS operation. In BLAS this is an in-place operation see https://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaf07edfbb2d2077687522652c9e283e1e.html#gaf07edfbb2d2077687522652c9e283e1e. I'm not in favor of adding extra dispatch versions of the ! methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the ! methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like * should be as generic as possible.

I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.

@Luapulu
Copy link

Luapulu commented Oct 8, 2021

So, to do mul!(dest, A, UpperTriangular(B)) efficiently, I have to do something like

copyto!(dest, A)
rmul!(dest, UpperTriangular(B))

am I getting that right?

And, why not create a mul! method to do this?

@sostock
Copy link
Contributor Author

sostock commented Oct 8, 2021

I'm not in favor of adding extra dispatch versions of the ! methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the ! methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like * should be as generic as possible.

I don’t really understand this argument. The case mul!(full1, triangular, full2) it already calls out to BLAS, why not do it for the case mul!(full1, full2, triangular) as well? They would both call the same BLAS method (the one you linked), since it can handle either the first or the second matrix as triangular (it updates the other one). If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?

I think it is really unintuitive that A * B uses a fast method, but when one wants to preallocate the output, mul!(dest, A, B) is 30 times slower.

@andreasnoack
Copy link
Member

And, why not create a mul! method to do this?

Because you are using the ! methods to optimize your code by avoiding unnecessary allocations and copies so it's not a good idea to introduce a hidden copy. You'd think that the operation is out-of-place when you see the signature.

If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?

Yes but it's probably not worth the effort.

My main point is that you shouldn't use the three argument triangular matrix multiplication if you are trying to write optimized code.

@Luapulu
Copy link

Luapulu commented Oct 8, 2021

I still think mul! should have an optimised method here because it should at least not be slower than using mul! on two full matrices. And yes, in some cases users may accidentally copy a result when they could have used an inplace rmul! or lmul! but that extra copy is certainly better than the 100x slow down users currently get and it's still faster than using mul! on two full matrices. If the use case permits it users can choose to switch to lmul! or rmul!. My case doesn't permit it and I don't see why we should force users to essentially write their own version of the mul! method when their use case doesn't allow for using the in place rmul! or lmul!.

@andreasnoack I think your argument is mainly an argument for adding a line to the docs that says users may want to consider using rmul! or lmul! instead of mul! if they want to avoid an extra copy in some cases.

@sostock
Copy link
Contributor Author

sostock commented Oct 8, 2021

I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.

It uses a tiled algorithm. The allocations reported above are on Julia 1.6, where it uses global arrays and unsafe_wrap, but still allocates. On master it allocates differently, due to #42309, which explicitly creates new Arrays for the tiles.

julia> @btime mul!($dest, $A, UpperTriangular($B));
  894.954 ms (3 allocations: 30.75 KiB)

@dkarrasch
Copy link
Member

I may be missing something, but given that we have

# The three methods are neceesary to avoid ambiguities with definitions in matmul.jl
mul!(C::AbstractVector , A::AbstractTriangular, B::AbstractVector) = lmul!(A, copyto!(C, B))
mul!(C::AbstractMatrix , A::AbstractTriangular, B::AbstractVecOrMat) = lmul!(A, copyto!(C, B))
mul!(C::AbstractVecOrMat, A::AbstractTriangular, B::AbstractVecOrMat) = lmul!(A, copyto!(C, B))

and

mul!(C::AbstractVector, A::AbstractTriangular, transB::Transpose{<:Any,<:AbstractVecOrMat}) =
(B = transB.parent; lmul!(A, transpose!(C, B)))
mul!(C::AbstractMatrix, A::AbstractTriangular, transB::Transpose{<:Any,<:AbstractVecOrMat}) =
(B = transB.parent; lmul!(A, transpose!(C, B)))
mul!(C::AbstractMatrix, A::AbstractTriangular, adjB::Adjoint{<:Any,<:AbstractVecOrMat}) =
(B = adjB.parent; lmul!(A, adjoint!(C, B)))
mul!(C::AbstractVecOrMat, A::AbstractTriangular, adjB::Adjoint{<:Any,<:AbstractVecOrMat}) =
(B = adjB.parent; lmul!(A, adjoint!(C, B)))

and that we even fully materialize tranposes/adjoints here

@inline mul!(C::AbstractMatrix, A::AbstractTriangular, B::Adjoint{<:Any,<:AbstractVecOrMat}, alpha::Number, beta::Number) =
mul!(C, A, copy(B), alpha, beta)
@inline mul!(C::AbstractMatrix, A::AbstractTriangular, B::Transpose{<:Any,<:AbstractVecOrMat}, alpha::Number, beta::Number) =
mul!(C, A, copy(B), alpha, beta)

I feel like the missing methods are more like oversight rather than deliberately missing? Of course, fixing the issue requires adding what feels like a dozen new mul! methods, but OTOH the performance penalty is rather hefty.

@andreasnoack
Copy link
Member

I just tried to refresh my memory on this topic and it looks like I agree with 2015 Andreas, see #10019. That issue was closed without introducing new methods. Then later that year #12812 was opened but it wasn't merged. Two years later I merged it without leaving a comment and I don't remember why I did so. Afterwards, we then had to add all the other methods to avoid ambiguities.

So it looks like it might have been a bit random that these methods were introduced in the first. I don't like them and believe that people would be better off without them but given the current state, if somebody makes a PR with these methods and associated tests then I won't try to filibuster it (even though it adds extra compilation to an already very compilation heavy and therefore very slow test file.)

@leostenzel
Copy link

Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym), so decided to test this more generally:

using LinearAlgebra, BenchmarkTools
using Base.Iterators: product
A = rand(200, 200)
B = rand(200, 200) 
dest = similar(A)

types = [identity, Symmetric, Hermitian, UpperTriangular, UnitUpperTriangular,
  LowerTriangular, UnitLowerTriangular, UpperHessenberg, 
  Tridiagonal, SymTridiagonal ∘ Symmetric, 
  a->Bidiagonal(diag(a), diag(a, 1), :U), Diagonal]

for (Ta, Tb) in product(types, types)
  _A = Ta(A)
  _B = Tb(B)
  t1 = time(@benchmark $_A * $_B seconds=0.1)
  tmul = time(@benchmark mul!(dest, $_A, $_B) seconds=0.1)
  if !isapprox(t1, tmul, rtol=0.5)
    arg_types = nameof.(typeof.([_A, _B]))
    @show arg_types, ratio(t1, tmul)
  end
end

On my M1 Mac with Julia 1.9rc2 this yields a bunch of type combinations where mul! is significantly slower than *:

([:Symmetric, :Symmetric], 0.15212131077987395)
([:Hermitian, :Symmetric], 0.14867050011769367)
([:Symmetric, :Hermitian], 0.1501201330590954)
([:Hermitian, :Hermitian], 0.15710876467798882)
([:Array, :UpperTriangular], 0.0783870960007864)
([:Symmetric, :UpperTriangular], 0.08869831751722534)
([:Hermitian, :UpperTriangular], 0.08843682327204656)
([:UpperHessenberg, :UpperTriangular], 0.08703282707024972)
([:Diagonal, :UpperTriangular], 0.3072659809309098)
([:Array, :UnitUpperTriangular], 0.0801960959640439)
([:Symmetric, :UnitUpperTriangular], 0.08932123686354122)
([:Hermitian, :UnitUpperTriangular], 0.08990729018466966)
([:UpperHessenberg, :UnitUpperTriangular], 0.08850920305059527)
([:Diagonal, :UnitUpperTriangular], 0.35110059171597635)
([:Array, :LowerTriangular], 0.07920584733372955)
([:Symmetric, :LowerTriangular], 0.08806820049841861)
([:Hermitian, :LowerTriangular], 0.08689892949913336)
([:UpperHessenberg, :LowerTriangular], 0.08785087978687132)
([:Diagonal, :LowerTriangular], 0.2959498808788494)
([:Array, :UnitLowerTriangular], 0.07715129397475415)
([:Symmetric, :UnitLowerTriangular], 0.09019522967933709)
([:Hermitian, :UnitLowerTriangular], 0.08938477232681072)
([:UpperHessenberg, :UnitLowerTriangular], 0.08934690586575288)
([:Diagonal, :UnitLowerTriangular], 0.34585527043905834)
([:Diagonal, :UpperHessenberg], 0.3220858895705521)
([:Diagonal, :Tridiagonal], 0.1578947764617085)
([:Diagonal, :SymTridiagonal], 0.20002717522004712)
([:UpperTriangular, :Bidiagonal], 0.30357989568515886)
([:UnitUpperTriangular, :Bidiagonal], 0.2987819866247383)
([:LowerTriangular, :Bidiagonal], 0.2744375206094478)
([:UnitLowerTriangular, :Bidiagonal], 0.27931542328982784)
([:Diagonal, :Bidiagonal], 0.1732365783162167)
([:UpperTriangular, :Diagonal], 0.024214568979777973)
([:UnitUpperTriangular, :Diagonal], 0.03201487204047439)
([:LowerTriangular, :Diagonal], 0.026398215281650865)
([:UnitLowerTriangular, :Diagonal], 0.029608938547486034)
([:UpperHessenberg, :Diagonal], 0.32266666666666666)
([:Tridiagonal, :Diagonal], 0.19409417260880255)
([:SymTridiagonal, :Diagonal], 0.19127713498622592)
([:Bidiagonal, :Diagonal], 0.1897607790860655)
([:Diagonal, :Diagonal], 0.028365610288337475)

But there are also some cases where mul! is faster—this could just be caused by improved allocations, of course…

([:Tridiagonal, :UpperTriangular], 6.484450754024589)
([:SymTridiagonal, :UpperTriangular], 6.755345806882726)
([:Tridiagonal, :UnitUpperTriangular], 5.848837490624924)
([:SymTridiagonal, :UnitUpperTriangular], 5.85030633412128)
([:Tridiagonal, :UnitLowerTriangular], 5.854310386415167)
([:SymTridiagonal, :UnitLowerTriangular], 5.8304607385296805)
([:Tridiagonal, :LowerTriangular], 5.472315492957747)
([:SymTridiagonal, :LowerTriangular], 6.163230320159385)
([:SymTridiagonal, :Tridiagonal], 2.412737178282923)
([:Tridiagonal, :Tridiagonal], 2.2284411276948592)
([:Tridiagonal, :Tridiagonal], 2.482101850601108)
([:SymTridiagonal, :Bidiagonal], 2.6301575735495035)
([:Bidiagonal, :Bidiagonal], 2.3222368960335174)
([:Tridiagonal, :Tridiagonal], 2.4564705882352937)
([:Tridiagonal, :SymTridiagonal], 2.210109392682007)
([:SymTridiagonal, :SymTridiagonal], 2.437875751503006)
([:Bidiagonal, :SymTridiagonal], 2.491171597633136)

@dkarrasch
Copy link
Member

Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym)

I think the main reason for this is that * is allowed to simply allocate new plain arrays with easy memory access, as in

*(A::HermOrSym, B::HermOrSym) = A * copyto!(similar(parent(B)), B)

and subsequently use specialized BLAS routines. In contrast, mul!(C, ::Hermitian, ::Hermitian) falls back to generic_matmatmul!! The point is that the promise of mul! is that it operates in a memory-aware fashion, and doesn't create copies of the factors. So what we do is that we add mul! methods whenever we realize a more efficient way of doing the multiplication without creating copies.

Another aspect is that you measure computation times with output dest::Matrix. In some structured cases, this means that mul! needs to overwrite many entries by zeros, whereas * may be able to create a specific output type that has all the zeros "implicit", like *Diagonal etc. Think of *(::Diagonal, ::Diagonal), compared to mul!(::Matrix, ::Diagonal, ::Diagonal) (compared to mul!(::Diagonal, ::Diagonal, ::Diagonal)!

@leostenzel
Copy link

leostenzel commented Apr 11, 2023

Thanks for your reply! I agree with your arguments, and many cases already seem to be addressed by your PR.
It seem logical that a user should, e.g., allocate a Triagular matrix if they multiply mul!(::Triangular, ::Diagonal, ::[Unit]Triangular). Though, it's surprising that dest::StridedMatrix is 40 times slower…

I'm just not happy with mul!(::StridedMatrix, ::HermOrSym, ::HermOrSym) (and its 5-arg counterpart).
Wanted to propose something like:

mul!(C::StridedMatrix{T}, A::HermOrSym{T,<:StridedMatrix},
    B::HermOrSym{T,<:StridedMatrix}) where {T<:BlasFloat} = mul!(C, copyto!(A.data, A), B)

I guess it's not ideal to mutate A… But as a user, I'd be less surprised to get symmetrized A.data, compared to a 7-fold increased runtime?
Would do a corresponding PR, if you think that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants