Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove buffer pool for tiled GEMM #42309

Merged
merged 1 commit into from
Sep 20, 2021
Merged

Remove buffer pool for tiled GEMM #42309

merged 1 commit into from
Sep 20, 2021

Conversation

tkf
Copy link
Member

@tkf tkf commented Sep 19, 2021

This PR is an RFC for removing the pooled buffer for the generic tiled matrix multiplication. The current implementation has several issues and a quick benchmark does not suggest that allocation and/or GC are the bottleneck.

Issue 1: Reentrancy

Inside of _generic_matmatmul!, we have the check

tile_size = 0
if isbitstype(R) && isbitstype(T) && isbitstype(S) && (tA == 'N' || tB != 'N')
tile_size = floor(Int, sqrt(tilebufsize / max(sizeof(R), sizeof(S), sizeof(T), 1)))
end
@inbounds begin
if tile_size > 0

which then uses pre-allocated arrays

const tilebufsize = 10800 # Approximately 32k/3
# per-thread arrays of buffers resized by __init__ if needed
const Abuf = [Vector{UInt8}(undef, tilebufsize)]
const Bbuf = [Vector{UInt8}(undef, tilebufsize)]
const Cbuf = [Vector{UInt8}(undef, tilebufsize)]

via

Atile = unsafe_wrap(Array, convert(Ptr{T}, pointer(Abuf[Threads.threadid()])), sz)
Btile = unsafe_wrap(Array, convert(Ptr{S}, pointer(Bbuf[Threads.threadid()])), sz)

I think the major and practical issue here is that isbitstype is not enough for ensuring no-reentrancy of * (so that the buffers are not re-used by multiple functions). I can quite easily define arbitrary large matrix with small isbitstype

struct LargeMatrix <: AbstractMatrix{Flaot64}
    id::UInt64
end

const MTARICES = Dict{UInt64,Matrix{Float64}}()

Base.getindex(m::LargeMatrix, i::Int, j::Int) = MATRICES[m.id][i, j]

This example is rather silly but it's conceivable to have a Julia wrapper of externally managed matrices that is just a Ptr{Cvoid} as a Julia object. If these isbits-but-large matrices are used in nested/blocked matrices, the current implementation can corrupt the result.

Issue 2: @async- and @spawn-safety

A method of * can yield (e.g., a wrapper type that dumps all operations in a file). We then have the same problem as the reentrancy.

Furthermore, if the current task is a @spawned task, it can be migrated to another worker thread upon a yield.

Issue 3: nthreads may not be constant in the future

After a change in the scheduler like #42302, it will be impossible to rely on that nthreads does not change after julia is started. Supporting a pattern like this would also require a proper "static" initialization API (e.g., using the double checked locking pattern).

Issue 4: GC

Also, as noted in #23914, this code also has a problem due to missing GC.@preserve.

Performance impact

It also seems that the allocation is not the bottleneck. For large enough input that the tiling matters, the effect of allocation/GC is not observable. For small matrices where you can see the effect of GC, it's faster to not use the tiled version.

A = randn(n, n)
B = randn(n, n)
C = similar(A)
@btime LinearAlgebra.generic_matmatmul!(C, 'N', 'N', A, B)

shows

After PR No tile Before PR
n = 500 82.1 ms 109.9 ms 81.3 ms
n = 10 1.070 μs 510.361 ns 770.266 ns

"No tile" means applying this diff

diff --git a/stdlib/LinearAlgebra/src/matmul.jl b/stdlib/LinearAlgebra/src/matmul.jl
index 0cbfeaf9ed..b4ff0c0c5a 100644
--- a/stdlib/LinearAlgebra/src/matmul.jl
+++ b/stdlib/LinearAlgebra/src/matmul.jl
@@ -768,6 +768,7 @@ function _generic_matmatmul!(C::AbstractVecOrMat{R}, tA, tB, A::AbstractVecOrMat
     if isbitstype(R) && isbitstype(T) && isbitstype(S) && (tA == 'N' || tB != 'N')
         tile_size = floor(Int, sqrt(tilebufsize / max(sizeof(R), sizeof(S), sizeof(T), 1)))
     end
+    tile_size = 0
     @inbounds begin
     if tile_size > 0
         sz = (tile_size, tile_size)

@tkf tkf added domain:linear algebra Linear algebra domain:multithreading Base.Threads and related functionality labels Sep 19, 2021
@DilumAluthge
Copy link
Member

cc: @chriselrod

@@ -797,8 +792,7 @@ function _generic_matmatmul!(C::AbstractVecOrMat{R}, tA, tB, A::AbstractVecOrMat
end
end
else
# FIXME: This code is completely invalid!!!
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@chriselrod
Copy link
Contributor

chriselrod commented Sep 19, 2021

I think the major and practical issue here is that isbitstype is not enough for ensuring no-reentrancy of * (so that the buffers are not re-used by multiple functions). I can quite easily define arbitrary large matrix with small isbitstype

I don't follow this example.
The eltype is Float64, so the buffer is unsafe_wrapped into an Array{Float64} that you can copy elements into.
Given the inefficient memory accesses of LargeMatrix, this example looks like a case where we'd really want to use buffers.

FWIW, Octavian.jl avoids the GC issues for Abuf via using a large mutable struct holding an NTuple (think a large StaticArrays.MVector). This will be stack allocated if you're careful, resulting in zero memory allocations.

However, for Bbuf, Octavian is using a large preallocated array as Bbuf is much larger than Abuf and would blow the stack.
Stack size also seems smaller in 32 bit Julia, so it's using arrays instead of stack allocation there, too.

LinearAlgebra.generic_matmatmul! is extremely slow anyway, so I think it's fine to go with a simpler implementation like this PR instead of trying to stack allocate buffers on 64 bit platforms.

@tkf
Copy link
Member Author

tkf commented Sep 19, 2021

Note that the main concern (discussed in issue sections) here is correctness and not performance. I discussed performance a bit to point out that the impact is small for previously working cases.

But I think the point on reentrancy can be clarified a bit more. Consider

struct LargeMatrix <: AbstractMatrix{Float64}
    id::UInt64
end

struct LargeMatrix2 <: AbstractMatrix{LargeMatrix}
    id::UInt64
end

The implementation of * on LargeMatrix2 needs to compute * on LargeMatrix. If both methods hit the tiled GEMM, and if the tiled GEMM uses pooled buffers, * on LargeMatrix corrupts the buffers still used by * on LargeMatrix2.

The eltype is Float64

So the point is when LargeMatrix is used as elements. That's why I mentioned block matrix in the OP.

this example looks like a case where we'd really want to use buffers.

After this PR, we'll still use the buffers. The point is to not pool the buffers.

@chriselrod
Copy link
Contributor

chriselrod commented Sep 19, 2021

The implementation of * on LargeMatrix2 needs to compute * on LargeMatrix. If both methods hit the tiled GEMM

Got it, thanks for the explanation.

I got that this needed changes because of possible task movement as well as the threadpools PR, but missed the example of LargeMatrix being isbits yet potentially recursively calling the tiled matmul.

Copy link
Sponsor Member

@KristofferC KristofferC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clearly has to be done at some point and the benchmarks show that this doesn't have a significant performance impact (with perhaps the exception of cases that shouldn't use tiling anyway), So LGTM.

@vchuravy vchuravy merged commit 6893f21 into JuliaLang:master Sep 20, 2021
@tkf tkf deleted the unbuf branch September 20, 2021 18:44
KristofferC pushed a commit that referenced this pull request Sep 28, 2021
@tkf tkf changed the title RFC: Remove buffer pool for tiled GEMM Remove buffer pool for tiled GEMM Oct 15, 2021
@tkf tkf added backport 1.6 Change should be backported to release-1.6 backport 1.7 kind:bugfix This change fixes an existing bug labels Oct 15, 2021
@tkf
Copy link
Member Author

tkf commented Oct 15, 2021

This is a bugfix (the bug I illustrated in the above comment can be triggered). So, I suppose this is a target of backporting?

KristofferC pushed a commit that referenced this pull request Nov 11, 2021
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Nov 12, 2021
@KristofferC
Copy link
Sponsor Member

This causes quite a bit of breakage on 1.6 so I'll remove this from backports for now. This is of course the packages fault but we try to be very conservative with patch releases.

@tkf
Copy link
Member Author

tkf commented Nov 12, 2021

Ah, I didn't think at all that this could trigger breakage on packages. But yes, it makes sense to play on the safe side in backports. Thanks for hunting down backport issues!

(This makes me wonder if it makes sense to delay backport or at least PkgEval until the related patches land on the latest release. It'll give some chance for the packages to fix their bugs. Probably the only disadvantage is that we can't backport things while the patches are hot in the memory of PR authors.)

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra domain:multithreading Base.Threads and related functionality kind:bugfix This change fixes an existing bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants