Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient hvcat of scalars and arrays of numbers #39729

Merged
merged 12 commits into from
Jan 13, 2025

Conversation

BioTurboNick
Copy link
Contributor

@BioTurboNick BioTurboNick commented Feb 18, 2021

First attempt to address #39713

Original:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
@btime [a c ; b d]   # 31 allocations and 1.25 kb

New:

@btime [a c ; b d]   # 15 allocations and 656 bytes

Others unchanged, as expected. Though if different types of numbers are mixed, it still takes the longer path. I tried expanding the definition but there's some weird stuff going on that increases allocations in the other situations I posted in that issue. Works for any Number element type.

Fixes #39713

@BioTurboNick
Copy link
Contributor Author

Oops, yep. completely broken. Should have guessed.

@BioTurboNick BioTurboNick changed the title More efficient hvcat of same-type elements More efficient hvcat of scalars and arrays of numbers Feb 18, 2021
base/abstractarray.jl Outdated Show resolved Hide resolved
@BioTurboNick
Copy link
Contributor Author

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]
@btime e19() # 12 alloc: 528 bytes     ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e20() # 13 alloc: 544 bytes     ; 1.6.0-rc1: 39 alloc: 1.38 KiB
@btime e21() # <broken>                 ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e22() # 10 alloc: 512 bytes     ; 1.6.0-rc1: 10 alloc: 512 bytes
const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1

e6() = [x y ; x y]
@btime e6() # 4 alloc: 672 bytes        ; 1.6.0-rc1: 104 alloc: 4.98 KiB

This is what I achieved by consolidating with hvncat (once it's ready). Note it integrates #39725, which is currently broken for assigning strings into a range setindex.

@timholy
Copy link
Member

timholy commented Feb 19, 2021

I haven't had time to look at this, but in case you're interested and if it helps, I recently sped up cat: #39292, #39294, #39314

@BioTurboNick
Copy link
Contributor Author

Cool! @timholy I think I noticed some of that effect without knowing the cause. Don't worry about reviewing this yet, it's dependent on my other PR #33697

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Jun 6, 2021

Updated this, now that the hvncat mechanism is merged:

julia> @btime e19()
  904.651 ns (8 allocations: 576 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 2.611 μs (31 allocations: 1.25 KiB)
#             master: 2.156 μs (30 allocations: 1.44 KiB)

julia> @btime e20()
  910.811 ns (9 allocations: 592 bytes)
3×3 Matrix{Float64}:
 0.0  0.0  2.0
 0.0  0.0  4.0
 3.0  4.0  5.0
# compare with 1.6.1: 2.811 μs (39 allocations: 1.38 KiB)
#             master: 2.267 μs (38 allocations: 1.56 KiB)

julia> @btime e21()
  847.368 ns (5 allocations: 432 bytes)
3×3 Matrix{Any}:
 0  0  2
 0  0  4
 3  4   "string"
# compare with 1.6.1: 2.678 μs (31 allocations: 1.25 KiB)
#             master: 2.167 μs (26 allocations: 1.25 KiB)

julia> @btime e22()
  873.333 ns (6 allocations: 528 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 1.390 μs (10 allocations: 512 bytes)
#             master: 1.240 μs (10 allocations: 512 bytes)

julia> @btime e6() # 4 alloc: 672 bytes
  515.104 ns (5 allocations: 640 bytes)
4×4×2 Array{Int64, 3}:
[:, :, 1] =
 1  2  2  3
 3  4  4  5
 1  2  2  3
 3  4  4  5

[:, :, 2] =
 5  6  6  7
 7  8  8  9
 5  6  6  7
 7  8  8  9
# compare with 1.6.1: 4.414 μs (56 allocations: 2.41 KiB)
#             master: 3.337 μs (50 allocations: 2.39 KiB)

@btime [1 2; 4 5]
  29.879 ns (1 allocation: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 4  5
# compare with 1.6.1: 91.208 ns (2 allocations: 160 bytes)
#             master:  93.194 ns (2 allocations: 160 bytes)

julia> @btime [1 2; 4 fill(1, 1, 1, 1)]
  877.551 ns (9 allocations: 624 bytes)
2×2 Matrix{Int64}:
 1  2
 4  1
# compare with 1.6.1: 4.188 μs (43 allocations: 1.62 KiB)
#             master: 3.487 μs (42 allocations: 1.81 KiB)

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Jun 6, 2021

With respect to the original motivating examples:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 2.922 μs (31 allocations: 1.25 KiB); now 897.778 ns (8 allocations: 576 bytes)
@btime [a c ; [b d]] # 2.322 μs (21 allocations: 880 bytes); 1.690 μs (26 allocations: 1.30 KiB)
@btime [[a c] ; [b d]] # 975.000 ns 16 allocations and 816 bytes -- explicit hcat nested within vcat (should be worst case)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 1.410 μs ( (10 allocations: 512 bytes); now 865.957 ns (6 allocations: 528 bytes)
@btime [a c ; [b [d]]] # 1.230 μs (9 allocations: 560 bytes); now 816.250 ns (14 allocations: 1008 bytes)
@btime [[a c] ; [b [d]]] # 142.069 ns (4 allocations: 496 bytes) -- explicit hcat nested within vcat (should be worst case)

So better. Still some more improvements to make. Not entirely sure why the middle case is relatively poor.

@BioTurboNick

This comment was marked as outdated.

@BioTurboNick
Copy link
Contributor Author

Current status of this PR vs. Master:

(Some timing differences from previous test likely due to my new computer)

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 259.371 ns (22 allocations: 1.17 KiB) +++
@btime [a c ; [b d]] # 1.612 μs (26 allocations: 1.12 KiB) ---
@btime [[a c] ; [b d]] # 497.052 ns (16 allocations: 720 bytes) ---

# scalars wrapped in arrays
@btime [a c ; b [d]] # 189.981 ns (4 allocations: 352 bytes)
@btime [a c ; [b [d]]] # 1.161 μs (14 allocations: 816 bytes) ---
@btime [[a c] ; [b [d]]] # 84.902 ns (4 allocations: 384 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 628.402 ns (17 allocations: 944 bytes)
@btime [a c ; [b d]] # 1.110 μs (21 allocations: 800 bytes)
@btime [[a c] ; [b d]] # 480.513 ns (16 allocations: 720 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 801.111 ns (10 allocations: 448 bytes)
@btime [a c ; [b [d]]] # 665.161 ns (9 allocations: 464 bytes)
@btime [[a c] ; [b [d]]] # 79.321 ns (4 allocations: 384 bytes)

+++ = Improved since last test
--- = Regression since last test

So currently, Master is actually faster or equivalent for 4 of the 6 cases now, which is great. So will need to think if this PR is truly necessary anymore, and see what can improve that last case.

@BioTurboNick
Copy link
Contributor Author

Updating based on current master:

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 2.473 μs (43 allocations: 2.03 KiB)
@btime [a c ; [b d]] # 1.561 μs (33 allocations: 1.17 KiB)
@btime [[a c] ; [b d]] # 467.454 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 176.155 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 1.088 μs (22 allocations: 864 bytes)
@btime [[a c] ; [b [d]]] # 57.834 ns (8 allocations: 432 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 564.249 ns (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.041 μs (23 allocations: 832 bytes)
@btime [[a c] ; [b d]] # 470.286 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 708.369 ns (12 allocations: 464 bytes)
@btime [a c ; [b [d]]] # 585.099 ns (12 allocations: 496 bytes)
@btime [[a c] ; [b [d]]] # 59.694 ns (8 allocations: 432 bytes)

All cases on master are all faster than they were, even in the presence of additional allocations. I'll see about whether this PR can be wrapped up one way or the other this weekend I think.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 11, 2024

New simple approach - just dispatching hvcat to the hvncat_shape methods. Back to being similar or faster than master. But need to verify I didn't slow any other conditions down.

EDIT: updated timings after better addressing dynamic dispatch of cat_similar

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 150.041 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 565.120 ns (31 allocations: 1.11 KiB)
@btime [[a c] ; [b d]] # 472.211 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 85.399 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 148.468 ns (20 allocations: 800 bytes)
@btime [[a c] ; [b [d]]] # 58.048 ns (8 allocations: 432 bytes)

base/abstractarray.jl Outdated Show resolved Hide resolved
base/abstractarray.jl Outdated Show resolved Hide resolved
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 12, 2024

The remaining slowdown I'm seeing relates to the hcat methods.

In 1.10, hcat for an array and a scalar dispatches to a SparseArrays method (even though I haven't loaded SparseArrays?) which then redirects to typed_hcat, and then to _cat_t(Val(2), T, X...), which is very fast.

On master, the fallback abstractarrays.jl hcat is called instead, which feeds into the slower cat(A...; dims=Val(2)) instead.

Possibly related to #49322 ? Looks like a regression in cat, haven't fully investigated:

#1.10:
julia> @btime cat(b, d, dims = 2)
#  24.975 ns (1 allocation: 80 bytes)

#master:
julia> @btime cat(b, d, dims = 2)
#  554.371 ns (19 allocations: 576 bytes)

But we can probably ignore that for the purposes of this PR, which is about hvcat.

@BioTurboNick BioTurboNick marked this pull request as ready for review February 12, 2024 15:11
@RomeoV
Copy link

RomeoV commented Mar 10, 2024

I wonder if this resolves the following issue:

using LinearAlgebra
function measure()
  data = rand(3,3)
  mat = [
    data ones(size(data, 1))
    ones(size(data, 2))' 0
  ]
  det(mat)
end
@code_warntype measure()  # -> can't infer type of `mat`

There has been some discussion at https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Matrix.20comprehension.20failed.20type.20inference.

@BioTurboNick
Copy link
Contributor Author

Seems like it does!

MethodInstance for measure()
  from measure() @ Main REPL[2]:1
Arguments
  #self#::Core.Const(measure)
Locals
  mat::Matrix{Float64}
  data::Matrix{Float64}
Body::Float64
1 ─ %1  = Main.rand::Core.Const(rand)
│         (data = (%1)(3, 3))
│   %3  = Core.tuple(2, 2)::Core.Const((2, 2))
│   %4  = data::Matrix{Float64}
│   %5  = Main.ones::Core.Const(ones)
│   %6  = Main.size::Core.Const(size)
│   %7  = data::Matrix{Float64}
│   %8  = (%6)(%7, 1)::Int64
│   %9  = (%5)(%8)::Vector{Float64}
│   %10 = Main.:var"'"::Core.Const(adjoint)
│   %11 = Main.ones::Core.Const(ones)
│   %12 = Main.size::Core.Const(size)
│   %13 = data::Matrix{Float64}
│   %14 = (%12)(%13, 2)::Int64
│   %15 = (%11)(%14)::Vector{Float64}
│   %16 = (%10)(%15)::Adjoint{Float64, Vector{Float64}}
│         (mat = Base.hvcat(%3, %4, %9, %16, 0))
│   %18 = Main.det::Core.Const(LinearAlgebra.det)
│   %19 = mat::Matrix{Float64}
│   %20 = (%18)(%19)::Float64
└──       return %20

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Jan 9, 2025

Alright, I think it's time! Can we consider this for merging? Universally faster hvcat, even though a few instances have a higher allocation count, all tests passing. And fixes the type instability mentioned above.

Comparing against current master:

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]

@btime e19() # 1.088 μs (17 allocations: 816 bytes)
@btime e20() # 1.117 μs (19 allocations: 848 bytes)
@btime e21() # 1.154 μs (17 allocations: 816 bytes)
@btime e22() # 2.429 μs (32 allocations: 1.20 KiB)

@btime [a c ; b d] # 1.075 μs (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.906 μs (25 allocations: 1008 bytes)

@btime [a c ; b [d]] # 2.406 μs (32 allocations: 1.20 KiB)
@btime [a c ; [b [d]]] # 1.908 μs (27 allocations: 1.05 KiB)

And current PR:

@btime e19() # 312.075 ns (25 allocations: 1.19 KiB)
@btime e20() # 356.990 ns (30 allocations: 1.27 KiB)
@btime e21() # 115.967 ns (6 allocations: 304 bytes)
@btime e22() # 141.525 ns (8 allocations: 368 bytes)

@btime [a c ; b d] # 312.979 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 244.602 ns (18 allocations: 736 bytes)

@btime [a c ; b [d]] # 139.707 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 241.090 ns (20 allocations: 800 bytes)

@oscardssmith oscardssmith added performance Must go faster arrays [a, r, r, a, y, s] labels Jan 9, 2025
@oscardssmith
Copy link
Member

oscardssmith commented Jan 9, 2025

The one thing left that it would be good to see is performance on a literal 100x100 matrix (or similar). It's fairly common for large arrays to be written out in source code and it's important that the compiler not choke on them.

Other than that, this PR looks awesome! Deleting code and getting speedup is a great combination.

@JeffBezanson
Copy link
Member

Love the 4-year persistence on this!! 😄

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Jan 10, 2025

I did a 100x100 array of 1.0 and on this PR: 8.533 μs (4 allocations: 79.01 KiB)
and with one string in it: 5.113 ms (120016 allocations: 2.90 MiB)

vs master: 8.108 μs (4 allocations: 79.01 KiB)
and with one string in it: 4.813 ms (71808 allocations: 2.29 MiB)

Even though they seem to be a bit faster on master, I also got a bunch of warnings I've never seen before about inference being slow that I don't see on the PR, when I threw the string in:

info: inference of var"##sample#280"(Tuple{}, BenchmarkTools.Parameters) from var"##sample#280"(Any, BenchmarkTools.Parameters) exceeding 2501 frames (may be slow).
info: inference of var"##sample#280"(Tuple{}, BenchmarkTools.Parameters) from var"##sample#280"(Any, BenchmarkTools.Parameters) exceeding 5001 frames (may be slow).
info: inference of var"##sample#280"(Tuple{}, BenchmarkTools.Parameters) from var"##sample#280"(Any, BenchmarkTools.Parameters) exceeding 10001 frames (may be slow).

Side note - don't try to paste a literal 100x100 matrix into PowerShell, it goes very, very slowly. (WSL is fine)

@BioTurboNick
Copy link
Contributor Author

Looking at the profiles, I think the performance decrease for the mixed-type example is largely due to a little more overhead, and the extensive dimensionality checking (because the concatenated elements could be numbers or arrays).

The current code concatenates each row and then concatenates the rows. This PR checks dimensions, creates the destination array, and then populates it one element at a time. I suspect the master performance would get worse if more rows were mixed types, since all the intermediate arrays would have to have an Any eltype instead of Float64.

The differences in timing might not matter that much, I think - might not be repeatable? And the first example with all 1s actually doesn't touch my code.

@oscardssmith oscardssmith merged commit f135f3a into JuliaLang:master Jan 13, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extra allocations with hvcat of mixed arrays and scalars
7 participants