Copying small arrays with broadcasting syntax should be faster #45487

BioTurboNick · 2022-05-28T02:37:51Z

For loops are 2-10x faster (depending on size, destination) for copying small arrays than copyto!, which the broadcasting machinery relies on.

It would be amazing if the broadcasting syntax could be faster for small arrays automatically.

I am aware of LoopVectorization, the intent here is 1) using the tools available in Base; 2) using the elegant broadcasting syntax; to 3) provide a reasonable default.

using BenchmarkTools

function broadcastedcopy(x, y)
    a = rand(x)
    b = zeros(x) # zeros(x, 3)
    for i = 1:y
        b .= a # b[:, 1] .= a
    end
    b
end

function forloopcopy(x, y)
    a = rand(x)
    b = zeros(x) # zeros(x, 3)
    for i = 1:y
        for j = 1:x
            b[j] = a[j] # b[j, 1] = a[j]
        end
    end
    b
end

iterations = 1_000_000
ns = (1, 2, 3, 4, 5, 10, 20, 50, 63, 64, 65, 100)

println("Broadcasted copy")
for n in ns
    print("$n: ")
    @btime broadcastedcopy($n, $iterations)
end

println("For loop copy")
for n in ns
    print("$n: ")
    @btime forloopcopy($n, $iterations)
end

The text was updated successfully, but these errors were encountered:

N5N3 · 2022-05-28T04:45:51Z

Looks like the for loop version is faster only because it's fully inlined.
My local bench shows that copyto! is faster than for loop if we seperate the copy kernal and mark it as @noinline. (.= is about 1.5x slower than copyto!, but that should also blame the extra overhead caused by @noinline)

My Bench Code

function floop(a, b)
    @inbounds for i in eachindex(a, b)
        a[i] = b[i]
    end
    a
end
_copyto!(a, b) = a .= b
_repeat(times, f, a, b) = for _ in 1:times
    @noinline f(a, b)
end
using BenchmarkTools
ele = [1, 2, 3, 4, 5, 10, 20, 50, 63, 64, 65, 100]
times = zeros(length(ele), 3)
for n in 1:length(ele)
    a = zeros(ele[n])
    b = randn(ele[n])
    times[n, 1] = @belapsed _repeat(100, floop, $a, $b)
    times[n, 2] = @belapsed _repeat(100, copyto!, $a, $b)
    times[n, 3] = @belapsed _repeat(100, _copyto!, $a, $b)
end

BioTurboNick · 2022-05-28T15:51:17Z

Hmm... Looking at @code_native, the assembly generated for the broadcast version is 874 lines long, whereas the for loop is 267 lines long.

The code generated for the broadcast does have a reference to _copyto_impl! but also jl_array_copy, and several inlined SIMD copying loops. Perhaps choosing the best for different situations? Looks like copyto_impl! is the one used here, according to @profile.

Meanwhile, the for loop version just has a regular-register copying loop.

BioTurboNick · 2022-05-28T17:10:04Z

Here I'm comparing if I remove the copyto! call altogether, falling back to the @simd loop, or if I test for length < 8 and only use a for loop in that case.

 @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
           if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
               A = bc.args[1]
               if axes(dest) == axes(A)
                   if length(A) > 8
                       return copyto!(dest, A)
                   else
                       for i ∈ eachindex(A)
                           dest[i] = A[i]
                       end
                       return dest
                   end
               end
           end
           bc′ = preprocess(dest, bc)
           # Performance may vary depending on whether `@inbounds` is placed outside the
           # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
           @inbounds @simd for I in eachindex(bc′)
               dest[I] = bc′[I]
           end
           return dest
       end

Still some overhead from the broadcast, but better scaling

And for copying into an array column:

N5N3 · 2022-05-29T12:23:10Z

IIRC, only Array's Base.copyto! are based-on memmove.
So optimizaiton on copyto! side seems a better choice?
(BTW, pure for-loop is not correct when dest and A have overlapped memory.)

BioTurboNick · 2022-05-29T13:42:53Z

@N5N3 - going back to your original observation, if I decorate all copyto!s and _copyto_impl! with @inline, I end up getting better performance - memcpy is embedded in the code_native output now. It's still not quite as good as a simple for loop, but removes some of the broadcast overhead (~30%).

Though I suspect there are reasons those functions aren't already marked as inlined?

brenhinkeller · 2022-11-19T20:46:26Z

c.f. also #28126

BioTurboNick · 2024-02-14T19:06:11Z

Thought I'd check in on how this is doing, and this is interesting:

In 1.11-DEV.1597, the simple for loop now has a somewhat erratic performance profile, and shows a cache issue around 64 (the spike). It looks a lot like the SIMD broadcasting profile shown above, so I presume Julia is better at using SIMD here than it used to?

The 1.11-DEV.1597 for loop seems to share a worst-case bound with broadcasting, but the for loop is still strictly faster for most inputs, except it is slower at the low end than the 1.10 for loop. I notice the benchmarking is showing 4 allocations occurring in 1.11-DEV.1597 whereas there are only 2 in 1.10, so that could be the source?

BioTurboNick · 2024-02-14T19:26:07Z

The extra time seems to be spent in this line, which is new in 1.11:

julia/base/essentials.jl

Line 891 in 8eaf83c

    
           @boundscheck ult_int(bitcast(UInt, sub_int(i, 1)), bitcast(UInt, length(A))) || throw_boundserror(A, (i,))

I'll make a new issue for this specific case.

oscardssmith · 2024-02-14T19:34:15Z

The 4 allocations vs 2 is just that the Memory and Array are now no longer the same allocation.

oscardssmith added performance Must go faster domain:arrays [a, r, r, a, y, s] labels May 28, 2022

brenhinkeller added the compiler:simd instruction-level vectorization label Nov 19, 2022

brenhinkeller mentioned this issue Nov 19, 2022

Significant performance gap in assigning to a subsection of an array with and without broadcasting #40962

Open

BioTurboNick mentioned this issue Feb 14, 2024

Regression in performance of copying small vectors w/for loop #53337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copying small arrays with broadcasting syntax should be faster #45487

Copying small arrays with broadcasting syntax should be faster #45487

BioTurboNick commented May 28, 2022

N5N3 commented May 28, 2022 •

edited

Loading

BioTurboNick commented May 28, 2022 •

edited

Loading

BioTurboNick commented May 28, 2022 •

edited

Loading

N5N3 commented May 29, 2022 •

edited

Loading

BioTurboNick commented May 29, 2022

brenhinkeller commented Nov 19, 2022

BioTurboNick commented Feb 14, 2024 •

edited

Loading

BioTurboNick commented Feb 14, 2024 •

edited

Loading

oscardssmith commented Feb 14, 2024

Copying small arrays with broadcasting syntax should be faster #45487

Copying small arrays with broadcasting syntax should be faster #45487

Comments

BioTurboNick commented May 28, 2022

N5N3 commented May 28, 2022 • edited Loading

BioTurboNick commented May 28, 2022 • edited Loading

BioTurboNick commented May 28, 2022 • edited Loading

N5N3 commented May 29, 2022 • edited Loading

BioTurboNick commented May 29, 2022

brenhinkeller commented Nov 19, 2022

BioTurboNick commented Feb 14, 2024 • edited Loading

BioTurboNick commented Feb 14, 2024 • edited Loading

oscardssmith commented Feb 14, 2024

N5N3 commented May 28, 2022 •

edited

Loading

BioTurboNick commented May 28, 2022 •

edited

Loading

BioTurboNick commented May 28, 2022 •

edited

Loading

N5N3 commented May 29, 2022 •

edited

Loading

BioTurboNick commented Feb 14, 2024 •

edited

Loading

BioTurboNick commented Feb 14, 2024 •

edited

Loading