Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying small arrays with broadcasting syntax should be faster #45487

Open
BioTurboNick opened this issue May 28, 2022 · 9 comments
Open

Copying small arrays with broadcasting syntax should be faster #45487

BioTurboNick opened this issue May 28, 2022 · 9 comments
Labels
compiler:simd instruction-level vectorization domain:arrays [a, r, r, a, y, s] performance Must go faster

Comments

@BioTurboNick
Copy link
Contributor

For loops are 2-10x faster (depending on size, destination) for copying small arrays than copyto!, which the broadcasting machinery relies on.

It would be amazing if the broadcasting syntax could be faster for small arrays automatically.

I am aware of LoopVectorization, the intent here is 1) using the tools available in Base; 2) using the elegant broadcasting syntax; to 3) provide a reasonable default.

image
image

using BenchmarkTools

function broadcastedcopy(x, y)
    a = rand(x)
    b = zeros(x) # zeros(x, 3)
    for i = 1:y
        b .= a # b[:, 1] .= a
    end
    b
end

function forloopcopy(x, y)
    a = rand(x)
    b = zeros(x) # zeros(x, 3)
    for i = 1:y
        for j = 1:x
            b[j] = a[j] # b[j, 1] = a[j]
        end
    end
    b
end

iterations = 1_000_000
ns = (1, 2, 3, 4, 5, 10, 20, 50, 63, 64, 65, 100)

println("Broadcasted copy")
for n in ns
    print("$n: ")
    @btime broadcastedcopy($n, $iterations)
end

println("For loop copy")
for n in ns
    print("$n: ")
    @btime forloopcopy($n, $iterations)
end
@oscardssmith oscardssmith added performance Must go faster domain:arrays [a, r, r, a, y, s] labels May 28, 2022
@N5N3
Copy link
Member

N5N3 commented May 28, 2022

Looks like the for loop version is faster only because it's fully inlined.
My local bench shows that copyto! is faster than for loop if we seperate the copy kernal and mark it as @noinline. (.= is about 1.5x slower than copyto!, but that should also blame the extra overhead caused by @noinline)

My Bench Code
function floop(a, b)
    @inbounds for i in eachindex(a, b)
        a[i] = b[i]
    end
    a
end
_copyto!(a, b) = a .= b
_repeat(times, f, a, b) = for _ in 1:times
    @noinline f(a, b)
end
using BenchmarkTools
ele = [1, 2, 3, 4, 5, 10, 20, 50, 63, 64, 65, 100]
times = zeros(length(ele), 3)
for n in 1:length(ele)
    a = zeros(ele[n])
    b = randn(ele[n])
    times[n, 1] = @belapsed _repeat(100, floop, $a, $b)
    times[n, 2] = @belapsed _repeat(100, copyto!, $a, $b)
    times[n, 3] = @belapsed _repeat(100, _copyto!, $a, $b)
end

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented May 28, 2022

Hmm... Looking at @code_native, the assembly generated for the broadcast version is 874 lines long, whereas the for loop is 267 lines long.

The code generated for the broadcast does have a reference to _copyto_impl! but also jl_array_copy, and several inlined SIMD copying loops. Perhaps choosing the best for different situations? Looks like copyto_impl! is the one used here, according to @profile.

Meanwhile, the for loop version just has a regular-register copying loop.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented May 28, 2022

Here I'm comparing if I remove the copyto! call altogether, falling back to the @simd loop, or if I test for length < 8 and only use a for loop in that case.

 @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
           if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
               A = bc.args[1]
               if axes(dest) == axes(A)
                   if length(A) > 8
                       return copyto!(dest, A)
                   else
                       for i ∈ eachindex(A)
                           dest[i] = A[i]
                       end
                       return dest
                   end
               end
           end
           bc′ = preprocess(dest, bc)
           # Performance may vary depending on whether `@inbounds` is placed outside the
           # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
           @inbounds @simd for I in eachindex(bc′)
               dest[I] = bc′[I]
           end
           return dest
       end

Still some overhead from the broadcast, but better scaling

image

And for copying into an array column:
image

@N5N3
Copy link
Member

N5N3 commented May 29, 2022

IIRC, only Array's Base.copyto! are based-on memmove.
So optimizaiton on copyto! side seems a better choice?
(BTW, pure for-loop is not correct when dest and A have overlapped memory.)

@BioTurboNick
Copy link
Contributor Author

@N5N3 - going back to your original observation, if I decorate all copyto!s and _copyto_impl! with @inline, I end up getting better performance - memcpy is embedded in the code_native output now. It's still not quite as good as a simple for loop, but removes some of the broadcast overhead (~30%).

Though I suspect there are reasons those functions aren't already marked as inlined?

@brenhinkeller
Copy link
Sponsor Contributor

c.f. also #28126

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 14, 2024

Thought I'd check in on how this is doing, and this is interesting:
image

In 1.11-DEV.1597, the simple for loop now has a somewhat erratic performance profile, and shows a cache issue around 64 (the spike). It looks a lot like the SIMD broadcasting profile shown above, so I presume Julia is better at using SIMD here than it used to?

The 1.11-DEV.1597 for loop seems to share a worst-case bound with broadcasting, but the for loop is still strictly faster for most inputs, except it is slower at the low end than the 1.10 for loop. I notice the benchmarking is showing 4 allocations occurring in 1.11-DEV.1597 whereas there are only 2 in 1.10, so that could be the source?

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 14, 2024

The extra time seems to be spent in this line, which is new in 1.11:

@boundscheck ult_int(bitcast(UInt, sub_int(i, 1)), bitcast(UInt, length(A))) || throw_boundserror(A, (i,))

I'll make a new issue for this specific case.

@oscardssmith
Copy link
Member

The 4 allocations vs 2 is just that the Memory and Array are now no longer the same allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization domain:arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants