Broadcasting is much slower than a for loop #28126

YingboMa · 2018-07-15T20:13:27Z

Here is a minimal working example.

julia> using BenchmarkTools

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e)
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i])
           end
           nothing
       end
goo (generic function with 1 method)

julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##9#10"))}(getfield(Main, Symbol("##9#10"))(), 1:5)

julia> @btime foo($a,$b,$c,$d,$e)
  1.277 μs (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  345.568 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 0.7.0-beta2.12
Commit a878341 (2018-07-15 15:57 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_PKG3_PRECOMPILE = 1

gasagna · 2018-07-15T21:19:15Z

Maybe related to #27988 ?

KristofferC · 2018-07-15T22:05:33Z

There is no inference problem, the inbounds loop manages to simd while the broadcast doesn't. Removing the inbounds and they are equally fast.

YingboMa · 2018-07-15T22:07:12Z

No, I don't think so. After applying this diff and rebuilding Julia, I got the same result.

julia> @btime foo($a,$b,$c,$d,$e)
  1.274 μs (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  348.897 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 0.7.0-beta2.12
Commit a878341* (2018-07-15 15:57 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_PKG3_PRECOMPILE = 1

jebej · 2018-07-15T22:17:32Z

Don't dot-ops do operations inbounds?

YingboMa · 2018-07-15T22:27:44Z

Maybe related to #22255?

julia> using BenchmarkTools

julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##5#6"))}(getfield(Main, Symbol("##5#6"))(), 1:5)

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}) where T
           @. a = b + 0.2c + 0.3d
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.2c[i] + 0.3d[i]
           end
           nothing
       end
goo (generic function with 1 method)

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.2c + 0.3d + 0.4e
           nothing
       end
foo (generic function with 2 methods)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
           end
           nothing
       end
goo (generic function with 2 methods)

julia> @btime foo($a,$b,$c,$d)
  168.767 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d)
  166.587 ns (0 allocations: 0 bytes)

julia> @btime foo($a,$b,$c,$d,$e)
  1.135 μs (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  342.142 ns (0 allocations: 0 bytes)

Keno · 2018-07-15T23:42:20Z

Looks like LLVM is having some trouble with the array bounds in the broadcast example:

remark: <unknown>:0:0: loop not vectorized: cannot identify array bounds

Keno · 2018-07-15T23:54:54Z

If I'm reading this correctly, the array pointer load is not getting LICM'ed. @vtjnash was looking at something related recently:

  %171 = load double*, double* addrspace(11)* %152, align 8
  %172 = getelementptr inbounds double, double* %171, i64 %value_phi1517.us.us
  store double %170, double* %172, align 8

Keno · 2018-07-15T23:56:58Z

Actually, my bad, I accidentally dropped TBAA.

Keno · 2018-07-16T00:08:41Z

Ok, the array bounds are still the issue though. Not sure why LLVM is unable to find them. We should probably fix that. However, there is a larger issue here. We go through all this trouble on the julia side to unalias the destination array from any of the arguments, but we don't actually tell LLVM that we did that, so it tries to insert a dynamic memory range check for which it then can't find the bounds. The better fix here is probably to figure out how to inform LLVM that the pointers don't alias.

ChrisRackauckas · 2018-07-16T02:57:16Z

the inbounds loop manages to simd while the broadcast doesn't

If this is the case, why? Why not assert that the sizes work and then just @inbounds @simd (or let it auto-vectorize)?

ChrisRackauckas · 2018-07-16T03:12:10Z

Maybe related to #22255?

It's not #22255 because if it was then the reduced size example would have no regression and the larger size example would have allocations.

chriselrod · 2018-07-16T16:03:15Z

If this is the case, why? Why not assert that the sizes work and then just @inbounds @simd (or let it auto-vectorize)?

I don't know the details of how broadcasting is implemented (and all the LLVM interfacing is way beyond me), but as Keno elaborated above, LLVM isn't aware that the array's aren't aliased while broadcasting, and therefore doesn't auto-vectorize (if b[i+1] refers to the same element as a[i] in @. a = b + 0.1 * (0.2c + 0.3d + 0.4e), you'd obviously have to calculate each a[i] in order, instead of in SIMD chunks).

ChrisRackauckas · 2018-07-18T05:37:45Z

Would #19658 be a good solution?

chriselrod · 2018-07-18T09:49:07Z

I think having @assert_nolias would be pretty cool. I struggled with getting the compiler to vectorize while messing with matmul.
But, hopefully it'll be unnecessary for plain old array data types. In there Keno says:

Eventually we may want to introduce more first class support for non-aliasing arrays (which array almost are, were it not for sharing by reshaping etc).

I think that's why we now have reinterpret and reshaped arrays, to try and make it so whenever
isa(a, Array) && isa(b, Array) && !(a === b)
they'd be guaranteed not to alias, and no runtime checks are needed.

I thought we were already there, but I realized unsafe_wrap still creates a plain old array. To test usnafe_wrap, and whether alias checks are happening...

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.2c + 0.3d + 0.4e
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
           end
           nothing
       end
goo (generic function with 1 method)

julia> c = randn(80);  d = randn(80);  e = randn(80);

julia> x1 = randn(81);

julia> x2 = copy(x1); x3 = copy(x1); #x3 backup, if we want to double check answers later

julia> px1 = pointer(x1); px2 = pointer(x2);

julia> a1 = unsafe_wrap(Vector{Float64}, px1 + 8, 80); # a[i] = b[i+1]

julia> a2 = unsafe_wrap(Vector{Float64}, px2 + 8, 80);

julia> b1 = unsafe_wrap(Vector{Float64}, px1, 80);

julia> b2 = unsafe_wrap(Vector{Float64}, px2, 80);

julia> foo(a1, b1, c, d, e)

julia> goo(a2, b2, c, d, e)

julia> a1'
1×80 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.805164  -0.134631  -0.696007  -0.39507  0.139368  0.209027  0.992377  …  -0.288711  -0.417188  -0.330905  -0.518275  -0.0962622  0.0562025

julia> a2'
1×80 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.805164  -0.134631  -0.696007  -0.39507  0.139368  0.209027  0.992377  …  -0.288711  -0.417188  -0.330905  -0.518275  -0.0962622  0.0562025

but they agreed!

Benchmarking also supports that they got caught by a runtime check:

julia> using BenchmarkTools

julia> @btime foo($a1, $b1, $c, $d, $e)
  304.583 ns (0 allocations: 0 bytes)

julia> @btime goo($a2, $b2, $c, $d, $e)
  284.393 ns (0 allocations: 0 bytes)

julia> a3 = randn(80); b3 = randn(80);

julia> @btime foo($a3, $b3, $c, $d, $e)
  89.477 ns (0 allocations: 0 bytes)

julia> @btime goo($a3, $b3, $c, $d, $e)
  24.370 ns (0 allocations: 0 bytes)

julia> function goo2(a::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) - 1 == length(c) == length(d) == length(e)
           # Make aliasing obvious
           @inbounds for i in eachindex(c)
               a[i+1] = a[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
           end
           nothing
       end
goo2 (generic function with 1 method)

julia> a4 = randn(81);

julia> @btime goo2($a4, $c, $d, $e)
  192.453 ns (0 allocations: 0 bytes)

The clearly aliasing version was much faster (195ns vs 285ns) than the disguised alias. Both much slower of course than no aliasing.

What all is left, before the compiler can be promised different arrays never alias? Will that happen, and how much does it matter (ie, what is the cost of runtime checking)?

vtjnash · 2018-07-18T13:53:10Z

Would #19658 be a good solution?

I'm not sure – this used to roughly be the meaning of the @simd macro, but we recently removed it since we found that @no_alias also essentially ends up asserting that no data may be read from the destination as well ("self-aliasing"), and there didn't seem to be a good way to express that contract in general. This was OK for the basic Array type, but was breaking generic code when it happened to get passed a different AbstractArray, such as a BitArray.

KristofferC · 2018-07-20T10:36:56Z

I think that's why we now have reinterpret and reshaped arrays, to try and make it so whenever
isa(a, Array) && isa(b, Array) && !(a === b)
they'd be guaranteed not to alias, and no runtime checks are needed.

julia> a = rand(2,2); b = vec(a);

julia> isa(a, Array) && isa(b, Array) && !(a === b)
true

julia> a[1,1] = 2
2

julia> b[1]
2.0

chriselrod · 2018-07-20T11:05:14Z

That's a much less convoluted example than the unsafe_wraps.
I also realize now:

julia> x = rand(2,2,2);

julia> reshape(x,2,4)
2×4 Array{Float64,2}:
 0.575181  0.154845  0.845574  0.772895
 0.722177  0.329466  0.258449  0.718844

I realize now the ReshapedArray type only applies to AbstractArrays?

I was thinking of this comment: https://discourse.julialang.org/t/use-of-pointer/8849/9
so I thought there was an effort to disallow aliasing in that way.
Any good issues/prs that show some discussion?
I don't know all these low level / compiler details, but am trying to get some grasp of what's going on.

AshtonSBradley · 2019-10-19T10:37:28Z

maybe @stevengj has some ideas about this?

giordano · 2020-10-05T22:47:51Z

Interestingly enough, when the vectors are much larger (10k x those used by @YingboMa in the original benchmarks), broadcasting and for loop are equally fast (or slow?):

julia> a,b,c,d,e=(rand(10_000_000) for i in 1:5)
Base.Generator{UnitRange{Int64}, var"#28#29"}(var"#28#29"(), 1:5)

julia> @btime foo($a,$b,$c,$d,$e)
  32.563 ms (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  32.485 ms (0 allocations: 0 bytes)

YingboMa · 2020-10-06T14:03:49Z

Yeah, when the data is large, then the performance is bounded by memory not compute.

chriselrod · 2020-10-06T15:50:06Z

You can see this by comparing with benchmarks of shorter vectors:

julia> a,b,c,d,e=(rand(10_000_000) for i in 1:5);

julia> @btime foo($a,$b,$c,$d,$e)
  28.240 ms (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  25.086 ms (0 allocations: 0 bytes)

julia> a,b,c,d,e=(rand(512) for i in 1:5);

julia> @btime foo($a,$b,$c,$d,$e)
  648.171 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  73.751 ns (0 allocations: 0 bytes)

julia> ((28.24e-3, 25.086e-3) ./ (648.171e-9, 73.751e-9)) ./ (10_000_000 / 512)
(2.230719979758428, 17.41540046914618)

So by shortening the vectors from 10 million to 512 elements, I get 2x and 17x faster results relative to the number of elements.

giordano · 2020-10-06T16:11:59Z

Yeah, after writing the message I realised that I was getting a very bad scaling compared the initial 1000-element example (10k x more elements but 22k x slower, so the same 2x factor you're seeing), but didn't think that performance was actually bound by memory. I should have probably have gone to sleep yesterday night instead of doing dumb benchmarks 😆

ViralBShah · 2022-03-11T22:14:59Z

@YingboMa Close?

YingboMa · 2022-03-11T22:19:26Z

This is not resolved.

chriselrod · 2022-03-12T15:34:09Z

To resolve this issue for most cases, we'd need to take a FastBroadcast-like approach of generating at least two versions of the loop explicitly, so that fast code can be generated for at least 1 case.

YingboMa · 2022-03-13T02:17:22Z

That's what #30973 does.

vtjnash · 2023-08-24T19:24:41Z

julia> @btime foo($a,$b,$c,$d,$e)
  313.348 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  305.598 ns (0 allocations: 0 bytes)

julia> @btime foo($a,$b,$c,$d)
  142.696 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d)
  141.994 ns (0 allocations: 0 bytes)

julia> @btime foo($a,$b,$c,$d,$e)
  337.626 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  330.529 ns (0 allocations: 0 bytes)

jishnub · 2023-08-25T05:06:47Z

Is this fixed on v1.10 as well, or is scheduled to be backported?

EDIT: it seems to be fixed.

julia> using BenchmarkTools

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e)
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i])
           end
           nothing
       end
goo (generic function with 1 method)

julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64}, var"#1#2"}(var"#1#2"(), 1:5)

julia> @btime foo($a,$b,$c,$d,$e)
  406.280 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  404.220 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.10.0-beta2
Commit a468aa198d0 (2023-08-17 06:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
Environment:
  LD_LIBRARY_PATH = :/usr/lib/x86_64-linux-gnu/gtk-3.0/modules
  JULIA_EDITOR = subl

chriselrod · 2023-08-28T10:46:53Z

Looking at FastBroadcast.jl's vectorization tests (foo uses FastBroadcast, bar uses Base broadcast):

julia> using MuladdMacro, FastBroadcast

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @muladd @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> function foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
           @muladd @.. a =
               b +
               0.1 * (
                   0.2c +
                   0.3d +
                   0.4e +
                   0.5f +
                   0.6g +
                   0.6h +
                   0.6i +
                   0.6j +
                   0.6k +
                   0.6l +
                   0.6m +
                   0.6n +
                   0.6o +
                   0.6p +
                   0.6q +
                   0.6r +
                   0.6s +
                   0.6t +
                   0.6u +
                   0.6v +
                   0.6w +
                   0.6x +
                   0.6y +
                   0.6z
               )
           nothing
       end
foo26 (generic function with 1 method)

julia> function bar9(a, b, c, d, e, f, g, h, i)
           @muladd @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
bar9 (generic function with 1 method)

julia> function bar26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
           @muladd @. a =
               b +
               0.1 * (
                   0.2c +
                   0.3d +
                   0.4e +
                   0.5f +
                   0.6g +
                   0.6h +
                   0.6i +
                   0.6j +
                   0.6k +
                   0.6l +
                   0.6m +
                   0.6n +
                   0.6o +
                   0.6p +
                   0.6q +
                   0.6r +
                   0.6s +
                   0.6t +
                   0.6u +
                   0.6v +
                   0.6w +
                   0.6x +
                   0.6y +
                   0.6z
               )
           nothing
       end
bar26 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z =
           [rand(100, 100, 2) for i = 1:26];

julia> foo9(a, b, c, d, e, f, g, h, i)

julia> foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)

julia> bar9(a, b, c, d, e, f, g, h, i)

julia> bar26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)

julia> @benchmark foo9($a, $b, $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  44.698 μs … 100.563 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.143 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.254 μs ± 727.554 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▄▆▇█▆▃▁                                                
  ▂▂▂▃▄▆████████▇▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  44.7 μs         Histogram: frequency by time         47.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar9($a, $b, $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  88.260 μs … 105.204 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     88.657 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.839 μs ± 651.841 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▄▅▇▇██▇▇▆▄▂                               ▁    ▁▁▂▂▂▂▂▂▁▁ ▃
  ▆██████████████▇▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▅▄▇▇▇▇████▇████████████ █
  88.3 μs       Histogram: log(frequency) by time      91.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 6248 samples with 1 evaluation.
 Range (min … max):  154.994 μs … 175.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     156.067 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.352 μs ±   1.123 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▇█▆▂          ▁▃▃▃▃▂▁                           ▁
  ▃▃▄▆███▇▇▇▇▇██████▇▇▇▅▅▅▅▄▅▅█████████▇███▇▆▅▅▄▄▃▅▃▁▁▁▁▁▃▁▁▆█▇ █
  155 μs        Histogram: log(frequency) by time        159 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 1994 samples with 1 evaluation.
 Range (min … max):  494.945 μs … 524.828 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     496.478 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   497.447 μs ±   1.812 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▄▄██▆▄▁                              ▁▃▃▂▅ ▂▁           
  ▂▂▃▃▆▇█████████▆▅▃▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▅▆▆█████████▇▄▅▄▃▃▂▂▂ ▄
  495 μs           Histogram: frequency by time          500 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT:
Just to confirm, specifying broadcast=true (so that it actually has the same semantics as base broadcasting) doesn't hurt performance:

julia> function foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
           @muladd @.. broadcast=true a =
               b +
               0.1 * (
                   0.2c +
                   0.3d +
                   0.4e +
                   0.5f +
                   0.6g +
                   0.6h +
                   0.6i +
                   0.6j +
                   0.6k +
                   0.6l +
                   0.6m +
                   0.6n +
                   0.6o +
                   0.6p +
                   0.6q +
                   0.6r +
                   0.6s +
                   0.6t +
                   0.6u +
                   0.6v +
                   0.6w +
                   0.6x +
                   0.6y +
                   0.6z
               )
           nothing
       end
foo26 (generic function with 1 method)

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @muladd @.. broadcast=true a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> @benchmark foo9($a, $b, $c, $d, $e, $f, $g, $h, $i)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  44.795 μs …  55.059 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.219 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.327 μs ± 479.755 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃▄▅▆▇█████▇▆▅▄▄▃▃▂▂▁▂▁▂▁▁                      ▁▁▁▁▁  ▁▁   ▃
  ▆███████████████████████████▇▇▅▆▁▃▁▃▁▁▃▁▄▁▃▅▅▆▇████████████▇ █
  44.8 μs       Histogram: log(frequency) by time      47.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 6227 samples with 1 evaluation.
 Range (min … max):  154.880 μs … 178.189 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     156.287 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.849 μs ±   1.892 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁ ▂█▄  ▁▄▃▁▁     ▁▂                                        ▁
  ▃▄███████▆██████▆▅▆▄██▃▃▃▆▇▅▄▄▄▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▄▅▆▅▆▅▅▅▅▇▆ █
  155 μs        Histogram: log(frequency) by time        168 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Meaning that this is an implementation deficiency in base broadcasting.

N5N3 · 2023-08-29T15:48:49Z

IIUC, FastBroadcast has a fast branch for non-broadcast cases (args with identity axes).
Once we try some "real" broadcasting, FastBroadcast might become (slightly) slower than Base.broadcast

function foo9(a, b, c, d, e, f, g, h, i)
    @. @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
    nothing
end

function bar9(a, b, c, d, e, f, g, h, i)
    @.. broadcast = true @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
    nothing
end

a, b, c, d, e, f, g, h, i = [rand(1024) for i = 1:26];

julia> @benchmark foo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.150 μs …  18.840 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.210 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.348 μs ± 506.534 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅█▅▂▂▁ ▁    ▂▂▁▁▁ ▁   ▁  ▂ ▃  ▃  ▂  ▁                      ▂
  ██████████▇█████████▇████▅█▇██▅██▅█▆▅█▄▄▆▄▂▄▃▄▄▄▄▄▄▂▅▄▅▄▂▃▇ █
  1.15 μs      Histogram: log(frequency) by time      2.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.422 μs … 340.733 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.969 μs ±   3.826 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▂█▃▃▂▅▃▂▁▂▁▁ ▁▁ ▁                ▁                         ▁
  ████████████████▇█▇▅▇▅█▅█▅█▅▅█▅█▅▃█▄▃▇▃▄▃▃▄▄▂▅▄▄▄▄▄▄▃▃▄▄▅▃▅ █
  2.42 μs      Histogram: log(frequency) by time      7.16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Compared with the result before the LLVM upgrading and pass tuning, now LLVM does generate vectorization code even there are more than 3 dynamic axes1.
But apparently, it still not vectorize all axes1 combinations to avoid possible code bloat. And the vectorization might fail for high dimension cases.

vtjnash · 2023-08-29T15:55:57Z

Thanks N5N3 for checking

giordano · 2023-08-29T16:09:01Z

Is there a way to make sure this doesn't regress again? An llvm test for example?

chriselrod · 2023-08-29T16:42:13Z

IIUC, FastBroadcast has a fast branch for non-broadcast cases (args with identity axes).
Once we try some "real" broadcasting, FastBroadcast might become (slightly) slower than Base.broadcast

I am aware. Getting worse performance in "real broadcasting" was a deliberate choice that I introduced myself, because I wanted to avoid excessive code generation/loop multiversioning by LLVM for cases I and the SciML ecosystem do not prioritize.

If we wanted to rice benchmarks at the cost of code generation, we could revert that and have FB dominate base broadcast's runtime performance.

chriselrod · 2023-08-29T16:52:49Z

@N5N3 the original issue title was "Broadcasting is much slower than a for loop".

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @. @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> function bar9(a, b, c, d, e, f, g, h, i)
           @.. broadcast = true @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
bar9 (generic function with 1 method)

julia> function loo9(a, b, c, d, e, f, g, h, i)
           @fastmath for ii = eachindex(a) 
             a[ii] = b[1] + 0.1 * (0.2c[ii] + 0.3d[ii] + 0.4e[ii] + 0.5f[ii] + 0.6g[ii] + 0.6h[ii] + 0.6i[ii])
           end
       end
loo9 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i = [rand(1024) for i = 1:26];

julia> @benchmark foo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.560 μs …  1.846 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.568 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.570 μs ± 15.111 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁█▅                                                       
  ▂▃███▆▃▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▂ ▂
  1.56 μs        Histogram: frequency by time        1.67 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.245 μs …   9.063 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.247 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.334 μs ± 541.530 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▂                                                        ▁
  █▁██▆▃▄▃▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▄▃▁▃▄▄▄▁▃▄▄▄▄▅▄▄▄▄▄▄▅▅▅▆▆▇ █
  4.25 μs      Histogram: log(frequency) by time      8.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark loo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 195 evaluations.
 Range (min … max):  490.226 ns … 710.538 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     492.646 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   498.221 ns ±  13.150 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▅▂▃▂ ▂▄▄▃▂▄▅▃▂▁▂▁ ▁▃▄▃▃▁▁                                  ▂
  ███████▇████████████▇███████▆██▇█▅▅▇▆▆▆▆█▆▄▅█▇▃▅█▅▃▁▅▄▃▁▅▃▃▁▆ █
  490 ns        Histogram: log(frequency) by time        544 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

ChrisRackauckas · 2023-08-29T16:53:25Z

Reopening for regression tests.

IIUC, FastBroadcast has a fast branch for non-broadcast cases (args with identity axes).

I'd argue that Base.broadcast should get this fast path, as using broadcast for map operations is quite common and so getting a major performance hit in the standard case isn't great.

fredrikekre added the performance Must go faster label Jul 15, 2018

JeffBezanson added the domain:broadcast Applying a function over a collection label Jul 15, 2018

YingboMa mentioned this issue Jul 18, 2018

Not fully broadcasting methods SciML/OrdinaryDiffEq.jl#106

Closed

39 tasks

ChrisRackauckas mentioned this issue Jul 20, 2018

inference failure for broadcast implementations that use flatten (eg SparseArrays) #27988

Closed

JeffBezanson added the kind:regression Regression in behavior compared to a previous version label Jul 25, 2018

ChrisRackauckas mentioned this issue Jul 30, 2018

Minor performance regressions on v0.7 SciML/OrdinaryDiffEq.jl#459

Closed

ChrisRackauckas mentioned this issue Sep 5, 2018

RFC: Macro for expression noalias hints #19658

Open

mbauman mentioned this issue Sep 10, 2018

fix perf in exp(::Matrix) (for smallish matrices) #29116

Merged

cstjean mentioned this issue Dec 19, 2018

Broadcasting with missings is slow #30455

Closed

ChrisRackauckas mentioned this issue Dec 20, 2018

Many of the implicit SDE solvers fail when adaptive=true with no method matching +(::Array{Float64,1}, ::Float64) SciML/StochasticDiffEq.jl#123

Closed

ChrisRackauckas mentioned this issue Jan 23, 2019

DiffEqFlux Blog Post JuliaLang/www.julialang.org#274

Merged

JeffBezanson added this to the 1.x milestone Jan 23, 2019

ChrisRackauckas mentioned this issue Feb 3, 2019

RFC/WIP: Use macro to auto switch between for loop and broadcast SciML/OrdinaryDiffEq.jl#639

Closed

KristofferC mentioned this issue Feb 20, 2020

WIP: Backports release 1.0.6 #34011

Closed

56 tasks

ViralBShah changed the title ~~Broadcasting is much slower than a for loop on Julia 0.7~~ Broadcasting is much slower than a for loop Mar 11, 2022

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

giordano mentioned this issue May 30, 2022

large broadcasting overhead on small subarray #45510

Closed

This was referenced Nov 19, 2022

Significant performance gap in assigning to a subsection of an array with and without broadcasting #40962

Open

Copying small arrays with broadcasting syntax should be faster #45487

Open

giordano mentioned this issue Feb 2, 2023

Inferior performance of inplace broadcasting over map! for multiple wide matrices #47873

Open

vtjnash closed this as completed Aug 24, 2023

chriselrod reopened this Aug 28, 2023

vtjnash closed this as completed Aug 29, 2023

ChrisRackauckas reopened this Aug 29, 2023

charleskawczynski mentioned this issue Jul 18, 2024

Leverage linear indexing for pointwise kernels CliMA/ClimaCore.jl#1889

Open

Broadcasting is much slower than a for loop #28126

Broadcasting is much slower than a for loop #28126

Comments

YingboMa commented Jul 15, 2018

gasagna commented Jul 15, 2018

KristofferC commented Jul 15, 2018

YingboMa commented Jul 15, 2018

jebej commented Jul 15, 2018

YingboMa commented Jul 15, 2018

Keno commented Jul 15, 2018

Keno commented Jul 15, 2018

Keno commented Jul 15, 2018

Keno commented Jul 16, 2018

ChrisRackauckas commented Jul 16, 2018

ChrisRackauckas commented Jul 16, 2018

chriselrod commented Jul 16, 2018 • edited Loading

ChrisRackauckas commented Jul 18, 2018

chriselrod commented Jul 18, 2018 • edited Loading

vtjnash commented Jul 18, 2018

KristofferC commented Jul 20, 2018

chriselrod commented Jul 20, 2018

AshtonSBradley commented Oct 19, 2019

giordano commented Oct 5, 2020

YingboMa commented Oct 6, 2020

chriselrod commented Oct 6, 2020

giordano commented Oct 6, 2020

ViralBShah commented Mar 11, 2022

YingboMa commented Mar 11, 2022

chriselrod commented Mar 12, 2022

YingboMa commented Mar 13, 2022

vtjnash commented Aug 24, 2023

jishnub commented Aug 25, 2023 • edited Loading

chriselrod commented Aug 28, 2023 • edited Loading

N5N3 commented Aug 29, 2023 • edited Loading

vtjnash commented Aug 29, 2023

giordano commented Aug 29, 2023

chriselrod commented Aug 29, 2023

chriselrod commented Aug 29, 2023 • edited Loading

ChrisRackauckas commented Aug 29, 2023

chriselrod commented Jul 16, 2018 •

edited

Loading

chriselrod commented Jul 18, 2018 •

edited

Loading

jishnub commented Aug 25, 2023 •

edited

Loading

chriselrod commented Aug 28, 2023 •

edited

Loading

N5N3 commented Aug 29, 2023 •

edited

Loading

chriselrod commented Aug 29, 2023 •

edited

Loading