Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed broadcast! into bitarray algorithm #32048

Merged
merged 9 commits into from
Nov 4, 2023

Conversation

chethega
Copy link
Contributor

Cf https://discourse.julialang.org/t/broadcast-vs-slow-performance-allocations/24259/6 for some more discussion and #32047 for the question of validity in view of exceptions.

@mbauman is this valid with respect to eachindex, etc?

Before:

julia> using BenchmarkTools, Random
julia> y=1; xsmall=[1]; Random.seed!(42); xlarge=rand(1:4, 100_003);
julia> @btime broadcast(==, $xsmall, $y); @btime  broadcast(==, $xlarge, $y); @show hash(broadcast(==, xlarge, y).chunks);
  860.500 ns (3 allocations: 4.31 KiB)
  152.971 μs (3 allocations: 16.59 KiB)
hash((broadcast(==, xlarge, y)).chunks) = 0xaa3b5a692968e128

After:

julia> @btime broadcast(==, $xsmall, $y); @btime  broadcast(==, $xlarge, $y); @show hash(broadcast(==, xlarge, y).chunks);
  65.466 ns (2 allocations: 128 bytes)
  42.927 μs (2 allocations: 12.41 KiB)
hash((broadcast(==, xlarge, y)).chunks) = 0xaa3b5a692968e128

Monkeypatch:

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           ind = cind = 1
           bcp = preprocess(dest, bc)
           length(bcp)<=0 && return dest
           @inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
               z = UInt64(0)
               for j=0:63
                  z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
               end
               destc[i+1] = z
           end
           i = Base.num_bit_chunks(length(bcp))-1
           z = UInt64(0)
           @inbounds for j=0:(length(bcp)-1)&63
                z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
           end
           @inbounds destc[i+1] = z
           return dest
       end

@KristofferC KristofferC added the performance Must go faster label May 16, 2019
@fredrikekre fredrikekre requested a review from mbauman May 16, 2019 11:52
@fredrikekre fredrikekre added the broadcast Applying a function over a collection label May 16, 2019
@chethega
Copy link
Contributor Author

Ok, this does not work for cartesian indices. Multidimensional variants I tried:

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           bcp = preprocess(dest, bc)
           length(destc)<=0 && return dest
           ea = eachindex(bcp)
           i1 = 1
           i2 = 0
           z = UInt64(0)
           @inbounds for idx in ea # = 0:Base.num_bit_chunks(length(bcp))-2
               z |= (bcp[idx]::Bool) << (i2&63)
               i2 += 1
               if (i2 & 63) == 0
                   destc[i1]=z
                   i2 = 0
                   i1 += 1
                   z = UInt64(0)
               end
           end
           i2 != 0 && @inbounds destc[i1] = z
           return dest
       end

and

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           bcp = preprocess(dest, bc)
           length(bcp)<=0 && return dest
           ea = eachindex(bcp)
           idx = first(ea)
           @inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
               z = UInt64(0)
               for j=0:63
                  z |= (bcp[idx]::Bool) << (j&63)
                  idx = nextind(ea, idx)
               end
               destc[i+1] = z
           end
           i = Base.num_bit_chunks(length(bcp))-1
           z = UInt64(0)
           @inbounds for j=0:(length(bcp)-1)&63
                z |= (bcp[idx]::Bool) << (j&63)
                idx = nextind(ea, idx)
           end
           @inbounds destc[i+1] = z
           return dest
       end

None of these present an unambiguous win on my machine. So I restricted the new alg to bitvectors, for the time being. Any ideas how to get the fast variant for higher-dimensional arrays?

The best way would be to propagate IndexLinear information. The second best way would be to find a way to only cache 64 values in a register, instead of an array of Bools (I think the main perf reason for caching is to avoid memory-carried dependencies). Both of my above variants do this, but fail to consistently outperform the current implementation.

@mbauman
Copy link
Member

mbauman commented May 16, 2019

This is great — thanks. Definitely an improvement over the bitcache. And I do think it should be possible to get this fast for n-dimensional broadcasts, too, but it is indeed tricky. This gets us a bit closer:

    @inbounds for i in 1:Base.num_bit_chunks(length(bcp))-1
        z = UInt64(0)
        for j in 0:63
            z |= bcp[idx] << j
            (idx, s) = iterate(ea, s)
        end
        destc[i]=z
    end
    z = UInt64(0)
    for j in 0:63
        z |= bcp[idx] << j
        r = iterate(ea, s)
        r === nothing && break
        (idx, s) = r
    end
    destc[end]=z

But the trouble is that we're still introducing a branch with (idx, s) = iterate(ea, s) since that could return nothing and we can't really turn that off. We could potentially just use nextind instead of iterate, which I think should give us the performance we want.

@chethega
Copy link
Contributor Author

If we somehow managed to get linear indexing working for common constructions, then I think that we could maybe remove the bitcache logic entirely (also from bitarray constructors), replacing it by caching a single UInt64 (reducing complexity and linecount, yay!). If linear indexing is genuinely unsupported, then the current slight perf regression for that should not matter (would need benchmarking). And linear indexing would be a giant win for more important broadcast operations than bitarrays, cf. 32051 for a 20 x speedup of arr .+ const between N and 1 x N sizes.

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           bcp = preprocess(dest, bc)
           length(destc)<=0 && return dest
           ea = eachindex(bcp)
           i1 = 1
           i2 = 0
           z = UInt64(0)
           @inbounds for idx in ea
               z |= (bcp[idx]::Bool) << (i2&63)
               i2 += 1
               if i2 == 64
                   destc[i1]=z
                   i2 = 0
                   i1 += 1
                   z = UInt64(0)
               end
           end
           i2 != 0 && @inbounds destc[i1] = z
           return dest
       end

base/broadcast.jl Outdated Show resolved Hide resolved
@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Oct 31, 2023
base/broadcast.jl Outdated Show resolved Hide resolved
@vtjnash vtjnash merged commit f3ae44c into JuliaLang:master Nov 4, 2023
2 checks passed
@giordano giordano removed the merge me PR is reviewed. Merge when all tests are passing label Nov 4, 2023
N5N3 added a commit to N5N3/julia that referenced this pull request Dec 21, 2023
N5N3 added a commit to N5N3/julia that referenced this pull request Jan 4, 2024
N5N3 added a commit that referenced this pull request Jan 5, 2024
…52736)

Follows #32048.
This PR fully avoids the allocation thus make nd logical broadcast
better scaled for small inputs.

---------

Co-authored-by: Matt Bauman <mbauman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants