Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shareindexes macro for (sometimes) faster iteration #15356

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

timholy
Copy link
Member

@timholy timholy commented Mar 4, 2016

I'm preparing another couple of PRs that will expand the range of array iterators (and I'm planning even further expansion later on). I'm thinking ahead about loops that involve multiple arrays.

The problem

Currently we have two options:

for I in eachindex(A, B)
    # body
end

and

for (IA, IB) in zip(eachindex(A), eachindex(B))
    # body
end

My sense is that our eachindex(A, B) mechanism is too limited for the coming brave new world, and should gradually be retired. That means we should use zip. However, zip sometimes still has performance issues, some of which are unavoidable: when you have to increment two iterators rather than one, it's going to be slower, especially if incrementing is more expensive than the body. See benchmarks below.

This PR's solution

This PR introduces a new macro, @shareindexes, which takes a block like this:

@shareindexes for (IA,IB) in zip(eachindex(A), eachindex(B))
    @inbounds s += A[IA]*B[IB]
end

and turns it into

RA, RB = eachindex(A), eachindex(B)
if RA == RB
    for I in RA
        @inbounds s += A[I]*B[I]
    end
else
    for (IA,IB) in zip(RA, RB)
        @inbounds s += A[IA]*B[IB]
    end
end

An alternative that didn't work

I tried an alternative implementation based on

immutable Zip2Each{I1, I2} <: AbstractZipIterator
    same::Bool
    a::I1
    b::I2
end

and then defining start, done, and next functions that test same at runtime. Unfortunately, this was slower than just plain zip. If we could hoist that same check out of the entire block, then this solution would be just like the macro-based solution (but prettier).

Benchmarks

Using the functions mydot1, mydot2, and mydotshared in the new tests (with a @simd added in front of the loop in mydot1 for good measure), with arrays of size (10001,10000), here are timings.

mydot1
  0.093147 seconds (5 allocations: 176 bytes)
  0.094556 seconds (5 allocations: 176 bytes)
  0.094084 seconds (5 allocations: 176 bytes)
mydot2
  0.115129 seconds (5 allocations: 176 bytes)
  0.796566 seconds (5 allocations: 176 bytes)
  0.161923 seconds (5 allocations: 176 bytes)
mydotshared
  0.114697 seconds (5 allocations: 176 bytes)
  0.180229 seconds (5 allocations: 176 bytes)
  0.161990 seconds (5 allocations: 176 bytes)

Like in the tests, the first in each block is for a pair of LinearFast arrays, the second for LinearSlow arrays, and third with one of each.

I'm not quite sure why mydotshared isn't as fast as mydot1, but I've verified that this also holds when I write out that function manually rather than using the macro. The noteworthy case is when there are two CartesianRange iterators, which yields a ~4x speedup.

CC @JeffBezanson, @mbauman.

@eschnett
Copy link
Contributor

eschnett commented Mar 4, 2016

It seems as if this could be solved by "control inversion": Instead of passing the iterator to the for loop, one could pass the for loop body (as closure) to the iterator, which can then easily implement the if statement outside the loop. If closures are fast (are they?), this could be an option. The front-end would need to lower for loops to a call to Base.forloop.

I know this sounds radical and far-fetched, and I'm only mentioning this tongue-in-cheek. However, it would be a quite natural solution to implementing parallel loops, (partial) unrolling, etc. Of course, a macro has the same expressive power, and it's thus safer (for guaranteed performance) to use a macro than to rely on loop hoisting or common subexpression elimination.

@mbauman
Copy link
Member

mbauman commented Mar 4, 2016

That'd definitely be worth a test, @eschnett. It doesn't need a complete front-end re-write:

eachindex(A, B, ) do (IA, IB, )
    # body
end

Brave new world, indeed!

Edit: I suppose this will only work if the body is small enough to get inlined. I'll look at this closer later, Tim, but a huge 👍 on the direction here.

@timholy
Copy link
Member Author

timholy commented Mar 4, 2016

Great suggestion, @eschnett. I will test this (if @mbauman doesn't beat me to it). As @mbauman says, I suspect control over inlining may be necessary. I did notice that

function foo(A)
    @inline function inner(i)
        body
    end
    loop
end

is valid syntax. However, I am also noticing that setting external variables from a closure has inference problems. That is,

getith = i -> A[i]

is fast, but

addith = i -> (@inbounds s+= A[i])

is slow.

@@ -341,3 +341,73 @@ function collect{I<:IteratorND}(g::Generator{I})
dest[1] = first
return map_to!(g.f, 2, st, dest, g.iter)
end

"""
`@shareindexes` may improve the efficiency of loops that have multiple indexes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should make its way to the rst docs if we go with this

@tkelman
Copy link
Contributor

tkelman commented Mar 5, 2016

also ref #12902

@@ -1430,6 +1430,7 @@ export
@inbounds,
@fastmath,
@simd,
@shareindexes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good opportunity to revive the "indices" vs. "indexes" discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI re #12902 I'm OK with either outcome. I was originally surprised by indexes but have grown fond of it.

Since it looks like the closure solution might not be feasible for a while, we might need the macro. Should I rename this @shareindex simply to circumvent the debate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it more general? @shareiterators? There's nothing inherently index-y about this macro right now... besides the current use-case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though what's really being shared is the state, not the iterator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess you could say the iterator is being shared too. So I think we could go with that name.

@nalimilan
Copy link
Member

I don't understand why eachindex(A, B) is too limited. Couldn't it return a wrapper iterator (like zip does) that does whatever is needed, with a fast path for when a single index can be used for both arrays? I must say I don't really like the idea that standard for loops couldn't be used for efficient generic array indexing. That's one of their main uses!

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

It wasn't very clear, but what I meant is that having eachindex return a single index that is magically supposed to be good for both arrays is too limited.

@nalimilan
Copy link
Member

Got it. Could we imagine writing this instead?

for (IA, IB) in eachindex(A, B)
    # body
end

That would allow the iterator to return twice the same index if possible.

(Yes, this would be a breaking change, but we could find another name for this function.)

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

Yes, that's exactly what the zip(eachindex(A), eachindex(B)) does. I think I'd rather keep the name eachindex, and go through a release cycle where we deprecate the single-index version. Then for the release after that we can introduce the multi-index return.

@nalimilan
Copy link
Member

The point is that for (IA, IB) in eachindex(A, B) could return a single twice the same index and increment it only once if possible, while for (IA, IB) in zip(eachindex(A), eachindex(B)) cannot. That's what you're trying to fix, right?

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

Ideally yes, but as you wrote it that falls under the "An alternative that didn't work" section above.

@nalimilan
Copy link
Member

Sorry, I may be missing something, but here's an illustration of what I have in mind for the 2-arrays case. The equality check would only happen once.

EDIT: Of course that's type-unstable. It could only work if RA == RB could be replaced with a dispatch on the types of A and B.

immutable PseudoZip2{I} <: Base.AbstractZipIterator
    a::I
end
Base.length(z::PseudoZip2) = length(z.a)
Base.eltype{I}(::Type{PseudoZip2{I}}) = Tuple{eltype(I), eltype(I)}
Base.start(z::PseudoZip2) = (s = start(z.a); (s, s))
function Base.next(z::PseudoZip2, st)
    n = next(z.a, st[1])
    return ((n[1], n[1]), (n[2], n[2]))
end
Base.done(z::PseudoZip2, st) = done(z.a, st[1])

function eachindex_shared(A, B)
    RA = eachindex(A)
    RB = eachindex(B)

    if RA == RB
        PseudoZip2(RA)
    else
        zip(RA, RB)
    end
end

for I(A, IB) in eachindex_shared(A, B)
    # ...
end

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

Yes, I'm trying to avoid the type-instability. Worse, you can't do this purely on the basis of the types of RA and RB. For example, say you're copying the elements of a 4x6 array to a 6x4 array. The two iterators are not equal (so you'd want to use regular Zip2), but they differ only in the value domain, not in the type domain.

@toivoh
Copy link
Contributor

toivoh commented Mar 5, 2016 via email

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

I basically tried that: that Zip2Each structure in the top post is immutable, and I think I tried to force-inline start, next, and done. Maybe I screwed up, want to give it a try yourself and see if you can pull it off?

@toivoh
Copy link
Contributor

toivoh commented Mar 5, 2016 via email

@nalimilan
Copy link
Member

I wondered whether, for zip(eachindex(A), eachindex(B)), LLVM shouldn't be able to detect that the code is incrementing two indices in parallel and that they cannot possibly be different, so that only one of them needs to be computed. Indeed, if you change:

Base.done(z::Base.Zip2, st) = done(z.a,st[1]) | done(z.b,st[2])

to

Base.done(z::Base.Zip2, st) = done(z.a,st[1])

then the generated code for the loop is identical to that obtained with enumerate(A, B) when passing two Arrays of the same size.

This kind of optimization could be applied via a special, say, SameLengthZip2 type which would check for equal lengths of its inputs. I guess eachindex isn't intended to be used with arrays of different lengths, right? Maybe LLVM could also be improved about this particular case, since it's already smart enough to get rid of the double computation.

Code to replicate:

A = rand(10000, 1000);
B = rand(10000, 1000);

function f(A, B)
    s = 0.0
    for I in eachindex(A, B)
        @inbounds s += A[I]*B[I]
    end
    s
end

Base.done(z::Base.Zip2, st) = done(z.a,st[1])

function g(A, B)
    s = 0.0
    for (IA, IB) in zip(eachindex(A), eachindex(B))
        @inbounds s += A[IA]*B[IB]
    end
    s
end

@code_native f(A, B)
@code_native g(A, B)

f(A, B); g(A, B);

@time for i in 1:100; f(A, B); end
@time for i in 1:100; g(A, B); end

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

@nalimilan, that looks really promising! I'll check that myself (or feel free to submit your own PR, since you discovered this).

We'd probably need a separate function, since zip is designed to handle mismatched iterators. zipmatched? Or zip(Val{:matched}, iter1, iter2, ...)?

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

I'm not replicating this, @nalimilan, and the case of two LinearSlow arrays (AS = sub(A, 1:size(A,1), :)) is still much slower. Can you check to see if it works for you?

@timholy timholy force-pushed the teh/shareindexes branch from 459bfc9 to 818fd03 Compare March 6, 2016 04:11
@nalimilan
Copy link
Member

As you found out, it only works for LinearFast, since for LinearSlow the code needs to handle the possibility of arrays with different shapes, e.g. (2, 3) and (3, 2). That said, I'm surprised you don't see the same efficient code as I do for LinearFast. Also, this means LLVM isn't that far from being able to optimize this itself.

Could one of our LLVM experts comment here? It looks like LLVM should be able to do this, at least if the correct passes are enabled in the expected order. There's also the issue of the threshold. http://llvm.org/docs/Passes.html#loop-unswitch-unswitch-loops

I've written a small C program to see whether clang is able to hoist the check for same in a similar loop as ours: https://gist.github.com/nalimilan/fffd490da8686ef72ab2 From my (limited) understanding of the assembly output (clang -S -masm=intel unswitch.c), it looks like the hoisting only happens with -O3, not with -O2. So maybe we only need to tweak the thresholds, or have a way to tell the compiler what it has to do in this particular case. It would be interesting to see what happens with a more complex example similar to LinearSlow, but that's a bit more work.

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

I changed the name and also implemented the case with 3 iterators, since that's also quite important in practice. I tried to ensure the docstring would end up in the RST docs, but ../julia genstdlib.jl yielded

WARNING: missing docs for signature:

    @shareiterators

WARNING: missing docs for signature:

    build_sysimg(sysimg_path=default_sysimg_path, cpu_target="native", userimg_path=nothing; force=false)

What am I doing wrong? Is there something broken for macros?

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

Nice detective work, @nalimilan. Looks like we already enable the relevant passes:

julia/src/jitlayers.cpp

Lines 70 to 71 in 8a7eac2

PM->add(createLICMPass()); // Hoist loop invariants
PM->add(createLoopUnswitchPass()); // Unswitch loops.
. So it's not quite clear why it's not working more generally. To clarify, on this branch, even when I use two LinearFast arrays, the native instructions are not identical.

@@ -341,3 +341,104 @@ function collect{I<:IteratorND}(g::Generator{I})
dest[1] = first
return map_to!(g.f, 2, st, dest, g.iter)
end

"""
`@shareiterators` may improve the efficiency of loops that have multiple indexes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timholy, the warning for missing @shareiterators docs is because the signature needs to be

"""
    @shareiterators

...
"""

rather than an inline one.

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

Thanks, @MichaelHatherly!

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

@nalimilan, things are looking both more mysterious and more promising. With

immutable Zip2Same{I1, I2} # <: AbstractZipIterator
    same::Bool
    a::I1
    b::I2
end

zipsame(a, b) = Zip2Same(typeof(a)==typeof(b) && a==b, a, b)
Base.length(z::Zip2Same) = z.same ? length(z.a) : min(length(z.a), length(z.b))
Base.eltype{I1,I2}(::Type{Zip2Same{I1,I2}}) = Tuple{eltype(I1), eltype(I2)}
@inline Base.start(z::Zip2Same) = z.same ? (s = start(z.a); (s,s::typeof(start(z.b)))) : (start(z.a), start(z.b))
@inline function Base.next(z::Zip2Same, st)
    n1 = next(z.a,st[1])
    n2 = next(z.b,st[2])
    if z.same
        n11, n12 = n1
        n21, n22 = n2
        return ((n11, n11::typeof(n21)), (n12, n12::typeof(n22)))
    end
    return ((n1[1], n2[1]), (n1[2], n2[2]))
end
@inline Base.done(z::Zip2Same, st) = z.same ? done(z.a,st[1]) : done(z.a,st[1]) | done(z.b,st[2])

Note those type annotations in, e.g., next are intended to maintain type-stability. I'm betting/hoping that type inference will use the type of n2 and then LLVM will notice it's not being used and skip computing it.

Test function:

function mydotshared(A, B)
    s = 0.0
    for (IA,IB) in zipsame(eachindex(A), eachindex(B))
        @inbounds s += A[IA]*B[IB]
    end
    s
end

To the extent that I'm reading it correctly, this looks quite promising:

julia> @code_llvm mydotshared(A, B)

define double @julia_mydotshared_22504(%jl_value_t*, %jl_value_t*) #0 {
top:
  %_var1.sroa.0 = alloca i8, align 1
  %2 = alloca %Zip2Same, align 8
  %3 = alloca %UnitRange.1, align 8
  %4 = alloca %UnitRange.1, align 8
  %5 = bitcast %jl_value_t* %0 to double**
  %6 = load double*, double** %5, align 8
  %7 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 1
  %8 = bitcast %jl_value_t* %7 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = bitcast %jl_value_t* %1 to double**
  %11 = load double*, double** %10, align 8
  %12 = getelementptr inbounds %jl_value_t, %jl_value_t* %1, i64 1
  %13 = bitcast %jl_value_t* %12 to i64*
  %14 = load i64, i64* %13, align 8
  %15 = icmp sgt i64 %9, 0
  %16 = select i1 %15, i64 %9, i64 0
  %17 = getelementptr inbounds %UnitRange.1, %UnitRange.1* %3, i64 0, i32 0
  store i64 1, i64* %17, align 8
  %18 = getelementptr inbounds %UnitRange.1, %UnitRange.1* %3, i64 0, i32 1
  store i64 %16, i64* %18, align 8
  %19 = icmp sgt i64 %14, 0
  %20 = select i1 %19, i64 %14, i64 0
  %21 = getelementptr inbounds %UnitRange.1, %UnitRange.1* %4, i64 0, i32 0
  store i64 1, i64* %21, align 8
  %22 = getelementptr inbounds %UnitRange.1, %UnitRange.1* %4, i64 0, i32 1
  store i64 %20, i64* %22, align 8
  call void @julia_zipsame_22505(%Zip2Same* nonnull sret %2, %UnitRange.1* nonnull %3, %UnitRange.1* nonnull %4) #0
  %23 = getelementptr inbounds %Zip2Same, %Zip2Same* %2, i64 0, i32 0
  %24 = load i8, i8* %23, align 8
  %25 = getelementptr inbounds %Zip2Same, %Zip2Same* %2, i64 0, i32 1, i32 0
  %26 = load i64, i64* %25, align 8
  %27 = and i8 %24, 1
  %28 = icmp eq i8 %27, 0
  br i1 %28, label %L1.split.us, label %L1.L1.split_crit_edge

L1.L1.split_crit_edge:                            ; preds = %top
  %29 = getelementptr inbounds %Zip2Same, %Zip2Same* %2, i64 0, i32 1, i32 1
  %30 = load i64, i64* %29, align 8
  %31 = add i64 %30, 1
  %32 = bitcast i8* %_var1.sroa.0 to i1*
  br label %L4

L1.split.us:                                      ; preds = %top
  %33 = getelementptr inbounds %Zip2Same, %Zip2Same* %2, i64 0, i32 2, i32 0
  %34 = load i64, i64* %33, align 8
  %35 = getelementptr inbounds %Zip2Same, %Zip2Same* %2, i64 0, i32 1, i32 1
  %36 = load i64, i64* %35, align 8
  %37 = add i64 %36, 1
  %38 = bitcast i8* %_var1.sroa.0 to i1*
  %39 = getelementptr inbounds %Zip2Same, %Zip2Same* %2, i64 0, i32 2, i32 1
  %40 = load i64, i64* %39, align 8
  %41 = add i64 %40, 1
  br label %L4.us

L4.us:                                            ; preds = %L6.us, %L1.split.us
  %"#s2.sroa.0.0.us" = phi i64 [ %26, %L1.split.us ], [ %48, %L6.us ]
  %"#s2.sroa.4.0.us" = phi i64 [ %34, %L1.split.us ], [ %49, %L6.us ]
  %s.0.us = phi double [ 0.000000e+00, %L1.split.us ], [ %57, %L6.us ]
  %42 = icmp eq i64 %"#s2.sroa.0.0.us", %37
  %43 = icmp eq i64 %"#s2.sroa.4.0.us", %41
  %44 = or i1 %42, %43
  store i1 %44, i1* %38, align 1
  %45 = load i8, i8* %_var1.sroa.0, align 1
  %46 = and i8 %45, 1
  %47 = icmp eq i8 %46, 0
  br i1 %47, label %L6.us, label %L8.loopexit

L6.us:                                            ; preds = %L4.us
  %48 = add i64 %"#s2.sroa.0.0.us", 1
  %49 = add i64 %"#s2.sroa.4.0.us", 1
  %50 = add i64 %"#s2.sroa.0.0.us", -1
  %51 = getelementptr double, double* %6, i64 %50
  %52 = load double, double* %51, align 8
  %53 = add i64 %"#s2.sroa.4.0.us", -1
  %54 = getelementptr double, double* %11, i64 %53
  %55 = load double, double* %54, align 8
  %56 = fmul double %52, %55
  %57 = fadd double %s.0.us, %56
  br label %L4.us

L4:                                               ; preds = %L1.L1.split_crit_edge, %L6
  %"#s2.sroa.0.0" = phi i64 [ %26, %L1.L1.split_crit_edge ], [ %62, %L6 ]
  %s.0 = phi double [ 0.000000e+00, %L1.L1.split_crit_edge ], [ %70, %L6 ]
  %58 = icmp eq i64 %"#s2.sroa.0.0", %31
  store i1 %58, i1* %32, align 1
  %59 = load i8, i8* %_var1.sroa.0, align 1
  %60 = and i8 %59, 1
  %61 = icmp eq i8 %60, 0
  br i1 %61, label %L6, label %L8.loopexit22

L6:                                               ; preds = %L4
  %62 = add i64 %"#s2.sroa.0.0", 1
  %63 = add i64 %"#s2.sroa.0.0", -1
  %64 = getelementptr double, double* %6, i64 %63
  %65 = load double, double* %64, align 8
  %66 = add i64 %"#s2.sroa.0.0", -1
  %67 = getelementptr double, double* %11, i64 %66
  %68 = load double, double* %67, align 8
  %69 = fmul double %65, %68
  %70 = fadd double %s.0, %69
  br label %L4

L8.loopexit:                                      ; preds = %L4.us
  br label %L8

L8.loopexit22:                                    ; preds = %L4
  br label %L8

L8:                                               ; preds = %L8.loopexit22, %L8.loopexit
  %s.0.lcssa = phi double [ %s.0.us, %L8.loopexit ], [ %s.0, %L8.loopexit22 ]
  ret double %s.0.lcssa
}

Yet the performance is almost 3-fold worse than mydot1 and mydot2. Here's the code for mydot1:

julia> @code_llvm mydot1(A, B)

define double @julia_mydot1_22541(%jl_value_t*, %jl_value_t*) #0 {
top:
  %2 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 1
  %3 = bitcast %jl_value_t* %2 to i64*
  %4 = load i64, i64* %3, align 8
  %5 = getelementptr inbounds %jl_value_t, %jl_value_t* %1, i64 1
  %6 = bitcast %jl_value_t* %5 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = icmp sge i64 %7, %4
  %9 = select i1 %8, i64 %7, i64 %4
  %10 = icmp slt i64 %9, 1
  br i1 %10, label %L2, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %11 = bitcast %jl_value_t* %0 to double**
  %12 = load double*, double** %11, align 8
  %13 = bitcast %jl_value_t* %1 to double**
  %14 = load double*, double** %13, align 8
  br label %if

L2.loopexit:                                      ; preds = %if
  br label %L2

L2:                                               ; preds = %L2.loopexit, %top
  %s.0.lcssa = phi double [ 0.000000e+00, %top ], [ %22, %L2.loopexit ]
  ret double %s.0.lcssa

if:                                               ; preds = %if.lr.ph, %if
  %"#s2.04" = phi i64 [ 1, %if.lr.ph ], [ %15, %if ]
  %s.03 = phi double [ 0.000000e+00, %if.lr.ph ], [ %22, %if ]
  %15 = add i64 %"#s2.04", 1
  %16 = add i64 %"#s2.04", -1
  %17 = getelementptr double, double* %12, i64 %16
  %18 = load double, double* %17, align 8
  %19 = getelementptr double, double* %14, i64 %16
  %20 = load double, double* %19, align 8
  %21 = fmul double %18, %20
  %22 = fadd double %s.03, %21
  %23 = icmp eq i64 %"#s2.04", %9
  br i1 %23, label %L2.loopexit, label %if
}

It looks like there's quite a bit more logic going on in the body of the mydotshared loop.

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

My bet is that what we need here is a x:::T operator (note the 3 colons), where it means "don't check that the type is OK, just trust me and run with it."

@nalimilan
Copy link
Member

In the line you pointed to above, if I use createLoopUnswitchPass(false), then the loop unswitching happens as expected, and most of the overhead is gone when using Zip2Each type even for LinearSlow arrays (about 7% slower). Here's the code I used:
https://gist.github.com/nalimilan/f84d0c8880904cde586a

EDIT: false means do not "OptimizeForSize", which apparently disables all non-trivial unswitching operations (look for that identifier in http://llvm.org/docs/doxygen/html/LoopUnswitch_8cpp_source.html). Of course we should measure the broader implications of toggling this parameter. I can make a PR to run benchmarks on it. I'm not sure whether we could enable this selectively like @inline.

@nalimilan
Copy link
Member

Is that after modifying the LLVM unswitch pass? I get this (after loading my gist):

Two Arrays:
  0.258120 seconds (5 allocations: 176 bytes)
  0.255581 seconds (5 allocations: 176 bytes)
  0.271568 seconds (5 allocations: 176 bytes)

Two LinearSlow SubArrays
  0.534033 seconds (5 allocations: 176 bytes)
  1.651132 seconds (5 allocations: 176 bytes)
  0.603574 seconds (5 allocations: 176 bytes)

Mixed case
  0.410661 seconds (5 allocations: 176 bytes)
  0.361416 seconds (5 allocations: 176 bytes)
  0.490788 seconds (5 allocations: 176 bytes)

My machine is old, so vectorization might not make that much of a difference (but I'm not sure why one version would be vectorized and not the other.)

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

Are you sure that modifying the pass makes any difference? Isn't false the default value? http://legup.eecg.utoronto.ca/doxygen/namespacellvm.html#ef1f57caf4390a4c863cd52a574153f6

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

Amusingly, if I just add @inline in front of Zip2's start, next, and done, I get this:

Two Arrays:
  0.117608 seconds (5 allocations: 176 bytes)
  0.121061 seconds (5 allocations: 176 bytes)
  0.312968 seconds (5 allocations: 176 bytes)
Two LinearSlow SubArrays
  0.285209 seconds (5 allocations: 176 bytes)
  0.355089 seconds (5 allocations: 176 bytes)
  0.361093 seconds (5 allocations: 176 bytes)
Mixed case
  0.250418 seconds (5 allocations: 176 bytes)
  0.220168 seconds (5 allocations: 176 bytes)
  0.283443 seconds (5 allocations: 176 bytes)

which is better than Zip2Same in every case.

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

However, @shareiterators is tied for the best performance in all 3 cases (the 4th entry is with the macro):

Two Arrays:
  0.116534 seconds (5 allocations: 176 bytes)
  0.118737 seconds (5 allocations: 176 bytes)
  0.311935 seconds (5 allocations: 176 bytes)
  0.118407 seconds (5 allocations: 176 bytes)
Two LinearSlow SubArrays
  0.281333 seconds (5 allocations: 176 bytes)
  0.354340 seconds (5 allocations: 176 bytes)
  0.357118 seconds (5 allocations: 176 bytes)
  0.283382 seconds (5 allocations: 176 bytes)
Mixed case
  0.250119 seconds (5 allocations: 176 bytes)
  0.221807 seconds (5 allocations: 176 bytes)
  0.284371 seconds (5 allocations: 176 bytes)
  0.221170 seconds (5 allocations: 176 bytes)

@nalimilan
Copy link
Member

Oops, so false wouldn't have made any difference. This means that on my machine Zip2Same is fast already. Could you post the native code for Zip2Same and mydot1 for two LinearFast arrays? I don't understand what should make such a difference... (Then I should go back to work. :-)

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

julia> @code_native mydot1(A, B)
        .text
Filename: zip2samec.jl
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        vxorpd  %xmm0, %xmm0, %xmm0
Source line: 24
        movq    8(%rdi), %r8
        movq    8(%rsi), %rax
Source line: 240
        cmpq    %r8, %rax
        movq    %r8, %rcx
        cmovgeq %rax, %rcx
Source line: 83
        cmpq    $1, %rcx
        jl      L94
        vxorpd  %xmm0, %xmm0, %xmm0
Source line: 24
        movq    (%rdi), %rcx
        movq    (%rsi), %rdx
Source line: 83
        cmpq    %rax, %r8
        cmovgq  %r8, %rax
        nopw    %cs:(%rax,%rax)
Source line: 26
L64:
        vmovsd  (%rcx), %xmm1           # xmm1 = mem[0],zero
        vmulsd  (%rdx), %xmm1, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
Source line: 83
        addq    $-1, %rax
        addq    $8, %rdx
        addq    $8, %rcx
        cmpq    $0, %rax
        jne     L64
Source line: 28
L94:
        popq    %rbp
        retq

julia> @code_native mydotshared(A, B)
        .text
Filename: zip2samec.jl
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r14
        pushq   %rbx
        subq    $80, %rsp
        xorl    %eax, %eax
Source line: 40
        movq    (%rdi), %r14
        movq    8(%rdi), %rcx
        movq    (%rsi), %rbx
        movq    8(%rsi), %rdx
Source line: 83
        cmpq    $0, %rcx
        cmovleq %rax, %rcx
        movq    $1, -80(%rbp)
        movq    %rcx, -72(%rbp)
        cmpq    $0, %rdx
        cmovgq  %rdx, %rax
        movq    $1, -96(%rbp)
        movq    %rax, -88(%rbp)
        movabsq $zipsame, %rax
        leaq    -64(%rbp), %rdi
        leaq    -80(%rbp), %rsi
        leaq    -96(%rbp), %rdx
        callq   *%rax
        movb    -64(%rbp), %al
        movq    -56(%rbp), %rdx
        vxorpd  %xmm0, %xmm0, %xmm0
Source line: 10
        andb    $1, %al
        jne     L219
Source line: 83
        movq    -40(%rbp), %rax
        movq    -48(%rbp), %rsi
        movq    -32(%rbp), %rcx
Source line: 10
        leaq    -1(%rdx), %rdi
        shlq    $3, %rdi
        addq    %rdi, %r14
        addq    $1, %rsi
        subq    %rdx, %rsi
        leaq    -1(%rax), %rdx
        shlq    $3, %rdx
        addq    %rdx, %rbx
        addq    $1, %rcx
        subq    %rax, %rcx
        jmp     L189
        nopl    (%rax)
Source line: 42
L160:
        vmovsd  (%r14), %xmm1           # xmm1 = mem[0],zero
        vmulsd  (%rbx), %xmm1, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        addq    $8, %r14
        addq    $-1, %rsi
        addq    $8, %rbx
        addq    $-1, %rcx
Source line: 21
L189:
        cmpq    $0, %rsi
        sete    %dl
        cmpq    $0, %rcx
        sete    %al
        orb     %dl, %al
        andb    $1, %al
        movb    %al, -17(%rbp)
        movb    -17(%rbp), %al
        andb    $1, %al
        je      L160
        jmp     L302
Source line: 83
L219:
        movq    -48(%rbp), %rax
Source line: 21
        addq    $1, %rax
Source line: 10
        leaq    -1(%rdx), %rcx
        shlq    $3, %rcx
        addq    %rcx, %rbx
        addq    %rcx, %r14
        subq    %rdx, %rax
        jmp     L281
        nopw    %cs:(%rax,%rax)
Source line: 42
L256:
        vmovsd  (%r14), %xmm1           # xmm1 = mem[0],zero
        vmulsd  (%rbx), %xmm1, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        addq    $8, %rbx
        addq    $8, %r14
        addq    $-1, %rax
Source line: 21
L281:
        cmpq    $0, %rax
        sete    %cl
        andb    $1, %cl
        movb    %cl, -17(%rbp)
        movb    -17(%rbp), %cl
        andb    $1, %cl
        je      L256
Source line: 44
L302:
        addq    $80, %rsp
        popq    %rbx
        popq    %r14
        popq    %rbp
        retq

julia> versioninfo()
Julia Version 0.5.0-dev+3001
Commit 1f101b3* (2016-03-06 13:54 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

I also just tried this on the very latest master (in case there was difference due to a recent commit, e.g., @vtjnash's recent workq PR), but I got the same performance.

@nalimilan
Copy link
Member

I get very similar code, except with movsd and mulsd instead of their v counterparts. Maybe that's the cause of the difference, even though both functions have the same core sequence of instructions. But mydotshared contains an additional series of instructions in each loop, which might introduce delays. I don't understand why they are there, since the point of unswitching was to get this overhead outside of it.

        cmpq    $0, %rax
        sete    %cl
        andb    $1, %cl
        movb    %cl, -17(%rbp)
        movb    -17(%rbp), %cl
        andb    $1, %cl

UPDATE: these instructions come from the branch in done. If I define (abusively) @inline Base.done(z::Zip2Same, st) = done(z.a,st[1]), they go away. That doesn't explain how to get rid of this, though...

@nalimilan
Copy link
Member

Please check again with the updated gist. I've just found out that Base.done(z::Zip2Same, st) = done(z.a,st[1]) & (z.same | done(z.b,st[2])) allows getting rid of these instructions. On my box it looks slightly better.

@nalimilan
Copy link
Member

The idea is that if we check beforehand that the arrays have the same lengths, we don't actually need to check done for both. We can call it in case the implementation relies on it, but we don't need the return value.

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

Woot! That did it for me:

Algorithm Two Arrays Two LinearSlow SubArrays One of each
eachindex(A,B) 0.116820 0.284779 0.249632
eachindex(A,B) (with @simd) 0.091651 0.224375 0.141257
zip (with forced @inline) 0.120010 0.376532 0.248000
zipsame 0.118089 0.312213 0.218812
@shareiterators 0.115690 0.284069 0.219706

The macro is consistently a little faster in the case of two LinearSlow iterators, but it's about a 10% difference. I'm not convinced it's worth it.

@nalimilan
Copy link
Member

Great! We could continue trying to reduce the gap with the macro, but I agree in general it's not worth adding a new construct just for 10%.

Note that SubArrays covering the entirety of their parent seem to lead to more efficient code than when only a subset of rows is used. At least I've seen a difference in some cases. So better check this case too.

@JeffBezanson
Copy link
Member

Excellent performance investigation, @timholy and @nalimilan . I'd be very much in favor of something like zipsame (ideally a special case of the existing Zip types) over the macro.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2016

I suspect we can just add a same field to Zip2 and Zip. If we don't introduce a special function zipsame, the key question will be how to control when it checks for sameness. You certainly wouldn't want it to check for sameness with zip(A, B), but perhaps we could define a CompactIterator trait similar to our LinearFast/LinearSlow dichotomy.

I'm sure it's obvious to you already, but I should add that most of the savings seem to come from forced inlining. We might get another 20-30% or so out of zipsame compared to zip. I suspect that's worth it, but I'm putting it on a bit of a back burner compared to fleshing out the overall ReshapedArray infrastructure (when I have time to get back to it).

@nalimilan
Copy link
Member

I had the same idea actually. I was thinking we could always check for "sameness" using === so that it only works with immutables, for which there is no risk of introducing incorrect behavior. Then we could maybe have zip{T}(a::T, b::T) = Zip2Same(a, b) (and recursively). The contract for Zip2Same implies that we might avoid calling next and done. This shouldn't be an issue for immutables, since they don't save any state, right?

BTW, the definition of done I gave above is incorrect for iterators of different lengths. It should be done(z.a,st[1]) | (!z.same & done(z.b,st[2])), which appears to offer the same performance. I've fixed the gist.

@timholy
Copy link
Member Author

timholy commented Mar 9, 2016

I'm going to close this, since the final version will have very different design.

@nalimilan
Copy link
Member

Coming back to this old issue, I realize that using Float64 arrays and dot product for the tests wasn't the best choice, since its vectorization requires @simd, which isn't supported for all of the tested solutions (#15378). Using e.g. Int8, vectorization happens automatically for mydot1 and mydot2, but not for mydotshared. That means moving either to the zip(eachindex(A), eachindex(B)) or the zipsame(A, B) approach currently kills performance for cases where vectorization really makes a difference.

That also means that we should probably not blindly replace eachindex(A, B) with zip(eachindex(A), eachindex(B)): it's fine in places where SIMD won't be enabled anyway (most cases, apparently), but absolutely not where it would have been enabled. For example, changes like this appear problematic to me: adding a scalar to an array is no longer vectorized. Compare with the original loop:

function add{T}(A::AbstractArray{T}, B::Number)
    F = similar(A, typeof(zero(T) + zero(B)))
    for i in eachindex(A, F)
        @inbounds F[i] = A[i] + B
    end
    return F
end

@code_llvm add([1], 1)
@code_llvm [1] .+ 1

@timholy Does this analysis sound correct?

@nalimilan nalimilan reopened this Sep 18, 2016
@timholy
Copy link
Member Author

timholy commented Sep 18, 2016

Very interesting. We definitely don't want to give up vectorization when it would help. OTOH, there are cases where you'd like the two containers to have separate iterators for performance (ReshapedArrays being a poster child for this, but even LinearSlow vs LinearFast would be nice).

@nalimilan
Copy link
Member

So how can we proceed to fix the .+ regression on 0.5? Should we move back to the old approach until we have found a way to get vectorized code?

@KristofferC
Copy link
Member

KristofferC commented Oct 14, 2016

@nalimilan Benchmarking the old way for a few things that I see have regressed in generated code (no SIMD mostly) 5f550ea.

Locally, I don't see any real speed difference though so maybe it is fine the way it is.

@brenhinkeller brenhinkeller added the iteration Involves iteration or the iteration protocol label Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iteration Involves iteration or the iteration protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants