-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shareindexes macro for (sometimes) faster iteration #15356
base: master
Are you sure you want to change the base?
Conversation
It seems as if this could be solved by "control inversion": Instead of passing the iterator to the for loop, one could pass the for loop body (as closure) to the iterator, which can then easily implement the I know this sounds radical and far-fetched, and I'm only mentioning this tongue-in-cheek. However, it would be a quite natural solution to implementing parallel loops, (partial) unrolling, etc. Of course, a macro has the same expressive power, and it's thus safer (for guaranteed performance) to use a macro than to rely on loop hoisting or common subexpression elimination. |
That'd definitely be worth a test, @eschnett. It doesn't need a complete front-end re-write: eachindex(A, B, …) do (IA, IB, …)
# body
end Brave new world, indeed! Edit: I suppose this will only work if the body is small enough to get inlined. I'll look at this closer later, Tim, but a huge 👍 on the direction here. |
Great suggestion, @eschnett. I will test this (if @mbauman doesn't beat me to it). As @mbauman says, I suspect control over inlining may be necessary. I did notice that
is valid syntax. However, I am also noticing that setting external variables from a closure has inference problems. That is, getith = i -> A[i] is fast, but addith = i -> (@inbounds s+= A[i]) is slow. |
@@ -341,3 +341,73 @@ function collect{I<:IteratorND}(g::Generator{I}) | |||
dest[1] = first | |||
return map_to!(g.f, 2, st, dest, g.iter) | |||
end | |||
|
|||
""" | |||
`@shareindexes` may improve the efficiency of loops that have multiple indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should make its way to the rst docs if we go with this
also ref #12902 |
@@ -1430,6 +1430,7 @@ export | |||
@inbounds, | |||
@fastmath, | |||
@simd, | |||
@shareindexes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good opportunity to revive the "indices" vs. "indexes" discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI re #12902 I'm OK with either outcome. I was originally surprised by indexes
but have grown fond of it.
Since it looks like the closure solution might not be feasible for a while, we might need the macro. Should I rename this @shareindex
simply to circumvent the debate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make it more general? @shareiterators
? There's nothing inherently index-y about this macro right now... besides the current use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though what's really being shared is the state
, not the iterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess you could say the iterator is being shared too. So I think we could go with that name.
I don't understand why |
It wasn't very clear, but what I meant is that having |
Got it. Could we imagine writing this instead? for (IA, IB) in eachindex(A, B)
# body
end That would allow the iterator to return twice the same index if possible. (Yes, this would be a breaking change, but we could find another name for this function.) |
Yes, that's exactly what the |
The point is that |
Ideally yes, but as you wrote it that falls under the "An alternative that didn't work" section above. |
Sorry, I may be missing something, but here's an illustration of what I have in mind for the 2-arrays case. The equality check would only happen once. EDIT: Of course that's type-unstable. It could only work if immutable PseudoZip2{I} <: Base.AbstractZipIterator
a::I
end
Base.length(z::PseudoZip2) = length(z.a)
Base.eltype{I}(::Type{PseudoZip2{I}}) = Tuple{eltype(I), eltype(I)}
Base.start(z::PseudoZip2) = (s = start(z.a); (s, s))
function Base.next(z::PseudoZip2, st)
n = next(z.a, st[1])
return ((n[1], n[1]), (n[2], n[2]))
end
Base.done(z::PseudoZip2, st) = done(z.a, st[1])
function eachindex_shared(A, B)
RA = eachindex(A)
RB = eachindex(B)
if RA == RB
PseudoZip2(RA)
else
zip(RA, RB)
end
end
for I(A, IB) in eachindex_shared(A, B)
# ...
end |
Yes, I'm trying to avoid the type-instability. Worse, you can't do this purely on the basis of the types of |
If you could design the iterator to make llvm perform loop unswitching (use
conditional code in the loop that obviously depends on a loop invariant
condition) then it might be possible to get llvm to duplicate the loop for
us and choose which one based on the value domain information at runtime.
But I'm not sure how robust that would be, even if it could be made to
work.
|
I basically tried that: that |
I see. It'd be fun to try it, but I'm not so sure that I will.
|
I wondered whether, for Base.done(z::Base.Zip2, st) = done(z.a,st[1]) | done(z.b,st[2]) to Base.done(z::Base.Zip2, st) = done(z.a,st[1]) then the generated code for the loop is identical to that obtained with This kind of optimization could be applied via a special, say, Code to replicate: A = rand(10000, 1000);
B = rand(10000, 1000);
function f(A, B)
s = 0.0
for I in eachindex(A, B)
@inbounds s += A[I]*B[I]
end
s
end
Base.done(z::Base.Zip2, st) = done(z.a,st[1])
function g(A, B)
s = 0.0
for (IA, IB) in zip(eachindex(A), eachindex(B))
@inbounds s += A[IA]*B[IB]
end
s
end
@code_native f(A, B)
@code_native g(A, B)
f(A, B); g(A, B);
@time for i in 1:100; f(A, B); end
@time for i in 1:100; g(A, B); end |
@nalimilan, that looks really promising! I'll check that myself (or feel free to submit your own PR, since you discovered this). We'd probably need a separate function, since |
I'm not replicating this, @nalimilan, and the case of two |
459bfc9
to
818fd03
Compare
As you found out, it only works for Could one of our LLVM experts comment here? It looks like LLVM should be able to do this, at least if the correct passes are enabled in the expected order. There's also the issue of the threshold. http://llvm.org/docs/Passes.html#loop-unswitch-unswitch-loops I've written a small C program to see whether clang is able to hoist the check for |
I changed the name and also implemented the case with 3 iterators, since that's also quite important in practice. I tried to ensure the docstring would end up in the RST docs, but
What am I doing wrong? Is there something broken for macros? |
Nice detective work, @nalimilan. Looks like we already enable the relevant passes: Lines 70 to 71 in 8a7eac2
LinearFast arrays, the native instructions are not identical.
|
@@ -341,3 +341,104 @@ function collect{I<:IteratorND}(g::Generator{I}) | |||
dest[1] = first | |||
return map_to!(g.f, 2, st, dest, g.iter) | |||
end | |||
|
|||
""" | |||
`@shareiterators` may improve the efficiency of loops that have multiple indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timholy, the warning for missing @shareiterators
docs is because the signature needs to be
"""
@shareiterators
...
"""
rather than an inline one.
Thanks, @MichaelHatherly! |
@nalimilan, things are looking both more mysterious and more promising. With immutable Zip2Same{I1, I2} # <: AbstractZipIterator
same::Bool
a::I1
b::I2
end
zipsame(a, b) = Zip2Same(typeof(a)==typeof(b) && a==b, a, b)
Base.length(z::Zip2Same) = z.same ? length(z.a) : min(length(z.a), length(z.b))
Base.eltype{I1,I2}(::Type{Zip2Same{I1,I2}}) = Tuple{eltype(I1), eltype(I2)}
@inline Base.start(z::Zip2Same) = z.same ? (s = start(z.a); (s,s::typeof(start(z.b)))) : (start(z.a), start(z.b))
@inline function Base.next(z::Zip2Same, st)
n1 = next(z.a,st[1])
n2 = next(z.b,st[2])
if z.same
n11, n12 = n1
n21, n22 = n2
return ((n11, n11::typeof(n21)), (n12, n12::typeof(n22)))
end
return ((n1[1], n2[1]), (n1[2], n2[2]))
end
@inline Base.done(z::Zip2Same, st) = z.same ? done(z.a,st[1]) : done(z.a,st[1]) | done(z.b,st[2]) Note those type annotations in, e.g., Test function: function mydotshared(A, B)
s = 0.0
for (IA,IB) in zipsame(eachindex(A), eachindex(B))
@inbounds s += A[IA]*B[IB]
end
s
end To the extent that I'm reading it correctly, this looks quite promising:
Yet the performance is almost 3-fold worse than
It looks like there's quite a bit more logic going on in the body of the |
My bet is that what we need here is a |
In the line you pointed to above, if I use EDIT: |
Is that after modifying the LLVM unswitch pass? I get this (after loading my gist):
My machine is old, so vectorization might not make that much of a difference (but I'm not sure why one version would be vectorized and not the other.) |
Are you sure that modifying the pass makes any difference? Isn't |
Amusingly, if I just add
which is better than |
However,
|
Oops, so |
|
I also just tried this on the very latest master (in case there was difference due to a recent commit, e.g., @vtjnash's recent workq PR), but I got the same performance. |
I get very similar code, except with
UPDATE: these instructions come from the branch in |
Please check again with the updated gist. I've just found out that |
The idea is that if we check beforehand that the arrays have the same lengths, we don't actually need to check |
Woot! That did it for me:
The macro is consistently a little faster in the case of two |
Great! We could continue trying to reduce the gap with the macro, but I agree in general it's not worth adding a new construct just for 10%. Note that |
Excellent performance investigation, @timholy and @nalimilan . I'd be very much in favor of something like |
I suspect we can just add a I'm sure it's obvious to you already, but I should add that most of the savings seem to come from forced inlining. We might get another 20-30% or so out of |
I had the same idea actually. I was thinking we could always check for "sameness" using BTW, the definition of |
I'm going to close this, since the final version will have very different design. |
Coming back to this old issue, I realize that using That also means that we should probably not blindly replace function add{T}(A::AbstractArray{T}, B::Number)
F = similar(A, typeof(zero(T) + zero(B)))
for i in eachindex(A, F)
@inbounds F[i] = A[i] + B
end
return F
end
@code_llvm add([1], 1)
@code_llvm [1] .+ 1 @timholy Does this analysis sound correct? |
Very interesting. We definitely don't want to give up vectorization when it would help. OTOH, there are cases where you'd like the two containers to have separate iterators for performance ( |
So how can we proceed to fix the |
@nalimilan Benchmarking the old way for a few things that I see have regressed in generated code (no SIMD mostly) 5f550ea. Locally, I don't see any real speed difference though so maybe it is fine the way it is. |
I'm preparing another couple of PRs that will expand the range of array iterators (and I'm planning even further expansion later on). I'm thinking ahead about loops that involve multiple arrays.
The problem
Currently we have two options:
and
My sense is that our
eachindex(A, B)
mechanism is too limited for the coming brave new world, and should gradually be retired. That means we should usezip
. However,zip
sometimes still has performance issues, some of which are unavoidable: when you have to increment two iterators rather than one, it's going to be slower, especially if incrementing is more expensive than the body. See benchmarks below.This PR's solution
This PR introduces a new macro,
@shareindexes
, which takes a block like this:and turns it into
An alternative that didn't work
I tried an alternative implementation based on
and then defining
start
,done
, andnext
functions that testsame
at runtime. Unfortunately, this was slower than just plainzip
. If we could hoist thatsame
check out of the entire block, then this solution would be just like the macro-based solution (but prettier).Benchmarks
Using the functions
mydot1
,mydot2
, andmydotshared
in the new tests (with a@simd
added in front of the loop inmydot1
for good measure), with arrays of size (10001,10000), here are timings.Like in the tests, the first in each block is for a pair of
LinearFast
arrays, the second forLinearSlow
arrays, and third with one of each.I'm not quite sure why
mydotshared
isn't as fast asmydot1
, but I've verified that this also holds when I write out that function manually rather than using the macro. The noteworthy case is when there are twoCartesianRange
iterators, which yields a ~4x speedup.CC @JeffBezanson, @mbauman.