Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient cartesian iteration (new version of #6437) #8432

Merged
merged 7 commits into from
Nov 15, 2014

Conversation

timholy
Copy link
Member

@timholy timholy commented Sep 21, 2014

This is the complement of "adding array views": while it seems not to be widely appreciated, the reality is that none of the View proposals (mine included) have good performance in the general case if we rely on linear indexing (demos below, see for example also a recent StackOverflow post). This is basically a direct consequence of the fact that div is horribly slow. Consequently, we need to transition to writing algorithms with multidimensional/cartesian iterators. While the Cartesian macros provide a way to do this, I think it's fair to say that everyone would like a solution that doesn't rely on macros. While this PR offers only a small fraction of the power of Cartesian, at least simple cases can now get good performance without resorting to macros.

This is basically a redo of the long-languishing #6437, with two differences:

At least for low dimensional Arrays, which have efficient linear indexing, there's scarcely any difference between linear and cartesian iteration. Nevertheless I preserved linear indexing for Arrays using the "traits trick". If you have an AbstractArray that has fast linear indexing, you can cause start to use it simply by declaring Base.linearindexing(::MyArrayType) = Base.LinearFast().

The only user-visible disadvantage I'm aware of in not using tuples is that the notation for multidimensional indexing needs to be A[I] rather than A[I...]. Consequently if/when we switch to tuples, people will have to update their code. In my opinion, that's a small price to pay for finally having efficient multidimensional iteration.

In the performance demos below, there's still room for some improvement. The most important one, I suspect, would be for types that have tuples in them to "inline" the tuple into the type (as pointed out by @simonster among others). Of course if we get that, this could be redone using tuples.

Performance demos:

function mdsum(A)
    s = 0.0
    for a in eachelement(A)
        s += a
    end
    s
end

function mdsum2(A)
    s = 0.0
    @inbounds for I in eachindex(A)
        s += A[I]
    end
    s
end

function sumlinear(A)
    s = 0.0
    @inbounds for i = 1:length(A)
        s += A[i]
    end
    s
end

function sumcart2(A::AbstractMatrix)
    s = 0.0
    @inbounds for j = 1:size(A,2), i = 1:size(A,1)
        s += A[i,j]
    end
    s
end

println("Arrays")
A = rand(10^4, 10^4)
mdsum(A); mdsum2(A); sumlinear(A); sumcart2(A)
@time mdsum(A)
@time mdsum2(A)
@time sumlinear(A)
@time sumcart2(A)

println("SubArrays")
B = sub(A, 2:9999, 2:9999)
mdsum(B); mdsum2(B); sumlinear(B); sumcart2(B)
@time mdsum(B)
@time mdsum2(B)
@time sumlinear(B)
@time sumcart2(B)

using ArrayViews
println("ArrayViews")
B = view(A, 2:9999, 2:9999)
mdsum(B); mdsum2(B); sumlinear(B); sumcart2(B)
@time mdsum(B)
@time mdsum2(B)
@time sumlinear(B)
@time sumcart2(B)

Results:

julia> include("/tmp/testiteration.jl")
Arrays
elapsed time: 0.15763366 seconds (13896 bytes allocated)
elapsed time: 0.209725186 seconds (96 bytes allocated)
elapsed time: 0.133438424 seconds (96 bytes allocated)
elapsed time: 0.133621887 seconds (96 bytes allocated)
SubArrays
elapsed time: 0.491196255 seconds (96 bytes allocated)
elapsed time: 0.406468852 seconds (96 bytes allocated)
elapsed time: 1.871101162 seconds (96 bytes allocated)
elapsed time: 0.283826772 seconds (96 bytes allocated)
ArrayViews
elapsed time: 1.331280489 seconds (96 bytes allocated)
elapsed time: 0.27865117 seconds (96 bytes allocated)
elapsed time: 2.543739468 seconds (96 bytes allocated)
elapsed time: 0.253180521 seconds (96 bytes allocated)
4.99765960924932e7

@timholy
Copy link
Member Author

timholy commented Sep 21, 2014

I should have added, the reason for the poor performance with mdsum and ArrayViews is the check for negative array dimensions. If you delete that, then the performance of the two mdsum variants are similar. It would be nice, of course, if the compiler would hoist out the size check.

@carlobaldassi
Copy link
Member

@timholy sorry, I just saw your comment on #6035 (I've been catching up with my inbox today). If this gets merged, it could indeed be used to simplify and generalize the broadcasting code considerably (among other things), so I'm all for it.

This was referenced Sep 22, 2014
@timholy timholy changed the title Efficient cartesian iteration (new version of #6437) WIP: efficient cartesian iteration (new version of #6437) Sep 25, 2014
@timholy
Copy link
Member Author

timholy commented Sep 25, 2014

I changed the title now to include WIP, since with the merger of stagedfunctions there are likely a couple of tweaks I want to make (see #8472 (comment)).

@jakebolewski
Copy link
Member

Does this mean that cartesian.jl in its current form can be removed from Base?

@timholy
Copy link
Member Author

timholy commented Sep 25, 2014

Not even close---Cartesian does so much that this cannot (yet) do. But it does mean that one won't have to use @nloops in simple cases, like copy!, fill!, etc. (Basically, any function that has the pattern of visiting each element of an AbstractArray.) The reality is that most of base (main exceptions: reductions & broadcasting) doesn't need more than that, so I think this could have a pervasive effect in allowing us to write macro-free code that nevertheless performs well for arbitrary AbstractArrays.

It also allows something cool that Cartesian can't easily do: "zipped" iteration over containers that have commensurate sizes but different shapes, like copying a 5x1x3 array to a 5x3 array. I've always been annoyed by this open TODO. It really relied on linear indexing, and if X is of type that doesn't have efficient linear indexing, you're doomed. Cartesian struggled with this because @ngenerate can really only handle iteration over a single fixed dimensionality. (@Jutho added a nice extension that can handle such cases. It was merged into the Cartesian package but not into Base.)

It's also worth pointing out that with stagedfunctions, @ngenerate is on the chopping block. Very happy about that. But won't have time to get to it for a while.

@jiahao jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06
@timholy timholy changed the title WIP: efficient cartesian iteration (new version of #6437) Efficient cartesian iteration (new version of #6437) Nov 11, 2014
@timholy
Copy link
Member Author

timholy commented Nov 11, 2014

While I can't find it now, I seem to recall a recent declaration that stagedfunctions should not have side effects. So, until we have stagedtypes (#8472), I think this should stay as-is. Will merge soon if no complaints arise.

@vchuravy
Copy link
Member

Comment about side effects: #8886 (comment)

@Jutho
Copy link
Contributor

Jutho commented Nov 11, 2014

I have one comment. The current implementation does not allow new methods to make use of the new Iterators, since they would need to be implemented in the

let implemented
...
end

block, together with eachelement and eachindex. A nicer implementation could be to have an

let implemented=IntSet()
    stagedfunction Base.call(::Type{SizeIterator},sizes::NTuple{N,Int})
        if !(N in implemented)
            generate ...
        end
       . ..
        return ...
    end
end

and then let eachelement do something like

function eachindex(A::AbstractArray)
return SizeIterator(size(A))
end

@timholy
Copy link
Member Author

timholy commented Nov 11, 2014

Thanks, @vchuravy.

Agreed, @Jutho. I'll look into that.

timholy added a commit that referenced this pull request Nov 11, 2014
@timholy
Copy link
Member Author

timholy commented Nov 11, 2014

@Jutho, it's done, see if this is better.

Waiting for Travis (had to re-start OSX, but it passed Linux).

@Jutho
Copy link
Contributor

Jutho commented Nov 11, 2014

I would think one could still do without the register function. I was thinking something along the lines of
https://gist.github.com/Jutho/832f3f4aee84cf927a53

I also took the liberty to rename your types to CartesianIndex and IndexIterator, which I find personally more appropriate than Subscript (kind of vague) and SizeIterator (it's not really iterating over sizes, but over indices). But that's a purely subjective matter.

@timholy
Copy link
Member Author

timholy commented Nov 12, 2014

It looks mostly good, but if I insert a println("Here") at this line, then I get a problem (note that it pre-constructs the types for dimensions 1-8, and starting at 9 it relies on dynamic generation):

julia> A9 = rand(3,3,3,3,3,3,3,3,3);

julia> A8 = rand(3,3,3,3,3,3,3,3);

julia> S9 = sub(A9, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3);

julia> S8 = sub(A8, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3);

julia> z = 0.0; for s in IteratorsMD.eachelement(S8)
           z += s
       end
Here

julia> z = 0.0; for s in IteratorsMD.eachelement(S9)
           z += s
       end

julia> 

It never actually builds the type.

@Jutho
Copy link
Contributor

Jutho commented Nov 12, 2014

Indeed, I made a small last-minute change which broke it. I restored this now (latest revision of the same gist) by having the start{T,N}(AT::(AbstractArray{T,N},LinearSlow)) out of the eval and letting it call CartesianIndex, which triggers the type generation when using eachelement. For eachindex, the types are already generated when the iterator is constructed. In principle, one could also move start{N}(iter::IndexIterator{N}) out of the eval and letting it call CartesianIndex, but the check for any zero dimension might be slightly uglier.

@timholy timholy force-pushed the teh/cartesian_iteration branch from fb036e3 to b9744b1 Compare November 12, 2014 21:56
@timholy
Copy link
Member Author

timholy commented Nov 12, 2014

I decided to go with your implementation. Technically it is not effect-free, since calling gen_cartesian from inside the stagedfunction can trigger consequences that are not encapsulated by the returned expression. I had been wondering about doing it this way, but decided against it given #8886 (comment). But it's such a carefully-managed side-effect, I'm back to thinking that this counts as a legitimate exception.

One nice thing is that this version allows us to get rid of eachelement. (Before we only needed it for dimensions > 8, but it's just nicer this way.) I also fixed a small bug in one implementation of next. b9744b1#diff-89e97f67f8058d32eac8d917db5bbdbfR80.

Let's see what Travis says; if it goes green, then I say it's merge-time.

@ivarne
Copy link
Member

ivarne commented Nov 15, 2014

Not sure. I might be miss remembering this, but I'm pretty sure I tried this once, and I don't remember any problems. I would also guess absolute file references would just work, because the previous directory is still there. Now I'm starting to fear that I never finished that test, because the copy took too long and I got side tracked.

@timholy
Copy link
Member Author

timholy commented Nov 15, 2014

OK, I think I figured it out. julia was getting confused, especially during the bootstrap phase, about the difference between the traits-call start(::(AbstractArray,LinearFast/Slow)) and normal iteration for tuples.

I ran the entire perf suite, and didn't see anything convincing in terms of a performance regression. And of course, we have things like this:

function mdsum(A)
    s = 0.0
    for a in A
        s += a
    end
    s
end
A = randn(10^4,10^4)
S = sub(A, :, :)

After warmup, master:

julia> @time mdsum(A)
elapsed time: 0.184208858 seconds (13824 bytes allocated)
2311.344158094607

julia> @time mdsum(S)
elapsed time: 2.006607241 seconds (96 bytes allocated)
2311.344158094607

This PR:

julia> @time mdsum(A)
elapsed time: 0.217518824 seconds (13824 bytes allocated)
-653.3611064403863

julia> @time mdsum(S)
elapsed time: 0.223128702 seconds (216 bytes allocated)
-653.3611064403863

Any other issues to be concerned about before merging?

@Jutho
Copy link
Contributor

Jutho commented Nov 15, 2014

That's great news. Makes sense, ... in hindsight. Great detective work.

@timholy
Copy link
Member Author

timholy commented Nov 15, 2014

Yeah. The tricky part was that (1) it wasn't causing errors, and (2) it only affected code compiled during the bootstrap sequence before type-inference gets turned on. So I was getting different timing results for whether this was being built on top of a successful build of master, or if I did a make clean && make. For a while that was quite confusing.

timholy added a commit that referenced this pull request Nov 15, 2014
Efficient cartesian iteration (new version of #6437)
@timholy timholy merged commit 37ffa13 into master Nov 15, 2014
@timholy
Copy link
Member Author

timholy commented Nov 15, 2014

Man that felt good. I've been wanting this for ages.

@StefanKarpinski
Copy link
Member

Exciting.

@ViralBShah
Copy link
Member

This is really amazing to have.

@tkelman
Copy link
Contributor

tkelman commented Nov 16, 2014

This is causing a bunch of ambiguity warnings on Windows with SharedArrays - https://gist.github.com/anonymous/dc0a75ac1474f5d67b8a (the backtrace failure at the end has been happening for a while, that's not relevant here)

@Jutho
Copy link
Contributor

Jutho commented Nov 16, 2014

I guess the current implementation assumes that there were no SharedArrays for Windows, that't easy to change. However, the underlying problem might also be that the getindex definitions for SharedArray are to general. I had the same problem with trying to define a new getindex method for indexing an AbstractArray with a new custom type in TensorOperations.jl in order to implement some index notation. Because SharedArray defines a getindex(::SharedArray,::Any), you always need to define special methods for SharedArray, which requires looking into its definition etc. I don't think this is a viable strategy, so I would like to see the type of the index arguments in the getindex methods become restricted to e.g. Union(Real,AbstractVector). Then these extra patches like https://github.com/JuliaLang/julia/pull/8432/files#diff-89e97f67f8058d32eac8d917db5bbdbfR46 become unnecessary.

@@ -515,6 +515,8 @@ export
cumsum,
cumsum!,
cumsum_kbn,
eachelement,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end you ended up not keeping eachelement, so this export should be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, indeed. See 40a5a41. Thanks!

Jutho added a commit that referenced this pull request Nov 17, 2014
@timholy timholy deleted the teh/cartesian_iteration branch November 18, 2014 18:42
@ivarne
Copy link
Member

ivarne commented Nov 19, 2014

Hmm... How does this new function eachindex() work?

@timholy
Copy link
Member Author

timholy commented Nov 19, 2014

Hmm... seems I should add some documentation!

@timholy
Copy link
Member Author

timholy commented Nov 19, 2014

@ivarne, see 9b4f212

@ivarne
Copy link
Member

ivarne commented Nov 19, 2014

😃 Now I understand what this is really about.

Should it be defined for other indexable containers too?

Does it have enough overlap with keys() so that we can overload that function instead? (I guess the answer is NO, because eachindex returns a tuple you have to splat into getindex, while keys returns the key value)

@Jutho
Copy link
Contributor

Jutho commented Nov 19, 2014

Well, it's not really a tuple because of the overhead, so its a new type CartesianIndex_N. Since getindex is overloaded to accept this type, there is also no splatting involved. So maybe we could consider this as another method of keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.