Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient permutedims #6517

Closed
wants to merge 14 commits into from
Closed

Conversation

Jutho
Copy link
Contributor

@Jutho Jutho commented Apr 14, 2014

Dear Julia developers,

this is a first suggestion for a performance improvement for permutedims! by using a cache-friendly blocking strategy. The improvement can be large (up to a factor of 5 or more) for big arrays (100 x 100 x 100 x 100 or 10 x 10 x 10 x 10 x 10 x 10 x 10 x 10) and permutations that mix strongly, such as the complete reversing of dimensions (p=N:-1:1), although there might be a penalty for small arrays because of the dynamic computation of a blocking strategy. This could matter in loops where small arrays are permuted many times. I will try to post some timing results later today.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 14, 2014

There is certainly something still wrong with the current pull request, as it is actually performing worse on several test cases. I will try to further optimise it over the course of the following days. Any suggestions are welcome.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 15, 2014

I have made some additional changes and benchmarked the new and old implementation. The result can be viewed here:
http://nbviewer.ipython.org/gist/Jutho/10713261
For small arrays, the new implementation of permutedims! is at most 2 times slower (red line on final plot), which I think is due to the fact that the new implementation is more general: it allows a general StridedArray as output array and it thus has to compute two non-trivial linear indices. In this regime, the data completely fits into the cache and the computation time is dominated by the index computation, so the new implementation provides more general functionality at the cost of a doubled computation time. For larger data sizes where the complete data does not fit into the cache, the new implementation is often several times faster than the old one.

I have also removed permutedims! from export, as it is not documented and this fits with what I believe is the general philosophy not to export the mutating functions (e.g. transpose! is not exported).

@quinnj
Copy link
Member

quinnj commented Apr 15, 2014

This seems like great work Jutho. Perhaps for smaller arrays, we default to
a faster implementation and use this for larger. This same style is used in
the Sort module.
On Apr 15, 2014 4:27 AM, "Jutho" notifications@github.com wrote:

I have made some additional changes and benchmarked the new and old
implementation. The result can be viewed here:
http://nbviewer.ipython.org/gist/Jutho/10713261
For small arrays, the new implementation of permutedims! is at most 2
times slower (red line on final plot), which I think is due to the fact
that the new implementation is more general: it allows a general
StridedArray as output array and it thus has to compute two non-trivial
linear indices. In this regime, the data completely fits into the cache and
the computation time is dominated by the index computation, so the new
implementation provides more general functionality at the cost of a doubled
computation time. For larger data sizes where the complete data does not
fit into the cache, the new implementation is often several times faster
than the old one.

I have also removed permutedims! from export, as it is not documented and
this fits with what I believe is the general philosophy not to export the
mutating functions (e.g. transpose! is not exported).


Reply to this email directly or view it on GitHubhttps://github.com//pull/6517#issuecomment-40456983
.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 16, 2014

I can certainly implement this. The point is that the double index computation is unavoidable if the output tensor is also a general StridedArray. Hence the additional method (which will basically be a copy of the old version of permutedims!) will only handle the case of small arrays for which the output is a normal Array or BitArray.

I also have another question. Currently the cache size information is independently hardwired in several methods throughout Julia Base, e.g. in Base.transpose! and in Base.LinAlg.generic_matmatmul, and now also in my implementation of permutedims! (actually in the auxiliary function blockdims). It would probably make sense to centralise this information and maybe even to try to get it from the system (or to set it at build time) instead of fixing it at 32k.

@timholy
Copy link
Member

timholy commented Apr 16, 2014

I like the idea of centralizing cache size information.

@ViralBShah
Copy link
Member

+1 to cenralizing cache size info.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 18, 2014

Should I create a new file cache.jl containing the single line

cachesize = 1<<15

? (or preferably something smarter that actually queries it from the system environment)

@timholy
Copy link
Member

timholy commented Apr 18, 2014

I would put it in sysinfo.jl. If necessary I suspect that file could be moved earlier in the build sequence (although any constant won't be needed until used).

@ViralBShah
Copy link
Member

Apart from the cache information being hardcoded (should be a separate PR), is this ready to merge?

@ViralBShah
Copy link
Member

We certainly need to default to the current implementation for small arrays, and use this for larger arrays.

@andrewcooke
Copy link
Contributor

what cache are we talking about here? if it's a hardware cache i guess there is much bike-shedding to be had over which and how this is all represented? but anyway, i would like to know what L1 (say) is on the current machine. is that the idea? if so, how can i get it? has a separate issue been raised for that? where? thanks.

@ViralBShah
Copy link
Member

Let's get this PR done without worrying about the bike-shedding of cache stuff. There isn't a new issue to discuss this, but there probably should be. L1 and L2 cache sizes would be nice to have available, but it may require some effort to get this right across OSes and processor types.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 26, 2014

I will try to finalise this PR over the weekend.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 27, 2014

I have been experimenting with adding the default implementation for small arrays. I can then get the speed down to the old version; but I do not understand how to choose the value of the transition size. I would have thought that this would also be of the order of the level 1 cache size, but apparently I have to choose it much larger (order 100000 elements for a Float64 array, i.e. 800000 bytes) to get it equal speed (on average) to the old permutedims. Anybody has any insight in this. I will submit once my code is cleaned up.

@ViralBShah
Copy link
Member

The size of the L2 cache is where you are getting the performance transition, perhaps?

@timholy
Copy link
Member

timholy commented Apr 27, 2014

When you profile your new version, where are the bottlenecks?

@Jutho
Copy link
Contributor Author

Jutho commented Apr 27, 2014

Latest commit is untested, I pushed it as an easy way of saving because I had to completely reinstall julia (something with -malign-double being an unknown argument in libfftw after a make cleanall, and I was not able to fix it). I will come back with more news soon.

This commit contains the old version of permutedims! with its original name, and contains permutedims1! using a blocking strategy and permutedims2! using a recursive strategy. I will report test results soon and then make a decision.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 28, 2014

While trying to benchmark my new code versus the old version of permutedims!, I suddenly noticed that the old permutedims! has become unbelievably slow for arrays with N>6 (i.e. N=7 and beyond). I switched back to today's master "Version 0.3.0-prerelease+2787 (2014-04-27 20:12 UTC)" to confirm that this was not something in my private branch. This currently takes more than 68 seconds on my machine:

A=randn(10,10,10,10,10,10,10,10);
B=similar(A);
@time Base.permutedims!(B,A,[8:-1:1]);

whereas it used to take around 2 seconds I would think. @time also indicates 28005483496 bytes allocated, which is impossible for this mutating method. This makes it a bit harder to properly benchmark my own implementations, but more importantly seems to indicate some major efficiency decrease in a recent update to Julia's base library. Profiling indicates that the issue is with the line
sumc = sum(@ntuple N d->counts_{d+1})
so I assume something is broken in the type inference.

@JeffBezanson
Copy link
Member

Wow. Could you try to bisect where this regression happened?

@Jutho
Copy link
Contributor Author

Jutho commented Apr 28, 2014

I have implemented an @nfunction in cartesian.jl and wrote a permutedimsnew2! which uses an explicit recursive approach with an inner function innerrec, defined via @nfunction. Nevertheless, this still seems to create overhead:

A=randn(100,100,100,100);
B=similar(A);B1=similar(A);B2=similar(A);

@time Base.permutedims!(B,A,[4:-1:1]);
@time Base.permutedimsnew!(B1,A,[4:-1:1]);
@time Base.permutedimsnew2!(B2,A,[4:-1:1]);

returns (after first compilation run)

elapsed time: 1.979682352 seconds (352 bytes allocated) # original permutedims!
elapsed time: 0.276397226 seconds (2416 bytes allocated) # explicit stack
elapsed time: 0.569465434 seconds (15358880 bytes allocated) # explicit recursion

I have no idea how to investigate these methods as code_typed is not particularly helpful for functions created using @ngenerate.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2014

Experiencing a bit with the simpler case of transpose!, which doesn't require any Cartesian macros, has learned that the recursive approach is really fast (no allocation overhead) if it is just one big recursive function. If, on the other hand, the recursive kernel and base kernel are implemented as inner functions inside the transpose! function definition, then there is again a big allocation overhead. So to implement permutedims! nicer I need to find a way to make a recursive function that combines well with @ngenerate. I think I did something like that in early versions of my TensorOperations.jl package, but then I stopped using recursive methods because there was still the overhead of using tuples. But now with my @nfunction macro that should also be avoidable.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2014

I will actually have to redraw my statement of yesterday, cause as soon as the recursive function gets wrapped inside another function or some let block, there seems to be a huge allocation overhead to the recursive calls. So to summarise (and maybe somebody can comment and correct me if I am wrong):

  • recursive function with NTuple{N,Int} for the arguments (strides,dimensions,...): no need to use any dynamic code generation using metaprogramming, but allocation overhead coming from creating new tuples at every level of recursion
  • recursive function with all arguments as simple Ints: new function definitions need to be generated for every N, and these will live inside some other function or let block or ..., and there also seems to be a huge allocation overhead for recursive application of such functions

So I guess this only leaves the current implementation as option:

  • building a 'wannabe' recursive function using an explicit stack: not very clean code, easy to make mistakes, but it works

@quinnj
Copy link
Member

quinnj commented Jul 2, 2014

Bump. Seems a lot of great work has gone into this. What else needs to happen for a merge here?

@Jutho
Copy link
Contributor Author

Jutho commented Jul 2, 2014

I think the code works. What is less satisfactory is:

  • Code is not very nice, possibly the work on staged functions could help here.
  • Even though it is memory-oblivious via recursion (although not explicitly), the base case is still pretty large to get proper speed
  • This implementation is not always faster than the original one, it is just faster averaged over random permutations, with a much smaller standard deviation. It is a lot faster for the strongly mixing permutations, such as complete reverse perm=N:-1:1, but can be a bit slower for permutations that only move big blocks of contiguous data, e.g. no permutation perm=1:N, or permutations that only swap for example the last few indices, e.g. perm=[1,2,...,N,N-1].

@Jutho
Copy link
Contributor Author

Jutho commented Jul 2, 2014

Also, it seems like I still have a lot of cleaning up to do.

@Jutho
Copy link
Contributor Author

Jutho commented Jul 13, 2014

I have completely rewritten my TensorOperations.jl package using this type of cache-oblivious 'recursive' algorithms, see: https://github.com/Jutho/TensorOperations.jl/tree/cacheoblivious . It includes a function tensorcopy! , which is essentially permutedims! with a different way of specifying the arguments (i.e. the permutation is implicitly defined by assigning labels to the different dimensions/indices of the input and output tensor). It also includes tensorcontract! for performing the general contractions of two tensors, which can also be used to compute a matrix multiplication (and is thus related to what I tried in #6690).

I have now explicitly included a copy of cartesian.jl, which I modified a bit for my purpose. @timholy , I hope this is ok? Should I include any kind of reference or copyright at the top of this file? Before I just used on Base.Cartesian, but since made my package unavailable for julia 2.0. I made the following modifications: I removed all functionality related to splatting, which is unneeded in my case. I included macros @mngenerate and @mnpgenerate for creating methods with 2 or 3 different dimensionality parameter. I performed some integer arithmetic evaluation so that @nloops, @nexprs etc could be called with e.g. N1-div(N1+N2-N3,2). I also included a new macro @stridedloop which allows to write the permutedims kernel in a single line, and is also useful for the other algorithms. I also generate the kernels (base case) for the other methods via macros, such that they are easy to replace by better ones at a later stage.

Performancewise, not much has changed since before. I will try to benchmark some more the next few days but I am still not sure whether this kind of implementation of permutedims! is ready to be merged to base.

@timholy
Copy link
Member

timholy commented Jul 13, 2014

Certainly, you can do whatever you like with Cartesian! Glad it was useful to you.

However, just a couple of days ago I ran into the same issue, and decided to make the Cartesian package just be a mirror of Base.Cartesian---basically bringing Base.Cartesian to 0.2. So now Images.jl contains logic like this at the top:

if VERSION.minor < 3
    using Cartesian
else
    using Base.Cartesian
end

I think that might be a good practice in general when people want to maintain version compatibility but use some new Julia feature. Perhaps we need a JuliaBackports organization?

Your other changes sound really interesting. @ngenerate will eventually be killed off by staged functions, but in the meantime I completely agree that being able to do easy multidimensional generation is quite useful. Instead of @ngenerate, @mngenerate, and @mnpgenerate, what about generalizing @ngenerate to check whether the first argument is a tuple-expression, and call it as @ngenerate (M,N,P) retturntype expr?

Re @stridedloops, I can't quite figure out what that's supposed to do. Presumably you know you can pass a step, e.g., @nloops 3 i d->(1:3:size(A,d)) bodyexpr? Even d->(start_d:step_d:stop_d) works fine.

Also, 💯 for being able to do math on N. I should have added that ages ago, don't know why I didn't.

Finally, in 3f0adc6 I made some changes in lreplace! that should make certain multidimensional operations much more useful. Necessary make expressions like dest_{m+(n-1)*$K} = tmp work properly in #7568.

@Jutho
Copy link
Contributor Author

Jutho commented Jul 13, 2014

Thanks for all the input and tips.

On 13 Jul 2014, at 12:15, Tim Holy notifications@github.com wrote:

Certainly, you can do whatever you like with Cartesian! Glad it was useful to you.

However, just a couple of days ago I ran into the same issue, and decided to make the Cartesian package just be a mirror of Base.Cartesian---basically bringing Base.Cartesian to 0.2. So now Images.jl contains logic like this at the top:

if VERSION.minor < 3
using Cartesian
else
using Base.Cartesian
end
I think that might be a good practice in general when people want to maintain version compatibility but use some new Julia feature. Perhaps we need a JuliaBackports organisation?

I’ll try to do something similar. This of course requires that I make some commits to Cartesian and Base.Cartesian to include the extra functionality. Should I commit to Cartesian or to Base.Cartesian, or to both.

Your other changes sound really interesting. @ngenerate will eventually be killed off by staged functions, but in the meantime I completely agree that being able to do easy multidimensional generation is quite useful. Instead of @ngenerate, @mngenerate, and @mnpgenerate, what about generalizing @ngenerate to check whether the first argument is a tuple-expression, and call it as @ngenerate (M,N,P) retturntype expr?

That's a good suggestion, I will try to rewrite the code.
Re @stridedloops, I can't quite figure out what that's supposed to do. Presumably you know you can pass a step, e.g., @nLoops 3 i d->(1:3:size(A,d)) bodyexpr? Even d->(start_d:step_d:stop_d) works fine.

That’s indeed the idea, but the point is that you often want counters that increment at different rates. e.g. for the permutedims! you need something like

indA_2=startA
indC_2=startC
for i_2=1:dim_2
    indA_1=indA_2
    indC_1=indC_2
    for i_1=1:dim_1
        @inbound C[indC_1]=A[indC_1]
        indA_1+=stepA_1
        indC_1+=stepC_1
    end
    indA_2+=stepA_2
    indC_2+=stepC_2
end

With @stridedloops, this can be written as @stridedloops(2,i,indA,startA,stepA,indC,startC,stepC,@inbounds C[indC]=A[indA]). Nothing spectacular, just very convenient. This does not have to be added to Cartesian, it can remain in my TensorOperations package.

Also, :+100: for being able to do math on N. I should have added that ages ago, don't know why I didn't.

I wrote a new function intarith! for this, but I later noticed you have a function exprresolve which does the same (and even more). Maybe this can be applied to the body expression passed to ngenerate?
Finally, in 3f0adc6 I made some changes in lreplace! that should make certain multidimensional operations much more useful. Necessary make expressions like dest_{m+(n-1)*$K} = tmp work properly in #7568.

Good to know.


Reply to this email directly or view it on GitHub.

@timholy
Copy link
Member

timholy commented Jul 13, 2014

Should I commit to Cartesian or to Base.Cartesian, or to both.

Pick whichever is easier for you (presumably Cartesian, so you don't have to rebuild Julia all the time), and then we can discuss.

Thanks for the explanation about @stridedloops, makes much more sense now. However, I suspect you can do all that with pre and post expressions. For example, in the reduction code (reducedim.jl) we access the input array A with indexes i, and the (reduced) output array R with indexes j. We use a pre-expression d->(j_d = sizeR_d==1 ? 1 : i_d) which basically says "if we're reducing along dimension d, set j_d to 1, otherwise set it equal to i_d." (A fun point for the afficianados: while it might seem that this branch in the reduction code would be expensive, for any given call to this function it evaluates the same way every time, so the CPU's branch prediction seems to basically eliminate it.)

I suspect you could write your increments something like this:

@nexprs N d->(indA_d = startA)
# The last looks like assigned them all, but LLVM will probably get rid of the unnecessary ones
@nexprs N d->(d == N ? (indC_d = startC) : nothing)  # or you could get rid of them this way
@nloops(N, i, A,
        d->(d > 1 ? (indA_{d-1}=indA_d; indC_{d-1}=indC_d) : nothing), # pre-expression
        d->(indA_d += stepA_d; indC_d += stepC_d),   # post-expression
                @inbounds C[indC_1]=A[indC_1])   # body

Maybe this can be applied to the body expression passed to ngenerate?

Yes, that would do it.

@Jutho
Copy link
Contributor Author

Jutho commented Jul 13, 2014

Thanks for the explanation about @stridedloops, makes much more sense now. However, I suspect you can do all that with pre and post expressions. For example, in the reduction code (reducedim.jl) we access the input array A with indexes i, and the (reduced) output array R with indexes j. We use a pre-expression d->(j_d = sizeR_d==1 ? 1 : i_d) which basically says "if we're reducing along dimension d, set j_d to 1, otherwise set it equal to i_d." (A fun point for the afficianados: while it might seem that this branch in the reduction code would be expensive, for any given call to this function it evaluates the same way every time, so the CPU's branch prediction seems to basically eliminate it.)

It most certainly can be done with the pre and post of nloops, this is how I was doing it before and how the original permutedims! is written. It is just a more specialised version of nloops that makes it easier (shorter) to write this specific pattern, because I needed it quite often in the algorithms in TensorOperations, including nested three times in the tensorcontract routines (and nested 6 times in the old implementation that used blocking). Ignore my enthusiasm, I was just happy to be able to write this in a single line since the original code of my algorithms looked horrible :D.

@timholy
Copy link
Member

timholy commented Jul 14, 2014

I didn't mean it as criticism---there is certainly room for more specialized implementations that make common tasks easier. When I started developing Cartesian in the first place, I was struggling to find a generic API, and had about 5 different variants of @nloops. Once I figured out a more general API, I have tended to obsess about using it everywhere 😄.

@quinnj
Copy link
Member

quinnj commented May 9, 2016

Going through old PRs here......is this something that is still relevant for Base? Seems like a lot of new machinery has gone into Arrays/iteration, but I also wonder if this is a better candidate for the Combinatorics.jl package?

@ViralBShah
Copy link
Member

This one certainly belongs to Base.

@Jutho
Copy link
Contributor Author

Jutho commented May 9, 2016

But this particular PR can certainly be closed as it is completely outdated and will need to be redone ( (this was before @generated functions. permutedims has already evolved since this PR was created, though as far as I know it is still not yet cache efficient / oblivious.

@ViralBShah ViralBShah closed this May 9, 2016
@timholy
Copy link
Member

timholy commented May 9, 2016

The generic algorithm (meaning, the one that doesn't assume linear indexing & strides) is somewhat cache-efficient, but without a doubt more could be done to tune its performance. Once I beat the alternative of copying the input array, I just quit there (bad me).

@stevengj
Copy link
Member

stevengj commented May 9, 2016

@timholy, I'm a little confused by that code. Where is the permutation in P[I1, i, I2, j, I3] = src[I1, i, I2, j, I3]? All the arguments seem to be in the same order on the lhs and rhs.

@timholy
Copy link
Member

timholy commented May 9, 2016

So glad you noticed how sneaky it is. The magic is in the definition of setindex!. The implementation is:

  • allocate the output array dest
  • wrap dest in a PermutedDimsArray "view," called P. P has the same size as the original matrix you want to permute.
  • copy the data to P. From the perspective of dest, setindex! takes care of rearranging the indexes to enact the permutation. (There's also a getindex, naturally, which does the same index-swapping upon access.)
  • discard the wrapper and return dest

So it's the implementation of copy! that does all the work---you basically get two methods for the price of one, meaning you can concentrate your efforts at improving performance.

Plus, julia now supports arrays with arbitrary storage order (in less than 100 lines of code). This simply turned out to be the most modular way to implement permutedims in a way that avoids linear indexing (which is horrible for many array types). But obviously this can go way, way beyond that. I'm not sure how well these views will work for general use, hence this infrastructure is deliberately buried (not exported, not even visible in Base) until there's evidence that it's safe to use more broadly. It definitely would be good for adventurous souls to kick the tires and see what's missing.

@stevengj
Copy link
Member

stevengj commented May 9, 2016

@timholy, because of that rearrangement of indices, it is not as cache-efficient as it could be. The problem is that at least one of the arrays (either the source or the destination) will be reading/writing data non-consecutively in memory, which hurts cache-line utilization.

However, for power-of-two sizes (which are the worst case for cache-line conflicts), permutedims(A, (2,1)) on Julia 0.4 seems 2-3 times worse than copy(A), which is actually surprisingly good. Maybe cache associativity has gotten a lot larger since the last time I looked at this sort of thing? Or is the transpose case special-cased somewhere?

@timholy
Copy link
Member

timholy commented May 9, 2016

That's the reason for splitting out the loops over src's first dimension and dest's first dimension (the i and j loops) separately. It does the copy in 8x8 blocks (a number I chose arbitrarily).

@timholy
Copy link
Member

timholy commented May 9, 2016

Oh shoot, I was going by a mix of memory and the code, but just glanced again and I see I've been a little bit confusing. To clarify, _permutedims! is just a specialized implementation of copy!, for those cases where the arrays don't share an extended first dimension in common. When both arrays share the same first dimension(s), and those dimensions correspond to quite a few elements (perhaps I should set the threshold bigger than 1 😄), then there's really no harm in using plain-old copy!.

@Jutho Jutho deleted the jh/permuteperformance branch August 23, 2018 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants