-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient permutedims #6517
Conversation
There is certainly something still wrong with the current pull request, as it is actually performing worse on several test cases. I will try to further optimise it over the course of the following days. Any suggestions are welcome. |
I have made some additional changes and benchmarked the new and old implementation. The result can be viewed here: I have also removed permutedims! from export, as it is not documented and this fits with what I believe is the general philosophy not to export the mutating functions (e.g. transpose! is not exported). |
This seems like great work Jutho. Perhaps for smaller arrays, we default to
|
I can certainly implement this. The point is that the double index computation is unavoidable if the output tensor is also a general StridedArray. Hence the additional method (which will basically be a copy of the old version of permutedims!) will only handle the case of small arrays for which the output is a normal Array or BitArray. I also have another question. Currently the cache size information is independently hardwired in several methods throughout Julia Base, e.g. in Base.transpose! and in Base.LinAlg.generic_matmatmul, and now also in my implementation of permutedims! (actually in the auxiliary function blockdims). It would probably make sense to centralise this information and maybe even to try to get it from the system (or to set it at build time) instead of fixing it at 32k. |
I like the idea of centralizing cache size information. |
+1 to cenralizing cache size info. |
Should I create a new file cache.jl containing the single line cachesize = 1<<15 ? (or preferably something smarter that actually queries it from the system environment) |
I would put it in |
Apart from the cache information being hardcoded (should be a separate PR), is this ready to merge? |
We certainly need to default to the current implementation for small arrays, and use this for larger arrays. |
what cache are we talking about here? if it's a hardware cache i guess there is much bike-shedding to be had over which and how this is all represented? but anyway, i would like to know what L1 (say) is on the current machine. is that the idea? if so, how can i get it? has a separate issue been raised for that? where? thanks. |
Let's get this PR done without worrying about the bike-shedding of cache stuff. There isn't a new issue to discuss this, but there probably should be. L1 and L2 cache sizes would be nice to have available, but it may require some effort to get this right across OSes and processor types. |
I will try to finalise this PR over the weekend. |
I have been experimenting with adding the default implementation for small arrays. I can then get the speed down to the old version; but I do not understand how to choose the value of the transition size. I would have thought that this would also be of the order of the level 1 cache size, but apparently I have to choose it much larger (order 100000 elements for a Float64 array, i.e. 800000 bytes) to get it equal speed (on average) to the old permutedims. Anybody has any insight in this. I will submit once my code is cleaned up. |
The size of the L2 cache is where you are getting the performance transition, perhaps? |
When you profile your new version, where are the bottlenecks? |
Latest commit is untested, I pushed it as an easy way of saving because I had to completely reinstall julia (something with -malign-double being an unknown argument in libfftw after a make cleanall, and I was not able to fix it). I will come back with more news soon. This commit contains the old version of permutedims! with its original name, and contains permutedims1! using a blocking strategy and permutedims2! using a recursive strategy. I will report test results soon and then make a decision. |
…into jh/permuteperformance
While trying to benchmark my new code versus the old version of permutedims!, I suddenly noticed that the old permutedims! has become unbelievably slow for arrays with N>6 (i.e. N=7 and beyond). I switched back to today's master "Version 0.3.0-prerelease+2787 (2014-04-27 20:12 UTC)" to confirm that this was not something in my private branch. This currently takes more than 68 seconds on my machine: A=randn(10,10,10,10,10,10,10,10);
B=similar(A);
@time Base.permutedims!(B,A,[8:-1:1]); whereas it used to take around 2 seconds I would think. @time also indicates 28005483496 bytes allocated, which is impossible for this mutating method. This makes it a bit harder to properly benchmark my own implementations, but more importantly seems to indicate some major efficiency decrease in a recent update to Julia's base library. Profiling indicates that the issue is with the line |
Wow. Could you try to bisect where this regression happened? |
I have implemented an A=randn(100,100,100,100);
B=similar(A);B1=similar(A);B2=similar(A);
@time Base.permutedims!(B,A,[4:-1:1]);
@time Base.permutedimsnew!(B1,A,[4:-1:1]);
@time Base.permutedimsnew2!(B2,A,[4:-1:1]); returns (after first compilation run)
I have no idea how to investigate these methods as |
Experiencing a bit with the simpler case of transpose!, which doesn't require any Cartesian macros, has learned that the recursive approach is really fast (no allocation overhead) if it is just one big recursive function. If, on the other hand, the recursive kernel and base kernel are implemented as inner functions inside the transpose! function definition, then there is again a big allocation overhead. So to implement permutedims! nicer I need to find a way to make a recursive function that combines well with |
I will actually have to redraw my statement of yesterday, cause as soon as the recursive function gets wrapped inside another function or some
So I guess this only leaves the current implementation as option:
|
…ermuteperformance
Bump. Seems a lot of great work has gone into this. What else needs to happen for a merge here? |
I think the code works. What is less satisfactory is:
|
Also, it seems like I still have a lot of cleaning up to do. |
I have completely rewritten my TensorOperations.jl package using this type of cache-oblivious 'recursive' algorithms, see: https://github.com/Jutho/TensorOperations.jl/tree/cacheoblivious . It includes a function tensorcopy! , which is essentially permutedims! with a different way of specifying the arguments (i.e. the permutation is implicitly defined by assigning labels to the different dimensions/indices of the input and output tensor). It also includes tensorcontract! for performing the general contractions of two tensors, which can also be used to compute a matrix multiplication (and is thus related to what I tried in #6690). I have now explicitly included a copy of cartesian.jl, which I modified a bit for my purpose. @timholy , I hope this is ok? Should I include any kind of reference or copyright at the top of this file? Before I just used on Base.Cartesian, but since made my package unavailable for julia 2.0. I made the following modifications: I removed all functionality related to splatting, which is unneeded in my case. I included macros Performancewise, not much has changed since before. I will try to benchmark some more the next few days but I am still not sure whether this kind of implementation of |
Certainly, you can do whatever you like with Cartesian! Glad it was useful to you. However, just a couple of days ago I ran into the same issue, and decided to make the Cartesian package just be a mirror of Base.Cartesian---basically bringing Base.Cartesian to 0.2. So now Images.jl contains logic like this at the top:
I think that might be a good practice in general when people want to maintain version compatibility but use some new Julia feature. Perhaps we need a Your other changes sound really interesting. Re Also, 💯 for being able to do math on Finally, in 3f0adc6 I made some changes in |
Thanks for all the input and tips. On 13 Jul 2014, at 12:15, Tim Holy notifications@github.com wrote:
indA_2=startA
indC_2=startC
for i_2=1:dim_2
indA_1=indA_2
indC_1=indC_2
for i_1=1:dim_1
@inbound C[indC_1]=A[indC_1]
indA_1+=stepA_1
indC_1+=stepC_1
end
indA_2+=stepA_2
indC_2+=stepC_2
end With
|
Pick whichever is easier for you (presumably Cartesian, so you don't have to rebuild Julia all the time), and then we can discuss. Thanks for the explanation about I suspect you could write your increments something like this: @nexprs N d->(indA_d = startA)
# The last looks like assigned them all, but LLVM will probably get rid of the unnecessary ones
@nexprs N d->(d == N ? (indC_d = startC) : nothing) # or you could get rid of them this way
@nloops(N, i, A,
d->(d > 1 ? (indA_{d-1}=indA_d; indC_{d-1}=indC_d) : nothing), # pre-expression
d->(indA_d += stepA_d; indC_d += stepC_d), # post-expression
@inbounds C[indC_1]=A[indC_1]) # body
Yes, that would do it. |
It most certainly can be done with the pre and post of |
I didn't mean it as criticism---there is certainly room for more specialized implementations that make common tasks easier. When I started developing Cartesian in the first place, I was struggling to find a generic API, and had about 5 different variants of |
6c7c7e3
to
1a4c02f
Compare
Going through old PRs here......is this something that is still relevant for Base? Seems like a lot of new machinery has gone into Arrays/iteration, but I also wonder if this is a better candidate for the Combinatorics.jl package? |
This one certainly belongs to Base. |
But this particular PR can certainly be closed as it is completely outdated and will need to be redone ( (this was before |
The generic algorithm (meaning, the one that doesn't assume linear indexing & strides) is somewhat cache-efficient, but without a doubt more could be done to tune its performance. Once I beat the alternative of copying the input array, I just quit there (bad me). |
@timholy, I'm a little confused by that code. Where is the permutation in |
So glad you noticed how sneaky it is. The magic is in the definition of
So it's the implementation of Plus, julia now supports arrays with arbitrary storage order (in less than 100 lines of code). This simply turned out to be the most modular way to implement |
@timholy, because of that rearrangement of indices, it is not as cache-efficient as it could be. The problem is that at least one of the arrays (either the source or the destination) will be reading/writing data non-consecutively in memory, which hurts cache-line utilization. However, for power-of-two sizes (which are the worst case for cache-line conflicts), |
That's the reason for splitting out the loops over |
Oh shoot, I was going by a mix of memory and the code, but just glanced again and I see I've been a little bit confusing. To clarify, |
Dear Julia developers,
this is a first suggestion for a performance improvement for permutedims! by using a cache-friendly blocking strategy. The improvement can be large (up to a factor of 5 or more) for big arrays (100 x 100 x 100 x 100 or 10 x 10 x 10 x 10 x 10 x 10 x 10 x 10) and permutations that mix strongly, such as the complete reversing of dimensions (
p=N:-1:1
), although there might be a penalty for small arrays because of the dynamic computation of a blocking strategy. This could matter in loops where small arrays are permuted many times. I will try to post some timing results later today.