-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
room for performance improvement for SubArray #5117
Comments
A better version that writes profiles into different files: function mykern(r, s, a, x, y)
const nsteps = 25000
istep = 0
ifirst = minimum(r)
ilast = maximum(r)
while istep < nsteps #&& t < tmax
for i = ifirst : ilast-1
s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i]) # line #10
end
for i = ifirst : ilast
x[i] = a * s[i-1] # line #14
end
for i = ifirst : ilast
y[i] = a * x[i-1]
end
for i = ifirst : ilast-1
s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
end
istep=istep + 1
end
s
end
const siz = 10^4
const a = e^pi - pi^e
s = rand(siz)
x = rand(siz)
y = rand(siz)
s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))
s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)
s3 = sub(s, 4:siz)
# warming up
# mykern( 2:(siz - 3), s, a, x, y )
# mykern( 2:(siz - 3), s, a, x1, y1 )
# mykern( 3:(siz - 2), s, a, x2, y2 )
# @time mykern( 2:(siz - 3), s, a, x, y )
# the real test
println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x, y )
#println("follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )")
#@time mykern( 2:(siz - 3), s1, a, x1, y1 )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )
#@time mykern( 3:(siz - 2), s1, a, x2, y2 )
#@time mykern( 5:(siz - 5), s3, a, x2, y2 )
println("profiling: mykern( 2:(siz - 3), s, a, x, y )")
@profile mykern( 2:(siz - 3), s, a, x, y )
s1 = open("mykern.SubArray.prof.txt","w")
Profile.print(s1)
close(s1)
#code_typed(mykern,(2:(siz - 3), s, a, x, y))
code_llvm(mykern,(2:(siz - 3), s, a, x, y))
code_native(mykern,(2:(siz - 3), s, a, x, y))
Profile.clear()
println("profiling: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@profile mykern( 3:(siz - 2), s2, a, x2, y2 )
s2 = open("mykern.array.prof.txt","w")
Profile.print(s2)
close(s2)
#code_typed(mykern,(3:(siz - 2), s2, a, x2, y2))
code_llvm(mykern,(3:(siz - 2), s2, a, x2, y2))
code_native(mykern,(3:(siz - 2), s2, a, x2, y2)) And profiles diff:
Not clear how to study the performance further. Thanks! |
This seems to be a more common misperception than I expected; it's not counting the number of function calls, it's counting the number of samples collected. Since the samples are collected at regular intervals (by default, 1 ms), this simply means that it took more time. You've pretty much dived down into as far as one can go with the profiler; since the Any further optimization would have to start with the code in You could also add If you're truly working just in 1d, the only other step I can think of is pre-shifting the pointer. That won't be feasible in higher dimensions, unfortunately. |
Thanks for clarifying my misconception. I should have known better that in this case a number of samples was printed - just skimmed through the docs. Does it make sense to compare codes using code_llvm, code_native calls? |
It does, but I'm still struggling with my assembly. If you know what you're doing there, you're ahead of me. |
I doubt that I'm ahead of you :-) s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i])
In any case this is not very human readable. Will try to structure it and make some summary. |
A first attempt to structure for the assignment Array-based version:
SubArray-based version:
|
Here's the assembly for the code at line 14, with some annotations that should be at least approximately correct, although I'm certainly not an expert here. Array case: Source line: 14
# Make sure x is in bounds
lea RCX, QWORD PTR [RSI - 2]
cmp RCX, QWORD PTR [R15 + 16]
jae 343
# Make sure y is in bounds
lea RCX, QWORD PTR [RSI - 1]
cmp RCX, QWORD PTR [RBX + 16]
jae 329
# Load pointer to data for x
mov RCX, QWORD PTR [R15 + 8]
# Perform multiplication
vmulsd XMM0, XMM2, QWORD PTR [RCX + RDI]
# Load pointer to data for y
mov RCX, QWORD PTR [RBX + 8]
# Set y[i]
vmovsd QWORD PTR [RCX + RDX], XMM0 SubArray case: Source line: 14
# Load pointer to x.parent
mov R9, QWORD PTR [RAX + 8]
# Make sure x.parent is defined
test R9, R9
je 774
# Load pointer to x.strides
mov RSI, QWORD PTR [RAX + 32]
# Make sure x.strides is defined
test RSI, RSI
je 761
# Make sure length of x.strides > 0
cmp QWORD PTR [RSI + 16], 0
je 765
# Load pointer to data for x.strides
mov RDI, QWORD PTR [RSI + 8]
# Multiply (i-1)*x.strides[1]
lea RSI, QWORD PTR [RCX - 2]
imul RSI, QWORD PTR [RDI]
# Load x.first_index
mov RDI, QWORD PTR [RAX + 40]
# Make sure x.parent[x.first_index + (i-1)*x.strides[1]] is in bounds
lea RSI, QWORD PTR [RDI + RSI - 1]
cmp RSI, QWORD PTR [R9 + 16]
jae 734
# Load pointer to y.parent
mov RDI, QWORD PTR [R13 + 8]
# Make sure y.parent is defined
test RDI, RDI
je 706
# Load pointer to y.strides
mov RBX, QWORD PTR [R13 + 32]
# Make sure y.strides is defined
test RBX, RBX
je 693
# Make sure length of y.strides > 0
cmp QWORD PTR [RBX + 16], 0
je 697
# Load pointer to data for y.strides
mov R8, QWORD PTR [RBX + 8]
# Multiply (i-1)*y.strides[1]
lea RBX, QWORD PTR [RCX - 1]
imul RBX, QWORD PTR [R8]
# Load y.first_index
mov RDX, QWORD PTR [R13 + 40]
# Make sure y.parent[y.first_index + (i-1)*y.strides[1]] is in bounds
lea RBX, QWORD PTR [RDX + RBX - 1]
cmp RBX, QWORD PTR [RDI + 16]
jae 666
# Load pointer to data for x
mov RDX, QWORD PTR [R9 + 8]
# Perform multiplication
vmulsd XMM0, XMM2, QWORD PTR [RDX + 8*RSI]
# Load pointer to data for y
mov RDX, QWORD PTR [RDI + 8]
# Set y[i]
vmovsd QWORD PTR [RDX + 8*RBX], XMM0 Obviously for SubArrays we're doing a lot of work to check/load things that don't actually change between iterations of the loop. I am not sure if there's anything we can do to change this in the general case without compiler optimizations. |
Thanks for your analysis and for the annotated assembly listing. |
A lot of the overhead stems from x.strides being a Vector. Maybe it could be replaced by some kind of immutable type FixedVector{Int, N}, which would end up being "inlined" in the SubArray type and access to it would yield the same code as using fields like x.stride1, x.stride2, x.stride3, ... would? You could probably implement such a thing using the C interface stuff (Ptr{T}, unsafe_load and friends). |
I think the strides should be of type NTuple{Dim,Int64}. As tuples are immutable one has to calculate the strides using ntuple:
|
NTuples work for the most part, but for some reason they are not "inlined" into immutable types: immutable Strides4
stride1 :: Int
stride2 :: Int
stride3 :: Int
stride4 :: Int
end
immutable MySubArray1
strides :: Strides4
end
immutable MySubArray2
strides :: NTuple{4, Int}
end
println(sizeof(MySubArray1))
println(sizeof(MySubArray2)) prints 32 and 8, i.e. 4 "inlined" ints for MySubArray1 and one pointer to 4 ints for MySubArray2. Using NTuples is probably the right choice anyway, but the compiler should inline them like other immutables. Doing that, using |
The inlining work is ongoing. It will happen very soon. |
I've argued that immutables should not be allowed to be incompletely initialized since that would make all null checks go away for them and allow inlining much more easily. |
That is a reasonable thing to do that would nudge everyone towards higher performance. |
Thanks to all for the feedback! I added a new comment to the thread Fortran-like arrays... that inspired this issue. |
SubArray performance would benefit significantly if Julia had partial specialization akin C++. where |
Julia already does this. |
Does most of your faster performance arise simply from the fact that you don't allow arbitrary strides? And to expand on what Stefan said, not only does Julia allow it, but check out the code in |
Thanks for the hints on partial specialization. Julia is such an excellent language - flexible and expressive - very well designed. Yes, of cause, as FArray class is specialized only for a particular task - to have shifted indices to allow Fortran-like arrays starting from negative (or zero) bases so it can be optimized better for the task. The latest score - Array code (not sifted indices) - 0.44s, FArray code with shifted indices - 0.73s. |
Just happened to stumble across this. With the new SubArrays in 0.4, if you add |
Thanks Tim for your efforts. function mykern(r, s, a, x, y)
const nsteps = 250
istep = 0
ifirst = minimum(r)
ilast = maximum(r)
while istep < nsteps #&& t < tmax
@inbounds for i = ifirst : ilast-1
s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i]) # line #10
end
@inbounds for i = ifirst : ilast
x[i] = a * s[i-1] # line #14
end
@inbounds for i = ifirst : ilast
y[i] = a * x[i-1]
end
@inbounds for i = ifirst : ilast-1
s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
end
istep=istep + 1
end
s
end
const siz = 10^4
const a = e^pi - pi^e
s = rand(siz)
x = rand(siz)
y = rand(siz)
s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))
s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)
s3 = sub(s, 4:siz)
# warming up
mykern( 2:(siz - 3), s, a, x, y )
mykern( 2:(siz - 3), s, a, x1, y1 )
mykern( 3:(siz - 2), s, a, x2, y2 )
# @time mykern( 2:(siz - 3), s, a, x, y )
# the real test
println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x, y )
println("follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )")
@time mykern( 2:(siz - 3), s1, a, x1, y1 )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )
@time mykern( 3:(siz - 2), s1, a, x2, y2 )
@time mykern( 5:(siz - 5), s3, a, x2, y2 ) julia shifted3.jl
follow up: mykern( 2:(siz - 3), s, a, x, y )
17.038 milliseconds (165 allocations: 11149 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
9.040 seconds (73071 k allocations: 1115 MB, 1.11% gc time)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
8.964 seconds (73056 k allocations: 1115 MB, 0.94% gc time)
8.449 seconds (73056 k allocations: 1115 MB, 0.88% gc time)
8.462 seconds (73026 k allocations: 1114 MB, 0.89% gc time) What's wrong with the code above? |
My results:
Are you by any chance running julia 0.3? This is all only on julia 0.4. |
That has to be 0.4, unless somebody backported my |
Yeah, good point. What is your julia version, anyway? |
Also, with that large number of allocations (~73 million, ~1GB), it doesn't surprise me that it's slow... |
Your timing is good. I wish I had the same. _
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+5208 (2015-06-04 23:18 UTC)
_/ |\__'_|_|_|\__'_| | Commit e9fa25b* (0 days old master)
|__/ | x86_64-linux-gnu |
That's why I assumed julia 0.3, since SubArray construction was type-unstable on 0.3. Also @alsam, your "warmup" doesn't really warm up the case where a subarray is the 2nd argument of include("shifted3.jl")
include("shifted3.jl") in the REPL and see if the 2nd call is faster. |
Just tried to include("shifted3.jl")
include("shifted3.jl") The performance is also bad unfortunately. The experiment with rotating parameters to |
Thanks @timholy for the hint. julia> a = [-10:10];
WARNING: [a] concatenation is deprecated; use collect(a) instead
[...]
julia> b = Base.sub_unsafe(a, 12:21);
julia> b[-10];
ERROR: BoundsError: attempt to access 0-dimensional SubArray{Int64,0,Array{Int64,1},Tuple{Int64},1}:
1
at index [-10]
in throw_boundserror at abstractarray.jl:146
in checkbounds at abstractarray.jl:166
in getindex at subarray2.jl:4 Julia version: Version 0.4.0-dev+5236 (2015-06-06 15:02 UTC)
Commit 01abdc3* (0 days old master) julia> @code_llvm b[-10]
...
%10 = icmp eq i32 %2, 1
br i1 %10, label %fail, label %pass2
fail: ; preds = %top
%11 = getelementptr %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %11, i64 0, i64 1)
unreachable
... So even |
Sorry, pressed the wrong button and closed the issue. |
This seems to do what worked in 0.3.9. julia> a = [-10:10];
julia> b = sub(a, 12:21);
julia> Base.unsafe_getindex(b, -10)
-10 |
Sure, it works in 0.3.9, even conventional julia> a = [-10:10];
julia> b = sub(a, 12:21);
julia> b[-10]
-10
julia> b[-11]
ERROR: BoundsError()
in getindex at subarray.jl:194 Thanks @KristofferC for noticing it. |
@alsam I think you need _ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+5236 (2015-06-06 15:02 UTC)
_/ |\__'_|_|_|\__'_| | Commit 01abdc3* (1 day old master)
|__/ | x86_64-unknown-linux-gnu
julia> a = [-10:10;];
julia> b = sub(a, 12:21);
julia> Base.unsafe_getindex(b, -10)
-10 |
Copy that.
Thanks @yuyichao for the useful hint! |
:) and that's exactly what I did in my julia session. Just want to include the version info and point out that it is not |
Thank you @KristofferC and @yuyichao for clarification. |
The code below is probably fit for an insanity wolf meme hehe. julia> import Base: getindex
julia> const getindex = Base.unsafe_getindex
Warning: imported binding for getindex overwritten in module Main
unsafe_getindex (generic function with 19 methods)
julia> a = [-10:10;];
julia> b = sub(a, 12:21);
julia> b[-10]
-10 But seriously, you should probably be able to do the same but only for the |
Excellent, thanks @KristofferC ! |
Note, you should do something like getindex(V::SubArray, I::Int64...) = Base.unsafe_getindex(V, I...)
getindex(V::SubArray, I::Union(Range{Int64},Array{Int64,1},Int64,Colon)...) = Base.unsafe_getindex(V, I...)
getindex{T,N,P,IV,LD}(V::SubArray{T,N,P,IV,LD}, I::Union(AbstractArray{T,N},Colon,Real)...) = Base.unsafe_getindex(V, I...) instead to keep the boundschecking on for non subarray arrays. |
Cool, thanks a lot @KristofferC :)) julia> b[-10]
-10
julia> b[-11]
ERROR: BoundsError: attempt to access 21-element Array{Int64,1}: |
Chiming in late here, but yes, indexing changed a lot for #10525, so my old instincts are out of date. Glad y'all figured stuff out. |
Let me summarize the performance, with the following code below - thanks to @KristofferC, @timholy, @yuyichao for the hints getindex(V::SubArray, I::Int...) = Base.unsafe_getindex(V, I...)
getindex(V::SubArray, I::Union(Range{Int},Array{Int,1},Int,Colon)...) = Base.unsafe_getindex(V, I...)
getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractArray, Colon)...) = Base.unsafe_getindex(V, to_index(I)...)
getindex{T,N,P,IV,LD}(V::SubArray{T,N,P,IV,LD}, I::Union(AbstractArray{T,N},Colon,Real)...) = Base.unsafe_getindex(V, I...)
setindex!(V::SubArray, v, I::Int...) = Base.unsafe_setindex!(V, v, I...)
#import Base: getindex, setindex!
#const getindex = Base.unsafe_getindex
#const setindex! = Base.unsafe_setindex! the performance is: follow up: mykern( 2:(siz - 3), s, a, x, y )
12.896 milliseconds (166 allocations: 11485 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
81.872 milliseconds (16694 allocations: 713 KB)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
19.551 milliseconds (6 allocations: 192 bytes)
19.710 milliseconds (6 allocations: 192 bytes)
19.322 milliseconds (6 allocations: 192 bytes) that is the performance penalty is about 40% as @timholy predicted. follow up:
mykern( 2:(siz - 3), s, a, x, y )
elapsed time: 0.013599747 seconds (13888 bytes allocated)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
elapsed time: 0.170130061 seconds (938696 bytes allocated)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
elapsed time: 0.124578779 seconds (88 bytes allocated)
elapsed time: 0.114813522 seconds (88 bytes allocated)
elapsed time: 0.113371351 seconds (88 bytes allocated) |
Ok, let me summarize: getindex(V::SubArray, I::Int64...) = Base.unsafe_getindex(V, I...)
getindex(V::SubArray, I::Union(Range{Int64},Array{Int64,1},Int64,Colon)...) = Base.unsafe_getindex(V, I...)
getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractArray, Colon)...) = Base.unsafe_getindex(V, to_index(I)...)
setindex!(V::SubArray, v, I::Int64...) = Base.unsafe_setindex!(V, v, I...)
macro shifted_array(T, r)
:(sub(Array($T, length($r)), -minimum($r) + 2 : length($r)))
end and usage: julia> x = @shifted_array(Float64, -10:10)
10-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1}:
6.90782e-310
...
julia> x[-10]
6.9078200935909e-310
julia> x[-11]
ERROR: BoundsError: attempt to access 21-element Array{Float64,1}:
...
julia> x[0]
6.90782011166424e-310
julia> x[10]
6.9078200930178e-310
julia> x[11]
ERROR: BoundsError: attempt to access 21-element Array{Float64,1}: Penalty for a bigger example from 1d acoustics is quite reasonable - ~40% as expected: 593.164 milliseconds (40 allocations: 236 KB) for
Thanks! |
This resolves the core issue behind SubArray's performance as reported in #5117. Unfortunately, without #7799, the surface syntax isn't very nice since it's a ton of unsafe_ function calls. Using the test case from #5117 (comment) : * Julia pre-10525 (with `@inbounds`) ``` follow up: mykern( 2:(siz - 3), s, a, x, y ) 9.484 milliseconds (157 allocations: 10893 bytes) follow up: mykern( 2:(siz - 3), s1, a, x1, y1 ) 68.632 milliseconds (16353 allocations: 718 KB) follow up: mykern( 3:(siz - 2), s2, a, x2, y2 ) 16.753 milliseconds (6 allocations: 192 bytes) 15.493 milliseconds (6 allocations: 192 bytes) 17.036 milliseconds (6 allocations: 192 bytes) ``` * This branch (with `unsafe_getindex`/`unsafe_setindex!`): ``` follow up: mykern( 2:(siz - 3), s, a, x, y ) 10.046 milliseconds (158 allocations: 11469 bytes) follow up: mykern( 2:(siz - 3), s1, a, x1, y1 ) 70.425 milliseconds (19097 allocations: 827 KB) follow up: mykern( 3:(siz - 2), s2, a, x2, y2 ) 14.263 milliseconds (6 allocations: 192 bytes) 16.470 milliseconds (6 allocations: 192 bytes) 16.711 milliseconds (6 allocations: 192 bytes) ```
It is fantastic, but with the latest Julia @inline getindex(V::SubArray, I::Int64...) = Base.unsafe_getindex(V, I...)
@inline getindex(V::SubArray, I::Union(Range{Int64},Array{Int64,1},Int64,Colon)...) = Base.unsafe_getindex(V, I...)
@inline getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractVector, Colon)...) = Base.unsafe_getindex(V, to_index(I)...)
@inline setindex!(V::SubArray, v, I::Int64...) = Base.unsafe_setindex!(V, v, I...) (please note, that julia main_sub.jl
...
333.805 milliseconds (19501 allocations: 539 KB) vs. julia main.jl
...
633.032 milliseconds (40 allocations: 236 KB) but this is due to import Base: getindex, setindex!
const getindex = Base.unsafe_getindex
const setindex! = Base.unsafe_setindex! then we will get julia main.jl
Warning: imported binding for getindex overwritten in module Main
Warning: imported binding for setindex! overwritten in module Main
...
198.278 milliseconds (40 allocations: 236 KB) as expected, i.e. regular array version is faster than |
I've seen some strange performance effects due to using import Base: getindex, unsafe_getindex
@inline getindex(V::SubArray, I::Int64...) = unsafe_getindex(V, I...)
... We're planning on reverting back to the old SubArray semantics, where bounds checks are deferred to the parent array (which is then affected by |
Great, it would be nice to have old SubArray semantics reverted, thanks @mbauman ! |
Confirm that with the last julia julia> macro shifted_array(T, r)
:(sub(Array($T, length($r)), -minimum($r) + 2 : length($r)))
end
julia> z = @shifted_array(Float64, -7:7);
julia> z[-7]
6.92464944583943e-310
julia> z[-8]
ERROR: BoundsError: attempt to access 15-element Array{Float64,1}:
6.92465e-310 The last score for 1d acoustics code baseline: 320.611 milliseconds (40 allocations: 236 KB) vs. 483.243 milliseconds (19501 allocations: 539 KB) so the penalty is quite reasonable. |
For the following synthetic testcase:
SubArray based arrays got significant - up to 4x performance slowdown than Array based ones.
Profiling diff:
revealed that for SubArray code was invoked more frequently, say for the line 11
The Array version was invoked 801 times, SubArray version - 3301, i.e. 4x time more frequently.
Any hints for more detailed profiling/study of the code?
The text was updated successfully, but these errors were encountered: