Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

room for performance improvement for SubArray #5117

Closed
alsam opened this issue Dec 12, 2013 · 75 comments
Closed

room for performance improvement for SubArray #5117

alsam opened this issue Dec 12, 2013 · 75 comments
Labels
performance Must go faster

Comments

@alsam
Copy link

alsam commented Dec 12, 2013

For the following synthetic testcase:

function mykern(r, s, a, x, y)
    const nsteps = 25000 
    istep = 0

    ifirst = minimum(r)
    ilast  = maximum(r)
    while istep < nsteps #&& t < tmax

        for i = ifirst : ilast-1
            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i])
        end

        for i = ifirst : ilast
            x[i] = a * s[i-1]
        end

        for i = ifirst : ilast
            y[i] = a * x[i-1]
        end

        for i = ifirst : ilast-1
            s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
        end
        istep=istep + 1
    end 
    s
end 

const siz = 10^4
const a = e^pi - pi^e

s = rand(siz)
x = rand(siz)
y = rand(siz)

s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))

s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)

s3 = sub(s, 4:siz)

# warming up
mykern( 2:(siz - 3), s, a, x,  y  )
mykern( 2:(siz - 3), s, a, x1, y1 )

# the real test

println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x,  y  )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )

println("profiling: mykern( 2:(siz - 3), s, a, x, y )")
@profile mykern( 2:(siz - 3), s, a, x,  y  )
Profile.print()
#code_typed(mykern,(2:(siz - 3), s, a, x,  y))
code_llvm(mykern,(2:(siz - 3), s, a, x,  y))
code_native(mykern,(2:(siz - 3), s, a, x,  y))

println("profiling: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@profile mykern( 3:(siz - 2), s2, a, x2, y2 )
Profile.print()
#code_typed(mykern,(3:(siz - 2), s2, a, x2, y2))
code_llvm(mykern,(3:(siz - 2), s2, a, x2, y2))
code_native(mykern,(3:(siz - 2), s2, a, x2, y2))

SubArray based arrays got significant - up to 4x performance slowdown than Array based ones.

Profiling diff:

-follow up: mykern( 2:(siz - 3), s, a, x, y )
-elapsed time: 2.848171233 seconds (611520 bytes allocated)
-follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
-elapsed time: 8.839612056 seconds (704324 bytes allocated)
-profiling: mykern( 2:(siz - 3), s, a, x, y )
-2807 boot.jl; include; line: 238
-    2807 profile.jl; anonymous; line: 14
-       98  ...ray/src/shifted3.jl; mykern; line: 10
-       801 ...ray/src/shifted3.jl; mykern; line: 11
-       149 ...ray/src/shifted3.jl; mykern; line: 14
-       344 ...ray/src/shifted3.jl; mykern; line: 15
-       96  ...ray/src/shifted3.jl; mykern; line: 18
-       430 ...ray/src/shifted3.jl; mykern; line: 19
-       889 ...ray/src/shifted3.jl; mykern; line: 23
+profiling: mykern( 3:(siz - 2), s2, a, x2, y2 )
+11591 boot.jl; include; line: 238
+    11591 profile.jl; anonymous; line: 14
+       185  ...ray/src/shifted3.jl; mykern; line: 10
+       3302 ...ray/src/shifted3.jl; mykern; line: 11
+       149  ...ray/src/shifted3.jl; mykern; line: 14
+       2078 ...ray/src/shifted3.jl; mykern; line: 15
+       96   ...ray/src/shifted3.jl; mykern; line: 18
+       2133 ...ray/src/shifted3.jl; mykern; line: 19
+       3648 ...ray/src/shifted3.jl; mykern; line: 23

revealed that for SubArray code was invoked more frequently, say for the line 11

            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i])

The Array version was invoked 801 times, SubArray version - 3301, i.e. 4x time more frequently.

Any hints for more detailed profiling/study of the code?

@alsam
Copy link
Author

alsam commented Dec 12, 2013

A better version that writes profiles into different files:

function mykern(r, s, a, x, y)
    const nsteps = 25000 
    istep = 0

    ifirst = minimum(r)
    ilast  = maximum(r)
    while istep < nsteps #&& t < tmax

        for i = ifirst : ilast-1
            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i]) # line #10
        end

        for i = ifirst : ilast
            x[i] = a * s[i-1]                            # line #14
        end

        for i = ifirst : ilast
            y[i] = a * x[i-1]
        end

        for i = ifirst : ilast-1
            s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
        end
        istep=istep + 1
    end 
    s
end 

const siz = 10^4
const a = e^pi - pi^e

s = rand(siz)
x = rand(siz)
y = rand(siz)

s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))

s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)

s3 = sub(s, 4:siz)

# warming up
# mykern( 2:(siz - 3), s, a, x,  y  )
# mykern( 2:(siz - 3), s, a, x1, y1 )
# mykern( 3:(siz - 2), s, a, x2, y2 )
# @time mykern( 2:(siz - 3), s, a, x,  y )

# the real test

println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x,  y  )
#println("follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )")
#@time mykern( 2:(siz - 3), s1, a, x1, y1 )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )
#@time mykern( 3:(siz - 2), s1, a, x2, y2 )
#@time mykern( 5:(siz - 5), s3, a, x2, y2 )

println("profiling: mykern( 2:(siz - 3), s, a, x, y )")
@profile mykern( 2:(siz - 3), s, a, x,  y  )
s1 = open("mykern.SubArray.prof.txt","w")
Profile.print(s1)
close(s1)
#code_typed(mykern,(2:(siz - 3), s, a, x,  y))
code_llvm(mykern,(2:(siz - 3), s, a, x,  y))
code_native(mykern,(2:(siz - 3), s, a, x,  y))


Profile.clear()
println("profiling: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@profile mykern( 3:(siz - 2), s2, a, x2, y2 )
s2 = open("mykern.array.prof.txt","w")
Profile.print(s2)
close(s2)
#code_typed(mykern,(3:(siz - 2), s2, a, x2, y2))
code_llvm(mykern,(3:(siz - 2), s2, a, x2, y2))
code_native(mykern,(3:(siz - 2), s2, a, x2, y2))

And profiles diff:

diff -u mykern.array.prof.txt mykern.SubArray.prof.txt 
--- mykern.array.prof.txt   2013-12-12 19:41:07.454722693 +0400
+++ mykern.SubArray.prof.txt    2013-12-12 19:40:58.354723031 +0400
@@ -1,7 +1,9 @@
-8843 boot.jl; include; line: 238
-    8843 profile.jl; anonymous; line: 14
-       95   ...ray/src/shifted3.jl; mykern; line: 9
-       2527 ...ray/src/shifted3.jl; mykern; line: 10
-       1708 ...ray/src/shifted3.jl; mykern; line: 14
-       1756 ...ray/src/shifted3.jl; mykern; line: 18
-       2757 ...ray/src/shifted3.jl; mykern; line: 22
+2823 boot.jl; include; line: 238
+    2823 profile.jl; anonymous; line: 14
+       103 ...ray/src/shifted3.jl; mykern; line: 9
+       787 ...ray/src/shifted3.jl; mykern; line: 10
+       129 ...ray/src/shifted3.jl; mykern; line: 13
+       329 ...ray/src/shifted3.jl; mykern; line: 14
+       72  ...ray/src/shifted3.jl; mykern; line: 17
+       469 ...ray/src/shifted3.jl; mykern; line: 18
+       934 ...ray/src/shifted3.jl; mykern; line: 22

Not clear how to study the performance further.
Could you please give some hints?

Thanks!

@timholy
Copy link
Member

timholy commented Dec 12, 2013

The Array version was invoked 801 times, SubArray version - 3301, i.e. 4x time more frequently.

This seems to be a more common misperception than I expected; it's not counting the number of function calls, it's counting the number of samples collected. Since the samples are collected at regular intervals (by default, 1 ms), this simply means that it took more time.

You've pretty much dived down into as far as one can go with the profiler; since the getindex and setindex! calls are inlined, there is no more detail available. I guess one more thing would be to set combine=false, but I tried this and it wasn't deeply informative, except to point out that there are more operations being performed with the SubArray code than with Array code.

Any further optimization would have to start with the code in base/subarray. One thing I noticed when looking is that we seem to be missing a setindex! optimized for 1d. Adding that might make a difference.

You could also add @inbounds to your mykern function.

If you're truly working just in 1d, the only other step I can think of is pre-shifting the pointer. That won't be feasible in higher dimensions, unfortunately.

@alsam
Copy link
Author

alsam commented Dec 12, 2013

Thanks for clarifying my misconception. I should have known better that in this case a number of samples was printed - just skimmed through the docs.

Does it make sense to compare codes using code_llvm, code_native calls?

@timholy
Copy link
Member

timholy commented Dec 12, 2013

It does, but I'm still struggling with my assembly. If you know what you're doing there, you're ahead of me.

@alsam
Copy link
Author

alsam commented Dec 13, 2013

I doubt that I'm ahead of you :-)
You are quite right, combine=false is not useful at all.
code_llvm and code_native revealed a significant difference, no difference of code_lowered and more interesting is the difference in code_typed for the line 10

            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i])
-        #s3 = top(box)(Float64,top(div_float)(top(box)(Float64,top(mul_float)(a::Float64,top(box)(Float64,top(sub_float)(arrayref(x::Array{Float64,1},top(box)(Int64,top(add_int)(#s32::Int64,1))::Int64)::Float64,arrayref(x::Array{Float64,1},#s32::Int64)::Float64))::Float64))::Float64,top(box)(Float64,top(sub_float)(arrayref(y::Array{Float64,1},top(box)(Int64,top(add_int)(#s32::Int64,1))::Int64)::Float64,arrayref(y::Array{Float64,1},#s32::Int64)::Float64))::Float64))::Float64
-        arrayset(s::Array{Float64,1},#s3::Float64,#s32::Int64)::Array{Float64,1}
+        #s3 = top(box)(Float64,top(div_float)(top(box)(Float64,top(mul_float)(a::Float64,top(box)(Float64,top(sub_float)(arrayref(top(getfield)(x::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:parent)::Array{Float64,1},top(box)(Int64,top(add_int)(top(getfield)(x::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:first_index)::Int64,top(box)(Int64,top(mul_int)(top(box)(Int64,top(sub_int)(top(box)(Int64,top(add_int)(#s32::Int64,1))::Int64,1))::Int64,arrayref(top(getfield)(x::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:strides)::Array{Int64,1},1)::Int64))::Int64))::Int64)::Float64,arrayref(top(getfield)(x::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:parent)::Array{Float64,1},top(box)(Int64,top(add_int)(top(getfield)(x::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:first_index)::Int64,top(box)(Int64,top(mul_int)(top(box)(Int64,top(sub_int)(#s32::Int64,1))::Int64,arrayref(top(getfield)(x::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:strides)::Array{Int64,1},1)::Int64))::Int64))::Int64)::Float64))::Float64))::Float64,top(box)(Float64,top(sub_float)(arrayref(top(getfield)(y::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:parent)::Array{Float64,1},top(box)(Int64,top(add_int)(top(getfield)(y::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:first_index)::Int64,top(box)(Int64,top(mul_int)(top(box)(Int64,top(sub_int)(top(box)(Int64,top(add_int)(#s32::Int64,1))::Int64,1))::Int64,arrayref(top(getfield)(y::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:strides)::Array{Int64,1},1)::Int64))::Int64))::Int64)::Float64,arrayref(top(getfield)(y::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:parent)::Array{Float64,1},top(box)(Int64,top(add_int)(top(getfield)(y::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:first_index)::Int64,top(box)(Int64,top(mul_int)(top(box)(Int64,top(sub_int)(#s32::Int64,1))::Int64,arrayref(top(getfield)(y::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:strides)::Array{Int64,1},1)::Int64))::Int64))::Int64)::Float64))::Float64))::Float64
+        arrayset(top(getfield)(s::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:parent)::Array{Float64,1},#s3::Float64,top(box)(Int64,top(add_int)(top(getfield)(s::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:first_index)::Int64,top(box)(Int64,top(mul_int)(top(box)(Int64,top(sub_int)(#s32::Int64,1))::Int64,arrayref(top(getfield)(s::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:strides)::Array{Int64,1},1)::Int64))::Int64))::Int64)::Array{Float64,1}

In any case this is not very human readable. Will try to structure it and make some summary.

@alsam
Copy link
Author

alsam commented Dec 13, 2013

A first attempt to structure for the assignment s[i] = ..., i.e. for the writeback:

Array-based version:

-        arrayset   (s::Array{Float64,1},
                    #s3::Float64,
                    #s32::Int64
                    )::Array{Float64,1}

SubArray-based version:

+        arrayset   (top(getfield)(s::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:parent)::Array{Float64,1},
                    #s3::Float64,
                    top(box)(
                                Int64,top(add_int)(top(getfield)
                                (s::SubArray{Float64,1,Array{Float64,1},(Range1{Int64},)},:first_index)::Int64,
                                top(box)
                                    (Int64,
                                            top(mul_int)(
                                                top(box)(
                                                            Int64,top(sub_int)(#s32::Int64,1)
                                                        )::Int64,
                                                arrayref(top(getfield)
                                                    (s::SubArray{Float64,1,Array{Float64,1},
                                                        (Range1{Int64},)},
                                                        :strides
                                                    )::Array{Int64,1},1)::Int64
                                            )
                                    )::Int64)
                            )::Int64)::Array{Float64,1}

@simonster
Copy link
Member

Here's the assembly for the code at line 14, with some annotations that should be at least approximately correct, although I'm certainly not an expert here. Array case:

Source line: 14
        # Make sure x is in bounds
        lea     RCX, QWORD PTR [RSI - 2]
        cmp     RCX, QWORD PTR [R15 + 16]
        jae     343
        # Make sure y is in bounds
        lea     RCX, QWORD PTR [RSI - 1]
        cmp     RCX, QWORD PTR [RBX + 16]
        jae     329
        # Load pointer to data for x
        mov     RCX, QWORD PTR [R15 + 8]
        # Perform multiplication
        vmulsd  XMM0, XMM2, QWORD PTR [RCX + RDI]
        # Load pointer to data for y
        mov     RCX, QWORD PTR [RBX + 8]
        # Set y[i]
        vmovsd  QWORD PTR [RCX + RDX], XMM0

SubArray case:

Source line: 14
        # Load pointer to x.parent
        mov     R9, QWORD PTR [RAX + 8]
        # Make sure x.parent is defined
        test    R9, R9
        je      774
        # Load pointer to x.strides
        mov     RSI, QWORD PTR [RAX + 32]
        # Make sure x.strides is defined
        test    RSI, RSI
        je      761
        # Make sure length of x.strides > 0
        cmp     QWORD PTR [RSI + 16], 0
        je      765
        # Load pointer to data for x.strides
        mov     RDI, QWORD PTR [RSI + 8]
        # Multiply (i-1)*x.strides[1]
        lea     RSI, QWORD PTR [RCX - 2]
        imul    RSI, QWORD PTR [RDI]
        # Load x.first_index
        mov     RDI, QWORD PTR [RAX + 40]
        # Make sure x.parent[x.first_index + (i-1)*x.strides[1]] is in bounds
        lea     RSI, QWORD PTR [RDI + RSI - 1]
        cmp     RSI, QWORD PTR [R9 + 16]
        jae     734
        # Load pointer to y.parent
        mov     RDI, QWORD PTR [R13 + 8]
        # Make sure y.parent is defined
        test    RDI, RDI
        je      706
        # Load pointer to y.strides
        mov     RBX, QWORD PTR [R13 + 32]
        # Make sure y.strides is defined
        test    RBX, RBX
        je      693
        # Make sure length of y.strides > 0
        cmp     QWORD PTR [RBX + 16], 0
        je      697
        # Load pointer to data for y.strides
        mov     R8, QWORD PTR [RBX + 8]
        # Multiply (i-1)*y.strides[1]
        lea     RBX, QWORD PTR [RCX - 1]
        imul    RBX, QWORD PTR [R8]
        # Load y.first_index
        mov     RDX, QWORD PTR [R13 + 40]
        # Make sure y.parent[y.first_index + (i-1)*y.strides[1]] is in bounds
        lea     RBX, QWORD PTR [RDX + RBX - 1]
        cmp     RBX, QWORD PTR [RDI + 16]
        jae     666
        # Load pointer to data for x
        mov     RDX, QWORD PTR [R9 + 8]
        # Perform multiplication
        vmulsd  XMM0, XMM2, QWORD PTR [RDX + 8*RSI]
        # Load pointer to data for y
        mov     RDX, QWORD PTR [RDI + 8]
        # Set y[i]
        vmovsd  QWORD PTR [RDX + 8*RBX], XMM0

Obviously for SubArrays we're doing a lot of work to check/load things that don't actually change between iterations of the loop. I am not sure if there's anything we can do to change this in the general case without compiler optimizations.

@alsam
Copy link
Author

alsam commented Dec 13, 2013

Thanks for your analysis and for the annotated assembly listing.
Seems for general SubArray case it is hard if possible to improve performance significantly.

@RauliRuohonen
Copy link

A lot of the overhead stems from x.strides being a Vector. Maybe it could be replaced by some kind of immutable type FixedVector{Int, N}, which would end up being "inlined" in the SubArray type and access to it would yield the same code as using fields like x.stride1, x.stride2, x.stride3, ... would? You could probably implement such a thing using the C interface stuff (Ptr{T}, unsafe_load and friends).

@tknopp
Copy link
Contributor

tknopp commented Dec 14, 2013

I think the strides should be of type NTuple{Dim,Int64}. As tuples are immutable one has to calculate the strides using ntuple:

strides = ntuple(D, d-> ... )

@RauliRuohonen
Copy link

NTuples work for the most part, but for some reason they are not "inlined" into immutable types:

immutable Strides4
    stride1 :: Int
    stride2 :: Int
    stride3 :: Int
    stride4 :: Int
end

immutable MySubArray1
    strides :: Strides4
end

immutable MySubArray2
    strides :: NTuple{4, Int}
end 

println(sizeof(MySubArray1))
println(sizeof(MySubArray2))

prints 32 and 8, i.e. 4 "inlined" ints for MySubArray1 and one pointer to 4 ints for MySubArray2. Using NTuples is probably the right choice anyway, but the compiler should inline them like other immutables. Doing that, using immutable instead of type for SubArrays, and removing all the pointless null checks (should there be @notnull like @inbounds?) would cut off a lot of the overhead.

@Keno
Copy link
Member

Keno commented Dec 14, 2013

The inlining work is ongoing. It will happen very soon.

@StefanKarpinski
Copy link
Member

I've argued that immutables should not be allowed to be incompletely initialized since that would make all null checks go away for them and allow inlining much more easily.

@ViralBShah
Copy link
Member

That is a reasonable thing to do that would nudge everyone towards higher performance.

@JeffBezanson
Copy link
Member

see also #3496, #3503

@alsam
Copy link
Author

alsam commented Dec 24, 2013

Thanks to all for the feedback!

I added a new comment to the thread Fortran-like arrays...
https://groups.google.com/forum/#!topic/julia-dev/NOF6MA6tb9Y

that inspired this issue.

@alsam
Copy link
Author

alsam commented Dec 26, 2013

SubArray performance would benefit significantly if Julia had partial specialization akin C++.
The type has a signature
type SubArray{T,N,A<:AbstractArray,I<:(RangeIndex...,)} <: AbstractArray{T,N}

where N is dimension.
It would be great if we might specialize for the most common cases N = 1,2,3.

@StefanKarpinski
Copy link
Member

Julia already does this.

@timholy
Copy link
Member

timholy commented Dec 26, 2013

Does most of your faster performance arise simply from the fact that you don't allow arbitrary strides? getindex(S::SubArray, ...) has to do an extra lookup and multiplication for each argument.

And to expand on what Stefan said, not only does Julia allow it, but check out the code in subarray.jl; you'll see it's already used heavily.

@alsam
Copy link
Author

alsam commented Dec 28, 2013

Thanks for the hints on partial specialization. Julia is such an excellent language - flexible and expressive - very well designed.

Yes, of cause, as FArray class is specialized only for a particular task - to have shifted indices to allow Fortran-like arrays starting from negative (or zero) bases so it can be optimized better for the task.
Recently I specialized the class even more for the most common N = 1,2,3 dimension and got more performance boost.

The latest score - Array code (not sifted indices) - 0.44s, FArray code with shifted indices - 0.73s.
Thus the penalty is only 70%.
Happy New Year!

@timholy
Copy link
Member

timholy commented Dec 14, 2014

Just happened to stumble across this. With the new SubArrays in 0.4, if you add @inbounds around your loop in mykern, the performance gap has gone from being 400% to 40%. Obviously it would be nice to regain even that 40%, but I'd say we're well on our way here.

@alsam
Copy link
Author

alsam commented Jun 5, 2015

Thanks Tim for your efforts.
Just noticed and got an occasion to measure it again.
Very strange, my measurements show huge performance gap, otherwise I'd rather switch to standard SubArrays way as 40% penalty looks attractive. The code

function mykern(r, s, a, x, y)
    const nsteps = 250 
    istep = 0

    ifirst = minimum(r)
    ilast  = maximum(r)
    while istep < nsteps #&& t < tmax

        @inbounds for i = ifirst : ilast-1
            s[i] = a * (x[i+1] - x[i]) / (y[i+1] - y[i]) # line #10
        end

        @inbounds for i = ifirst : ilast
            x[i] = a * s[i-1]                            # line #14
        end

        @inbounds for i = ifirst : ilast
            y[i] = a * x[i-1]
        end

        @inbounds for i = ifirst : ilast-1
            s[i] = s[i] - (x[i+1] - x[i]) / (y[i+1] - y[i])
        end
        istep=istep + 1
    end 
    s
end 

const siz = 10^4
const a = e^pi - pi^e

s = rand(siz)
x = rand(siz)
y = rand(siz)

s1 = sub(s, 1:(siz - 1))
x1 = sub(x, 1:(siz - 1))
y1 = sub(y, 1:(siz - 1))

s2 = sub(s, 2:siz)
x2 = sub(x, 2:siz)
y2 = sub(y, 2:siz)

s3 = sub(s, 4:siz)

# warming up
mykern( 2:(siz - 3), s, a, x,  y  )
mykern( 2:(siz - 3), s, a, x1, y1 )
mykern( 3:(siz - 2), s, a, x2, y2 )
# @time mykern( 2:(siz - 3), s, a, x,  y )

# the real test

println("follow up: mykern( 2:(siz - 3), s, a, x, y )")
@time mykern( 2:(siz - 3), s, a, x,  y  )
println("follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )")
@time mykern( 2:(siz - 3), s1, a, x1, y1 )
println("follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )")
@time mykern( 3:(siz - 2), s2, a, x2, y2 )
@time mykern( 3:(siz - 2), s1, a, x2, y2 )
@time mykern( 5:(siz - 5), s3, a, x2, y2 )
julia shifted3.jl 
follow up: mykern( 2:(siz - 3), s, a, x, y )
  17.038 milliseconds (165 allocations: 11149 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
   9.040 seconds      (73071 k allocations: 1115 MB, 1.11% gc time)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
   8.964 seconds      (73056 k allocations: 1115 MB, 0.94% gc time)
   8.449 seconds      (73056 k allocations: 1115 MB, 0.88% gc time)
   8.462 seconds      (73026 k allocations: 1114 MB, 0.89% gc time)

What's wrong with the code above?
Thanks!

@timholy
Copy link
Member

timholy commented Jun 5, 2015

My results:

follow up: mykern( 2:(siz - 3), s, a, x, y )
  11.795 milliseconds (6 allocations: 192 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
 108.014 milliseconds (38263 allocations: 1119 KB)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
  16.299 milliseconds (6 allocations: 192 bytes)
  16.692 milliseconds (6 allocations: 192 bytes)
  22.398 milliseconds (6 allocations: 192 bytes)

Are you by any chance running julia 0.3? This is all only on julia 0.4.

@ScottPJones
Copy link
Contributor

That has to be 0.4, unless somebody backported my @time macro changes...

@timholy
Copy link
Member

timholy commented Jun 5, 2015

Yeah, good point. What is your julia version, anyway?

@ScottPJones
Copy link
Contributor

Also, with that large number of allocations (~73 million, ~1GB), it doesn't surprise me that it's slow...
I'd look into that first off.

@alsam
Copy link
Author

alsam commented Jun 5, 2015

Your timing is good. I wish I had the same.
My julia version:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5208 (2015-06-04 23:18 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit e9fa25b* (0 days old master)
|__/                   |  x86_64-linux-gnu

@timholy
Copy link
Member

timholy commented Jun 5, 2015

That's why I assumed julia 0.3, since SubArray construction was type-unstable on 0.3.

Also @alsam, your "warmup" doesn't really warm up the case where a subarray is the 2nd argument of mykern. So it could be computationcompilation, but if so there's still something wrong with your julia (compilation shouldn't be that slow). (EDIT: that's the reason my second call is >100ms rather than ~15ms.) But do try

include("shifted3.jl")
include("shifted3.jl")

in the REPL and see if the 2nd call is faster.

@alsam
Copy link
Author

alsam commented Jun 5, 2015

Just tried to

include("shifted3.jl")
include("shifted3.jl")

The performance is also bad unfortunately. The experiment with rotating parameters to mykern in order to make subarray the 1st argument didn't improve performance as well. Maybe something wrong with julia installation. Will retry with nightly binary and at home (another installation). Thanks!

@mbauman
Copy link
Member

mbauman commented Jun 5, 2015

Since this is more general than just SubArray and I wanted a scratchpad to document my tests, I've moved this to #11595. Thank you so much for noticing this and bringing it to my attention, @alsam!

@alsam
Copy link
Author

alsam commented Jun 7, 2015

but with nightly access to b[-10] triggers an exception due to checkbounds

You can call Base.sub_unsafe directly to bypass the checkbounds.

Thanks @timholy for the hint.
Unfortunately it doesn't work for me:

julia> a = [-10:10];
WARNING: [a] concatenation is deprecated; use collect(a) instead
[...]
julia> b = Base.sub_unsafe(a, 12:21);
julia> b[-10];
ERROR: BoundsError: attempt to access 0-dimensional SubArray{Int64,0,Array{Int64,1},Tuple{Int64},1}:
1
  at index [-10]
 in throw_boundserror at abstractarray.jl:146
 in checkbounds at abstractarray.jl:166
 in getindex at subarray2.jl:4

Julia version:

Version 0.4.0-dev+5236 (2015-06-06 15:02 UTC)
Commit 01abdc3* (0 days old master)
julia> @code_llvm b[-10]
...
  %10 = icmp eq i32 %2, 1
  br i1 %10, label %fail, label %pass2

fail:                                             ; preds = %top
  %11 = getelementptr %jl_value_t** %1, i64 1
  call void @jl_bounds_error_tuple_int(%jl_value_t** %11, i64 0, i64 1)
  unreachable
...

So even sub_unsafe checks bounds.

@alsam alsam closed this as completed Jun 7, 2015
@alsam alsam reopened this Jun 7, 2015
@alsam
Copy link
Author

alsam commented Jun 7, 2015

Sorry, pressed the wrong button and closed the issue.
Reopened it.
Thanks a lot for your efforts!

@KristofferC
Copy link
Member

This seems to do what worked in 0.3.9.

julia> a = [-10:10];

julia> b = sub(a, 12:21);

julia> Base.unsafe_getindex(b, -10)
-10

@alsam
Copy link
Author

alsam commented Jun 7, 2015

Sure, it works in 0.3.9, even conventional b[-10].

julia> a = [-10:10];

julia> b = sub(a, 12:21);

julia> b[-10]
-10

julia> b[-11]
ERROR: BoundsError()
 in getindex at subarray.jl:194

Thanks @KristofferC for noticing it.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 7, 2015

@alsam I think you need unsafe_getindex on 0.4

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5236 (2015-06-06 15:02 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 01abdc3* (1 day old master)
|__/                   |  x86_64-unknown-linux-gnu

julia> a = [-10:10;];

julia> b = sub(a, 12:21);

julia> Base.unsafe_getindex(b, -10)
-10

@alsam
Copy link
Author

alsam commented Jun 7, 2015

Copy that.

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+5236 (2015-06-06 15:02 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 01abdc3* (1 day old master)
|__/                   |  x86_64-linux-gnu

julia> a = [-10:10;];

julia> b = sub(a, 12:21);

julia> Base.unsafe_getindex(b, -10)
-10

julia> Base.unsafe_getindex(b, -11)
ERROR: BoundsError: attempt to access 21-element Array{Int64,1}:
 -10
...

Thanks @yuyichao for the useful hint!

@KristofferC
Copy link
Member

@alsam I meant that what I wrote does on 0.4 what used to work on 0.3.9. In fact, @yuyichao posted the exact same code as me :P

It is interesting how unsafe_getindex is not propagating the unsafeness all the way down but I am sure this has been discussed somewhere else.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 7, 2015

In fact, @yuyichao posted the exact same code as me :P

:) and that's exactly what I did in my julia session. Just want to include the version info and point out that it is not unsafe_sub rather unsafe_getindex.

@alsam
Copy link
Author

alsam commented Jun 7, 2015

Thank you @KristofferC and @yuyichao for clarification.
Got it :)
Just curious, what is the best way to wrap unsafe_getindex to get convenient short sugared syntax b[-10] instead of lengthy unsafe_getindex as 0.3.9 used to have it? Have some ideas in mind but maybe you might propose better way?

@KristofferC
Copy link
Member

The code below is probably fit for an insanity wolf meme hehe.

julia> import Base: getindex

julia> const getindex = Base.unsafe_getindex
Warning: imported binding for getindex overwritten in module Main
unsafe_getindex (generic function with 19 methods)

julia> a = [-10:10;];

julia> b = sub(a, 12:21);

julia> b[-10]
-10

But seriously, you should probably be able to do the same but only for the SubArray getindex methods.

@alsam
Copy link
Author

alsam commented Jun 7, 2015

Excellent, thanks @KristofferC !

@KristofferC
Copy link
Member

Note, you should do something like

getindex(V::SubArray, I::Int64...) = Base.unsafe_getindex(V, I...)
getindex(V::SubArray, I::Union(Range{Int64},Array{Int64,1},Int64,Colon)...) = Base.unsafe_getindex(V, I...)
getindex{T,N,P,IV,LD}(V::SubArray{T,N,P,IV,LD}, I::Union(AbstractArray{T,N},Colon,Real)...) = Base.unsafe_getindex(V, I...)

instead to keep the boundschecking on for non subarray arrays.

@alsam
Copy link
Author

alsam commented Jun 7, 2015

Cool, thanks a lot @KristofferC :))
Just checked, worked as expected:

julia> b[-10]
-10

julia> b[-11]
ERROR: BoundsError: attempt to access 21-element Array{Int64,1}:

@timholy
Copy link
Member

timholy commented Jun 7, 2015

Chiming in late here, but yes, indexing changed a lot for #10525, so my old instincts are out of date. Glad y'all figured stuff out.

@alsam
Copy link
Author

alsam commented Jun 8, 2015

Let me summarize the performance, with the following code below - thanks to @KristofferC, @timholy, @yuyichao for the hints

getindex(V::SubArray, I::Int...) = Base.unsafe_getindex(V, I...)
getindex(V::SubArray, I::Union(Range{Int},Array{Int,1},Int,Colon)...) = Base.unsafe_getindex(V, I...)
getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractArray, Colon)...) = Base.unsafe_getindex(V, to_index(I)...)
getindex{T,N,P,IV,LD}(V::SubArray{T,N,P,IV,LD}, I::Union(AbstractArray{T,N},Colon,Real)...) = Base.unsafe_getindex(V, I...)
setindex!(V::SubArray, v, I::Int...) = Base.unsafe_setindex!(V, v, I...)


#import Base: getindex, setindex!
#const getindex = Base.unsafe_getindex
#const setindex! = Base.unsafe_setindex!

the performance is:

follow up: mykern( 2:(siz - 3), s, a, x, y )
  12.896 milliseconds (166 allocations: 11485 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
  81.872 milliseconds (16694 allocations: 713 KB)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
  19.551 milliseconds (6 allocations: 192 bytes)
  19.710 milliseconds (6 allocations: 192 bytes)
  19.322 milliseconds (6 allocations: 192 bytes)

that is the performance penalty is about 40% as @timholy predicted.
That is excellent. Interesting that for 0.3.9 didn't observe performance penalty at all for sub indexing:

follow up: 
mykern( 2:(siz - 3), s, a, x, y )
elapsed time: 0.013599747 seconds (13888 bytes allocated)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
elapsed time: 0.170130061 seconds (938696 bytes allocated)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
elapsed time: 0.124578779 seconds (88 bytes allocated)
elapsed time: 0.114813522 seconds (88 bytes allocated)
elapsed time: 0.113371351 seconds (88 bytes allocated)

@timholy
Copy link
Member

timholy commented Jun 8, 2015

@alsam, you're still not noticing that 0.3.9 has a factor of ten performance penalty for sub indexing. (0.12... seconds is 120 milliseconds, not 12 millseconds).

#11617 makes it a lot better (and better than on 0.3), but it's still not back to where we were a couple weeks ago.

@alsam
Copy link
Author

alsam commented Jun 8, 2015

Yes, @timholy , you are quite right, thanks for noticing and for the link #11617 .

@alsam
Copy link
Author

alsam commented Jun 8, 2015

Ok, let me summarize:
Just a few lines of code emulate Fortran-like arrays quite efficiently:

getindex(V::SubArray, I::Int64...) = Base.unsafe_getindex(V, I...)
getindex(V::SubArray, I::Union(Range{Int64},Array{Int64,1},Int64,Colon)...) = Base.unsafe_getindex(V, I...)
getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractArray, Colon)...) = Base.unsafe_getindex(V, to_index(I)...)
setindex!(V::SubArray, v, I::Int64...) = Base.unsafe_setindex!(V, v, I...)

macro shifted_array(T, r)
    :(sub(Array($T, length($r)), -minimum($r) + 2 : length($r)))
end

and usage:

julia> x = @shifted_array(Float64, -10:10)
10-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},1}:
 6.90782e-310
...
julia> x[-10]
6.9078200935909e-310

julia> x[-11]
ERROR: BoundsError: attempt to access 21-element Array{Float64,1}:
...
julia> x[0]
6.90782011166424e-310

julia> x[10]
6.9078200930178e-310

julia> x[11]
ERROR: BoundsError: attempt to access 21-element Array{Float64,1}:

Penalty for a bigger example from 1d acoustics is quite reasonable - ~40% as expected:
timings, for base line:

593.164 milliseconds (40 allocations: 236 KB)

for sub version that uses macro @shifted_array:

 874.729 milliseconds (19507 allocations: 540 KB)

Thanks!

mbauman added a commit that referenced this issue Jun 9, 2015
This resolves the core issue behind SubArray's performance as reported in #5117.  Unfortunately, without #7799, the surface syntax isn't very nice since it's a ton of unsafe_ function calls.

Using the test case from #5117 (comment) :

* Julia pre-10525 (with `@inbounds`)

```
follow up: mykern( 2:(siz - 3), s, a, x, y )
   9.484 milliseconds (157 allocations: 10893 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
  68.632 milliseconds (16353 allocations: 718 KB)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
  16.753 milliseconds (6 allocations: 192 bytes)
  15.493 milliseconds (6 allocations: 192 bytes)
  17.036 milliseconds (6 allocations: 192 bytes)
```

* This branch (with `unsafe_getindex`/`unsafe_setindex!`):

```
follow up: mykern( 2:(siz - 3), s, a, x, y )
  10.046 milliseconds (158 allocations: 11469 bytes)
follow up: mykern( 2:(siz - 3), s1, a, x1, y1 )
  70.425 milliseconds (19097 allocations: 827 KB)
follow up: mykern( 3:(siz - 2), s2, a, x2, y2 )
  14.263 milliseconds (6 allocations: 192 bytes)
  16.470 milliseconds (6 allocations: 192 bytes)
  16.711 milliseconds (6 allocations: 192 bytes)
```
@alsam
Copy link
Author

alsam commented Jun 10, 2015

It is fantastic, but with the latest Julia Version 0.4.0-dev+5293 (2015-06-10 09:21 UTC) and @inline-ed getindex/setindex!:

@inline getindex(V::SubArray, I::Int64...) = Base.unsafe_getindex(V, I...)
@inline getindex(V::SubArray, I::Union(Range{Int64},Array{Int64,1},Int64,Colon)...) = Base.unsafe_getindex(V, I...)
@inline getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractVector, Colon)...) = Base.unsafe_getindex(V, to_index(I)...)
@inline setindex!(V::SubArray, v, I::Int64...) = Base.unsafe_setindex!(V, v, I...)

(please note, that @inline getindex{T,N,P,IV}(V::SubArray{T,N,P,IV}, I::Union(Real, AbstractVector ... follow c7e2b33 .)
sub version works faster in some cases:

julia main_sub.jl
...
333.805 milliseconds (19501 allocations: 539 KB)

vs.

julia main.jl
...
633.032 milliseconds (40 allocations: 236 KB)

but this is due to unsafe_getindex/unsafe_setindex!
if we define

import Base: getindex, setindex!
const getindex = Base.unsafe_getindex
const setindex! = Base.unsafe_setindex!

then we will get

julia main.jl 
Warning: imported binding for getindex overwritten in module Main
Warning: imported binding for setindex! overwritten in module Main
...
 198.278 milliseconds (40 allocations: 236 KB)

as expected, i.e. regular array version is faster than sub one. As byproduct check bounds is quite costly, without it the regular array version is >3x faster for this particular case.

@mbauman
Copy link
Member

mbauman commented Jun 10, 2015

I've seen some strange performance effects due to using Base.f() in function definitions. I never fully tracked it down, but perhaps try instead:

import Base: getindex, unsafe_getindex
@inline getindex(V::SubArray, I::Int64...) = unsafe_getindex(V, I...)
...

We're planning on reverting back to the old SubArray semantics, where bounds checks are deferred to the parent array (which is then affected by @inbounds if it's an Array). See #11641.

@alsam
Copy link
Author

alsam commented Jun 10, 2015

Great, it would be nice to have old SubArray semantics reverted, thanks @mbauman !

@alsam
Copy link
Author

alsam commented Jun 16, 2015

Confirm that with the last julia Version 0.4.0-dev+5394 (2015-06-15 23:17 UTC) even without redefining @inline getindex(V::SubArray, I::Int64...) = unsafe_getindex(V, I...) the sub trick to achieve negative indexing works out of the box and checks bounds correctly:

julia> macro shifted_array(T, r)
           :(sub(Array($T, length($r)), -minimum($r) + 2 : length($r)))
       end

julia> z = @shifted_array(Float64, -7:7);

julia> z[-7]
6.92464944583943e-310

julia> z[-8]
ERROR: BoundsError: attempt to access 15-element Array{Float64,1}:
 6.92465e-310

The last score for 1d acoustics code

baseline:

 320.611 milliseconds (40 allocations: 236 KB)

vs. sub version:

483.243 milliseconds (19501 allocations: 539 KB)

so the penalty is quite reasonable.
Thanks @mbauman, @timholy, @KristofferC, @yuyichao !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests