Add support for @simd #5355

ArchRobison · 2014-01-10T23:34:29Z

This pull request enables the LLVM loop vectorizer. ~~It's not quite ready for production.~~ I'd like feedback and help fixing some issues. The overall design is explained in this comment to issue #4786, except that it no longer relies on the "banana interface"mentioned in that comment.

Here is an example that it can vectorize when a is of type Float32, and x and y are of type Array{Float32,1}:

function saxpy( a, x, y )
    @simd for i=1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

I've seen the vectorized version run 3x faster than the unvectorized version when data fits in cache. When AVX can be enabled, the results are likely even better.

Programmers can put the @simd macro in front of one-dimensional for loops that have ranges of the form m:n and the type of the loop index supports < and `+``. The decoration is guaranteeing that the loop does not rely on wrap-around behavior and the loop iterations are safe to execute in parallel, even if chunks are done in lockstep.

The patch implements type-based alias analysis , which may help LLVM optimize better in general, and is essential for vectorization. The name "type-based alias analysis" is a bit of a misnomer, since it's really based on hierarchically partitioning memory. I've implemented it for Julia assuming that type-punning is never done for parts of data structures that users cannot access directly, but that user data can be type-punned freely.

~~Problems that I seek advice on:~~

~~The @simd macro is not found. Currently I have to do the following within the REPL:~~

include("base/simdloop.jl")
using SimdLoop.@simd

~~I tried to copy the way@printf is defined/exported, but something is wrong with my patch. What?~~

LLVM 3.3 disallows attaching metadata to a block, so I've attached it to an instruction in the block. It's kind of ad-hoc, but seems to work. Is there a better way to do it?
An alternative to attaching metadata is to eliminate src/llvm-simdloop.cpp and instead rely on LLVM's auto-vectorization capability, which inserts memory dependence tests. That indeed does work for the saxpy example above, i.e. it vectorizes without the support src/llvm-simdloop.cpp. However, @simd would still be necessary to tranform the loop into a form such that LLVM can compute a trip count.
An alternative to the trip-count issue is to eliminate @simd altogether and instead somehow ensure that m:n is lowered to a form for which LLVM can compute a trip count.
I'm a neophyte at writing macros, base/simdloop.jl could use a review by an expert.

Apologies for the useless comment:

This file defines two entry points:

I just noticed it, but being late on a Friday, I'll fix it later. It's supposed to say that one entry point is for marking simd loops and the other is for later lowering marked loops.

Thanks to @simonster for his information on enabling the loop vectorizer. It was a big help to get me going.

simonster · 2014-01-10T23:37:32Z

Amazing!

jiahao · 2014-01-10T23:39:13Z

😺

johnmyleswhite · 2014-01-10T23:41:35Z

💯

JeffBezanson · 2014-01-11T00:12:14Z

Amazing, I look forward to reading this in detail. Even just the TBAA part is great to have.

nolta · 2014-01-12T03:16:58Z

base/sysimg.jl

@@ -175,6 +175,9 @@ using .I18n
 using .Help
 push!(I18n.CALLBACKS, Help.clear_cache)

+# SIMD loops
+include("simdloop.jl")


I think you might need a

importall .SimdLoop

here?

Thanks! Now added.

lindahua · 2014-01-15T14:01:40Z

Eagerly looking forward to this.

ViralBShah · 2014-01-16T02:54:31Z

Likewise. Waiting for this to land.

ArchRobison · 2014-01-16T21:23:22Z

One feature of the pull request is that it enables auto-vectorization of some loops without @simd. But it's quirky, and the underlying reason for the quirkiness needs discussion because with a small change, we might be able to enable wider use of auto-vectorization in Julia. Consider the following example:

function saxpy( a, x, y )
    for i in 1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

LLVM will not auto-vectorize it because cannot compute a trip count. Now change 1:length(x) to (1:length(x))+0. Then (with the current PR) the example does vectorize!

The root issue is that the documented way that Julia lowers for loops works just fine for the vectorizer. But there is an undocumented optimization that gets in the way. If a loop has the form for i in a:b, then it is custom-lowered differently. (See 'for in src/julia-syntax.scm). The custom lowering likely helps compilation time by short-cutting through a lot of analysis and transform. Regrettably it puts the loop in a form where LLVM cannot compute a trip count. Here's sketch of the form (I'm abstracting out some details):

i = a
while i<=b 
    ...
    i = i+1

Assume a and b are of type Int. LLVM cannot compute a trip count because the loop is an infinite loop if b=typemax(Int). The "no signed wrap" (see #3929) enables LLVM to disallow this possibility. So I think we should consider one of two changes to short-cut lowering of for loops:

Somehow set the "no signed wrap" flag on the right add instruction, by using an intrinsic per the suggestion of @simonster.
Change the lowering to:

i = a
while i<b+1
    ...
    i = i+1

I think an annotation such as @simd is essential to trickier cases where run-time memory disambiguation is impractical. But I think we should consider whether the "short cut" lowering of for loops should be more friendly to auto-vectorization.

Comments?

simonster · 2014-01-17T02:06:07Z

This seems kind of like a bug in the current lowering, since for i = typemax(Int):typemax(Int); end should probably not be an infinite loop. Changing the lowering to i < b+1 would cause a loop ending in typemax(Int) not to be executed at all, which is still not quite right (although if the current behavior is acceptable, this seems equally acceptable). If we care about handling loops ending in typemax(Int), it seems like we could lower to:

if b >= a
  i = a
  while i != b+1
      ...
      i = i+1
  end
end

Can LLVM compute a trip count in that case?

JeffBezanson · 2014-01-17T06:02:18Z

Wow, it is quite satisfying that the shortcut hack works worse than the general case :)
This is indeed a bug.

It looks to me like @simonster 's solution is the only one that will handle full range. However, the Range type used in the general case can only have up to typemax(Int) elements. The special-case lowering could mimic that:

n = b-a+1
# error if range too big
c = 0
while c < n
    i = c+a
    ...
    c = c+1
end

StefanKarpinski · 2014-01-17T14:38:15Z

If you move the check to the end of the loop, then the fact that it's typemax doesn't matter:

i = a - 1
goto check
while true
    # body
    label check
    i < b || break
    i += 1
end

Edit: fix starting value.

StefanKarpinski · 2014-01-17T14:40:41Z

If you're willing to have an additional branch, then you can avoid the subtraction at the beginning.

ArchRobison · 2014-01-17T16:24:08Z

Is the short-cut expected to be semantically equivalent to the long path? E.g., how finicky should we be about which signatures are expected for the types of the bounds? If I understand correctly, the lowering at this point is happening before type inference. Do we have any measurements on what the short-cut is buying in terms of JIT+execution time or code space? I wondering if perhaps the short-cut could be removed and whatever savings it provided could be made up somewhere else in the compilation chain.

Here are some tricky examples to consider in proposing shortcuts/semantics:

for i=0.0:.1:.25   # Fun with floating-point round.  Tripcount should be 3.
       println(i)
end
for j=typemin(Int):typemin(Int)+1   # Tripcount should be 2.
       println(j)
end
for k=typemax(Int)-1:typemax(Int) # Tripcount should be 2
       println(k)
end

All of these deliver the correct (or at least obvious :-)) results with the long path, but may go astray with some shortcut solutions.

Besides user expectations, something else to consider is the path through the rest of the compilation chain. I suspect that the loop optimizations will almost invariably transform a test-at-top loop into a test-at-bottom loop wrapped in a zero-trip guard, i.e. something like this:

if (loop-test) {
      loop-preheader (compute loop invariants, initialize induction variables)
      do {
          loop body
      } while(loop-test);
}

So if we lower a loop into this form in the first place for semantic reasons, we're probably not creating any extra code bloat since the compiler was going to do it anyway.

StefanKarpinski · 2014-01-17T16:32:08Z

Maybe we should remove the special case handling altogether? At this point with range objects being immutable types and the compiler being quite smart about such, I suspect the special case may no longer be necssary. It originally was very necessary because neither of those things were true.

simonster · 2014-01-17T17:53:55Z

Without special lowering, we have to make a function call to colon, which has to call the Range1 constructor. This appears to have noticeable overhead if the time to execute the loop is short. Let:

function f(A)
    c = 0.0
    for i = 1:10000000
        for j = 1:length(A)
            @inbounds c += A[j]
        end
    end
    c
end

function g(A)
    c = 0.0
    for i = 1:10000000
        rg = 1:length(A)
        for j = rg
            @inbounds c += A[j]
        end
    end
    c
end

The only difference here should be that f(A) gets the special lowering whereas g(A) does not. For A = rand(5), after compilation, f(A) is consistently almost twice as fast:

julia> @time f(A);
elapsed time: 0.03747795 seconds (64 bytes allocated)

julia> @time f(A);
elapsed time: 0.037112331 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066732369 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066190191 seconds (64 bytes allocated)

If A = rand(100), the difference is almost non-existent, but I don't think we should deoptimize small loops. OTOH, if we could fully inline colon and the optimizer can elide the non-negative length check for Range1 construction, maybe this would generate the same code as @JeffBezanson's proposal.

JeffBezanson · 2014-01-17T18:39:33Z

Getting rid of the special case would be great. I'll explore what extra inlining might get us here.

JeffBezanson · 2014-01-17T19:19:50Z

LLVM seems to generate far more compact code with these definitions:

start(r::Range1) = r.start
next{T}(r::Range1{T}, i) = (i, oftype(T, i+1))
done(r::Range1, i) = i==(r.start+r.len)

With that plus full inlining I think we will be ok without the special case. Just need to make sure it can still vectorize the result.

JeffBezanson · 2014-01-17T20:09:32Z

Another idea: use the Range1 type only for integers, and have it store start and stop instead of length. That way the start and stop values can simply be accepted with no checks, and the length method can throw an overflow error if the length can't be represented as an Int. The reason for this is that computing the length is the hard part, and you often don't need it.

Otherwise we are faced with the following:

Check stop<start, set length to 0 if so
Compute checked_add(checked_sub(stop,start),1) to check for over-long ranges
Call Range1 constructor, which must check length<0 in case somebody calls the constructor directly

So there are 3 layers of checks, the third of which is redundant when called from colon. We could have a hidden unsafe constructor that elides check (3), for use by colon, but that's kind of a hack and only addresses a small piece.

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

ArchRobison · 2014-01-17T20:26:13Z

I verified that the auto-vectorizer can vectorize this example, which I believe is equivalent to code after "full inlining" of @JeffBezanson 's changes to Range1.

function saxpy( a, x, y )
    r = 1:length(x)
    s = r.start
    while !(s==(r.start+r.len))
        (i,s) = (s,oftype(Int,s+1))
        @inbounds y[i] = y[i]+a*x[i];
    end
end

ArchRobison · 2014-01-17T20:33:47Z

By the way, it's probably good to limit the shortcut to integer loops, or at least avoid any schemes that rely on floating-point induction variables. Otherwise round-off can cause surprises. Here's a surprise with the current Julia:

a=2.0^53
b=a+2
r = a:b
for i in r        # Performs 3 iterations as expected
    println(i)
end
for i in a:b      # Infinite loop
    println(i)
end

JeffBezanson · 2014-01-17T20:35:23Z

Clearly we need to just remove the special case. That will be a great change.

this fixes some edge-case loops that the special lowering did not handle correctly. colon() now checks for overflow in computing the length, which avoids some buggy Range1s that used to be possible. this required some changes to make sure Range1 is fast enough: specialized start, done, next, and a hack to avoid one of the checks and allow better inlining. in general performance is about the same, but a few cases are actually faster, since Range1 is now faster (comprehensions used Range1 instead of the special-case lowering, for example). also, more loops should be vectorizable when the appropriate LLVM passes are enabled. all that plus better correctness and a simpler front-end, and I'm sold.

StefanKarpinski · 2014-01-18T18:34:01Z

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

This seems quite sensible. I believe this actually addresses things like ranges of Char, BigInt, and other non-traditional types that you might want ranges of. There was another example recently, which I don't recall.

ArchRobison · 2014-01-22T17:32:10Z

Where in the manual should I document @simd? It's fundamentally about relaxing control flow, so doc/manual/control-flow.rst is a logical place. However, @simd is a bit esoteric and might be a distraction there. It's different than the parallel programming model, so doc/manual/parallel-computing.rst doesn't seem like the right place. Should I give @simd its own section in the manual?

ivarne · 2014-01-22T17:55:32Z

I would expect to find something like @inbounds and @simd to be in a performance chapter. They are both about making the user do something that ideally would be the compilers job.

How about performance-tips.rst?

jiahao · 2014-01-22T18:04:16Z

I like the idea of a new "performance tweaks" chapter

simonster · 2014-01-22T19:03:49Z

If we're still planning to implement #2299, I suspect we'll need eventually need a whole chapter just for SIMD.

tknopp · 2014-01-22T19:50:10Z

@simonster Hopefully not. The autovectorizer of llvm is pretty good and I have doubts that writing hand-written SIMD code is always faster. I made some experience and writting a simple matrix vector multiplication in C with autovectorization is as fast as the SIMD optimized Eigen routines (was using gcc when I tested this)

lindahua · 2014-01-22T20:51:23Z

I agree that when this lands, #2299 might be less urgent than before. Still, there are plenty of cases where explicit use of SIMD instructions are desired.

Latest advancement in compiler technology makes the compilers more intelligent, and they are now able to detect & vectorize simple loops (e.g. mapping and simple reduction, or sometimes matrix multiplication patterns).

However, they are still not smart enough to automatically vectorize more complex computation: for example, image filtering, small matrix algebra (where an entire matrix can fit in a small number of AVX registers, and one can finish 8x8 matrix multiplication in less than 100 CPU cycles using carefully crafted SIMD massage, as well as transcendental functions, etc.

jakebolewski · 2014-09-15T21:46:53Z

@ArchRobison that article was fantastic!

vchuravy · 2015-06-16T15:13:11Z

Recently there has been work on enabling interleaved memory accesses [1] in llvm. I am wondering how to best use this in combination with the SIMD work

[1] http://reviews.llvm.org/rL239291

ArchRobison · 2015-06-16T15:37:11Z

I see the feature is off by default. Maybe we could enable it with -O? My initial take is that the poster child for vectorizing interleaved memory access is complex arithmetic, but typically that involves complex multiplications which will require more work in LLVM to vectorize.

jackmott · 2016-01-19T13:52:22Z

I would like to add a vote for some method of doing SIMD by hand, whether it be part of a standard library or a language feature. Probably 90% of the potential benefit of SIMD is not going to be realized with automatic vectorization, and compilers aren't going to bridge that gap significantly ever. Consider for example, implementation of common noise functions like Perlin noise. These involve dozens of steps, a few branches, lookup tables, things the compilers won't be figuring out in my lifetime. My hand written SIMD achieved 3-5x speedup (128 vs 256bit wide varieties) over what the latest compilers manage to do automatically and I am a complete novice. There is a whole universe of applications - games, image processing, video streaming, video editing, physics and number theory research, where programmers are forced to drop down to C or accept code that is 3x->10x slower than it needs to be. With 512bit wide SIMD coming into the market it is too powerful to ignore, and adding good support for SIMD immediately differentiates your language from the other new languages out there which mostly ignore SIMD.

iamed2 · 2016-01-19T19:18:28Z

@jackmott You may be able to manually vectorize using llvmcall, but that would require knowledge of LLVM IR

eschnett · 2016-01-19T21:46:12Z

I've been wanting to write a small library based on NTuple and llvmcall for some time...

JeffBezanson · 2016-01-19T21:48:58Z

That would be awesome. Would be great to have simd types and operations within easy reach.

JeffBezanson · 2016-01-19T21:52:46Z

We could reopen #2299

eschnett · 2016-01-20T04:42:15Z

Here we go:

julia> workspace(); using SIMD; code_native(sqrt, (Vec{4,Float64},))
    .section    __TEXT,__text,regular,pure_instructions
Filename: SIMD.jl
Source line: 0
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 186
    vsqrtpd (%rsi), %ymm0
    vextractf128    $1, %ymm0, %xmm1
Source line: 5
    vmovhpd %xmm1, 24(%rdi)
    vmovlpd %xmm1, 16(%rdi)
    vmovhpd %xmm0, 8(%rdi)
    vmovlpd %xmm0, (%rdi)
    movq    %rdi, %rax
    popq    %rbp
    vzeroupper
    retq

This is with Julia master, using LLVM 3.7.1. LLVM seems to be a bit confused about how to store an array to memory, leading to the ugly vmov sequence in the end, but the actual vectorization works like a charm. See https://github.com/eschnett/SIMD.jl for the proof of concept.

vchuravy · 2016-01-20T04:52:35Z

@eschnett I assume I am to quick, but SIMD.jl is still empty ;)

eschnett · 2016-01-20T04:58:13Z

Thank you, forgot to push after adding the code.

eschnett · 2016-01-21T14:25:28Z

@JeffBezanson I notice that Julia tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD instructions, one has to convert in a series of extractvalue and insertelement instructions. Unfortunately, it turns out that LLVM (3.7, x86-64) is not good at optimizing these, leading at certain occasions to cumbersome generated code that breaks vectors into scalars and re-assembles them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these can be efficiently bitcast to LLVM vector types. That leads to efficient code, but is more complex on the Julia side.

ArchRobison · 2016-01-21T15:56:23Z

I'm on sabbatical (four more days!) and largely ignoring email and Github.
But apropos to this issue, I have an extant LLVM patch that fixes the
"cumbersome code" problem that Erik observed. The patch was developed
after I discovered from experience that mapping tuples to LLVM vectors was
not going to work well.

On Thu, Jan 21, 2016 at 8:25 AM, Erik Schnetter notifications@github.com
wrote:

@JeffBezanson https://github.com/JeffBezanson I notice that Julia
tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD
instructions, one has to convert in a series of extractvalue and
insertelement instructions. Unfortunately, it turns out that LLVM (3.7,
x86-64) is not good at optimizing these, leading at certain occasions to
cumbersome generated code that breaks vectors into scalars and re-assembles
them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these
can be efficiently bitcast to LLVM vector types. That leads to efficient
code, but is more complex on the Julia side.

—
Reply to this email directly or view it on GitHub
#5355 (comment).

yuyichao · 2016-01-21T16:00:34Z

It'll be nice if we have a standardized type for llvm vectors since they might be necessary to (c)call some vector math libraries.

eschnett · 2016-01-21T16:25:38Z

@ArchRobison I'm looking forward to trying your patch.

For the record, this is how a simple loop (summing an array of Float64) currently looks:

L224:
    vmovq   %rdx, %xmm0
    vmovq   %rbx, %xmm1
    vunpcklpd   %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm1[0]
    vmovq   %rdi, %xmm1
    vmovq   %rsi, %xmm2
    vunpcklpd   %xmm2, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm1, %ymm0, %ymm0
    vaddpd  (%rcx), %ymm0, %ymm0
    vextractf128    $1, %ymm0, %xmm1
    vpextrq $1, %xmm1, %rsi
    vmovq   %xmm1, %rdi
    vpextrq $1, %xmm0, %rbx
    vmovq   %xmm0, %rdx
    addq    $32, %rcx
    addq    $-4, %rax
    jne L224

Only the add instructions are real; the move, extract, unpack, and insert instructions are strictly redundant.

JeffBezanson · 2016-01-21T18:43:46Z

I recall some problems in mapping tuples to vectors, very likely involving alignment, calling convention, and/or bugs in LLVM. It's clear that only a small subset of tuple types can potentially be vector types, so there's ambiguity about whether a given tuple will be a struct or vector or array, which can cause subtle bugs interoperating with native code.

ArchRobison · 2016-01-22T03:40:08Z

I had the mapping from tuples to vectors working, with all the fixes for
alignment. It was messily context sensitive. But that wasn't the
show-stopper. What killed it was that it hurt performance as often as it
helped. My conclusion was that the mapping to vectors needs to happen much
later in the compilation pipeline, when LLVM can be sure it will likely pay
off. On Monday, when I return to the office after my 9 week absence, I'll
track down the status of review my LLVM patch. (It's context is probably
bit-rotted by now.)

Arch

On Thu, Jan 21, 2016 at 12:44 PM, Jeff Bezanson notifications@github.com
wrote:

I recall some problems in mapping tuples to vectors, very likely involving
alignment, calling convention, and/or bugs in LLVM. It's clear that only a
small subset of tuple types can potentially be vector types, so there's
ambiguity about whether a given tuple will be a struct or vector or array,
which can cause subtle bugs interoperating with native code.

—
Reply to this email directly or view it on GitHub
#5355 (comment).

eschnett · 2016-01-28T18:27:33Z

@ArchRobison Did you have time to look for the patch?

ArchRobison · 2016-01-28T18:52:05Z

Yes, and I updated it this morning per suggestions from LLVM reviewers while I was out. The patch has two parts:

http://reviews.llvm.org/D14185
http://reviews.llvm.org/D14260

eschnett · 2016-01-28T20:41:32Z

@ArchRobison I'm currently generating SIMD code like this:

julia> @code_llvm Vec{2,Float64}(1) + Vec{2,Float64}(2)

define void @"julia_+_23864.1"(%Vec.12* sret, %Vec.12*, %Vec.12*) #0 {
top:
  %3 = getelementptr inbounds %Vec.12, %Vec.12* %1, i64 0, i32 0
  %4 = load [2 x double], [2 x double]* %3, align 8
  %5 = getelementptr inbounds %Vec.12, %Vec.12* %2, i64 0, i32 0
  %6 = load [2 x double], [2 x double]* %5, align 8
  %arg1arr_0.i = extractvalue [2 x double] %4, 0
  %arg1_0.i = insertelement <2 x double> undef, double %arg1arr_0.i, i32 0
  %arg1arr_1.i = extractvalue [2 x double] %4, 1
  %arg1.i = insertelement <2 x double> %arg1_0.i, double %arg1arr_1.i, i32 1
  %arg2arr_0.i = extractvalue [2 x double] %6, 0
  %arg2_0.i = insertelement <2 x double> undef, double %arg2arr_0.i, i32 0
  %arg2arr_1.i = extractvalue [2 x double] %6, 1
  %arg2.i = insertelement <2 x double> %arg2_0.i, double %arg2arr_1.i, i32 1
  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %res_0.i = extractelement <2 x double> %res.i, i32 0
  %resarr_0.i = insertvalue [2 x double] undef, double %res_0.i, 0
  %res_1.i = extractelement <2 x double> %res.i, i32 1
  %resarr.i = insertvalue [2 x double] %resarr_0.i, double %res_1.i, 1
  %7 = getelementptr inbounds %Vec.12, %Vec.12* %0, i64 0, i32 0
  store [2 x double] %resarr.i, [2 x double]* %7, align 8
  ret void
}

That is:

a sequence of extractvalue/insertelement to convert the Julia tuple/LLVM array to an LLVM vector
a single LLVM vector operation (here: add)
a sequence of extractelement/insertvalue to convert the LLVM vector back to a LLVM array/Julia tuple

With your patches, would this still be a good way to proceed?
Or should this be a sequence of scalar operations instead, omitting the insert-/extractelement statements?

ArchRobison · 2016-01-28T21:18:01Z

Yes and no. The patch http://reviews.llvm.org/D14260 deals with optimizing the store. I ran your example through (using %Vec.12 = type { [2 x double] }, and the store was indeed to:

  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %7 = bitcast %Vec.12* %0 to <2 x double>*
  store <2 x double> %res.i, <2 x double>* %7, align 8
  ret void

But the load sequence was not optimized. The problem is that http://reviews.llvm.org/D14185 is targeting the situation where the tuple code is still fully scalar LLVM IR (such as this example from the unit tests), not partially vectorize code as in your example. For what you are doing, is it practical to generate fully scalar LLVM IR? Or do we need to consider adding another instruction-combining transform to LLVM?

eschnett · 2016-01-28T22:31:37Z

Yes, emitting scalar operations would be straightforward to do.

In the past -- with much older versions of LLVM, and/or with GCC -- it was important to emit arithmetic operations as vector operations since they would otherwise not be synthesized. It seems newer versions of LLVM are much better than this, so this might be the way to go.

eschnett · 2016-01-29T20:46:05Z

Yay! Success!

@ArchRobison Your patch D14260, applied to LLVM 3.7.1, with Julia's master branch and my LLVM-vector version of SIMD, is generating proper SIMD vector instructions without the nonsensical scalarization.

Here are two examples of generated AVX2 code (with bounds checking disabled; keeping it enabled still vectorizes the code, but has two additional branches at every loop iteration):

Adding two arrays:

L176:
    movq    (%r15), %rdx
Source line: 766
    vmovupd (%rcx,%rdx), %ymm0
Source line: 458
    movq    (%rbx), %rsi
Source line: 419
    vaddpd  (%rcx,%rsi), %ymm0, %ymm0
Source line: 803
    vmovupd %ymm0, (%rcx,%rdx)
    movq    %r14, -64(%rbp)
Source line: 62
    addq    $32, %rcx
    addq    $-4, %rax
    jne L176

Calculating the sum of an array:

L128:
    vaddpd  (%rcx), %ymm0, %ymm0
Source line: 55
    addq    $32, %rcx
    addq    $-4, %rax
    jne L128

Accessing the array elements in the first kernel is still too complicated. I assume that LLVM needs to be told that the two arrays don't overlap with the array descriptors. Also, some loop unrolling is called for.

Thanks a million!

ArchRobison · 2016-01-29T21:02:56Z

Good to hear it worked. Was that just D14260, or D14260 and D14185[http://reviews.llvm.org/D14185]? (Logically the two diffs belong together, but LLVM review formalities caused the split.)

eschnett · 2016-01-29T22:59:07Z

This was only D14260. D14185 didn't apply, so I tried without it, and it worked.

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

nolta reviewed Jan 12, 2014
View reviewed changes

simonster mentioned this pull request Jan 13, 2014

WIP: Enable LLVM loop vectorizer #3929

Closed

KristofferC mentioned this pull request Feb 3, 2016

Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD JuliaDiff/ForwardDiff.jl#98

Closed

eschnett mentioned this pull request Feb 7, 2016

Add LLVM patch D14260 to improve SIMD code #14976

Merged

Add support for @simd #5355

Add support for @simd #5355

Conversation

ArchRobison commented Jan 10, 2014

simonster commented Jan 10, 2014

jiahao commented Jan 10, 2014

johnmyleswhite commented Jan 10, 2014

JeffBezanson commented Jan 11, 2014

nolta Jan 12, 2014

Choose a reason for hiding this comment

ArchRobison Jan 13, 2014

Choose a reason for hiding this comment

lindahua commented Jan 15, 2014

ViralBShah commented Jan 16, 2014

ArchRobison commented Jan 16, 2014

simonster commented Jan 17, 2014

JeffBezanson commented Jan 17, 2014

StefanKarpinski commented Jan 17, 2014

StefanKarpinski commented Jan 17, 2014

ArchRobison commented Jan 17, 2014

StefanKarpinski commented Jan 17, 2014

simonster commented Jan 17, 2014

JeffBezanson commented Jan 17, 2014

JeffBezanson commented Jan 17, 2014

JeffBezanson commented Jan 17, 2014

ArchRobison commented Jan 17, 2014

ArchRobison commented Jan 17, 2014

JeffBezanson commented Jan 17, 2014

StefanKarpinski commented Jan 18, 2014

ArchRobison commented Jan 22, 2014

ivarne commented Jan 22, 2014

jiahao commented Jan 22, 2014

simonster commented Jan 22, 2014

tknopp commented Jan 22, 2014

lindahua commented Jan 22, 2014

jakebolewski commented Sep 15, 2014

vchuravy commented Jun 16, 2015

ArchRobison commented Jun 16, 2015

jackmott commented Jan 19, 2016

iamed2 commented Jan 19, 2016

eschnett commented Jan 19, 2016

JeffBezanson commented Jan 19, 2016

JeffBezanson commented Jan 19, 2016

eschnett commented Jan 20, 2016

vchuravy commented Jan 20, 2016

eschnett commented Jan 20, 2016

eschnett commented Jan 21, 2016

ArchRobison commented Jan 21, 2016

yuyichao commented Jan 21, 2016

eschnett commented Jan 21, 2016

JeffBezanson commented Jan 21, 2016

ArchRobison commented Jan 22, 2016

eschnett commented Jan 28, 2016

ArchRobison commented Jan 28, 2016

eschnett commented Jan 28, 2016

ArchRobison commented Jan 28, 2016

eschnett commented Jan 28, 2016

eschnett commented Jan 29, 2016

ArchRobison commented Jan 29, 2016

eschnett commented Jan 29, 2016