Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD #98

KristofferC · 2016-02-02T12:43:03Z

A standrad reduction function like:

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

which vectorizes with Vector{Float64} fails to vectorize with ForwardDiff numbers. For example:

@code_llvm simd_sum(rand(Float64, 10))

G = ForwardDiff.GradientNumber{4, Float64, NTuple{4, Float64}}
G_vec = G[rand(G) for i = 1:10]
@code_llvm simd_sum(G_vec)

It is possible that there is nothing that can be done on the ForwardDiff side to solve this but I thought it could be interesting to track nonetheless.

The text was updated successfully, but these errors were encountered:

KristofferC · 2016-02-02T12:49:11Z

Unfortunately, I am not well versed enough in LLVM IR to be able to pinpoint exactly what the problem is...

KristofferC · 2016-02-02T12:57:48Z

This generates very similar code and also fails to vectorize:

function tuple_simd_sum{T}(x::Vector{NTuple{4, T}})
    s = (0.0, 0.0, 0.0, 0.0)
    @inbounds @simd for i in eachindex(x)
        x_i = x[i]
        s = (s[1] + x_i[1], s[2] + x_i[2], s[3] + x_i[3], s[4] + x_i[4])
    end
    return s
end

tuple_vec = [(rand(), rand(), rand(), rand()) for i = 1:20]

@code_llvm tuple_simd_sum(tuple_vec)

mlubin · 2016-02-02T21:33:15Z

If moving from NTuple to another storage type for the epsilon components will make it easier to get access to SIMD, I'd say it's worth considering.

KristofferC · 2016-02-03T08:16:16Z

I don't know what other storage type that would be. A vector of GradientNumber is already packed efficiently in contiguous memory so I don't think the storage type is wrong, it is just that it is translated to LLVM IR in a non optimal way.

There are some patches pending to LLVM that might help JuliaLang/julia#5355 (comment)

It would be interesting to experiment with https://github.com/eschnett/SIMD.jl when it is mature enough.

jrevels · 2016-02-03T20:42:37Z

Using NTuple storage for epsilon components will allow us to manually leverage SIMD using llvmcall, such that Parials{N,T,NTuple{N,T}} (in an upcoming refactor, just Partials{N,T}) behaves like SIMD.jl's Vec{N,T}.

I think manually calling SIMD instructions might be the most consistent way to resolve this issue (instead of having to rely on the @simd heuristics in Base).

KristofferC · 2016-02-06T20:49:59Z

I made a small proof of concept using SIMD.jls Vec for the dual part at: 3fcd796. I basically threw out the whole PartialVec just to get it working as fast as possible.

Doing something like:

using ForwardDiff
using SIMD

function simd_sum{T}(x::Vector{T})
    s = zero(T)
    @simd for i in eachindex(x)
        @inbounds s = s + x[i]
    end
    return s
end

vec = [ForwardDiff.GradientNumber{4, Float64, Vec{4, Float64}}(1.0, ForwardDiff.Partials(Vec{4, Float64}((rand(), rand(), rand(), rand())))) for i in 1:500]

@code_native simd_sum(vec)

show a bunch of vectorized operations being used. It is however about twice as slow as the original code hehe. Maybe my implementation is not perfect or SIMD.jl needs a bit more battle experience. Still pretty interesting.

KristofferC · 2016-03-11T10:15:44Z

I made another proof of concept on getting the operations with the partial numbers becoming SIMD accelerated.

First I started on a package SIMDVectors.jl: https://github.com/KristofferC/SIMDVectors.jl It uses JuliaLang/julia#15244 to generate vectorized operations for a type similar to a tuple.

I then made a GradientNumber and Partials that uses this SIMDVector as its Partials.data field: https://gist.github.com/KristofferC/56594a9c57d6f27df769

The result is quite nicely generated SIMD code:

 include(simd_fdiff.jl)
 using SIMDVectors


a = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))
b = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))

@code_native a + b

.
    vaddpd  16(%rdx), %xmm5, %xmm5
    vaddpd  32(%rdx), %xmm4, %xmm4
    vaddpd  48(%rdx), %xmm3, %xmm3
    vaddpd  64(%rdx), %xmm2, %xmm2
    vaddsd  80(%rdx), %xmm1, %xmm1
    vaddsd  (%rdx), %xmm0, %xmm0
.

@code_native a * b

julia> @code_native a * b
.
    vmulpd  32(%rsi), %xmm1, %xmm10
    vmulpd  48(%rsi), %xmm1, %xmm11
    vmulpd  64(%rsi), %xmm1, %xmm8
Source line: 6
    vmulsd  80(%rsi), %xmm0, %xmm5
Source line: 34
    vmovsd  (%rdx), %xmm6           # xmm6 = mem[0],zero
Source line: 8
    vmovddup    %xmm6, %xmm7    # xmm7 = xmm6[0,0]
    vmulpd  16(%rdx), %xmm7, %xmm1
    vmulpd  32(%rdx), %xmm7, %xmm2
    vmulpd  48(%rdx), %xmm7, %xmm3
    vmulpd  64(%rdx), %xmm7, %xmm7
Source line: 6
    vmulsd  80(%rdx), %xmm6, %xmm4
Source line: 8
    vaddpd  %xmm1, %xmm9, %xmm1
    vaddpd  %xmm2, %xmm10, %xmm2
    vaddpd  %xmm3, %xmm11, %xmm3
    vaddpd  %xmm7, %xmm8, %xmm7
Source line: 6
    vaddsd  %xmm4, %xmm5, %xmm4
    vmulsd  %xmm6, %xmm0, %xmm0
.```

KristofferC · 2016-04-08T10:41:00Z

Using SIMD.jl with the new #15244 I get this for a multiplication with two gradient numbers:

julia> @code_native a*b
.
    vbroadcastsd    %xmm1, %ymm2
    vmulpd  32(%rsi), %ymm2, %ymm2
    vbroadcastsd    %xmm0, %ymm3
    vmulpd  32(%rdx), %ymm3, %ymm3
    vaddpd  %ymm2, %ymm3, %ymm2
Source line: 65
    vmulsd  %xmm1, %xmm0, %xmm0
.

which uses lovely AVX 256 bit instructions. The example in the OP also runs faster.

jrevels · 2016-04-08T13:20:55Z

This is so cool. I suppose now the race is on for me to wrap up #102 before JuliaLang/julia#15244 lands.

KristofferC · 2016-04-08T14:20:41Z

There are of course a few caveats, SIMD Vectors are limited to length 16 and SIMD.jl has no support for other than C types. I think the best thing would be to let the user chose Vec or Tuple as the backend storage for Partials when creating the gradient function, default to Tuple of course.

It shouldn't be that much code since only simple operations are needed on Partials.

KristofferC · 2016-04-08T19:28:41Z

I guess that means no nested differentiation as well..

mlubin · 2016-04-08T19:31:45Z

SIMD + nested differentiation is not a use case that I care too much about

mlubin · 2016-04-08T19:32:21Z

So how much faster will a multiplication of two 16-length gradient numbers be? 4x faster?

KristofferC · 2016-04-08T19:49:05Z

Theoretically yes, but I'm segfaulting now when I run @benchmark so can't actually properly test it.

dbeach24 · 2016-09-22T22:29:14Z

Couldn't SIMDVector still be used for the "innermost" partials of a nested differentiation?

KristofferC · 2016-09-23T06:23:43Z

The best would be if whatever is used to hold the partials would just degenerate to a normal tuple when the type is not something that simd would work on. As in https://github.com/KristofferC/SIMDVectors.jl#user-defined-number-types. SIMD.jl is better than that package though so adding support to that would be nice.

dbeach24 · 2016-10-03T19:55:42Z

@jrevels I was just reading about and experimenting with Julia 0.5's SIMD support. Are there any technical hurdles to implementing SIMD optimizations for the Partials type at this time?

As I think you suggested earlier, it seems like Partials{N,T} could be defined as:

immutable Partials{N,T}
    values::NTuple{N,VecElement{T}}
end

With the corresponding tupexpr functions specialized to emit LLVM SIMD opcodes for the cases where T is a bitstype.

Would there be any undue overhead for cases where T is not a bitstype?
Do you expect compile-times to slow due to the increased type nesting?

KristofferC · 2016-10-03T20:14:00Z

Note

However, since the automatic vectorization cannot be relied upon, future use will mostly be via libraries that use llvmcall.

This is what SIMD.jl is doing.

dbeach24 · 2016-10-03T20:26:01Z

@KristofferC Understood. Also looked at SIMD.jl.

I thought @jrevels expressed the opinion that he would prefer to insert the llvmcall code directly into ForwardDiff.jl, rather than rely on an external dependency. (At least that's how I interpreted his above comment on Feb 3.)

Is SIMD.jl mature enough to be used as a dependency at this time?

KristofferC · 2016-10-03T20:36:58Z

Since we do not require many fancy operations on the partials it should be possible to just roll our own. When I experimented with this I got random crashes when trying to benchmark but perhaps the situation had improved.

dbeach24 · 2016-10-03T22:33:29Z

Attempting to answer my first question about this approach, i.e. what is the cost of defining the partials as NTuple{N, VecElement{T}}?

I created a branch of ForwardDiff.jl where the Partials are defined in this way (for Julia 0.5 only). No SIMD optimizations have been added (yet) -- only the storage of the partials is changed. For 1st derivatives, the performance seems to be unchanged (expected). However, I'm seeing a 2-4x slowdown when using nested Duals, and this is certainly problematic. This problem will need to be solved before it's even worth thinking about introducing the SIMD operations.

It would be really great to change the definition of type Partials{N,T} type, based on the value of the type parameters (i.e. if T is a bitstype, wrap it in VecElement, otherwise, just use a normal NTuple{N,T}).

Is there any good way to do this?

Otherwise, we may need to have multiple types of Dual, or (alternatively) allow a Dual's value and partials types to be different.

Thoughts?

KristofferC · 2016-10-04T05:33:35Z

What is the problem with using VecElement if T is not a bits type? It should degenerate to exactly the same LLVM representation as a tuple of T for non SIMD supported T.

dbeach24 · 2016-10-04T15:01:36Z

The problem with nesting Dual within VecElement seems to be performance. (The code still appears to run correctly.) It's possible that I've done something wrong in my code, but I'm only seeing the performance issue when computing 2nd (or higher) derivatives.

KristofferC · 2016-10-04T15:07:41Z

How does the generated code look for something really simple like adding a two nested dual numbers with VecElement / Tuple?

dbeach24 · 2016-10-05T05:44:46Z

@KristofferC Please see the attached files which show the generated code and timings for sizes of {1,3,6,9} x {1,3,6,9}. In the 1x1, 3x1, 6x1, and 9x1 cases, there is virtually no difference in the generated code. However, as the number of partials are increased in the nested dual, the generated code becomes considerably lengthier (with the 9x9 case being quite lengthy, indeed.)

This seems to imply that VecElement has a cost, and that, sadly, that cost appears in exactly the place we were hoping to create the performance boost: when the innermost Dual has high dimension. My theory is that the trouble comes from nesting the inner-Duals inside a VecElement container. If we can somehow peel away this wrapping for the outer-Duals, but leave it intact for the innermost Duals (i.e. where we need it), then perhaps we can still achieve the desired optimization.

dualtest.jl.txt
output_diff_sidebyside.txt
output_vecel_disabed.txt
output_vecel_enabled.txt

UPDATE: It appears that the generated code for (m by n; n > 1) duals is suffering from unnecessary allocations and memory copy calls.

KristofferC · 2016-10-09T19:59:55Z

Hopefully that works. However, I don't think there is any inherit reason why nested Vecs shouldn't work just as well, so if you already have some code around that shows the problem, perhaps a Base issue would be good?

dbeach24 · 2016-10-09T23:16:16Z

Update: My above approach won't work. Thanks @KristofferC for encouraging me to further distill the problem and file an issue.

dbeach24 · 2016-10-10T04:43:01Z

Feel free to read JuliaLang/julia#18857. It's possible that introducing the llvmcall optimizations will be sufficient to improve the performance, but it's starting to feel like a bit of a reach...

vchuravy · 2016-10-10T05:27:19Z

One of the problems with using Vec without lowering to the correct SIMD instructions is that LLVM will have to effectively unpack and repack at every step. More or less you tell LLVM that is has to store this data in a particular way and place (the SIMD registers) and then you move it out and pack into these register when doing work on them (and that is quite inefficient).

I have been working on supporting vector intrinsics in base Julia JuliaLang/julia#18470, but I didn't make much progress on that recently. (The problem is what to do in the case that LLVM is not available). The goal of that work was to prevent people from having to roll their own llvmcall implementations. (LLVM IR is not stable)

In the mean time I would recommend using SIMD.jl it is a very nice package and it solved several of the hard problems already.

jrevels · 2016-10-11T19:11:08Z

@dbeach24 Just caught up with this issue after returning from vacation, thanks for looking into it. I'm a SIMD noob, and haven't really been following recent developments in Julia/the ecosystem (thanks for your insight, @vchuravy).

Upon revisiting this, I've just found that LLVM's SLP auto-vectorizer now seems to work in some cases for Julia v0.5. Lo and behold:

➜  data julia -O3 -q
julia> using ForwardDiff: Dual

julia> a = Dual(1., (2., 3.))
Dual(1.0,2.0,3.0)

julia> b = Dual(4., (5., 6.))
Dual(4.0,5.0,6.0)

julia> @code_llvm a + b

define void @"julia_+_70898"(%Dual* noalias sret, %Dual*, %Dual*) #0 {
top:
  %3 = getelementptr inbounds %Dual, %Dual* %1, i64 0, i32 1, i32 0, i64 1
  %4 = load double, double* %3, align 8
  %5 = getelementptr inbounds %Dual, %Dual* %2, i64 0, i32 1, i32 0, i64 1
  %6 = load double, double* %5, align 8
  %7 = fadd double %4, %6
  %8 = bitcast %Dual* %1 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = bitcast %Dual* %2 to <2 x double>*
  %11 = load <2 x double>, <2 x double>* %10, align 8
  %12 = fadd <2 x double> %9, %11
  %13 = bitcast %Dual* %0 to <2 x double>*
  store <2 x double> %12, <2 x double>* %13, align 8
  %14 = getelementptr inbounds %Dual, %Dual* %0, i64 0, i32 1, i32 0, i64 1
  store double %7, double* %14, align 8
  ret void
}

julia> @code_native a + b
    .section    __TEXT,__text,regular,pure_instructions
Filename: dual.jl
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 114
    vmovsd  16(%rsi), %xmm0         ## xmm0 = mem[0],zero
    vaddsd  16(%rdx), %xmm0, %xmm0
Source line: 182
    vmovupd (%rsi), %xmm1
    vaddpd  (%rdx), %xmm1, %xmm1
    vmovupd %xmm1, (%rdi)
    vmovsd  %xmm0, 16(%rdi)
    movq    %rdi, %rax
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

If I don't use -O3, then the LLVM IR doesn't show any vector instructions ~~, but the generated assembly is basically the same (this benefit is probably target-specific).~~ EDIT: I got a vaddsd mixed up with a vaddpd; you do, in fact, need -O3 to get it to vectorize.

Note that the example in this PR's initial description also now auto-vectorizes for me (in the LLVM IR + native assembly with -O3~~, and only in the native assembly without -O3~~).

dbeach24 · 2016-10-11T20:03:06Z

@jrevels Very interesting, indeed. If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort?

KristofferC · 2016-10-11T20:10:32Z

Also noted in the end of https://github.com/JuliaArrays/StaticArrays.jl/blob/master/README.md#speed.

However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64.

jrevels · 2016-10-11T20:41:11Z

However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64.

It's working for me (also works for the simd_sum example):

julia> a = Dual(1., (2., 3., 4., 5.))
Dual(1.0,2.0,3.0,4.0,5.0)

julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0,5.0)

julia> @code_llvm a + b

define void @"julia_+_70874"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
  %3 = getelementptr inbounds %Dual.66, %Dual.66* %1, i64 0, i32 1, i32 0, i64 3
  %4 = load double, double* %3, align 8
  %5 = getelementptr inbounds %Dual.66, %Dual.66* %2, i64 0, i32 1, i32 0, i64 3
  %6 = load double, double* %5, align 8
  %7 = fadd double %4, %6
  %8 = bitcast %Dual.66* %1 to <4 x double>*
  %9 = load <4 x double>, <4 x double>* %8, align 8
  %10 = bitcast %Dual.66* %2 to <4 x double>*
  %11 = load <4 x double>, <4 x double>* %10, align 8
  %12 = fadd <4 x double> %9, %11
  %13 = bitcast %Dual.66* %0 to <4 x double>*
  store <4 x double> %12, <4 x double>* %13, align 8
  %14 = getelementptr inbounds %Dual.66, %Dual.66* %0, i64 0, i32 1, i32 0, i64 3
  store double %7, double* %14, align 8
  ret void
}

It can even vectorize operations on the values + partials simultaneously, if the lengths work out correctly:

julia> a = Dual(1., 2., 3., 4.)
Dual(1.0,2.0,3.0,4.0)

julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0)

julia> @code_llvm a + b

define void @"julia_+_71231"(%Dual.74* noalias sret, %Dual.74*, %Dual.74*) #0 {
top:
  %3 = bitcast %Dual.74* %1 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = bitcast %Dual.74* %2 to <4 x double>*
  %6 = load <4 x double>, <4 x double>* %5, align 8
  %7 = fadd <4 x double> %4, %6
  %8 = bitcast %Dual.74* %0 to <4 x double>*
  store <4 x double> %7, <4 x double>* %8, align 8
  ret void
}

If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort?

Possibly. I guess it could depend on the context-sensitivity of the auto-vectorizer (it's LLVM's SLP vectorizer, BTW, not Julia's) - I don't know if it works in every case. So far, though, it seems to work where I'd expect it to work. My original goal was to vectorize arithmetic operations on Partials, which theoretically are perfect candidates for SLP vectorization, but it might be possible to leverage SIMD in more advanced ways depending on how you're using Dual numbers.

Anyway, do we consider this issue closed since it now works with -O3?

jrevels · 2016-10-11T20:56:44Z

vectorize the innermost partials of a nested Dual type

Just had to make sure this works, and it does (soooo cool that this works):

julia> a = Dual(Dual(1., 2.), Dual(3., 4.))
Dual(Dual(1.0,2.0),Dual(3.0,4.0))

julia> b = copy(a)
Dual(Dual(1.0,2.0),Dual(3.0,4.0))

julia> @code_llvm a + b

define void @"julia_+_70918"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
  %3 = bitcast %Dual.66* %1 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = bitcast %Dual.66* %2 to <4 x double>*
  %6 = load <4 x double>, <4 x double>* %5, align 8
  %7 = fadd <4 x double> %4, %6
  %8 = bitcast %Dual.66* %0 to <4 x double>*
  store <4 x double> %7, <4 x double>* %8, align 8
  ret void
}

vchuravy · 2016-10-11T23:42:13Z

This line %3 = bitcast %Dual.66* %1 to <4 x double>* is really cool to see. The only benefit of using SIMD.jl is now that we could have these optimisations without -O3 (I have to check which passes we are running with -O3).

I don't think we currently can define a convert method between <4 x double> and a struct with llvmcall. And reinterpret only works for bitstypes.

dbeach24 · 2016-10-12T04:33:48Z

I tried various operations on nested duals, and it appears that the SIMD instructions appear in more complex cases as well -- as long as I specify -O3.

I did not see <4 x double>, but merely <2 x double> instructions for @jrevels's example above. Does the compiled LLVM depend in any way on my native architecture (core i7)?

UPDATE: Apparently <4 x double> works just fine if I compile my own Julia instead of using the prebuilt 0.5.0 release...

vchuravy · 2016-10-12T04:58:50Z

There are many factors that are at play here:

Are you compiling from source?
Are you using your system LLVM? (we patch LLVM to improve vectorization)
Hardware support in from of avx and avx2 is necessary, but even then the LLVM cost model might decide to otherwise
What is your precise made of CPU? cat /proc/cpuinfo on Linux will give you this information, (take a look at flags).

It is a shame that we can't do:

immutable Double
         x :: Float64
         y :: Float64
end
reinterpret(NTuple{2,VecElement{Float64}}, Double(1.0, 2.0))

but this works:
reinterpret(NTuple{2,VecElement{Float64}}, [Double(1.0, 2.0)])

see JuliaLang/julia#10140

KristofferC · 2016-10-12T08:13:14Z

Now we just need to lobby to enable the vector passes with default -O level :)

KristofferC · 2016-10-12T08:15:04Z

Would be interesting to run the benchmarks with and without -O3 to see if it gives any cool speedups.

KristofferC · 2016-10-12T08:21:59Z

-O3 on left:

Hessian and gradient seems faster?

jrevels · 2016-10-12T12:25:25Z

Yeah, I saw similar perf differences on my machine. Wonder what's causing the slowdown for some of the benchmarks? It could be from some other pass enabled via -O3 besides the SLP vectorizer (though the same case can be made for the speed-ups). Is the alias analysis pass the only other -O3-specific pass?

KristofferC · 2016-10-13T11:20:14Z

Perhaps the way forward is just to document this and close the issue? I'm pretty mindblown how good code that is generated with O3... When did this start happening (or have we just not tried it before)? Speedup of Rosenbrock with almost a factor of 3 is pretty great, no?

jrevels · 2016-10-13T16:37:24Z

Perhaps the way forward is just to document this and close the issue?

I agree with this.

When did this start happening (or have we just not tried it before)?

I played around with trying to get the SLP vectorizer to actually trigger/emit vectorized instructions for ForwardDiff's old number types back in the days of Julia v0.3/early v0.4, but never succeeded. When @dbeach24 started reviving discussion here, I remembered that I should try again with the new Dual type and Julia v0.5.

Speedup of Rosenbrock with almost a factor of 3 is pretty great, no?

Yeah, I was really excited to see that 😃

mlubin · 2016-10-13T16:38:29Z

Is there any way we can add a test for regressions here?

jrevels · 2016-10-13T17:01:23Z

We could have a bunch of tests like:

@test contains(sprint(io ->  Base.code_llvm(io, +, (typeof(a), typeof(b)))), "fadd <4 x double>")

Would this be too fragile?

dbeach24 · 2016-10-13T17:23:53Z

It would fail if you forgot to run with -O3. It would also fail if you're using the prebuilt Julia-0.5.0 for Mac, since (as I recently discovered) that doesn't include support for <4 x double> operations.

jrevels · 2016-10-13T17:36:12Z

Well yeah, we'd have to put -O3 in the Travis script to catch SIMD regressions. The LLVM IR generated should be the same either way, no (i.e. it's just that the native code won't include SIMD)? Do we emit different IR based on the architecture? I thought it was LLVM's job to abstract over the architecture.

jrevels · 2016-10-13T17:42:16Z

For local testing, we might be able to get the optimization level from within Julia, then only run the SIMD tests if it's 3.

KristofferC · 2016-10-13T17:52:32Z

IR is also different depending on sysimg is locally compiled or not.

KristofferC mentioned this issue Feb 6, 2016

Segfault when creating a Vec{7, Float64} eschnett/SIMD.jl#1

Closed

jrevels mentioned this issue Feb 8, 2016

Giant rewrite for Julia v0.5 #102

Merged

5 tasks

KristofferC mentioned this issue Mar 12, 2016

Inverted marking of vector tuple type JuliaLang/julia#15244

Merged

KristofferC closed this as completed Apr 8, 2016

KristofferC reopened this Apr 8, 2016

dbeach24 mentioned this issue Oct 9, 2016

Nested VecElement structures generate bulky code JuliaLang/julia#18857

Closed

KristofferC closed this as completed Oct 12, 2016

KristofferC reopened this Oct 12, 2016

KristofferC mentioned this issue Oct 12, 2016

Enable LLVM -O optimizations by default JuliaLang/julia#14632

Closed

jrevels mentioned this issue Oct 13, 2016

Add docs and regressions tests for SIMD vectorization of simple dual number operations #152

Merged

jrevels closed this as completed in #152 Oct 13, 2016

Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD #98

Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD #98

Comments

KristofferC commented Feb 2, 2016

KristofferC commented Feb 2, 2016

KristofferC commented Feb 2, 2016

mlubin commented Feb 2, 2016

KristofferC commented Feb 3, 2016

jrevels commented Feb 3, 2016

KristofferC commented Feb 6, 2016

KristofferC commented Mar 11, 2016

KristofferC commented Apr 8, 2016

jrevels commented Apr 8, 2016

KristofferC commented Apr 8, 2016

KristofferC commented Apr 8, 2016

mlubin commented Apr 8, 2016

mlubin commented Apr 8, 2016

KristofferC commented Apr 8, 2016

dbeach24 commented Sep 22, 2016

KristofferC commented Sep 23, 2016

dbeach24 commented Oct 3, 2016

KristofferC commented Oct 3, 2016

dbeach24 commented Oct 3, 2016

KristofferC commented Oct 3, 2016

dbeach24 commented Oct 3, 2016 • edited Loading

KristofferC commented Oct 4, 2016

dbeach24 commented Oct 4, 2016

KristofferC commented Oct 4, 2016

dbeach24 commented Oct 5, 2016 • edited Loading

KristofferC commented Oct 9, 2016

dbeach24 commented Oct 9, 2016

dbeach24 commented Oct 10, 2016

vchuravy commented Oct 10, 2016

jrevels commented Oct 11, 2016 • edited Loading

dbeach24 commented Oct 11, 2016

KristofferC commented Oct 11, 2016

jrevels commented Oct 11, 2016

jrevels commented Oct 11, 2016

vchuravy commented Oct 11, 2016

dbeach24 commented Oct 12, 2016 • edited Loading

vchuravy commented Oct 12, 2016

KristofferC commented Oct 12, 2016

KristofferC commented Oct 12, 2016

KristofferC commented Oct 12, 2016

jrevels commented Oct 12, 2016

KristofferC commented Oct 13, 2016

jrevels commented Oct 13, 2016

mlubin commented Oct 13, 2016

jrevels commented Oct 13, 2016

dbeach24 commented Oct 13, 2016

jrevels commented Oct 13, 2016

jrevels commented Oct 13, 2016

KristofferC commented Oct 13, 2016 • edited Loading

dbeach24 commented Oct 3, 2016 •

edited

Loading

dbeach24 commented Oct 5, 2016 •

edited

Loading

jrevels commented Oct 11, 2016 •

edited

Loading

dbeach24 commented Oct 12, 2016 •

edited

Loading

KristofferC commented Oct 13, 2016 •

edited

Loading