Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portable packed SIMD vector types #2366

Closed
wants to merge 17 commits into from
Closed

Portable packed SIMD vector types #2366

wants to merge 17 commits into from

Conversation

gnzlbg
Copy link
Contributor

@gnzlbg gnzlbg commented Mar 22, 2018

This RFC extends Rust with portable packed SIMD vector types.

Rendered.


Acknowledgments

It would not have happened without the invaluable contributions of @alexcrichton and @BurntSushi , the excellent feedback given by the many reviewers of the initial drafts: @sunfish, @nagisa, @eddyb, and @hsivonen, amongst others, all the people using stdsimd, filling issues, and submitting PRs, and @huonw's work on the rust-lang-nursery/simd crate.

@gnzlbg gnzlbg force-pushed the ppv branch 4 times, most recently from b28a35a to 47ba7e7 Compare March 22, 2018 14:44

* `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`,
`{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`: vertical (lane-wise) arithmetic
operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth describing how behavior of these operators is different (or not different) from the same operators for non-vector arithmetic types - e.g. wrapping vs panicking on overflow.
(Same applies to bit-shift traits.)

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm on this. @sunfish also requested to know the behavior of Div/Rem when dividing by zero, so I'll add that as well, and also to mention that Div and Rem are provided as a convenience, but that most (or no) hardware actually offers intrinsics for these.

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at rustc's implementation this is the behavior that they seem to have (I need to add stdsimd test to exercise these everywhere though), but this is what LLVM says that it guarantees:

  • on integer vectors overflowing results in the correct mathematical result modulo 2^n (see exceptions below)

  • on floating point vectors overflowing and division by zero has the same behavior as that of f32 or f64 (+-INFINITY for overflow, and some NaN for division by zero IIRC).

  • on integer vectors we directly map division and rem to LLVM's {s,u}{div,rem} which state that if any element is divided by zero or the operation overflows these operations result in undefined behavior.

So this is what we are currently doing. Now the question: can we do better?

  • For Add, Sub, Mul, Shl, Shr: we could use nuw/nsw to generate a poison value and get the same behavior than for the scalar integers (that is, a panic). We could then expose wrapping arithmetic via wrapping_{add,sub,mul,shl,shr}. This might be a bit of a performance foot-gun, so the alternative could be to just expose the wrapping methods instead.

  • For Div and Rem: we could expose these as unsafe fn {div, rem}(self, o: Self) -> Self methods with a precondition on division by zero, and provide an implementation in the traits that checks that no element is zero before performing the operation (panicking on precondition violation). This is "expensive": one vector comparison + one horizontal reduction, and whether it is better than not providing these operations at all is debatable. Another alternative might be to not expose these operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petrochenkov I've gone ahead and updated the RFC which what I consider the "sane" arithmetic and shift operator semantics:

  • the arithmetic and shift traits behave like the operators for the scalars. That is, for integers they panic on overflow and division by zero, and for floats they produce the same results as the scalar.

  • the integer vectors also implement a set of wrapping_{add,sub,...} methods that perform modulo 2^n arithmetic, just like the scalar integers do

  • the integer vectors also implement two unsafe wrapping_{div,rem}_unchecked methods that perform no checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's sane here is a hard question.
On one hand, the only reason for SIMD's existence is improving performance and uNxM + uNxM should ideally turn into a single hardware SIMD addition instruction, but checked operations can't be lowered into a single instruction.
On the other hand, consistency with scalar arithmetic, error detection and everything...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that any SIMD proposal that does not mesh with the rest of the language is not going to be "good enough". So what I meant with "sane" was the same defaults than the rest of the std library, while at the same time offering easy-to-reach more efficient operations with slightly different semantics. While the std library puts these typically in core::intrinsics, the portable SIMD vector types have these as inherent methods, making them more "first class".

IMO the SIMD std facilities have two jobs: protecting users from easy mistakes, and allowing them to write efficient code. These two jobs are at constant tension.

I don't think that unsafe operations should become the default, but this does not mean that they should be hard to reach for either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, consistency with scalar arithmetic, error detection and everything...

Being consistent with scalar arithmetic would have overflow checking for + (etc.) only in debug mode, so release mode would still lower it to a single instruction.

I think being consistent with scalar math on this point would be OK, but making + check for overflow in release mode would probably not be OK in terms of performance expectations.

If overflow is checked in debug mode, wrapping_foo variants are needed. E.g. the Rust SIMD code in Firefox for checking if a UTF-16 string contains right-to-left characters currently relies on wrapping underflow for - on u16x8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsivonen I've added that overflow checking only happens in debug mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW "release/debug" is not the right distinction for overflow checks. They're a separate switch (-C overflow-checks=[on|off]) that can be flipped independently of debug info, optimization level, etc. -- it's just by default tied to debug_assertions which in turn is by default tied to release/debug.

returns `true` if all lanes compare `true`. It is equivalent to
`a.eq(b).all()`.
* `PartialOrd<Self>`: compares two vectors lexicographically.
* `From/Into` lossless casts between vectors with the same number of lanes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about From conversions [{et}{lw}; {nl}] -> {et}{wl}x{nl} (arrays) and {et}{wl}^{nl} -> {et}{wl}x{nl} (tuples) ?

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't given these much thought until now, so I don't know if adding these is worth it. In any case, this is how they could be implemented:

  • For arrays, one could implement these on top of the {load,store}_unaligned APIs without any issues.

  • For tuples, it depends. IIRC the layout of tuple elements is unspecified to allow for things like reordering tuple elements to reduce their size. There was, however, a variadic RFC built on tuples that required the layout of the tuples to be fully specified to allow for things like slicing tuples. Iff the layout of tuples with multiple elements of the same type would be specified as being the same layout as that of an array/slice, then we could implement these for tuples with unaligned memory load/stores as well. Otherwise, we'll probably need to copy the tuple elements into an array first, and then perform the memory loads (and vice-versa for stores).

Having said this, we don't need to add these now either. Those interested in trying these out could add them to stdsimd behind an unstable feature gate. If they turn out to be useful a mini-FCP might be enough to stabilize these later on.

Copy link
Contributor

@Centril Centril Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough; On the question of "is it worth it" I'd like to note my opinion that it would help with ergonomics so I'd be 👍 for those impls.

Could you perhaps note the possibility of adding these somewhere in the RFC (like the Unresolved questions) for future reference?

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Mar 22, 2018
text/0000-ppv.md Outdated

##### Semantics for floating-point numbers

* `eq`: yields `true` if both operands are not a `QNAN` and `self` is equal to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why quiet NaN? Surely any NaN should result in false?

And what does "self is equal to other" mean anyway? Is it a bitwise comparison? Then that doesn't match scalar float comparisons wrt positive/negative zero.


I don't think this section (and others like it) is needed or useful at all. It's enough to simply state that each lane is compared as in the scalar case. I assume that's what you're going for anyway (if not, that's a big problem IMO)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signaling NaN may result in the CPU trapping to the OS without giving the program a choice. LLVM recently tightened up their assumptions regarding the default floating-point environment, so only qNaN should be generated anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eternaleye That's multiple kinds of false

  • signalling NaNs do not trap. Operations on them raise a floating point exception, which is nothing like language exceptions or traps. The specific exception sNaN operations raise, (invalid operation) is also raised by sqrt(-1) and myriad other operations and is handled by returning a qNaN.
  • some LLVM people have been confused about this too but they're now finally tidying up their stuff to explicitly not consider signalling NaNs UB-producing
  • the LangRef change you link does not say sNaN is not produced, it simply spells out some consequences of assuming the program does not change the default rounding mode, exception handling, etc. -- one consequences of that is that qNaN and sNaN are indistinguishable so LLVM doesn't need to preserve signalling-ness, but you can still get an sNaN bit pattern in many way.
  • finally, in any case this issue affects vectors and scalars equally so even if what you wrote was true it still would not explain the divergence here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkruppe Ah, sorry - thanks for clarifying that for me!

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I just documented here what we are already doing, and QNAN is what LLVM guarantees. Why? I don't know. cc'ing here @sunfish, @chandlerc - they might know why LLVM guarantee this. While what you mention makes sense, I don't have a global view of what all architectures that LLVM supports actually do.

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this section (and others like it) is needed or useful at all. It's enough to simply state that each lane is compared as in the scalar case.

I need to check whether the way we are comparing floating-point vectors is actually the same way in which we are handling the scalar case. If not, then arguably we should, but maybe we can't for some reason. From an implementation point-of-view, the underlying LLVM intrinsics work on both scalars and vectors, so we ought to be doing the exact same thing for vectors as for scalars, but as mentioned, I haven't checked.

In any case, defining these to have the same semantics as the operations for scalar vectors makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated earlier, quite a few LLVM developers (and consequently the LangRef) was really confused about sNaN for a long time. The reference to QNAN specifically in the LangRef probably comes from that confusion. Certainly there is no remotely justifiable reason to treat sNaN differently here, and IEEE 754 does not do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkruppe

I don't think this section (and others like it)

Could you ping me in the other sections where wording like this is unnecessary?

text/0000-ppv.md Outdated
than two times the number of lanes in the input vector. The length of the
resulting vector equals the number of indices provided.

Given a vector with `N` lanes, the indices in range `[0, N)` refer to the `N` elements in the vector. In the two-vector version, the indices in range `[N, 2*N)` refer to elements in the second vector.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on out-of-range indices? Compile time error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, out-of-range indices are a compile-time error. Right now they are a monomorphization time error but before stabilization we should turn these into MIR typeck errors.

text/0000-ppv.md Outdated
this in one form or another, but if one future backend does not, this RFC can be
implemented on top of the architecture specific types.

## Zero-overhead requirement for backends

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler engineer in me is a bit worried by the tone of this section. Obviously we want to make nice source code execute as efficiently as possible, but this is also true of the rest of the language and, in the end, very much best-effort. It is simply not possible to guarantee optimality in any useful sense (cf. the "full employment theorems" for compiler writers 😄).

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shhh. We all want to be employed full time on Rust. Don't spoil this!


Kidding aside. Yes. This is all best effort. The good thing is that if there are compiler backend issues that must be resolved, in most cases we can resolve them in Rust - either in the compiler or the library. Sometimes this might not be enough, but at least for stdsimd our way has always been to diagnose issues first, fill backend bug reports afterwards, and in the mean time, add library or rustc workarounds. As the backends and rustc have been fixed (often by ourselves) we have consistently been able to remove library workarounds.

While I actually wanted to write that the best thing we can do here is only to strive for optimality, even though we can never achieve it, I actually think that's not true. For many trivial SIMD operations we can often reliably generate optimal code. For non trivial operations we just have to keep trying. We will obviously never be perfect on all cases, but being good enough on most cases suffices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkruppe I've reworded this section

text/0000-ppv.md Outdated
called scalable Vectors or scalable vectors. These include, amongst others, NecSX,
ARM SVE, RISC-V Vectors. These architectures have traditionally relied on
auto-vectorization combined with support for explicit vectorization annotations,
but newer architectures like ARM SVE and RISC-V introduce explicit vectorization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVE, sure, but RISC-V? Leaving aside that neither the vector extension nor compiler support for it is anywhere near finished, the architects have gone on record stating that they vastly prefer auto-vectorization for targeting their architecture.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I actually expect that intrinsics will wind up existing, but I'm still confused that this section implying they already exist. Although, for that matter, there are some aspects of RISC-V that might make assembly more attractive than intrinsics, but whatever.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the architects have gone on record stating that they vastly prefer auto-vectorization for targeting their architecture.

This is true, but it is also worth remarking that ARM SVE architects prefer auto-vectorization as well, 2) that they ended up providing explicit intrinsics that map to the ISA instructions anyways, and 3) that at least compared to "packed" SIMD vector models the RISC-V vector ISA is not that far away from the ARM SVE one.

In any case, time will tell. This paragraph does not suggest that the current programming model applies 1:1 to "scalable" vectors in any way, but rather that scalable vectors would require a slightly different programming model, and that this model does not really interfere that much with non-scalable vectors. Nothing more, nothing less.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again my main question/confusion here is the implication someone already went and defined intrinsics for the RISC-V vector extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again my main question/confusion here is the implication someone already went and defined intrinsics for the RISC-V vector extension.

I'll reword this section then. Technically, RISC-V does not even have SIMD support yet - the second revision of the vector extension ISA is only a proposal that is still being iterated on in the workshops.

text/0000-ppv.md Outdated
The `shuffle!` macro returns a new vector that contains a shuffle of the elements in
one or two input vectors. That is, there are two versions:

* `shuffle!(vec, [indices...])`: one-vector version

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What form do the indices take? Can they be arbitrary constant expressions, or do they have to be literals? Obviously we want the former but this should be mentioned, since a natural way to implement a macro would be to match on literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can they be arbitrary constant expressions, or do they have to be literals?

I'll mention that arbitrary constant expressions are allowed.

@scottmcm
Copy link
Member

Minor request: Please define the m types in the guide explanation. I think I know what they are, and [iuf] are obvious, but since there's no m16 type in rust, it would be good to describe what it is and why it exists and why there isn't just boolx4.

What are the layout guarantees of these types? Are they repr(C)? Could they offer .as_array(&self) -> &[T;N] (or even Deref to arrays)?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 23, 2018

@rkruppe and @hsivonen I think this commit incorporates all of your feedback, but it would be great if you could review the changes in case I missed or misinterpreted something.

@scottmcm I've added more documentation about vector masks to the guide-level explanation in this commit. Let me know what you think.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 23, 2018

@scottmcm

What are the layout guarantees of these types? Are they repr(C)? Could they offer .as_array(&self) -> &[T;N] (or even Deref to arrays)?

Those are really good questions. The short answer is that, as the ABI section mentions, their layout is unspecified.

The ABI section does however explicitly forbids their use on extern functions, which means that right now they are definitely not repr(C).

Also, neither this RFC nor AFAICT the std::arch RFC defines what the behavior of this is:

union A {
    data: [f32; 4],
    vec: f32x4,
}
let x: [f32; 4] = unsafe { A { vec: f32x4::splat(0.) }.data };

so for all practical purposes this is also currently unspecified behavior (triple-check: @alexcrichton is this statement correct or is this behavior specified somewhere?).

The only thing this RFC (and IIRC the std::arch RFC) specifies is the behavior of mem::transmute from one vector type to another vector type. For everything else one needs to use the memory load/store API ({load,store}_{aligned,unaligned}{,_unchecked}).

A future RFC specifying the layout and ABI of these types will certainly have to properly answer all these questions. Right now, there are a just a couple of things that could be done differently ABI-wise and that might produce a better ABI than what we currently have. So the purpose of leaving the ABI unspecified in these initial RFCs is to not limit those efforts in any way.

text/0000-ppv.md Outdated
# Unresolved questions
[unresolved]: #unresolved-questions

### Interaction with scalable vectors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit but I think "scalable" is very much an Arm-ism here. The more neutral and common term is probably "Cray-style".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. In the introduction I mention:

packed: means that these vectors have a compile-time fixed size. It is the opposite of scalable or "Cray vectors", which are SIMD vector types with a dynamic size, that is, whose size is only known at run-time.

and I consistently used the term "scalable" throughout the RFC because I thought "Cray vectors" would be a less approachable term. I don't like the term "scalable" much either TBH.

text/0000-ppv.md Outdated
The floating-point semantics follow the semantics of `min` and `max` for the
scalar `f32` and `f64` types. That is:

If either operand is a `NaN`, returns the other non-NaN operand. Returns `NaN`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnzlbg This section also unnecessarily describes behavior that matches the scalar case.

@alexcrichton
Copy link
Member

@gnzlbg

so for all practical purposes this is also currently unspecified behavior (triple-check: @alexcrichton is this statement correct or is this behavior specified somewhere?).

I'd personally expect such union-using behavior to work and operate as intended, I think like the non-portable SIMD types we'll want to at least define how large these are for unsafe operations. (aka is it safe to transmute f32x4 to __m128?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 23, 2018

I'd personally expect such union-using behavior to work and operate as intended, I think like the non-portable SIMD types we'll want to at least define how large these are for unsafe operations. (aka is it safe to transmute f32x4 to __m128?

I think there are a couple of issues.

The first one is whether one can transmute a vector type into an equally-sized non-vector type like an array or a tuple. The std::arch RFC specified that the vector types are Sized and their size, and this RFC specifies this as well. So to me this answer is a clear yes.

The other issue is whether these types are layout compatible with arrays and tuples. That is, whether we guarantee the following assert to pass everywhere:

union A { arr: [f32; 4],  vec: f32x4, tup: (f32,f32,f32,f32) }
let x: [f32; 4] = unsafe { A { vec: f32x4::new(0., 1., 2., 3.) }.arr };
assert_eq!(x[2], 2.);  // OK?
let y: (f32,f32,f32,f32) = unsafe { A { vec: f32x4::new(0., 1., 2., 3.) }.tup };
assert_eq!(y.2, 2.);  // OK?

These two cases currently work on all platforms, but this is incidental. For example, the layout of tuples is unspecified, so we can't guarantee this for tuples AFAICT.

The layout of arrays is specified so we could do this for arrays. For that, we probably need to write down that a vector type has the same layout as an array or something like this.


Also, it is worth pointing out that transmute is a semantic memcpy. While transmuting between vector times is zero-overhead in some cases, transmute from vector to array and vice-versa always requires an aligned/unaligned memory load/store. That's ok, but worth keeping this in mind.

* `indices` must be a `const` array of type `[usize; N]` where `N` is any
power-of-two in range `(0, 2 * {vec,vec0,vec1}::lanes()]`.
* the values of `indices` must be in range `[0, N)` for the one-vector version,
and in range `[0, 2N)` for the two-vector version.
Copy link
Contributor

@fbstj fbstj Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this precondition statically checked when possible? is it dynamically enforced when necessary? if not, what are the results of indexing outside the range?

EDIT: oh, answered here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line below this one states:

On precondition violation a type error is produced.

The current implementation produces a monomorphization time error, but that can be fixed.

@alexcrichton
Copy link
Member

@gnzlbg ok cool! I wonder, would it be possible to use #[repr(transparent)] to make them newtype wrappers? If that's the case then we could get a rock-solid guarantee about the stable API

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Aug 30, 2018

ok cool! I wonder, would it be possible to use #[repr(transparent)] to make them newtype wrappers? If that's the case then we could get a rock-solid guarantee about the stable API

The problem is that if we make them newtypes then f32x4(Simd<[f32; 4]>) is not Simd<[f32; 4]>. This can be problematic, for example, if you have Simd<[*mut f32; 4]> and load it from memory you get a Simd<[f32; 4]> which is a different type than f32x4 (which is a new type around Simd<T>). I can try to explore in a branch if the newtype approach can work. There might be other problems with the generic mask methods, shuffle macro (where the Simd<T> is differently accessed than NewType(Simd<T>)), etc. But maybe this all works out, one would just need to try it out.

@alexcrichton
Copy link
Member

@gnzlbg ah ok. I do suspect though that trying to work with Simd<[T; N]> will likely effectively block any stabilization until const generics is stabilized at least, and it itself will likely be pretty contentious.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 3, 2018

I do suspect though that trying to work with Simd<[T; N]> will likely effectively block any stabilization until const generics is stabilized at least, and it itself will likely be pretty contentious

I understand that such a type definition could be contentious, but it is very common in the ecosystem (e.g. SmallVec<[T; N]>, ArrayVec<[T; N]>, GenericArray<[T; N]>. Also, I don't think const generics will improve the API nor the implementation of Simd<[T; N]> much... if at all... in the same way that one doesn't need const generics to use SmallVec.

@alexcrichton
Copy link
Member

@gnzlbg and I had some discussion on IRC (pasted below), but unfortunately not a ton of conclusions :(

11:10 < gnzlbg> acrichto: the only improvement I can think of that cost generics 
                would add to the ppv RFC is that instead of Simd<[T; N]> we will be 
                able to write Simd<T, N>
11:11 < gnzlbg> but that will need a lot beyond just woboats const generics proposal
11:11 <@acrichto> gnzlbg: right yeah, but it's also sort of unknown waters for simd 
                  and languages, I'm not sure if we'd be able to put it in libstd
11:11 <@acrichto> all I remember is like dozens of threads about this
11:11 <@acrichto> and would have to scour them again for arguments one way or another
11:11 < gnzlbg> my best proposal here is to expose Simd<[T; N]> in nightly only via 
                a ::__rustc_Simd<T>
11:12 < gnzlbg> the second best is to use new types to hide that
11:12 <@acrichto> I don't think the way we treat typedfes can hide the generic param
11:12 <@acrichto> in that it's just the same type to rustc
11:12 <@acrichto> and can affect things like inference and method resolution and such
11:12 < gnzlbg> yes, the error will say ::__rustc_Simd<...>
11:13 < gnzlbg> typedefs go away, so essentially we would be stabilizing a 
                __rustc_Simd<T> type, but making it illegal to use directly
11:13 <@acrichto> yeah but I'm not sure we could even do that
11:13 <@acrichto> like you could probably still use it somehow
11:13 <@acrichto> in a stable fashion
11:14 < gnzlbg> i mean, we can stabiliza Simd<T>
11:14 <@acrichto> in that our stability never was rock solid really, it's always 
                  best effort
11:14 < gnzlbg> its just a type
11:14 < gnzlbg> without methods
11:14 < gnzlbg> like Simd<T> it cannot do anything
11:14 <@acrichto> oh sure, but that's not what I'm worried about
11:14 <@acrichto> I'm worried about the interactions with the tyep arameter
11:14 <@acrichto> across blocks and such
11:14 < gnzlbg> the type is actually Simd<T: __rustc_Simd_array>
11:14 <@acrichto> I don't really have a great articulation
11:14 < gnzlbg> where the __rustc_Simd_array is a sealed trait
11:14 < gnzlbg> so its not exposed
11:14 <@acrichto> I just know that not having generics is super simple
11:14 <@acrichto> if verbose
11:14 < gnzlbg> ah yes
11:14 <@acrichto> and is much easier to stomach stabilizing
11:14 < gnzlbg> yeah 
11:15 < gnzlbg> I agree
11:15 < gnzlbg> i think that ideally we would only stabilize the 128-bit wide types
11:15 < gnzlbg> and see how that goes first
11:16 <@acrichto> we could try?
11:16 < gnzlbg> the 8-64 bit wide, and 256-512 bit wide types could be stabilized 
                later, but with 128-bit portable simd one can do a lot already
11:16 <@acrichto> I can't really personally render a judgement on whether a type 
                  parameter will work
11:16 <@acrichto> I'm just trying to act as a proxy for all the folks who commented 
                  on historical threads
11:16 <@acrichto> where I thought there was enough pushback to generics to not do it
11:16 <@acrichto> I could be misremembering though
11:16 < gnzlbg> i recall something like that
11:17 < gnzlbg> for this, the element type and length need to be part of the type if 
                we ever want to support packed vectors of pointers 
11:17 < gnzlbg> but if we are not stabilizing that, then its not necessary
11:17 < gnzlbg> and I'd try to avoid that initially
11:18 <@acrichto> I could see that yeah
11:22 < gnzlbg> acrichto: I'll try to make the current type aliases newtypes. It 
                might not be that much work - but it would be nice if the libs team 
                could discuss this
11:22 < gnzlbg> there are a couple of libraries in the ecosystem that do this 
                (SmallVec, ArrayVec, etc. come to mind)
11:22 < gnzlbg> and just because Simd<T> is generic, it does not mean that it 
                accepts every T
11:23 < gnzlbg> Simd<T: sealed_trait> lets us control exactly which Ts it can 
                accept, and stabilize these independently of one another
11:23 < gnzlbg> So Simd<[f32; 4]> could be stabilized without stabilizing Simd<[f32; 
                8]> etc.
11:24 < gnzlbg> I don't know how that would interact with type inference though
11:24 < gnzlbg> and I see that it could be problematic
11:35 <@acrichto> gnzlbg: from an api-design perspective in a vacuum I think we'd 
                  all agree that generics are the way to go
11:35 <@acrichto> (no way we'd ever say Vec<T> shouldn't exist)
11:35 <@acrichto> but SIMD I thought was a special case
11:36 <@acrichto> in the sense that the libs team probably wouldn't have many 
                  thoughts on this
11:36 <@acrichto> other than "let's consult those using lots of SIMD"
11:36 < gnzlbg> i'm open to better suggestions
11:36 <@acrichto> and it may be the case now that with the platform apis the genrics 
                  are fine
11:37 < gnzlbg> as in, we didn't use generics at first in std::simd, it just turned 
                out that I didn't figure out a way to implement some features 
                without them
11:37 < gnzlbg> and I still don't know how to implement them without them
11:37 < gnzlbg> but maybe others do / will
11:37 < gnzlbg> or in other words, the library doesn't use generics because it makes 
                the api nicer, but because afaik it currently cannot be implemented 
                without them
11:39 < gnzlbg> i wish i did not had to use generics actually, because that required 
                re-implementing the whole library :/
11:39 < gnzlbg> this is why i would be fine with hiding generics in the API
11:45 < rkruppe> gnzlbg: i think we talked about it before but remind me, what were 
                 these features that you couldn't figure out how to do without 
                 generics?
11:47 < gnzlbg> rkruppe: packed vectors of pointers
11:48 < rkruppe> going from <n x T*> to <n x T> or what
11:48 < gnzlbg> the other way around
11:48 < gnzlbg> ah no yeah, that way
11:48 < gnzlbg> <n x T*> to <n x T>, <n x T**> to <n x T*>, etc.
11:49 < rkruppe> that screams "associated type" at me
11:49 < gnzlbg> when going from <n x T**> to <n x T*> what do you put in the 
                associated type ?
11:50 < rkruppe> `<VectorType::Element as Deref>::Target`. well not literally Deref 
                 because raw pointers, but like Deref
11:51 < rkruppe> i.e. a helper trait implemented on pointers such that `<*const 
                 T>::Pointee = T`
11:51 < gnzlbg> so yes, kind of, since the result is also a vector
11:52 < gnzlbg> that's pretty much how it is currently implemented
11:52 < rkruppe> to be clear we're talking about the signature of gather?
11:52 < gnzlbg> yes, that type is used in that signature
11:53 < rkruppe> oh hm i think i see the problem
11:53 < gnzlbg> the thing is now you need to implement this trait for some concrete 
                types
11:53 < gnzlbg> *concrete vector types
11:53 < gnzlbg> and you have *mut f32 and *const f32 which deref into f32x2, f32x4, 
                ... depending on the vector length
11:54 < gnzlbg> so all of this are different types and can have different impls, so 
                far so good
11:54 < gnzlbg> the leafs are concrete types, so a pf32x4 will dereference into a 
                f32x4, which is nice
11:55 < gnzlbg> but now we wanted to also support dereferencing a ppf32x4 into a 
                pf32x4
11:55 < gnzlbg> (I think sunfish eddyb you and maybe some others talked about this, 
                that supporting vectors of pointers to pointers would be nice)
11:56 < gnzlbg> and that gives you an infinite amount of vector types that you have 
                to support
11:56 < rkruppe> `ppf32x4` is `Simd<[*const *const f32; 4]>`?
11:56 < gnzlbg> for example
11:56 < gnzlbg> but we can have `Simd<[*mut *const f32; 4]>` (all of these 
                combinations work
11:56 < rkruppe> i was actually thinking `ptrx4<T>` where T can be for example `f32` 
                 or `*mut *const f32`
11:57 < rkruppe> well maybe two variants of that type if you want both mut and const 
                 at the outer layer
11:57 < gnzlbg> i'd have to check the implementation but I think what I ended up 
                doing is having a x4<T> vector type where T can be anything
11:57 < gnzlbg> so you can have x4<*mut *const *mut *const f32>
11:58 < gnzlbg> this is all an internal implementation detail anyways, none of this 
                can be observed from outside
11:59 < rkruppe> yes that is how i understood the current solution
11:59 < gnzlbg> https://github.com/rust-lang-nursery/packed_simd/blob/master/src/vPtr.rs#L5
11:59 < gnzlbg> those types are actually exposed
11:59 < rkruppe> i am just trying to understand why it was necessary
12:00 < gnzlbg> and build on the x4 types (which are internal)
12:00 < gnzlbg> yeah, rkruppe so it was necessary to implement vector of pointers 
                for all pointers
12:00 < gnzlbg> not only for pointers to a non-pointer value, but also for pointers 
                to pointers
12:02 < rkruppe> the impls you linked seem to be generic over the pointee type tho? 
                 as in, are there really different impls for e.g. `x4<*const T>` and 
                 `x4<*const *const >`?
12:02 < rkruppe> [...] and `x4<*const *const T>`?
12:02 < gnzlbg> hmm no, they are only generic over the first *mut/*const T
12:02 < gnzlbg> depending on the first *mut/*const you can only read, or read/write
12:03 < gnzlbg> so the api of Simd<[*mut T; N]> is different than that of 
                Simd<[*const T; N]>
12:03 < gnzlbg> whether T is an f32, or another pointer (e.g. *mut f32) is irrelevant
12:04 < rkruppe> so then i don't understand why you can't have one concrete type 
                 (generic over `T`) corresponding to `Simd<[*const T; 4]>` and 
                 another corresponding to `Simd<[*mut T; 4]>`
12:05 < gnzlbg> rkruppe: we have, those are the types exposed
12:05 < gnzlbg> type cptrx2<T> = Simd<[*const T; 2]>
12:05 < rkruppe> i mean something like `struct mutptr4<T>(...);` and `struct 
                 constptr4<T>(...);`
12:06 < gnzlbg> yep, those are cptrx4 and mptrx4
12:06 < rkruppe> again i want to understand why you moved to having x4<T> / 
                 Simd<ARRAY> at all
12:06 < rkruppe> i fear the signature of gather/scatter would involve an associated 
                 type from hell but it still seems like it should be *possible*
12:07 < gnzlbg> rkruppe: ah because the generic impl that dereferences into the 
                target type is easier if mutptrx4 and f32x4 are the same type
12:07 < rkruppe> `fn gather(self) -> <Self as GatherResult>::It` where `It` is 
                 either another pointer vector or a vector of ints/floats
12:08 < gnzlbg> the impls for the pointer types are generic, there is only one 
                impl<T> for cptrx4<T> { ... } and the target type is just an 
                associated type Target = Simd<[T; 4]>
12:08 < rkruppe> yes, that is in many ways much nicer, but i don't see why it's 
                 *necessary*
12:08 < rkruppe> oh well
12:08 < gnzlbg> if T is just f32 at the leaf, then Target == f32x4 automatically, 
                while if T is *mut T, then Target is Simd<[*mut T; 4]>
12:08 < rkruppe> doesn't matter
12:08 < gnzlbg> it isn't necessary if you add all the impls automatically
12:08 < gnzlbg> *manually
12:09 < gnzlbg> the problem is, if you add a generic impl for all pointer types, and 
                you add an impl for the leaf primitive types
12:09 < gnzlbg> they will conflict
12:09 < rkruppe> no point rewriting the library once again and having the length as 
                 an explicit parameter is better for the future anyway
12:10 < gnzlbg> one might be able to work around this using a separate trait, but 
                then you want to have an impl for all T's that are pointers, and one 
                impl for all Ts that are not pointer
12:10 < rkruppe> gnzlbg: if you're talking the impl that provides gather, there 
                 would just be one`impl<T> cptr4<T>` and the selection between 
                 gather returning another vector pointer or a vector of leaf types 
                 would be made with associated types dispatched on T
12:11 < gnzlbg> i am talking about the impls for the trait that is used to find the 
                associated type
12:11 < rkruppe> that trait is implemented for `*const T`, `*mut T`, and the 
                 float/int types?
12:13 < gnzlbg> that might work I think
12:14 < gnzlbg> looking at the impl of gather currently it dispatches to a concrete 
                type
12:16 < gnzlbg> but i guess one can also add a trait with an associated type and 
                implement it for the vectors of pointers
12:18 < gnzlbg> rkruppe so the problem is that particular trait
12:18 < gnzlbg> lets call it trait VecDeref { type Target; } 
12:19 < gnzlbg> I can add  an impl<T> VecDeref for cptrx2<*mut T> { type Target = 
                cptrx2<T>; } 
12:19 < gnzlbg> and then I can add the manual impls for the concrete types, impl 
                VecDeref for cptrx2<f32> { type Target = f32x2; } 
12:20 < gnzlbg> so no, I don't think there is a problem, that should just work
12:21 < gnzlbg> I guess that can be retroffitted without many changes
12:21 < gnzlbg> so maybe we don't need to expose the Simd<[T; N]> type at all

@alexcrichton
Copy link
Member

oh also gisted a bit with @rkruppe which I didn't get a chance to follow over the weekend!

@glaebhoerl
Copy link
Contributor

FWIW my recollection of the internals thread(s) I participated in about this is that people were concerned about it being premature to speculate about Simd<[T; N]>, it wasn't clear whether or how well it could work, it'd be more conservative and prudent to just stabilize f32x4 et al first, unknown unknowns, and that sort of thing. Process oriented and meta-level arguments. I don't recall there having been specific technical objections to Simd<[T; N]>. "I don't remember that there were" is not the same as "I remember that there weren't", of course.

If @gnzlbg has implemented it and it turned out that it not only works but works better than the alternatives, then that answers some of those questions, I think?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 5, 2018

@glaebhoerl My main worry is that I don't know the answer to the following questions:

  • Can we put impl Foo { } blocks behind feature flags ? (to only stabilize some?)
  • Can we put impl Trait for Foo {} impls behind distinct feature flags ? (to only stabilize some?)

so I don't know whether it is possible to stabilize some part of the implementation without stabilizing all of it (and instantaneously stabilizing all future additions).

(EDIT: these questions are important because given a Simd<T>, what you can do with it depends on the concrete T provided, which traits it implements, etc. These are all implementation details of the library, but affect what the user can do with a Simd<T>)

If the answer to both questions is that yes, those things are possible, and trait impls behind feature flags do not affect type inference on stable, and on nightly when that particular feature flag is not active, then I think all my concerns would be resolved (cc @eddyb ?) .

@alexcrichton mentioned that similar discussions about generics have been had, and that other points about these issues have been raised. It might be worth it to dig those up: could someone share a link?

@alexcrichton
Copy link
Member

@gnzlbg yes impl Foo { ... } blocks (and methods) can be separately stabilized. We cannot, however, selectively stabilize impl Trait for Foo { .. }.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 5, 2018

We cannot, however, selectively stabilize impl Trait for Foo { .. }.

So this might be the killer. To explain the problem, Simd<T> does not accept every T, it does not accept every array. For example Simd<[f32; 4]> is ok, but Simd<[f32; 3]> (and many others) is not.

The way that's implemented in the library is just using a trait to constrain T in the Simd implementation: Simd<T: sealed::SimdArray>. The private SimdArray trait is implemented in the library for some array times, like [f32; 4], but also some generic blanket impls like impl<T> SimdArray for [*const T; 4] where ... and similar.

This is problematic, because there are parts of the library that we don't want to stabilize at first, like 512-bit wide vectors, maybe vectors of pointers, etc. but we still want to be able to use and improve on nightly. The problem is that we can't do that, because we can't mark the impl Trait for Foo as an unstable impl behind a feature gate.

AFAICT const generics won't help. If we had a Simd<T, const N: usize> type we still have similar problems (e.g. restricting the Ns allowed, restricting T to not allow pointers at first, etc. ).

The only idea I have is that we could use conditional compilation to only put in core the pieces that are "ready" for stabilization, and that people would need to use the packed_simd crate for the "unstable" pieces. But that would mean that every time that we move something new to core it could become pretty much instastable, skipping the release train, which is a no go because the release train has detected many issues in the past when this was in std::simd.

@hsivonen
Copy link
Member

hsivonen commented Sep 6, 2018

@gnzlbg

Sorry about the slow response time. Looks great! Thank you. (I wish we could get safe wrappers around endianness-exposing bitcasts of vectors of the same bit width but different lane configuration sooner than later, though.)

PartialOrd<Self>: compares two vectors lexicographically.

Is it endianness-dependent? (Asking out of curiosity, not out of complaint, because endianness-exposing bitcasts were excluded from the RFC but endianness-dependent extract, new, etc. were included.)

@ralfbiedert

/// # Panics
///
/// If `slice.len() != Self::lanes()` or `&slice[0]` is not
/// aligned to an `align_of::<Self>()` boundary.
pub fn write_aligned(self, slice: &mut [element_type]);

Out of curiosity, have you checked if the compiler is smart enough to hoist the panic condition checks out of the loop when used with exact_chunks?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 6, 2018

Is it endianness-dependent?

No. AFAIK one cannot create endian-dependent behavior with packed_simd unless one writes unsafe code to e.g. transmute an u64 into an [u32; 2] or similar. These types of issues also affect primitive types like u64->[u32; 2] too, so they are not specific to vector types.

Out of curiosity, have you checked if the compiler is smart enough to hoist the panic condition checks out of the loop when used with exact_chunks?

Most of the recent work in the packed_simd examples and benchmarks has gone into beating ISPC in all benchmarks and manually removing these checks has played a non-negligible role in that. So what we have ended up doing is using the unsafe _unchecked methods pretty much in all kernels, but changing their implementation to include debug_assert!s so that they are checked in debug builds. That is, a method like vec.write_to_slice_aligned_unchecked(&[...]) does check that the target slice is large enough to hold vec and also suitably aligned if debug-assertions = true.

Having said this, there have been many changes that have improved performance in the recent weeks, so when all benchmark-related PRs have been merged it might be worth it to test this again by just enabling debug-assertions in release mode and running all benchmarks.

@alexcrichton
Copy link
Member

@gnzlbg ah yeah that's why I was wondering if newtype wrappers would work, but I unfortunately don't know of a great solution :(

@Centril Centril added A-simd SIMD related proposals & ideas A-types-libstd Proposals & ideas introducing new types to the standard library. labels Nov 22, 2018
@Centril Centril added the T-lang Relevant to the language team, which will review and decide on the RFC. label Feb 21, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jul 16, 2019

I'm closing this for now:

  • packed_simd has been updated with many new APIs since, and the RFC would need updating with those.

  • packed_simd is being updated to use const generics, which results in a couple of minor stylistic changes to the API (e.g. Simd<T, N> instead of Simd<[T; N]>). That work is not done, and it is unclear whether all currently-implemented and proposed APIs play well with const-generics.

This is kind of blocked on how much time I've got, and I don't think we should advance this RFC until both issues are resolved, so I'm closing this for now to reduce the RFC repo queue. Once that work is done, I'll update the RFC and re-open it or re-submit it if the changes are significant (I hope not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-simd SIMD related proposals & ideas A-types-libstd Proposals & ideas introducing new types to the standard library. T-lang Relevant to the language team, which will review and decide on the RFC. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.