-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized field trait #216
Conversation
It looks like the build can’t compile |
Yep the latest nightly seems to break main. I'll look into it |
Looks like we'll need to wait for a change in num -- see rust-num/num-bigint/issues/218. It affects lots of users so hopefully it will come soon. |
src/field/packed_field.rs
Outdated
|
||
fn broadcast(x: Self::FieldType) -> Self; | ||
|
||
fn from_arr(arr: &[Self::FieldType]) -> Self; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this from_slice
since it technically takes a slice? Or have it take [Self::FieldType; WIDTH]
Similarly for the method below, I'd either call it to_vec
or have it return [Self::FieldType; WIDTH]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is already on the to-do list 😊
src/field/packed_crandall_avx2.rs
Outdated
|
||
#[derive(Copy, Clone)] | ||
#[repr(C, align(4))] | ||
pub struct PackedCrandallAVX2(pub [u64; 4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm would it be simpler to wrap the __m256i
? Seems like we convert to it rather than doing anything directly with the [u64; 4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awful trick to stop Rust assuming 32-byte alignment. I’ve added the following comment to explain it:
// PackedCrandallAVX2 wraps an array of four u64s, with the new and get methods to convert that
// array to and from __m256i, which is the type we actually operate on. This indirection is a
// terrible trick to change PackedCrandallAVX2's alignment.
// We'd like to be able to cast slices of CrandallField to slices of PackedCrandallAVX2. Rust
// aligns __m256i to 32 bytes but CrandallField has a lower alignment. That alignment extends to
// PackedCrandallAVX2 and it appears that it cannot be lowered with #[repr(C, blah)]. It is
// important for Rust not to assume 32-byte alignment, so we cannot wrap __m256i directly.
// There are two versions of vectorized load/store instructions on x86: aligned (vmovaps and
// friends) and unaligned (vmovups etc.). The difference between them is that aligned loads and
// stores are permitted to segfault on unaligned accesses. Historically, the aligned instructions
// were faster, and although this is no longer the case, compilers prefer the aligned versions if
// they know that the address is aligned. Using aligned instructions on unaligned addresses leads to
// bugs that can be frustrating to diagnose. Hence, we can't have Rust assuming alignment, and
// therefore PackedCrandallAVX2 wraps [u64; 4] and not __m256i.
Please let me know if you can think of a nicer way to do this 😊
c9a8dc3
to
f284b35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Update:
I’m breaking this massive PR up into a few smaller ones to ease review. This one is for the
PackedField
andPackable
traits, as well as theSingleton
struct. APackedField
is a wrapper around a vector of field elements, permitting a subset of field operations.Packable
extendsField
with the defaultPackedField
for a particularField
. It defaults toSingleton
a dummy container for a scalar, and is specialized per platform.Original comment:
This PR introduces
PackedField
s, which permit SIMD field arithmetic. The default packing defers to scalar code, but can be overridden for particular architecture/field combinations.I implemented Crandall arithmetic in AVX2, and rewrote
fft_classic
in vector as a test case. The vectorized FFT is 30-ish% faster (needs more thorough benchmarks) on Skylake.While functional, this branch needs a cleanup before it can be reviewed. There’s some debug logging/unused imports/confusing names/undocumented tricks. But I welcome big-picture comments!