Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Fixes and enhancements in x86 intrinsics #1594

Merged
merged 9 commits into from
Jun 29, 2024

Conversation

sayantn
Copy link
Contributor

@sayantn sayantn commented Jun 24, 2024

  • Updated the x86-intel.xml to be a more recent update of Intel Intrinsics Guide (v3.6.8)
  • Added functionality to auto-generate a missing-x86.md for ease of implementation
  • Updated disassembly to allow windows-gnu targets (they have the same implementation as linux targets, as that uses objdump from binutils, and binutils is a dependency of GCC). Add the x86_64-pc-windows-gnu target in CI
  • Fixed some of the stream intrinsics.
  • modified floating-point reduce-add and reduce-mul intrinsics to NOT use simd_reduce_add_unordered and simd_reduce_mul_unordered as Intel specifies a strict associativity. Follow GCC and hand-implement the associativity ourselves (_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533)
  • Fixed _load_mask32 etc in AVX512BW (they should have taken a __mmask32/__mmask64 pointer, but took u32/u64 pointer)
  • As moves never modify any flags, add preserves_flags to the asm! blocks for moves
  • Fix _mm_loadu_si64 (it had target-feature sse, but needs sse2), _mm256_extract_epi64, _mm256_extract_epi32, _mm256_cvtsi256_si32 (these had target-feature avx2, but need avx).
  • Fixed _mm_cvtt intrinsics (they were actually calling vcvtss2si, when they should call vcvttss2si)
  • Removed all MMX support from stdarch-verify, and made the target-feature verification stricter
  • Implemented the missing intrinsics mentioned in Missing x86 vendor intrinsics (SSE2, SSE 4.1, AVX2) #1178 with feature-gate simd-x86-updates (Tracking Issue for Missing BMI1, AVX2, SSE2, SSE4.1, SSE4a and TBM intrinsics rust#126936)
  • Fixed _mm512_kunpackb
  • Modified the reduce-max and reduce-min intrinsics to preserve associativity specified by Intel and to use the comparison function they described (which is NOT maxnum from LLVM)
  • Bumped the OS in CI Docker containers to Ubunto 24.04 (except for in armv7-unknown-linux-gnueabihf and x86_64-unknown-linux-gnu-emulated)

Modifying fma has been moved to #1597
Masked load/stores are on standby due to rust-lang/rust#126919

@rustbot
Copy link
Collaborator

rustbot commented Jun 24, 2024

r? @Amanieu

rustbot has assigned @Amanieu.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Updated the intrinsics list from version 3.4 to 3.6.8. Added a missing-x86.md file to track progress.
fixed reduce-add and reduce-mul. and load/store of mask32 and mask64. added preserves-flags to mov asm. fixed the missing list. fixed `_mm_loadu_si64`. Added `assert_instr`
Added some tests, Fixed incorrect target-features, and verification code for target-features. Removed all MMX support from verification.
`_mm512_kunpackb` was implemented wrong, and `simd_reduce_max` uses `maxnum` for comparison, which adheres to IEEE754, but Intel specifically says that they do NOT adhere to IEEE754 for NaNs, which can give wrong results
crates/core_arch/src/x86/avx2.rs Outdated Show resolved Hide resolved
crates/core_arch/src/x86/sse41.rs Outdated Show resolved Hide resolved
crates/core_arch/src/x86/sse41.rs Outdated Show resolved Hide resolved
crates/stdarch-verify/tests/x86-intel.rs Outdated Show resolved Hide resolved
crates/core_arch/src/x86/sse41.rs Outdated Show resolved Hide resolved
crates/core_arch/src/x86/avx2.rs Outdated Show resolved Hide resolved
@sayantn sayantn force-pushed the avx512-fixes branch 4 times, most recently from 231b968 to 1c7aafe Compare June 29, 2024 12:20
@sayantn sayantn force-pushed the avx512-fixes branch 6 times, most recently from 2be8efe to a58f1ee Compare June 29, 2024 14:35
Fixed x86_64-apple-darwin freezing.
Bump all docker to Ubuntu-24.04 (except for emulated and armv7)
/// must be aligned on a 32-byte boundary or a general-protection exception may be generated. To
/// minimize caching, the data is flagged as non-temporal (unlikely to be used again soon)
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_stream_load_si256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and all other AVX2 non-temporal operations) should get the same safety comment that the older non-temporal stores have. See e.g. here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and only non-temporal stores have special memory orderings on x86. x86 non-temporal loads work just like normal loads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Amanieu told that that doesn't apply to streaming loads, only streaming stores.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize non-temporal loads even are a thing. More nightmare waiting to happen, I guess...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants