-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement all x86 vendor intrinsics #40
Comments
cc @BurntSushi @gnzlbg, I've opened this up and moved |
Could you edit the guide to suggest unsafe functions for the intrinsics? #21 |
For those wishing to implement intrinsics above SSE2, make sure you're running your tests with |
You can use `RUSTFLAGS="-C target-feature=+avx2" to enable a particular extension. Note however that a CPU that does support the extension is needed for running the tests. To develop tests for a different architecture (e.g. develop for ARM from x86) you can use cross-compilation. To run the tests... travis is an option. I don't know if there is a better option though. |
It looks like travis only runs SSE2 and below with our current config. I wonder if their machines support AVX... |
@AdamNiederer oh that's actually a bug! I think I see what's going on though, I'll submit a fix. |
@alexcrichton https://github.com/rust-lang-nursery/stdsimd/blob/master/ci/run.sh probably needs to set |
Added in #45. Let's see what Travis has to say about it. EDIT: The build is failing, but those same 20 tests were failing for me on my Ivy Bridge box last night. I think LLVM might be spitting out wider version of 128 or 64-wide instructions on CPUs which support them. It also looks like travis supports AVX2 🎉 |
@gnzlbg oh I'm going to add @AdamNiederer thanks! I'll look into the failures and see if I can fix them. |
Interested in helping out with this. Figured I'd start super small with |
Hello, I've given a try at |
Post #81 SSE 4.2 should be covered. |
@dlrobertson Awesome! I've updated the checklist. |
I've got an implementation for |
What is the plan with FMA, is there a reason behind omitting it in the above list? |
Here are some intrinsics that are in the TODO, but are already implemented. sse
sse2
sse3
ssse3
avx
avx2
|
@p32blo updated! |
|
This post should add how to document the intrinsics. |
@rroohhh it should be part of AVX2 although we might want to implement it in its own module. |
@alexcrichton this issue's topic is quite long and hard to browse, could you please use something like the mechanism described in this comment, to allow collapsing individual sections? Something like this
Code for the above: <details><summary>Something like this</summary><p>
<< This line break is necessary!
- [ ] Some intrinsic
</p></details> |
@alexcrichton Could you please check off the following tasks in the SSE section?
For |
@nominolo done1 |
`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.
`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.
`_mm_cvtsd_f64`, `_mm_cvtsd_si64x` and `_mm_cvttsd_si64x`. See rust-lang#40.
@alexcrichton long story short:
|
The SVML is just a bunch of inlining-friendly assembly-level subroutines which use SSE/AVX instructions to compute higher-level mathematical primitives. I'm pretty sure it's "just another library", otherwise. It's heavily optimized for Intel CPUs, much like ICC. I'm also pretty sure it's not open-source or readily available. |
@alexcrichton , sse instructions are split into 3 folders: i586, i686 and x86_64. How should I know where to put an implementation for |
@crypto-universe @AdamNiederer ok cool, thanks for the info! Sounds like I should omit those intrinsics. I've updated the OP to omit the SVML intrinsics. @crypto-universe oh the division between those modules is somewhat non-important now. The main one is that |
Ok I think this is effectively "done enough" that we can close and follow up with more specific issues if need be. Thanks so much for everyone's help on this! |
Is this the right place to mention that core::arch is missing RISC-V support or should I open a tracking bug? (I'm specifically interested in adding support for the equivalent of rdtsc). |
We generally try to stick to vendor-specified intrinsics, e.g. SSE intrinsics and ARM NEON intrinsics. AFAIK RISC-V doesn't have any target-specific intrinsics defined in GCC or Clang. |
Ough. Thanks. I can see your reasoning, but that raises the bar by orders of magnitude and pushes the problem to all clients of core::arch :( |
You can always just use inline assembly if you really want a specific instruction... |
That's literally what "pushes the problem to all clients" means. |
Probably best to just open a new issue where it can get eyes and discussion. The tail end of a long-closed issue isn't a good way to bring your problem to light. |
@Amanieu It doesn't look like there are any RISC-V intrinsics in llvm/clang yet, but there is some recent work in that area: https://www.sifive.com/blog/risc-v-vector-extension-intrinsic-support |
Those are actually much trickier than it seems since they involve scalable vectors with a size not known at compile-time. This requires special support in the compiler. The same issue applies to the ARM SVE intrinsics. |
Out of interest and because it has recently become relevant: ["VMX"] would be helpful. |
Maybe it's worth to open separate issue for each target feature? For example, I wanted to use Is there a reason why streaming load intrinsics were omitted? |
Please open a new issue if there are any missing intrinsic. |
This is intended to be a tracking issue for implementing all vendor intrinsics in this repository.
This issue is also intended to be a guide for documenting the process of adding new vendor intrinsics to this crate.
If you decide to implement a set of vendor intrinsics, please check the list below to make sure somebody else isn't already working on them. If it's not checked off or has a name next to it, feel free to comment that you'd like to implement it!
At a high level, each vendor intrinsic should correspond to a single exported Rust function with an appropriate
target_feature
attribute. Here's an example for_mm_adds_epi16
:Let's break this down:
#[inline]
is added because vendor intrinsic functions generally should always be inlined because the intent of a vendor intrinsic is to correspond to a single particular CPU instruction. A vendor intrinsic that is compiled into an actual function call could be quite disastrous for performance.#[target_feature(enable = "sse2")]
attribute intructs the compiler to generate code with thesse2
target feature enabled, regardless of the target platform. That is, even if you're compiling for a platform that doesn't supportsse2
, the compiler will still generate code for_mm_adds_epi16
as ifsse2
support existed. Without this attribute, the compiler might not generate the intended CPU instruction.#[cfg_attr(test, assert_instr(paddsw))]
attribute indicates that when we're testing the crate we'll assert that thepaddsw
instruction is generated inside this function, ensuring that the SIMD intrinsic truly is an intrinsic for the instruction!int64_t
translated toi64
in Rust)_mm_adds_epi16
down to a single particular CPU instruction. As such, the implementation typically defers to a compiler intrinsic (in this case,paddsw
) when one is available. More on this below as well.unsafe
due to the usage of#[target_feature]
Once a function has been added, you should also add at least one test for basic functionality. Here's an example for
_mm_adds_epi16
:Note that
#[simd_test]
is the same as#[test]
, it's just a custom macro to enable the target feature in the test and generate a wrapper for ensuring the feature is available on the local cpu as well.Finally, once that's done, send a PR!
Writing the implementation
An implementation of an intrinsic (so far) generally has one of three shapes:
_mm_add_epi16
intrinsic (note the missings
inadd
) is implemented viasimd_add(a, b)
, which compiles down to LLVM's cross platform SIMD vector API.extern
block to bring that intrinsic into scope and then call it. The example above (_mm_adds_epi16
) uses this approach._mm_cmpestri
(make sure to look at theconstify_imm8!
macro).References
All intel intrinsics can be found here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236
The compiler intrinsics available to us through LLVM can be found here: https://gist.github.com/anonymous/a25d3e3b4c14ee68d63bd1dcb0e1223c
The Intel vendor intrinsic API can be found here: https://gist.github.com/anonymous/25d752fda8521d29699a826b980218fc
The Clang header files for vendor intrinsics can also be incredibly useful. When in doubt, Do What Clang Does:
https://github.com/llvm-mirror/clang/tree/master/lib/Headers
TODO
["AVX2"]
_mm256_stream_load_si256
_mm_broadcastsi128_si256
["MMX"]
_mm_srli_pi16
_mm_srl_pi16
_mm_mullo_pi16
_mm_slli_si64
_mm_mulhi_pi16
_mm_srai_pi16
_mm_srli_si64
_mm_and_si64
_mm_cvtsi32_si64
_mm_cvtm64_si64
_mm_andnot_si64
_mm_packs_pu16
_mm_madd_pi16
_mm_cvtsi64_m64
_mm_cmpeq_pi16
_mm_sra_pi32
_mm_cvtsi64_si32
_mm_cmpeq_pi8
_mm_srai_pi32
_mm_sll_pi16
_mm_srli_pi32
_mm_slli_pi16
_mm_srl_si64
_mm_empty
_mm_srl_pi32
_mm_slli_pi32
_mm_or_si64
_mm_sll_si64
_mm_sra_pi16
_mm_sll_pi32
_mm_xor_si64
_mm_cmpeq_pi32
["SSE"]
_mm_free
_mm_storeu_si16
_mm_loadu_si16
_mm_loadu_si64
_mm_malloc
_mm_storeu_si64
["SSE2"]
_mm_loadu_si32
_mm_storeu_si32
["SSE4.1"]
_mm_stream_load_si128
previous description of this issue
The text was updated successfully, but these errors were encountered: