-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use REP MOVSQ/STOSQ on x86_64 #365
Conversation
Thanks for this! I forget, but do we already have tests for memset/memcmp/etc? If not, could you add some as part of this PR? Additionally, do you have some benchmark numbers for how these perform? |
97ad0fa
to
012085a
Compare
I updated this PR to add I ran a bunch of trials, results are below, the main takeaways are:
memcpy
memset
memcmp
|
Nice! Thos are some pretty slick wins and also nice find that |
This claims to close my issue, but isn't this only about x86, while the performance problems seem to be happening across the board? (Especially on WASM it seems rather bad atm) |
We can leave it open for other platforms, but FWIW there's not really much else we can do for wasm. The bulk memory proposal fixes this issue, however, because the |
If you use the WASI target it does a loop copying 32-bit values rather than individual bytes because it uses the musl implementation in the wasilibc then. I haven't done any real benchmarking, but I'd expect that to be faster. But yeah the bulk memory proposal also fixes that issue if you use that target feature. |
7a52621
to
7321f5c
Compare
Also, to confirm, do we have tests for this in-repo? If not could some be added? |
I would want better test coverage than what we currently have. I'm planning to add a bunch before this CL is ready for review again. |
src/mem/x86_64.rs
Outdated
@@ -0,0 +1,69 @@ | |||
use super::c_int; | |||
|
|||
// On recent Intel processors, "rep movsb" and "rep stosb" have been enhanced to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this implementation fare on non-intel implementations of x86_64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been investigating performance on AMD hardware (the only other x86 platform where anyone cares about performance). This has led me to modify the implementation. When I have some time, I'll post the results and update this comment to clarify the impact on AMD as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Intel fast string support is detectable by two flags in CPUID and one enable bit in IA32_MISC_ENABLE, if necessary. AMD may provide the same detection tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like virtually all newish AMD and Intel processors support some sort of "REP MOVS enhancement" (i.e. rep movs is somehow better than a normal loop). However, if the ermsb feature flag isn't present (like on all AMD processors) then rep movsq
seems better than rep movsb
. With ermsb the two variants are about the same speed.
Given this, I just implemented the rep movsq
version unconditionally without any CPUID checking. Variants that use rep movsb
when Intel's ERMSB/FSRM feature is enabled could be added later, but there doesn't seem to be much of a gain (at least with the benchmarks I'm running here).
These implementations assume that the direction flag is not set, which could not be always the case |
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
This allows comparing the "normal" implementations to the implementations provided by this crate. Signed-off-by: Joe Richey <joerichey@google.com>
Per the |
The assembly generated seems correct: https://rust.godbolt.org/z/GGnec8 Signed-off-by: Joe Richey <joerichey@google.com>
Even C doesn't tolerate DF being set generally. There are two legitimate uses of it which I have encountered. One is memmove(), and one is code dumps in backtraces, where you need to be wary of hitting page/permission boundaries (see https://github.com/xen-project/xen/blob/master/xen/arch/x86/traps.c#L175-L204). For performance, things are very tricky, and once size does not fit all. Presumably here we're talking about If alignment information is available at compile time, then Frankly, study a popular libc and follow their lead. A lot of time and effort has gone into optimising them generally across multiple generations of processor. Alternatively, if you do feel like doing feature-based dispatch, that will get better results if you can pick the optimum algorithm for the CPU you're on. |
i wonder why memmove got faster than memcpy lol |
Signed-off-by: Joe Richey <joerichey@google.com>
@alexcrichton the tests have been added so this is now ready for final review and merging. This implementation sticks with the For performance numbers, see my link in #365 (comment) |
Signed-off-by: Joe Richey <joerichey@google.com>
Thanks again for this! This all looks great to me. As one final thing, though, I'm not sure if the |
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
Done (for tests and builds), also the testcrate enables the |
Hm yeah ideally that would change but that's probably best left to another PR, thanks again! |
This change is needed for compiler-builtins to check for this feature when implementing memcpy/memset. See: rust-lang/compiler-builtins#365 The change just does compile-time detection. I think that runtime detection will have to come in a follow-up CL to std-detect. Like all the CPU feature flags, this just references rust-lang#44839 Signed-off-by: Joe Richey <joerichey@google.com>
Add compiler support for LLVM's x86_64 ERMSB feature This change is needed for compiler-builtins to check for this feature when implementing memcpy/memset. See: rust-lang/compiler-builtins#365 Without this change, the following code compiles, but does nothing: ```rust #[cfg(target_feature = "ermsb")] pub unsafe fn ermsb_memcpy() { ... } ``` The change just does compile-time detection. I think that runtime detection will have to come in a follow-up CL to std-detect. Like all the CPU feature flags, this just references rust-lang#44839 Signed-off-by: Joe Richey <joerichey@google.com>
in principle the PR uses rust-lang/compiler-builtins#365 to improve the performance
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes - PR uses rust-lang/compiler-builtins#365 to improve the performance - fix broken CI and build the bootloader on windows correctly Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes - PR uses rust-lang/compiler-builtins#365 to improve the performance - fix broken CI and build the bootloader on windows correctly Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes - PR uses rust-lang/compiler-builtins#365 to improve the performance - fix broken CI and build the bootloader on windows correctly Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes - PR uses rust-lang/compiler-builtins#365 to improve the performance - fix broken CI and build the bootloader on windows correctly Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
* mem: Move mem* functions to separate directory Signed-off-by: Joe Richey <joerichey@google.com> * memcpy: Create separate memcpy.rs file Signed-off-by: Joe Richey <joerichey@google.com> * benches: Add benchmarks for mem* functions This allows comparing the "normal" implementations to the implementations provided by this crate. Signed-off-by: Joe Richey <joerichey@google.com> * mem: Add REP MOVSB/STOSB implementations The assembly generated seems correct: https://rust.godbolt.org/z/GGnec8 Signed-off-by: Joe Richey <joerichey@google.com> * mem: Add documentations for REP string insturctions Signed-off-by: Joe Richey <joerichey@google.com> * Use quad-word rep string instructions Signed-off-by: Joe Richey <joerichey@google.com> * Prevent panic when compiled in debug mode Signed-off-by: Joe Richey <joerichey@google.com> * Add tests for mem* functions Signed-off-by: Joe Richey <joerichey@google.com> * Add build/test with the "asm" feature Signed-off-by: Joe Richey <joerichey@google.com> * Add byte length to Bencher Signed-off-by: Joe Richey <joerichey@google.com>
The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.
The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.
Addresses part of #339 (see also rust-osdev/cargo-xbuild#77)
The implementations for these functions are quite simple and (on many recent processors) are the fastest way to implement
memcpy
/memmove
/memset
. The implementations in MUSL and in the Linux kernel were used for inspiration.Benchmarks for the memory functions were also added in this PR, and can be invoked by running
cargo bench --package=testcrate
. The results of running this benchmarks on different hardware show that the qword-based variants are almost always faster than the byte-based variants.While the implementations for
memcmp
/bcmp
could be made faster though use of SIMD intrinsics, usingrep cmpsb
/rep cmpsq
makes them slower, so they are left as-is in this PR.Note that #164 added some optimized versions for memcmp on ARM.