Non memcmp slice comparison optimization #113654

krtab · 2023-07-13T12:15:48Z

This PR:

Changes the implementation for [T] slice comparison when T: !BytewiseEq. The previous implementation used zip which is not as optimized as it could be. Performance improvements are for example 20% when testing that [Some(0_u64); 4096].as_slice() == [Some(0_u64); 4096].as_slice().
On musl (and maybe others), memcmp is equivalent to the new code added in the previous point. On such platforms it is better to never call memcmp because the compiler can leverage T's alignment to generate better code than for memcmp's *const u8.

I only realized when running ./x.py test that platform specific code is not allowed in core. However, I would argue that this may be acceptable because:

It does not affect portability so much, both implementations work fine, and this is only a matter of performance
Deferring to libc's memcmp on the ground that it is more optimized than our generic one is already a kind of platform specific assumption. Arguably, rather than a cfg(not() making it opt-out, using memcmp should be opt-in.

This is likely more a draft than something ready to merge given the discussions that it must go through, but here is to starting these discussions.

The previous implementation was not optimized properly by the compiler, which didn't leverage the fact that both length were equal.

rustbot · 2023-07-13T12:15:56Z

r? @joshtriplett

(rustbot has picked a reviewer for you, use r? to override)

krtab · 2023-07-13T12:19:18Z

This is loosely related to #113576 so I'm cross linking these two issues for the record.

Musl implementation is equivalent to our general case implementation. Not using it: - Allows for better optimization leveraging alignment information - Prevents a `call` - Likely favors inlining and further optimization

Noratrieb · 2023-07-13T12:52:54Z

library/core/src/slice/cmp.rs

-            return false;
-        }
+// and `memcmp` is not equivalent to our generic case
+#[cfg(not(target_env = "musl"))]


that cfg needs a clear comment (saying that musl's memcmp is not more optimized)

Thx for reviewing.
I'm planning on commenting more here and in pal.rs but I'm waiting to see how the debate on cfg in core goes.

Noratrieb · 2023-07-13T12:55:06Z

Let's run compiler benchmarks here too @bors try @rust-timer queue

bors · 2023-07-13T12:55:16Z

⌛ Trying commit f566552 with merge 982256677362bea322fd9fda00d76e80e2d72534...

bors · 2023-07-13T14:11:44Z

☀️ Try build successful - checks-actions
Build commit: 982256677362bea322fd9fda00d76e80e2d72534 (982256677362bea322fd9fda00d76e80e2d72534)

rust-timer · 2023-07-13T18:15:48Z

Finished benchmarking commit (982256677362bea322fd9fda00d76e80e2d72534): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.5%	[0.5%, 0.6%]	3
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.9%	[-1.5%, -0.5%]	17
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.7%	[-1.5%, 0.6%]	20

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.4%	[1.1%, 1.8%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.2%	[-2.4%, -2.1%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.4%	[-2.4%, 1.8%]	4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.9%	[-5.0%, -1.2%]	10
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-2.9%	[-5.0%, -1.2%]	10

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.1%, 0.3%]	3
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.5%	[-2.2%, -0.0%]	28
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.4%	[-2.2%, 0.3%]	31

Bootstrap: 658.224s -> 655.457s (-0.42%)

Noratrieb · 2023-07-13T18:29:43Z

the bitmaps benchmark has been really noisy recently so at least half these improvements may not actually be improvements

krtab · 2023-07-20T08:01:27Z

I am willing to provide more benchmarks, especially on musl if needed, but I'd like to be sure that this won't be turned down on the ground of "no arch specific code in core" and before any performance consideration before doing so. Let me know what you feel is lacking for this to move forward.

krtab · 2023-10-17T14:51:14Z

#114382 makes the musl part of this PR irrelevant. To clarify the process, the T: !BytewiseEq is moved to its own PR: #116846.

Delete architecture-specific memchr code in std::sys Currently all architecture-specific memchr code is only used in `std::io`. Most of the actual `memchr` capacity exposed to the user through the slice API is instead implemented in `core::slice::memchr`. Hence this commit deletes `memchr` from `std::sys[_common]` and replace calls to it by calls to `core::slice::memchr` functions. This deletes `(r)memchr` from the list of symbols linked to libc. The interest of putting architecture specific code back in core is linked to the discussion to be had in rust-lang#113654

…bilee Delete architecture-specific memchr code in std::sys Currently all architecture-specific memchr code is only used in `std::io`. Most of the actual `memchr` capacity exposed to the user through the slice API is instead implemented in `core::slice::memchr`. Hence this commit deletes `memchr` from `std::sys[_common]` and replace calls to it by calls to `core::slice::memchr` functions. This deletes `(r)memchr` from the list of symbols linked to libc. The interest of putting architecture specific code back in core is linked to the discussion to be had in rust-lang#113654

…=<try> A more efficient slice comparison implementation for T: !BytewiseEq (This is a follow up PR on rust-lang#113654) This PR changes the implementation for `[T]` slice comparison when `T: !BytewiseEq`. The previous implementation using zip was not optimized properly by the compiler, which didn't leverage the fact that both length were equal. Performance improvements are for example 20% when testing that `[Some(0_u64); 4096].as_slice() == [Some(0_u64); 4096].as_slice()`.

…=the8472 A more efficient slice comparison implementation for T: !BytewiseEq (This is a follow up PR on rust-lang#113654) This PR changes the implementation for `[T]` slice comparison when `T: !BytewiseEq`. The previous implementation using zip was not optimized properly by the compiler, which didn't leverage the fact that both length were equal. Performance improvements are for example 20% when testing that `[Some(0_u64); 4096].as_slice() == [Some(0_u64); 4096].as_slice()`.

…bilee Delete architecture-specific memchr code in std::sys Currently all architecture-specific memchr code is only used in `std::io`. Most of the actual `memchr` capacity exposed to the user through the slice API is instead implemented in `core::slice::memchr`. Hence this commit deletes `memchr` from `std::sys[_common]` and replace calls to it by calls to `core::slice::memchr` functions. This deletes `(r)memchr` from the list of symbols linked to libc. The interest of putting architecture specific code back in core is linked to the discussion to be had in rust-lang#113654

A more efficient slice comparison implementation for T: !BytewiseEq

be64090

The previous implementation was not optimized properly by the compiler, which didn't leverage the fact that both length were equal.

rustbot assigned joshtriplett Jul 13, 2023

krtab changed the title ~~Non memcmp slice comparison otpimization~~ Non memcmp slice comparison optimization Jul 13, 2023

krtab force-pushed the slice_compare_no_memcmp_musl branch from 0659f11 to f566552 Compare July 13, 2023 12:26

Do not use memcmp on musl

f566552

Musl implementation is equivalent to our general case implementation. Not using it: - Allows for better optimization leveraging alignment information - Prevents a `call` - Likely favors inlining and further optimization

Noratrieb reviewed Jul 13, 2023

View reviewed changes

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 13, 2023

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jul 13, 2023

krtab mentioned this pull request Jul 24, 2023

Delete architecture-specific memchr code in std::sys #114016

Merged

dtolnay removed the T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) label Sep 27, 2023

krtab mentioned this pull request Oct 17, 2023

A more efficient slice comparison implementation for T: !BytewiseEq #116846

Merged

krtab closed this Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non memcmp slice comparison optimization #113654

Non memcmp slice comparison optimization #113654

Uh oh!

krtab commented Jul 13, 2023

Uh oh!

rustbot commented Jul 13, 2023

Uh oh!

krtab commented Jul 13, 2023

Uh oh!

Noratrieb Jul 13, 2023

Uh oh!

krtab Jul 13, 2023

Uh oh!

Noratrieb commented Jul 13, 2023

Uh oh!

This comment has been minimized.

bors commented Jul 13, 2023

Uh oh!

bors commented Jul 13, 2023

Uh oh!

This comment has been minimized.

rust-timer commented Jul 13, 2023

Uh oh!

Noratrieb commented Jul 13, 2023

Uh oh!

krtab commented Jul 20, 2023

Uh oh!

krtab commented Oct 17, 2023

Uh oh!

Uh oh!

Non memcmp slice comparison optimization #113654

Non memcmp slice comparison optimization #113654

Uh oh!

Conversation

krtab commented Jul 13, 2023

Uh oh!

rustbot commented Jul 13, 2023

Uh oh!

krtab commented Jul 13, 2023

Uh oh!

Noratrieb Jul 13, 2023

Choose a reason for hiding this comment

Uh oh!

krtab Jul 13, 2023

Choose a reason for hiding this comment

Uh oh!

Noratrieb commented Jul 13, 2023

Uh oh!

This comment has been minimized.

bors commented Jul 13, 2023

Uh oh!

bors commented Jul 13, 2023

Uh oh!

This comment has been minimized.

rust-timer commented Jul 13, 2023

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Uh oh!

Noratrieb commented Jul 13, 2023

Uh oh!

krtab commented Jul 20, 2023

Uh oh!

krtab commented Oct 17, 2023

Uh oh!

Uh oh!