-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asserts the maximum value that can be returned from Vec::len
#131193
Conversation
This comment has been minimized.
This comment has been minimized.
9c9540e
to
7705433
Compare
7705433
to
6f0772b
Compare
Have you figured out which expression fails to optimize? Maybe unchecked arithmetic could provide a more localized fix than adding an assume everywhere lengths are involved. Well, let's see what the perf impact is, maybe the assume is fine. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Asserts the maximum value that can be returned from `Vec::len` Currently, casting `Vec<i32>` to `Vec<u32>` takes O(1) time: ```rust // See <https://godbolt.org/z/hxq3hnYKG> for assembly output. pub fn cast(vec: Vec<i32>) -> Vec<u32> { vec.into_iter().map(|e| e as _).collect() } ``` But the generated assembly is not the same as the identity function, which prevents us from casting `Vec<Vec<i32>>` to `Vec<Vec<u32>>` within O(1) time: ```rust // See <https://godbolt.org/z/7n48bxd9f> for assembly output. pub fn cast(vec: Vec<Vec<i32>>) -> Vec<Vec<u32>> { vec.into_iter() .map(|e| e.into_iter().map(|e| e as _).collect()) .collect() } ``` This change tries to fix the problem. You can see the comparison here: <https://godbolt.org/z/jdManrKvx>.
r? the8472 (oh, you did that already) Fixing this seems desirable, but putting the check in I think @scottmcm may have looked at optimizing on these invariants at some point. |
This comment was marked as outdated.
This comment was marked as outdated.
I suspect the multiplication and division here, but not very sure:
If performance is a concern, there is also a possible optimization to precompute the length upper bound, maybe something like: pub trait SizedTypeProperties: Sized {
const IS_ZST: bool = size_of::<Self>() == 0;
const LAYOUT: Layout = Layout::new::<Self>();
const MAX_VEC_LEN: usize =
if Self::IS_ZST { usize::MAX } else { isize::MAX as usize / size_of::<Self>() };
}
impl Vec<T> {
pub fn len(&self) -> usize {
unsafe { core::hint::assert_unchecked(self.len <= T::MAX_VEC_LEN) };
self.len
}
} Also, there still is a remaining |
It's not, since the maximum is the size in bytes but the len field is tracking the number of entries. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (0fb2e37): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary 1.9%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary 1.4%, secondary 4.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.1%, secondary 0.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 772.929s -> 774.321s (0.18%) |
Well, that's mostly a regression, so a more localized or lightweight fix would be preferable.
The size is constant (so no need for a zero check) and the mul is unchecked so it already asserts it doesn't overflow. So I suspect it's probably some other arithmetic. |
Indeed, I had exactly code for this in You might get less of an impact by doing this change as let len = self.len;
intrinsics::assume(len <= T::MAX_SLICE_LEN);
len No branch, no duplicate projecting, and no Though really, the good way to do this is to allow |
6f0772b
to
15fc5d8
Compare
This comment has been minimized.
This comment has been minimized.
15fc5d8
to
602ee09
Compare
This comment has been minimized.
This comment has been minimized.
Asserts the maximum value that can be returned from `Vec::len` Currently, casting `Vec<i32>` to `Vec<u32>` takes O(1) time: ```rust // See <https://godbolt.org/z/hxq3hnYKG> for assembly output. pub fn cast(vec: Vec<i32>) -> Vec<u32> { vec.into_iter().map(|e| e as _).collect() } ``` But the generated assembly is not the same as the identity function, which prevents us from casting `Vec<Vec<i32>>` to `Vec<Vec<u32>>` within O(1) time: ```rust // See <https://godbolt.org/z/7n48bxd9f> for assembly output. pub fn cast(vec: Vec<Vec<i32>>) -> Vec<Vec<u32>> { vec.into_iter() .map(|e| e.into_iter().map(|e| e as _).collect()) .collect() } ``` This change tries to fix the problem. You can see the comparison here: <https://godbolt.org/z/jdManrKvx>.
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
It seems that the codegen result of ; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: read, inaccessiblemem: write) uwtable
define void @noop(ptr noalias nocapture noundef readonly align 4 dereferenceable(12) %v) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
%0 = getelementptr inbounds i8, ptr %v, i32 8
%_2.i = load i32, ptr %0, align 4, !alias.scope !2, !noundef !5
%.not = icmp eq i32 %_2.i, 0
br i1 %.not, label %bb3, label %"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17h7892da45aee13302E.exit"
"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17h7892da45aee13302E.exit": ; preds = %start
%1 = add i32 %_2.i, -1
%2 = load i32, ptr %v, align 4, !noundef !5
%cond.i = icmp ult i32 %1, %2
tail call void @llvm.assume(i1 %cond.i)
br label %bb3
bb3: ; preds = %start, %"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17h7892da45aee13302E.exit"
ret void
} With ; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: read, inaccessiblemem: write)
define dso_local void @noop(ptr noalias nocapture noundef readonly align 4 dereferenceable(12) %v) unnamed_addr #0 {
start:
%0 = getelementptr inbounds i8, ptr %v, i32 8
%_2.i = load i32, ptr %0, align 4, !alias.scope !1, !noundef !4
%.not = icmp eq i32 %_2.i, 0
br i1 %.not, label %bb3, label %bb7
bb3: ; preds = %start, %bb7
ret void
bb7: ; preds = %start
%1 = add i32 %_2.i, -1
%2 = load i32, ptr %v, align 4, !noundef !4
%cond.i = icmp ult i32 %1, %2
tail call void @llvm.assume(i1 %cond.i)
br label %bb3
} It seems the order of the last two basic blocks is reversed with I see two fixes for codegen tests:
|
f23b312
to
7d450bb
Compare
Hi @the8472, I have fixed the CI failure, would you like to try again? |
@bors r+ |
☀️ Test successful - checks-actions |
Finished benchmarking commit (303e8bd): comparison URL. Overall result: ❌ regressions - please read the text belowOur benchmarks found a performance regression caused by this PR. Next Steps:
@rustbot label: +perf-regression Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -0.0%, secondary 2.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary -0.9%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.0%, secondary -0.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 762.237s -> 765.312s (0.40%) |
The post-merge regression seems to be much larger than the previous runs. Is this regression expected? It seems to be mostly in opt builds, where it was probably expected but also in check builds. |
I'm surprised the pre-merge one was so small. The similar things I'd tried in the past, like #122926 (comment) and #123575 (comment) both had results like the post-merge one here. Interesting that for the opt-full regression on image, the biggest icount one, the wall time came out way better due to permuting the codegen schedule: |
Thanks for taking a look. I agree that this is something that is expected to regress (especially opt) performance, while potentially giving better codegen. @rustbot label: +perf-regression-triaged |
FWIW, I didn't end up merging my previous attempts at these things, so I don't know if the icount change is acceptable here. This kind of "slower to compile but hopefully-faster at runtime" we don't have good data about. This being entirely red even for check builds -- which wouldn't have the same expected codegen impact -- is less than great. |
Currently, casting
Vec<i32>
toVec<u32>
takes O(1) time:But the generated assembly is not the same as the identity function, which prevents us from casting
Vec<Vec<i32>>
toVec<Vec<u32>>
within O(1) time:This change tries to fix the problem. You can see the comparison here: https://godbolt.org/z/jdManrKvx.