-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vec::push
should load a passed-by-pointer argument *after* reserve_for_push
#115242
Comments
Would using coldcc for reserve_for_push help? |
Looking at what happens in each version (LLVM IR greatly simplified/pseudo-ified for exposition)... 1.71: https://godbolt.org/z/9rY5K3v8c We generate: define void @push(ptr %vec, ptr %array) {
%alloca = alloca [16 x i8]
call void @memcpy(%alloca, %array)
call void @Vec::push(%vec, %alloca)
ret void
} After inlining: define void @push(ptr %vec, ptr %array) {
%alloca = alloca [16 x i8]
call void @memcpy(%alloca, %array)
call_reserve:
call void @Vec::reserve_for_push(%vec)
exit:
call void @memcpy(%vec.buf, %alloca)
ret void
} Then instcombine optimizes out the intermediate alloca, and we end up with a memcpy directly from the argument to the vec: define void @push(ptr %vec, ptr %array) {
call void @Vec::reserve_for_push(%vec)
exit:
call void @memcpy(%vec.buf, %array)
ret void
} 1.72: https://godbolt.org/z/T17znjnxG We generate: define void @push(ptr %vec, ptr %array) {
%alloca = alloca [16 x i8]
%tmp = load %array
store %array, %alloca
call void @Vec::push(%vec, %alloca)
ret void
} After inlining: define void @push(ptr %vec, ptr %array) {
%alloca = alloca [16 x i8]
%tmp = load %array
store %array, %alloca
%tmp2 = load %alloca
call_reserve:
call void @Vec::reserve_for_push(%vec)
exit:
store %tmp2, %vec.buf
ret void
} Then SROA optimizes out the intermediate alloca, and we're left with one of the loads in the entry block: define void @push(ptr %vec, ptr %array) {
%tmp2 = load %array
call_reserve:
call void @Vec::reserve_for_push(%vec)
exit:
store %tmp2, %vec.buf
ret void
} nightly: https://godbolt.org/z/Tb38hGPjE Unlike 1.71 and 1.72, before inlining there is no intermediate alloca. We directly pass the array ptr to define void @push(ptr %vec, ptr %array) {
call void @Vec::push(%vec, %array)
ret void
} I believe this would result in ideal codegen*...except, argpromotion converts the array to be passed by value: define void @push(ptr %vec, ptr %array) {
%tmp = load %array
call void @Vec::push(%vec, %tmp)
ret void
} Then, when define void @push(ptr %vec, ptr %array) {
%tmp = load %array
call_reserve:
call void @Vec::reserve_for_push(%vec)
exit:
store %tmp, %vec.buf
ret void
} * because, before argpromotion, `Vec::push` LLVM IRdefine internal fastcc void @"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17h69513e07f1ac3cddE"(ptr noalias noundef align 8 dereferenceable(24) %self, ptr noalias nocapture noundef align 1 dereferenceable(16) %value) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
%0 = getelementptr inbounds %"alloc::vec::Vec<[u8; 16]>", ptr %self, i64 0, i32 1
%_4 = load i64, ptr %0, align 8
%1 = getelementptr inbounds { ptr, i64 }, ptr %self, i64 0, i32 1
%2 = load i64, ptr %1, align 8
%_3 = icmp eq i64 %_4, %2
br i1 %_3, label %bb1, label %bb3
bb3: ; preds = %bb1, %start
%self1 = load ptr, ptr %self, align 8
%count = load i64, ptr %0, align 8
%end = getelementptr inbounds [16 x i8], ptr %self1, i64 %count
%3 = load <16 x i8>, ptr %value, align 1
store <16 x i8> %3, ptr %end, align 1
%4 = load i64, ptr %0, align 8
%5 = add i64 %4, 1
store i64 %5, ptr %0, align 8
ret void
bb1: ; preds = %start
call fastcc void @"_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h2c9a8b8fbc06deffE"(ptr noalias noundef nonnull align 8 dereferenceable(16) %self, i64 noundef %_4)
br label %bb3
} Note the adjacent: %3 = load <16 x i8>, ptr %value, align 1
store <16 x i8> %3, ptr %end, align 1 So, to resolve this, I believe LLVM either:
The latter makes sense to me even just in terms of compile time (argpromotion should do much less work after inlining), but maybe there's a good reason the pipeline is ordered that way. |
Actually, argpromotion may not be the culprit. It only looks that way because our godbolt examples export In real programs, this wouldn't be the case, and argpromotion would be able to change the signature all the way up to the original value being pushed, which will either be:
|
@the8472 Yeah, that's what I was thinking about when I said tweaking calling conventions 🙂 I tried it out today, and it does make the push look nicer -- no more _ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h78533106871c2e4bE:
.seh_proc _ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h78533106871c2e4bE
push r11
.seh_pushreg r11
push r10
.seh_pushreg r10
push r9
.seh_pushreg r9
push r8
.seh_pushreg r8
push rdi
.seh_pushreg rdi
push rsi
.seh_pushreg rsi
push rdx
.seh_pushreg rdx
push rcx
.seh_pushreg rcx
sub rsp, 184
.seh_stackalloc 184
movaps xmmword ptr [rsp + 160], xmm5
.seh_savexmm xmm5, 160
movaps xmmword ptr [rsp + 144], xmm4
.seh_savexmm xmm4, 144
movaps xmmword ptr [rsp + 128], xmm3
.seh_savexmm xmm3, 128
movaps xmmword ptr [rsp + 112], xmm2
.seh_savexmm xmm2, 112
movaps xmmword ptr [rsp + 96], xmm1
.seh_savexmm xmm1, 96
movaps xmmword ptr [rsp + 80], xmm0
.seh_savexmm xmm0, 80
.seh_endprologue So I'm not confident it's the right choice, but I tossed it up in #115256 to see what perf says regardless. As @reinerp mentioned in #97544 (comment), this probably wants something that doesn't have to save-and-restore the simd registers, but |
Oh, for extra "fun", this is I was wondering why it wasn't reproing for me locally, and it was because I was trying with |
Looks like the opt-level dependence is because there's no argpromotion at opt-level=2 (https://rust.godbolt.org/z/f44xTqnG4), so my hypothesis was right. (but this is no solace for real code, as per #115242 (comment))
I wouldn't worry too much about this. Unlike |
Ah, thanks for the pointer, @erikdesjardins . It originally looked horrible locally, but it seems that's just because Windows is sadness -- I opened https://discourse.llvm.org/t/conv-c-and-conv-preservemost-mix-badly-on-windows-x64/73054?u=scottmcm to ask about it -- and if I look at calls from the SysV ABI instead it works great.
EDIT: Ah, looking at this I think I did the convention add PR wrong last time (#97512). PR up where I found the right place to put the "don't do this on windows" conversion: #115260 |
…Lapkin Use `preserve_mostcc` for `extern "rust-cold"` As experimentation in rust-lang#115242 has shown looks better than `coldcc`. Notably, clang exposes `preserve_most` (https://clang.llvm.org/docs/AttributeReference.html#preserve-most) but not `cold`, so this change should put us on a better-supported path. And *don't* use a different convention for cold on Windows, because that actually ends up making things worse. (See comment in the code.) cc tracking issue rust-lang#97544
Since the preserve_mostcc PR has landed I tried to apply #115256 locally but it doesn't appear to work. The register still gets spilled vec_push:
.cfi_startproc
subq $24, %rsp
.cfi_def_cfa_offset 32
movups (%rsi), %xmm0
movq 16(%rdi), %rsi
cmpq 8(%rdi), %rsi
jne .LBB2_2
movaps %xmm0, (%rsp)
callq _ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17hf9b0ac0abb2ba3edE
movaps (%rsp), %xmm0
movq 16(%rdi), %rsi
.LBB2_2:
movq (%rdi), %rax
movq %rsi, %rcx
shlq $4, %rcx
movups %xmm0, (%rax,%rcx)
incq %rsi
movq %rsi, 16(%rdi)
addq $24, %rsp
.cfi_def_cfa_offset 8
retq define internal preserve_mostcc void @"_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17hf9b0ac0abb2ba3edE"
[...]
; Function Attrs: nonlazybind uwtable
define void @vec_push(ptr noalias nocapture noundef align 8 dereferenceable(24) %vec, ptr noalias nocapture noundef readonly align 1 dereferenceable(16) %data) unnamed_addr #2 personality ptr @rust_eh_personality {
start:
%data.val = load <16 x i8>, ptr %data, align 1
tail call void @llvm.experimental.noalias.scope.decl(metadata !18)
%0 = getelementptr inbounds %"alloc::vec::Vec<[u8; 16]>", ptr %vec, i64 0, i32 1
%_4.i = load i64, ptr %0, align 8, !alias.scope !18, !noundef !4
%1 = getelementptr inbounds { ptr, i64 }, ptr %vec, i64 0, i32 1
%2 = load i64, ptr %1, align 8, !alias.scope !18, !noundef !4
%_3.i = icmp eq i64 %_4.i, %2
br i1 %_3.i, label %bb1.i, label %"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17ha4b516e01385232bE.exit"
bb1.i: ; preds = %start
; call alloc::raw_vec::RawVec<T,A>::reserve_for_push
tail call preserve_mostcc void @"_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17hf9b0ac0abb2ba3edE"(ptr noalias noundef nonnull align 8 dereferenceable(16) %vec, i64 noundef %_4.i)
%count.pre.i = load i64, ptr %0, align 8, !alias.scope !18
br label %"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17ha4b516e01385232bE.exit"
"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17ha4b516e01385232bE.exit": ; preds = %start, %bb1.i
%count.i = phi i64 [ %count.pre.i, %bb1.i ], [ %_4.i, %start ]
%self1.i = load ptr, ptr %vec, align 8, !alias.scope !18, !nonnull !4, !noundef !4
%end.i = getelementptr inbounds [16 x i8], ptr %self1.i, i64 %count.i
store <16 x i8> %data.val, ptr %end.i, align 1, !noalias !18
%3 = add i64 %count.i, 1
store i64 %3, ptr %0, align 8, !alias.scope !18
ret void
} |
( Inspired by @erikdesjardins 's example in #115212 (comment) )
When pushing a
[u8; 16]
(which is passed indirectly)On nightly today we load
data
to an xmm register, then spill it to stack to callreserve_for_push
, then load it again https://godbolt.org/z/4jTK4r3KoWe should make the stack adjustment unnecessary here -- ideally by loading it after the
reserve_for_push
, but other things like tweaking calling conventions could help too.The text was updated successfully, but these errors were encountered: