-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for bitwise-copyable struct's clone
#100294
Comments
@DianQK Can you have a look? |
There might be a couple things going on here, here's one portion reduced more: ; from the rust example, this is what `clone` emits
define void @missed_opt(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %p_out, ptr align 4 dereferenceable(16) %p_in) unnamed_addr {
start:
%p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
%p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
%in_u8s = load <8 x i8>, ptr %p_in, align 4
store <8 x i8> %in_u8s, ptr %p_out, align 4
%in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
store <2 x i32> %in_u32s, ptr %p_out_u32s, align 4
ret void
}
; manual reordering makes the optimization happen
define void @good_opt(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %p_out, ptr align 4 dereferenceable(16) %p_in) unnamed_addr {
start:
%p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
%p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
%in_u8s = load <8 x i8>, ptr %p_in, align 4
%in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
store <8 x i8> %in_u8s, ptr %p_out, align 4
store <2 x i32> %in_u32s, ptr %p_out_u32s, align 4
ret void
}
; reference, this is the `Copy` version in Rust
define void @reference(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %_0, ptr align 4 dereferenceable(16) %v) unnamed_addr {
start:
tail call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %_0, ptr noundef nonnull align 4 dereferenceable(16) %v, i64 16, i1 false)
ret void
}
declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg) #1 missed_opt: # @missed_opt
mov rax, rdi
mov rcx, qword ptr [rsi]
mov qword ptr [rdi], rcx
mov rcx, qword ptr [rsi + 8]
mov qword ptr [rdi + 8], rcx
ret
good_opt: # @good_opt
mov rax, rdi
vmovups xmm0, xmmword ptr [rsi]
vmovups xmmword ptr [rdi], xmm0
ret
reference: # @reference
mov rax, rdi
vmovups xmm0, xmmword ptr [rsi]
vmovups xmmword ptr [rdi], xmm0
ret Manually grouping the Rust repro: https://godbolt.org/z/dozrn683f edit: reduced a bit more |
I believe |
Purely for my own understanding - what makes the transformation difficult? It seems like an access of (I know nothing about how LLVM optimizes and realize this sounds potentially quadratic, just curious what makes something easy or hard). |
Don't mind me, I have a lot of LLVM knowledge to learn too. :) Here's my initial checklist of things to verify:
In fact, these checks might not be as tricky as they seem. One challenging issue is that |
@dtcxzyw I would like to know how InstCombine typically handles |
Sorry, that's not exactly right, we need to consider UB. I'll update later. |
Thanks for the details :) I really wouldn't expect it to actually go to
I think that by default, this: %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
%p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
%in_u8s = load <8 x i8>, ptr %p_in, align 4
store <8 x i8> %in_u8s, ptr %p_out, align 4
%in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
store <2 x i32> %in_u32s, ptr %p_out_u32s, align Would not be allowed to rearrange to this: %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
%p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
%in_u8s = load <8 x i8>, ptr %p_in, align 4
%in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
store <8 x i8> %in_u8s, ptr %p_out, align 4
store <2 x i32> %in_u32s, ptr %p_out_u32s, align Because So maybe this is just a missed chance to make use of |
Just playing around with the original example that was a much bigger struct with an array (always memcpy) in the middle and at the end. I'm trying to see if there is any way I can manually reorder to get it to merge the loads and stores in some way, so far no luck https://llvm.godbolt.org/z/6K9TPf1rn |
I forgot that I was removing |
I think this should be done in the backend, or after obtaining some target information. For me, just from the IR perspective, the heuristic here is that we can merge two loads and two stores. |
Godbolt link
Related with this
According to the discussion this is maybe something that LLVM could optimize.
The text was updated successfully, but these errors were encountered: