Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for bitwise-copyable struct's clone #100294

Closed
CrazyboyQCD opened this issue Jul 24, 2024 · 11 comments
Closed

Optimization for bitwise-copyable struct's clone #100294

CrazyboyQCD opened this issue Jul 24, 2024 · 11 comments
Labels
llvm:optimizations missed-optimization question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

Comments

@CrazyboyQCD
Copy link

Godbolt link
Related with this
According to the discussion this is maybe something that LLVM could optimize.

@dtcxzyw
Copy link
Member

dtcxzyw commented Jul 24, 2024

@DianQK Can you have a look?

@tgross35
Copy link

tgross35 commented Jul 24, 2024

There might be a couple things going on here, here's one portion reduced more:

; from the rust example, this is what `clone` emits
define void @missed_opt(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %p_out, ptr align 4 dereferenceable(16) %p_in) unnamed_addr {
start:
  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 4
  ret void
}

; manual reordering makes the optimization happen
define void @good_opt(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %p_out, ptr align 4 dereferenceable(16) %p_in) unnamed_addr {
start:
  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 4
  ret void
}

; reference, this is the `Copy` version in Rust
define void @reference(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %_0, ptr align 4 dereferenceable(16) %v) unnamed_addr {
start:
  tail call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %_0, ptr noundef nonnull align 4 dereferenceable(16) %v, i64 16, i1 false)
  ret void
}

declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg) #1 
missed_opt:                             # @missed_opt
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     qword ptr [rdi], rcx
        mov     rcx, qword ptr [rsi + 8]
        mov     qword ptr [rdi + 8], rcx
        ret
good_opt:                               # @good_opt
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vmovups xmmword ptr [rdi], xmm0
        ret
reference:                              # @reference
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vmovups xmmword ptr [rdi], xmm0
        ret

Manually grouping the loads together is enough of a change that it is able to treat it the same as the memcpy. I think this should be allowed by noalias?

Rust repro: https://godbolt.org/z/dozrn683f
IR version: https://llvm.godbolt.org/z/E7fMzGqTb

edit: reduced a bit more

@DianQK
Copy link
Member

DianQK commented Jul 24, 2024

I believe store and memcpy are basically interchangeable. However, merging multiple store into a memcpy would be more challenging. I would consider completing this transformation during the codegen phase in Rust.

@tgross35
Copy link

Purely for my own understanding - what makes the transformation difficult? It seems like an access of ptr..(ptr+N) and an access at (ptr+M)..(ptr+W) should be fairly easy to identify as contiguous if you can verify N == M. And then if contiguous loads correspond to contiguous stores of the same addresses, that they could be represented as a copy operation.

(I know nothing about how LLVM optimizes and realize this sounds potentially quadratic, just curious what makes something easy or hard).

@DianQK
Copy link
Member

DianQK commented Jul 25, 2024

Purely for my own understanding - what makes the transformation difficult? It seems like an access of ptr..(ptr+N) and an access at (ptr+M)..(ptr+W) should be fairly easy to identify as contiguous if you can verify N == M. And then if contiguous loads correspond to contiguous stores of the same addresses, that they could be represented as a copy operation.

(I know nothing about how LLVM optimizes and realize this sounds potentially quadratic, just curious what makes something easy or hard).

Don't mind me, I have a lot of LLVM knowledge to learn too. :)

Here's my initial checklist of things to verify:

  1. N == M (destination and source)
  2. There are no other write operations between copy(..N) and copy(M..) (reads are allowed)
  3. Ignore violations
  4. Merge align?
  5. Handle the logic for multiple consecutive copies, possibly using a work list (I believe this is linear)

In fact, these checks might not be as tricky as they seem. One challenging issue is that memcpy optimizations are far fewer than store optimizations. Regardless of where we merge into memcpy, we need to address this issue. However, transforming memcpy to store requires no checks, it's just that these operations are less when needed.

@DianQK
Copy link
Member

DianQK commented Jul 25, 2024

@dtcxzyw I would like to know how InstCombine typically handles memcpy. Thanks~

@DianQK
Copy link
Member

DianQK commented Jul 25, 2024

I believe store and memcpy are basically interchangeable.

Sorry, that's not exactly right, we need to consider UB. I'll update later.

@tgross35
Copy link

tgross35 commented Jul 25, 2024

Don't mind me, I have a lot of LLVM knowledge to learn too. :)

Here's my initial checklist of things to verify:

1. N == M (destination and source)

2. There are no other write operations between `copy(..N)` and `copy(M..)` (reads are allowed)

3. Ignore violations

4. Merge align?

5. Handle the logic for multiple consecutive copies, possibly using a work list (I believe this is linear)

In fact, these checks might not be as tricky as they seem. One challenging issue is that memcpy optimizations are far fewer than store optimizations. Regardless of where we merge into memcpy, we need to address this issue. However, transforming memcpy to store requires no checks, it's just that these operations are less when needed.

Thanks for the details :) I really wouldn't expect it to actually go to memcpy the routine unless heuristics say that it it is large enough to be worth it (which I'm sure is already implemented). But trying to combine loads and stores before other passes seems like there may be some wins.

I believe store and memcpy are basically interchangeable.

Sorry, that's not exactly right, we need to consider UB. I'll update later.

I think that by default, this:

  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 

Would not be allowed to rearrange to this:

  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 

Because p_out could point somewhere in p_in_u32s. However, this example has noalias on p_in (actually p_out does too from the original rust example, I think only one of the two arguments needs it here). I think noalias should allow for that reordering, if beneficial, since nothing touches p_in that LLVM doesn't know about.

So maybe this is just a missed chance to make use of noalias?

@tgross35
Copy link

Just playing around with the original example that was a much bigger struct with an array (always memcpy) in the middle and at the end. I'm trying to see if there is any way I can manually reorder to get it to merge the loads and stores in some way, so far no luck https://llvm.godbolt.org/z/6K9TPf1rn

@DianQK
Copy link
Member

DianQK commented Jul 26, 2024

I forgot that I was removing noalias with alive2. When transforming memcpy to load, we have to drop noundef: https://alive2.llvm.org/ce/z/u3iLZF. However, I'm not sure if noundef is useful in this particular context.

@DianQK
Copy link
Member

DianQK commented Jul 26, 2024

I believe store and memcpy are basically interchangeable.

Sorry, that's not exactly right, we need to consider UB. I'll update later.

I think that by default, this:

  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 

Would not be allowed to rearrange to this:

  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 

Because p_out could point somewhere in p_in_u32s. However, this example has noalias on p_in (actually p_out does too from the original rust example, I think only one of the two arguments needs it here). I think noalias should allow for that reordering, if beneficial, since nothing touches p_in that LLVM doesn't know about.

So maybe this is just a missed chance to make use of noalias?

I think this should be done in the backend, or after obtaining some target information. For me, just from the IR perspective, the heuristic here is that we can merge two loads and two stores.

@EugeneZelenko EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:optimizations missed-optimization question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!
Projects
None yet
Development

No branches or pull requests

5 participants