Optimization for bitwise-copyable struct's `clone` #100294

CrazyboyQCD · 2024-07-24T03:50:09Z

Godbolt link
Related with this
According to the discussion this is maybe something that LLVM could optimize.

dtcxzyw · 2024-07-24T03:52:25Z

@DianQK Can you have a look?

tgross35 · 2024-07-24T04:22:27Z

There might be a couple things going on here, here's one portion reduced more:

; from the rust example, this is what `clone` emits
define void @missed_opt(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %p_out, ptr align 4 dereferenceable(16) %p_in) unnamed_addr {
start:
  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 4
  ret void
}

; manual reordering makes the optimization happen
define void @good_opt(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %p_out, ptr align 4 dereferenceable(16) %p_in) unnamed_addr {
start:
  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 4
  ret void
}

; reference, this is the `Copy` version in Rust
define void @reference(ptr noalias sret([16 x i8]) align 4 dereferenceable(16) %_0, ptr align 4 dereferenceable(16) %v) unnamed_addr {
start:
  tail call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %_0, ptr noundef nonnull align 4 dereferenceable(16) %v, i64 16, i1 false)
  ret void
}

declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg) #1

missed_opt:                             # @missed_opt
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     qword ptr [rdi], rcx
        mov     rcx, qword ptr [rsi + 8]
        mov     qword ptr [rdi + 8], rcx
        ret
good_opt:                               # @good_opt
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vmovups xmmword ptr [rdi], xmm0
        ret
reference:                              # @reference
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vmovups xmmword ptr [rdi], xmm0
        ret

Manually grouping the loads together is enough of a change that it is able to treat it the same as the memcpy. I think this should be allowed by noalias?

Rust repro: https://godbolt.org/z/dozrn683f
IR version: https://llvm.godbolt.org/z/E7fMzGqTb

edit: reduced a bit more

DianQK · 2024-07-24T12:48:24Z

I believe store and memcpy are basically interchangeable. However, merging multiple store into a memcpy would be more challenging. I would consider completing this transformation during the codegen phase in Rust.

tgross35 · 2024-07-25T02:11:17Z

Purely for my own understanding - what makes the transformation difficult? It seems like an access of ptr..(ptr+N) and an access at (ptr+M)..(ptr+W) should be fairly easy to identify as contiguous if you can verify N == M. And then if contiguous loads correspond to contiguous stores of the same addresses, that they could be represented as a copy operation.

(I know nothing about how LLVM optimizes and realize this sounds potentially quadratic, just curious what makes something easy or hard).

DianQK · 2024-07-25T02:35:55Z

Purely for my own understanding - what makes the transformation difficult? It seems like an access of ptr..(ptr+N) and an access at (ptr+M)..(ptr+W) should be fairly easy to identify as contiguous if you can verify N == M. And then if contiguous loads correspond to contiguous stores of the same addresses, that they could be represented as a copy operation.

(I know nothing about how LLVM optimizes and realize this sounds potentially quadratic, just curious what makes something easy or hard).

Don't mind me, I have a lot of LLVM knowledge to learn too. :)

Here's my initial checklist of things to verify:

N == M (destination and source)
There are no other write operations between copy(..N) and copy(M..) (reads are allowed)
Ignore violations
Merge align?
Handle the logic for multiple consecutive copies, possibly using a work list (I believe this is linear)

In fact, these checks might not be as tricky as they seem. One challenging issue is that memcpy optimizations are far fewer than store optimizations. Regardless of where we merge into memcpy, we need to address this issue. However, transforming memcpy to store requires no checks, it's just that these operations are less when needed.

DianQK · 2024-07-25T02:37:53Z

@dtcxzyw I would like to know how InstCombine typically handles memcpy. Thanks~

DianQK · 2024-07-25T03:37:24Z

I believe store and memcpy are basically interchangeable.

Sorry, that's not exactly right, we need to consider UB. I'll update later.

tgross35 · 2024-07-25T04:45:05Z

Don't mind me, I have a lot of LLVM knowledge to learn too. :)

Here's my initial checklist of things to verify:
1. N == M (destination and source)

2. There are no other write operations between `copy(..N)` and `copy(M..)` (reads are allowed)

3. Ignore violations

4. Merge align?

5. Handle the logic for multiple consecutive copies, possibly using a work list (I believe this is linear)
In fact, these checks might not be as tricky as they seem. One challenging issue is that memcpy optimizations are far fewer than store optimizations. Regardless of where we merge into memcpy, we need to address this issue. However, transforming memcpy to store requires no checks, it's just that these operations are less when needed.

Thanks for the details :) I really wouldn't expect it to actually go to memcpy the routine unless heuristics say that it it is large enough to be worth it (which I'm sure is already implemented). But trying to combine loads and stores before other passes seems like there may be some wins.

I believe store and memcpy are basically interchangeable.

Sorry, that's not exactly right, we need to consider UB. I'll update later.

I think that by default, this:

  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align

Would not be allowed to rearrange to this:

  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align

Because p_out could point somewhere in p_in_u32s. However, this example has noalias on p_in (actually p_out does too from the original rust example, I think only one of the two arguments needs it here). I think noalias should allow for that reordering, if beneficial, since nothing touches p_in that LLVM doesn't know about.

So maybe this is just a missed chance to make use of noalias?

tgross35 · 2024-07-25T05:53:34Z

Just playing around with the original example that was a much bigger struct with an array (always memcpy) in the middle and at the end. I'm trying to see if there is any way I can manually reorder to get it to merge the loads and stores in some way, so far no luck https://llvm.godbolt.org/z/6K9TPf1rn

DianQK · 2024-07-26T01:35:02Z

I forgot that I was removing noalias with alive2. When transforming memcpy to load, we have to drop noundef: https://alive2.llvm.org/ce/z/u3iLZF. However, I'm not sure if noundef is useful in this particular context.

DianQK · 2024-07-26T01:38:02Z

I believe store and memcpy are basically interchangeable.

Sorry, that's not exactly right, we need to consider UB. I'll update later.

I think that by default, this:
  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 
Would not be allowed to rearrange to this:
  %p_in_u32s = getelementptr inbounds i8, ptr %p_in, i64 8
  %p_out_u32s = getelementptr inbounds i8, ptr %p_out, i64 8
  %in_u8s = load <8 x i8>, ptr %p_in, align 4
  %in_u32s = load <2 x i32>, ptr %p_in_u32s, align 4
  store <8 x i8> %in_u8s, ptr %p_out, align 4
  store <2 x i32> %in_u32s, ptr %p_out_u32s, align 
Because p_out could point somewhere in p_in_u32s. However, this example has noalias on p_in (actually p_out does too from the original rust example, I think only one of the two arguments needs it here). I think noalias should allow for that reordering, if beneficial, since nothing touches p_in that LLVM doesn't know about.

So maybe this is just a missed chance to make use of noalias?

I think this should be done in the backend, or after obtaining some target information. For me, just from the IR perspective, the heuristic here is that we can merge two loads and two stores.

github-actions bot added the new issue label Jul 24, 2024

dtcxzyw added llvm:optimizations missed-optimization and removed new issue labels Jul 24, 2024

CrazyboyQCD closed this as completed Sep 18, 2024

EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for bitwise-copyable struct's `clone` #100294

Optimization for bitwise-copyable struct's `clone` #100294

CrazyboyQCD commented Jul 24, 2024

dtcxzyw commented Jul 24, 2024

tgross35 commented Jul 24, 2024 •

edited

Loading

DianQK commented Jul 24, 2024

tgross35 commented Jul 25, 2024

DianQK commented Jul 25, 2024

DianQK commented Jul 25, 2024

DianQK commented Jul 25, 2024

tgross35 commented Jul 25, 2024 •

edited

Loading

tgross35 commented Jul 25, 2024

DianQK commented Jul 26, 2024 •

edited

Loading

DianQK commented Jul 26, 2024

Optimization for bitwise-copyable struct's clone #100294

Optimization for bitwise-copyable struct's clone #100294

Comments

CrazyboyQCD commented Jul 24, 2024

dtcxzyw commented Jul 24, 2024

tgross35 commented Jul 24, 2024 • edited Loading

DianQK commented Jul 24, 2024

tgross35 commented Jul 25, 2024

DianQK commented Jul 25, 2024

DianQK commented Jul 25, 2024

DianQK commented Jul 25, 2024

tgross35 commented Jul 25, 2024 • edited Loading

tgross35 commented Jul 25, 2024

DianQK commented Jul 26, 2024 • edited Loading

DianQK commented Jul 26, 2024

Optimization for bitwise-copyable struct's `clone` #100294

Optimization for bitwise-copyable struct's `clone` #100294

tgross35 commented Jul 24, 2024 •

edited

Loading

tgross35 commented Jul 25, 2024 •

edited

Loading

DianQK commented Jul 26, 2024 •

edited

Loading