Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mem::swap the obvious way for types smaller than the SIMD optimization's block size #52051

Merged
merged 2 commits into from
Jul 22, 2018

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented Jul 4, 2018

LLVM isn't able to remove the alloca for the unaligned block in the post-SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't. Found in the replace_with RFC discussion.

Examples of the improvements:

swapping `[u16; 3]` takes 1/3 fewer instructions and no stackalloc
type Demo = [u16; 3];
pub fn swap_demo(x: &mut Demo, y: &mut Demo) {
    std::mem::swap(x, y);
}

nightly:

_ZN4blah9swap_demo17ha1732a9b71393a7eE:
.seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	movzx	eax, word ptr [rcx + 4]
	mov	word ptr [rsp + 4], ax
	mov	eax, dword ptr [rcx]
	mov	dword ptr [rsp], eax
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	movzx	eax, word ptr [rsp + 4]
	mov	word ptr [rdx + 4], ax
	mov	eax, dword ptr [rsp]
	mov	dword ptr [rdx], eax
	add	rsp, 32
	ret
	.seh_handlerdata
	.section	.text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE
	.seh_endproc

this PR:

_ZN4blah9swap_demo17ha1732a9b71393a7eE:
	mov	r8d, dword ptr [rcx]
	movzx	r9d, word ptr [rcx + 4]
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	mov	word ptr [rdx + 4], r9w
	mov	dword ptr [rdx], r8d
	ret
`replace_with` optimizes down much better

Inspired by rust-lang/rfcs#2490,

fn replace_with<T, F>(x: &mut Option<T>, f: F)
    where F: FnOnce(Option<T>) -> Option<T>
{
    *x = f(x.take());
}

pub fn inc_opt(mut x: &mut Option<i32>) {
    replace_with(&mut x, |i| i.map(|j| j + 1));
}

Rust 1.26.0:

_ZN4blah7inc_opt17heb0acb64c51777cfE:
	mov	rax, qword ptr [rcx]
	movabs	r8, 4294967296
	add	r8, rax
	shl	rax, 32
	movabs	rdx, -4294967296
	and	rdx, r8
	xor	r8d, r8d
	test	rax, rax
	cmove	rdx, rax
	setne	r8b
	or	rdx, r8
	mov	qword ptr [rcx], rdx
	ret

Nightly (better thanks to ScalarPair, maybe?):

_ZN4blah7inc_opt17h66df690be0b5899dE:
	mov	r8, qword ptr [rcx]
	mov	rdx, r8
	shr	rdx, 32
	xor	eax, eax
	test	r8d, r8d
	setne	al
	add	edx, 1
	mov	dword ptr [rcx], eax
	mov	dword ptr [rcx + 4], edx
	ret

This PR:

_ZN4blah7inc_opt17h1426dc215ecbdb19E:
	xor	eax, eax
	cmp	dword ptr [rcx], 0
	setne	al
	mov	dword ptr [rcx], eax
	add	dword ptr [rcx + 4], 1
	ret

Where that add is beautiful -- using an addressing mode to not even need to explicitly go through a register -- and the remaining imperfection is well-known (#49420 (comment)).

@rust-highfive
Copy link
Collaborator

r? @KodrAus

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 4, 2018
@scottmcm scottmcm added the WG-llvm Working group: LLVM backend code generation label Jul 4, 2018
@kennytm
Copy link
Member

kennytm commented Jul 4, 2018

I don't think this has anything to do with SIMD, as the longer codegen is still produced when commenting out the #[repr(simd)].

OTOH, changing this line:

let mut t: UnalignedBlock = mem::uninitialized();

to

        let mut t = UnalignedBlock(0,0,0,0);

is sufficient to get this ASM on godbolt:

example::swap_i48:
  mov eax, dword ptr [rdi]
  movzx ecx, word ptr [rdi + 4]
  movzx edx, word ptr [rsi + 4]
  mov word ptr [rdi + 4], dx
  mov edx, dword ptr [rsi]
  mov dword ptr [rdi], edx
  mov word ptr [rsi + 4], cx
  mov dword ptr [rsi], eax
  ret

@scottmcm
Copy link
Member Author

scottmcm commented Jul 4, 2018

Interesting, @kennytm. Also really disturbing, since initializing to undef should in all situations allow the optimizer more freedom than initializing to a concrete value. It makes me wonder if the whole "swapping in u64s" approach actually has undef problems, perhaps in padding or similar...

(Agreed that it's not actually about the SIMD, but about the tail-too-small-to-SIMD part of the method.)

@scottmcm scottmcm changed the title Don't use SIMD optimizations in mem::swap for types smaller than the block size mem::swap the obvious way for types smaller than the SIMD optimization's block size Jul 4, 2018
@kennytm
Copy link
Member

kennytm commented Jul 4, 2018

I've also checked the inc_opt case, and my minor change in #52051 (comment) is not sufficient to reproduce the ultimate optimized code.

@scottmcm
Copy link
Member Author

scottmcm commented Jul 4, 2018

@kennytm Can you share your godbolt? When I try the change locally, I get worse codegen as it zero-inits the part of the stack that the memcpys don't write.

ASM with zero-init for the unaligned block
_ZN4blah9swap_demo17ha1732a9b71393a7eE:
.seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	xorps	xmm0, xmm0
	movups	xmmword ptr [rsp + 16], xmm0
	movups	xmmword ptr [rsp + 6], xmm0
	movzx	eax, word ptr [rcx + 4]
	mov	word ptr [rsp + 4], ax
	mov	eax, dword ptr [rcx]
	mov	dword ptr [rsp], eax
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	mov	eax, dword ptr [rsp]
	mov	dword ptr [rdx], eax
	movzx	eax, word ptr [rsp + 4]
	mov	word ptr [rdx + 4], ax
	add	rsp, 32
	ret
	.seh_handlerdata
	.section	.text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE
	.seh_endproc

As for inc_opt, that illustrates another advantage: the memcpys in swap_nonoverlapping_bytes always are align 1, whereas this PR keeps alignment information. I suspect that makes the difference since I see mov r8, qword ptr [rcx] with the zero-init (or nightly), which is probably an indication that LLVM isn't willing to split up the unaligned load and thus can't fold as much away.

@kennytm
Copy link
Member

kennytm commented Jul 4, 2018

@scottmcm https://godbolt.org/g/1vgFXw

Including inc_opt in the same snippet spoils the optimization though 🤔.

@scottmcm
Copy link
Member Author

scottmcm commented Jul 4, 2018

Using an array of 4 0_u64s instead of the struct of them also seems to break the optimization, putting in the explicit stack zeroing like I'm seeing locally: https://godbolt.org/g/smQfre

Given how subtle all these changes seem to be, I'm liking the "just do it the obvious way" code even more.

@kennytm
Copy link
Member

kennytm commented Jul 4, 2018

Could we modify swap_nonoverlapping instead of creating a new function swap_nonoverlapping_one?

unsafe fn swap_nonoverlapping<T>(x: *mut T, y: *mut T, count: usize) {
    if count == 1 && mem::size_of::<T>() < 32 {
        let z = read(x);
        copy_nonoverlapping(y, x, 1);
        write(y, z);
    } else {
        let x = x as *mut u8;
        let y = y as *mut u8;
        let len = mem::size_of::<T>() * count;
        swap_nonoverlapping_bytes(x, y, len)
    }
}

@scottmcm
Copy link
Member Author

scottmcm commented Jul 7, 2018

@kennytm I originally had the code in mem::swap directly, but that meant the 32 was rather out of context, so I did the pub(crate) function. I'm not a huge fan of putting the count == 1 check in swap_nonoverlapping since it feels wrong to bother checking in cases where it's not a compile-time constant and both sides of the if/else would need to get codegened...

@stokhos stokhos added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 21, 2018
@stokhos
Copy link

stokhos commented Jul 21, 2018

Ping from triage! @rust-lang/libs anyone got time to review this PR?

@stokhos stokhos added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 21, 2018
@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Jul 21, 2018

📌 Commit 1f73144 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Jul 21, 2018
@alexcrichton
Copy link
Member

I agree with @scottmcm that we should keep this as a separate function for now, but if the need arises we can always merge it with the original swap_nonoverlapping function!

kennytm added a commit to kennytm/rust that referenced this pull request Jul 21, 2018
mem::swap the obvious way for types smaller than the SIMD optimization's block size

LLVM isn't able to remove the alloca for the unaligned block in the post-SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't.  Found in the `replace_with` RFC discussion.

Examples of the improvements:
<details>
 <summary>swapping `[u16; 3]` takes 1/3 fewer instructions and no stackalloc</summary>

```rust
type Demo = [u16; 3];
pub fn swap_demo(x: &mut Demo, y: &mut Demo) {
    std::mem::swap(x, y);
}
```

nightly:
```asm
_ZN4blah9swap_demo17ha1732a9b71393a7eE:
.seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	movzx	eax, word ptr [rcx + 4]
	mov	word ptr [rsp + 4], ax
	mov	eax, dword ptr [rcx]
	mov	dword ptr [rsp], eax
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	movzx	eax, word ptr [rsp + 4]
	mov	word ptr [rdx + 4], ax
	mov	eax, dword ptr [rsp]
	mov	dword ptr [rdx], eax
	add	rsp, 32
	ret
	.seh_handlerdata
	.section	.text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE
	.seh_endproc
```

this PR:
```asm
_ZN4blah9swap_demo17ha1732a9b71393a7eE:
	mov	r8d, dword ptr [rcx]
	movzx	r9d, word ptr [rcx + 4]
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	mov	word ptr [rdx + 4], r9w
	mov	dword ptr [rdx], r8d
	ret
```
</details>

<details>
 <summary>`replace_with` optimizes down much better</summary>

Inspired by rust-lang/rfcs#2490,

```rust
fn replace_with<T, F>(x: &mut Option<T>, f: F)
    where F: FnOnce(Option<T>) -> Option<T>
{
    *x = f(x.take());
}

pub fn inc_opt(mut x: &mut Option<i32>) {
    replace_with(&mut x, |i| i.map(|j| j + 1));
}
```

Rust 1.26.0:
```asm
_ZN4blah7inc_opt17heb0acb64c51777cfE:
	mov	rax, qword ptr [rcx]
	movabs	r8, 4294967296
	add	r8, rax
	shl	rax, 32
	movabs	rdx, -4294967296
	and	rdx, r8
	xor	r8d, r8d
	test	rax, rax
	cmove	rdx, rax
	setne	r8b
	or	rdx, r8
	mov	qword ptr [rcx], rdx
	ret
```

Nightly (better thanks to ScalarPair, maybe?):
```asm
_ZN4blah7inc_opt17h66df690be0b5899dE:
	mov	r8, qword ptr [rcx]
	mov	rdx, r8
	shr	rdx, 32
	xor	eax, eax
	test	r8d, r8d
	setne	al
	add	edx, 1
	mov	dword ptr [rcx], eax
	mov	dword ptr [rcx + 4], edx
	ret
```

This PR:
```asm
_ZN4blah7inc_opt17h1426dc215ecbdb19E:
	xor	eax, eax
	cmp	dword ptr [rcx], 0
	setne	al
	mov	dword ptr [rcx], eax
	add	dword ptr [rcx + 4], 1
	ret
```

Where that add is beautiful -- using an addressing mode to not even need to explicitly go through a register -- and the remaining imperfection is well-known (rust-lang#49420 (comment)).
</details>
@bors
Copy link
Contributor

bors commented Jul 22, 2018

⌛ Testing commit 1f73144 with merge c069aa6bbc1d3c7a183764f584cd1ce8c0508957...

@bors
Copy link
Contributor

bors commented Jul 22, 2018

💔 Test failed - status-appveyor

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jul 22, 2018
LLVM isn't able to remove the alloca for the unaligned block in the SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't.  Found in the `replace_with` RFC discussion.
@scottmcm
Copy link
Member Author

Does the appveyor build use a different LLVM from tree? I just rebased this locally on windows and the test passes fine...

@kennytm
Copy link
Member

kennytm commented Jul 22, 2018

@scottmcm Are you targeting i686-pc-windows-msvc?

@kennytm kennytm added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 22, 2018
Smaller platforms don't merge the loads the same way.
@scottmcm
Copy link
Member Author

Thanks, @kennytm! That makes sense -- it probably doesn't unify the loads/stores on 32-bit. I've pushed an update to the test to not try to run it everywhere, since it depends on optimizations.

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Jul 22, 2018

📌 Commit c9482f7 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 22, 2018
kennytm added a commit to kennytm/rust that referenced this pull request Jul 22, 2018
mem::swap the obvious way for types smaller than the SIMD optimization's block size

LLVM isn't able to remove the alloca for the unaligned block in the post-SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't.  Found in the `replace_with` RFC discussion.

Examples of the improvements:
<details>
 <summary>swapping `[u16; 3]` takes 1/3 fewer instructions and no stackalloc</summary>

```rust
type Demo = [u16; 3];
pub fn swap_demo(x: &mut Demo, y: &mut Demo) {
    std::mem::swap(x, y);
}
```

nightly:
```asm
_ZN4blah9swap_demo17ha1732a9b71393a7eE:
.seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	movzx	eax, word ptr [rcx + 4]
	mov	word ptr [rsp + 4], ax
	mov	eax, dword ptr [rcx]
	mov	dword ptr [rsp], eax
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	movzx	eax, word ptr [rsp + 4]
	mov	word ptr [rdx + 4], ax
	mov	eax, dword ptr [rsp]
	mov	dword ptr [rdx], eax
	add	rsp, 32
	ret
	.seh_handlerdata
	.section	.text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE
	.seh_endproc
```

this PR:
```asm
_ZN4blah9swap_demo17ha1732a9b71393a7eE:
	mov	r8d, dword ptr [rcx]
	movzx	r9d, word ptr [rcx + 4]
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	mov	word ptr [rdx + 4], r9w
	mov	dword ptr [rdx], r8d
	ret
```
</details>

<details>
 <summary>`replace_with` optimizes down much better</summary>

Inspired by rust-lang/rfcs#2490,

```rust
fn replace_with<T, F>(x: &mut Option<T>, f: F)
    where F: FnOnce(Option<T>) -> Option<T>
{
    *x = f(x.take());
}

pub fn inc_opt(mut x: &mut Option<i32>) {
    replace_with(&mut x, |i| i.map(|j| j + 1));
}
```

Rust 1.26.0:
```asm
_ZN4blah7inc_opt17heb0acb64c51777cfE:
	mov	rax, qword ptr [rcx]
	movabs	r8, 4294967296
	add	r8, rax
	shl	rax, 32
	movabs	rdx, -4294967296
	and	rdx, r8
	xor	r8d, r8d
	test	rax, rax
	cmove	rdx, rax
	setne	r8b
	or	rdx, r8
	mov	qword ptr [rcx], rdx
	ret
```

Nightly (better thanks to ScalarPair, maybe?):
```asm
_ZN4blah7inc_opt17h66df690be0b5899dE:
	mov	r8, qword ptr [rcx]
	mov	rdx, r8
	shr	rdx, 32
	xor	eax, eax
	test	r8d, r8d
	setne	al
	add	edx, 1
	mov	dword ptr [rcx], eax
	mov	dword ptr [rcx + 4], edx
	ret
```

This PR:
```asm
_ZN4blah7inc_opt17h1426dc215ecbdb19E:
	xor	eax, eax
	cmp	dword ptr [rcx], 0
	setne	al
	mov	dword ptr [rcx], eax
	add	dword ptr [rcx + 4], 1
	ret
```

Where that add is beautiful -- using an addressing mode to not even need to explicitly go through a register -- and the remaining imperfection is well-known (rust-lang#49420 (comment)).
</details>
bors added a commit that referenced this pull request Jul 22, 2018
Rollup of 11 pull requests

Successful merges:

 - #51807 (Deprecation of str::slice_unchecked(_mut))
 - #52051 (mem::swap the obvious way for types smaller than the SIMD optimization's block size)
 - #52465 (Add CI test harness for `thumb*` targets. [IRR-2018-embedded])
 - #52507 (Reword when `_` couldn't be inferred)
 - #52508 (Document that Unique::empty() and NonNull::dangling() aren't sentinel values)
 - #52521 (Fix links in rustdoc book.)
 - #52581 (Avoid using `#[macro_export]` for documenting builtin macros)
 - #52582 (Typo)
 - #52587 (Add missing backtick in UniversalRegions doc comment)
 - #52594 (Run the error index tool against the sysroot libdir)
 - #52615 (Added new lines to .gitignore.)
@bors bors merged commit c9482f7 into rust-lang:master Jul 22, 2018
@scottmcm scottmcm deleted the swap-directly branch August 15, 2018 03:18
bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 6, 2023
…iler-errors

Use `load`+`store` instead of `memcpy` for small integer arrays

I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway.  After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments.

LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s.  In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712).  So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations.

rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now.

The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them.  I don't want to try to bite off too much in this PR, though.  (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.)

Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv>

Before:
```llvm
define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 {
  %a.i = alloca [3 x i16], align 2
  call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false)
  tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false)
  call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i)
  ret void
}
```
Note it going to stack:
```nasm
swap_rgb48_old:                         # `@swap_rgb48_old`
        movzx   eax, word ptr [rdi + 4]
        mov     word ptr [rsp - 4], ax
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rsp - 8], eax
        movzx   eax, word ptr [rsi + 4]
        mov     word ptr [rdi + 4], ax
        mov     eax, dword ptr [rsi]
        mov     dword ptr [rdi], eax
        movzx   eax, word ptr [rsp - 4]
        mov     word ptr [rsi + 4], ax
        mov     eax, dword ptr [rsp - 8]
        mov     dword ptr [rsi], eax
        ret
```

Now:
```llvm
define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 {
start:
  %0 = load <3 x i16>, ptr %x, align 2
  %1 = load <3 x i16>, ptr %y, align 2
  store <3 x i16> %1, ptr %x, align 2
  store <3 x i16> %0, ptr %y, align 2
  ret void
}
```
still lowers to `dword`+`word` operations, but has no stack traffic:
```nasm
swap_rgb48:                             # `@swap_rgb48`
        mov     eax, dword ptr [rdi]
        movzx   ecx, word ptr [rdi + 4]
        movzx   edx, word ptr [rsi + 4]
        mov     r8d, dword ptr [rsi]
        mov     dword ptr [rdi], r8d
        mov     word ptr [rdi + 4], dx
        mov     word ptr [rsi + 4], cx
        mov     dword ptr [rsi], eax
        ret
```

And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR
```llvm
define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}
```
but that lowers to reasonable `dword`+`qword` instructions still
```nasm
replace_short_array:                    # `@replace_short_array`
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     edi, dword ptr [rsi + 8]
        mov     dword ptr [rax + 8], edi
        mov     qword ptr [rax], rcx
        mov     rcx, qword ptr [rdx]
        mov     edx, dword ptr [rdx + 8]
        mov     dword ptr [rsi + 8], edx
        mov     qword ptr [rsi], rcx
        ret
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. WG-llvm Working group: LLVM backend code generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants