Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't implement mem::replace with mem::swap. #83022

Merged
merged 1 commit into from
Mar 13, 2021

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Mar 11, 2021

swap is a complicated operation, so this changes the implementation of replace to use read and write instead.

See #83019.

I wrote there:

Implementing the simpler operation (replace) with the much more complicated operation (swap) doesn't make a whole lot of sense. replace is just read+write, and the primitive for moving out of a &mut. swap is for doing that to two &mut at the same time, which is both more niche and more complicated (as shown by swap_nonoverlapping_bytes).

This could be especially interesting for Option<VeryLargeStruct>::take(), since swapping such a large structure with swap_nonoverlapping_bytes is going to be much less efficient than ptr::write()'ing a None.

But also for small values where swap just reads/writes using temporary variable, this makes a replace or take operation simpler:
image

@rust-highfive
Copy link
Collaborator

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 11, 2021
@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

1 similar comment
@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 11, 2021
@bors
Copy link
Contributor

bors commented Mar 11, 2021

⌛ Trying commit bf27819 with merge 9acd471bcf643c584381c0ff22e4d38a7617f313...

@rust-timer

This comment has been minimized.

@eddyb
Copy link
Member

eddyb commented Mar 11, 2021

FWIW I just tested this PR against Rust-GPU and it appears to handle the for i in 0..n motivation, from #83019's description.

@bors
Copy link
Contributor

bors commented Mar 11, 2021

☀️ Try build successful - checks-actions
Build commit: 9acd471bcf643c584381c0ff22e4d38a7617f313 (9acd471bcf643c584381c0ff22e4d38a7617f313)

@rust-timer
Copy link
Collaborator

Queued 9acd471bcf643c584381c0ff22e4d38a7617f313 with parent 5c6d3bf, future comparison URL.

@nagisa
Copy link
Member

nagisa commented Mar 11, 2021

since swapping such a large structure with swap_nonoverlapping_bytes is going to be much less efficient than ptr::write()'ing a None.

I don't believe that's true – assigning a None is a disciminant write, but ptr::write will need to copy over all the size_of::<Option<VeryLargeStruct>>() bytes of the type regardless of whether the value is a None or not, won't it?

@nagisa
Copy link
Member

nagisa commented Mar 11, 2021

x86_64, aarch64 assembly-level comparison.

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

@nagisa Nope, that just writes the tag of the Option, and nothing else. (Just like write()'ing a MaybeUninit::uninit() writes nothing at all.)

But looks like llvm is also smart enough to do that with the more compicated swap code: https://godbolt.org/z/TM1bcv

@camelid camelid added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 11, 2021
@khyperia
Copy link
Contributor

khyperia commented Mar 11, 2021

I don't believe that's true – assigning a None is a disciminant write, but ptr::write will need to copy over all the size_of::<Option<VeryLargeStruct>>() bytes of the type regardless of whether the value is a None or not, won't it?

x86_64, aarch64 assembly-level comparison.

Looking at the x86_64 assembly, changing it to be this shows off how the ptr::write version optimizes the None case, where the swap version doesn't.

pub fn replace_large(dest: &mut Option<[u8; 127]>) -> Option<[u8; 127]> {
    replace(dest, None)
}

But looks like llvm is also smart enough to do that with the more compicated swap code

Not smart enough in all cases, I guess? No idea how this case is different than yours, though.

edit: discovered this weird thing, I mistakenly thought it was relevant but I missed something ugh. big shrug~

godbolt link

pub fn speedy_fast(x: &mut Option<[u64; 10]>) -> Option<[u64; 10]> {
    std_replace(x, None)
}
// copied from std::mem::replace, just to be clear what's happening (same behavior with std::mem::replace)
fn std_replace(dest: &mut Option<[u64; 10]>, mut src: Option<[u64; 10]>) -> Option<[u64; 10]> {
    std::mem::swap(dest, &mut src);
    src
}

// manually inlined, should be the same? but isn't
pub fn big_slow_write(x: &mut Option<[u64; 10]>) -> Option<[u64; 10]> {
    let mut blah = None;
    std::mem::swap(x, &mut blah);
    blah
}

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (9acd471bcf643c584381c0ff22e4d38a7617f313): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 11, 2021
@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

🎉

Quite a significant improvement! Everything is green, except for one +174% that pushed one average into the red. I suppose that's just a random outlier:

image

@m-ou-se
Copy link
Member Author

m-ou-se commented Mar 11, 2021

r? @eddyb

@rust-highfive rust-highfive assigned eddyb and unassigned kennytm Mar 11, 2021
@klensy
Copy link
Contributor

klensy commented Mar 11, 2021

Can it rerun perf without try?

@rust-timer queue include=46449

@rust-timer
Copy link
Collaborator

Insufficient permissions to issue commands to rust-timer.

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

@rust-timer queue include=46449

I sure have permissions, lets see what happens.

EDIT: Looks like no, you can't rerun perf withouta another try build :(

@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

If nothing else, optimizing a simpler chunk of code is easier on the backend. While I sympathize with @SimonSapin's point about centralization of the unsafe code, I think replace and similar such heavily used functions benefit from such targeted simplifications a… lot.

The instruction count/walltime regression on the incr-patched: empty 3072 test appears to be almost entirely within LLVM's LLVM_lto_optimize and similar areas, so I don't expect much difference at all from a rerun. My guess here is that LLVM simply ends up making different decisions (e.g. giving replace or something similar a higher inline score) when optimizing the newly patched code and thus hitting one of its non-linear areas of code with a significantly larger amount of code.

@bors
Copy link
Contributor

bors commented Mar 12, 2021

☀️ Try build successful - checks-actions
Build commit: 6537b60dde68a3362e54e06e38e45445cd1e920d (6537b60dde68a3362e54e06e38e45445cd1e920d)

@rust-timer
Copy link
Collaborator

Queued 6537b60dde68a3362e54e06e38e45445cd1e920d with parent 0cc64a3, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (6537b60dde68a3362e54e06e38e45445cd1e920d): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 12, 2021
@nagisa
Copy link
Member

nagisa commented Mar 12, 2021

Well, we're good to go, I guess!

@bors r+

@bors
Copy link
Contributor

bors commented Mar 12, 2021

📌 Commit bf27819 has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 12, 2021
@SimonSapin
Copy link
Contributor

For what it’s worth I wasn’t saying in #83019 that we shouldn’t change anything, only that the current situation was not entirely accidental.

These macro-level results look great. Since this effectively undoes #40454 for replace it might be interesting to also look at micro-level effects, especially for large values. Unfortunately I don’t find code to reproduce the benchmarks in #40454.

@bors
Copy link
Contributor

bors commented Mar 12, 2021

⌛ Testing commit bf27819 with merge 46a934a...

@bors
Copy link
Contributor

bors commented Mar 13, 2021

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 46a934a to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 13, 2021
@bors bors merged commit 46a934a into rust-lang:master Mar 13, 2021
@rustbot rustbot added this to the 1.52.0 milestone Mar 13, 2021
@m-ou-se m-ou-se deleted the mem-replace-no-swap branch March 16, 2021 19:02
@jordens
Copy link

jordens commented Apr 21, 2021

Thanks for this fix! Example where this (I think) was painful: taking and replacing packet size data in Options:
https://godbolt.org/z/5dPh8eoWa

@m-ou-se
Copy link
Member Author

m-ou-se commented Apr 21, 2021

Thanks for this fix! Example where this (I think) was painful: taking and replacing packet size data in Options:
https://godbolt.org/z/5dPh8eoWa

Nice, that's a huge difference! Even with --opt-level set to z or s, it makes a big difference.

@scottmcm scottmcm mentioned this pull request Jun 24, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 6, 2023
…iler-errors

Use `load`+`store` instead of `memcpy` for small integer arrays

I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway.  After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments.

LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s.  In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712).  So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations.

rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now.

The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them.  I don't want to try to bite off too much in this PR, though.  (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.)

Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv>

Before:
```llvm
define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 {
  %a.i = alloca [3 x i16], align 2
  call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false)
  tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false)
  call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false)
  call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i)
  ret void
}
```
Note it going to stack:
```nasm
swap_rgb48_old:                         # `@swap_rgb48_old`
        movzx   eax, word ptr [rdi + 4]
        mov     word ptr [rsp - 4], ax
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rsp - 8], eax
        movzx   eax, word ptr [rsi + 4]
        mov     word ptr [rdi + 4], ax
        mov     eax, dword ptr [rsi]
        mov     dword ptr [rdi], eax
        movzx   eax, word ptr [rsp - 4]
        mov     word ptr [rsi + 4], ax
        mov     eax, dword ptr [rsp - 8]
        mov     dword ptr [rsi], eax
        ret
```

Now:
```llvm
define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 {
start:
  %0 = load <3 x i16>, ptr %x, align 2
  %1 = load <3 x i16>, ptr %y, align 2
  store <3 x i16> %1, ptr %x, align 2
  store <3 x i16> %0, ptr %y, align 2
  ret void
}
```
still lowers to `dword`+`word` operations, but has no stack traffic:
```nasm
swap_rgb48:                             # `@swap_rgb48`
        mov     eax, dword ptr [rdi]
        movzx   ecx, word ptr [rdi + 4]
        movzx   edx, word ptr [rsi + 4]
        mov     r8d, dword ptr [rsi]
        mov     dword ptr [rdi], r8d
        mov     word ptr [rdi + 4], dx
        mov     word ptr [rsi + 4], cx
        mov     dword ptr [rsi], eax
        ret
```

And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR
```llvm
define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}
```
but that lowers to reasonable `dword`+`qword` instructions still
```nasm
replace_short_array:                    # `@replace_short_array`
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        mov     edi, dword ptr [rsi + 8]
        mov     dword ptr [rax + 8], edi
        mov     qword ptr [rax], rcx
        mov     rcx, qword ptr [rdx]
        mov     edx, dword ptr [rdx + 8]
        mov     dword ptr [rsi + 8], edx
        mov     qword ptr [rsi], rcx
        ret
```
@joboet joboet mentioned this pull request Mar 16, 2024
fmease added a commit to fmease/rust that referenced this pull request Mar 16, 2024
Optimize `ptr::replace`

rust-lang#83022 optimized `mem::replace` to reduce the number of `memcpy`s. `ptr::replace`, which is [documented to behave just like `mem::replace`](https://doc.rust-lang.org/nightly/std/ptr/fn.replace.html), was not optimized however, leading to [worse code](https://godbolt.org/z/T3hdEEdfe) and missed optimizations. This PR simply forwards `ptr::replace` to `mem::replace` to take advantage of the better implementation.
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 17, 2024
Rollup merge of rust-lang#122601 - joboet:ptr_replace, r=workingjubilee

Optimize `ptr::replace`

rust-lang#83022 optimized `mem::replace` to reduce the number of `memcpy`s. `ptr::replace`, which is [documented to behave just like `mem::replace`](https://doc.rust-lang.org/nightly/std/ptr/fn.replace.html), was not optimized however, leading to [worse code](https://godbolt.org/z/T3hdEEdfe) and missed optimizations. This PR simply forwards `ptr::replace` to `mem::replace` to take advantage of the better implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.