Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ptr::read code in debug builds #81163

Closed
RalfJung opened this issue Jan 18, 2021 · 31 comments
Closed

Improve ptr::read code in debug builds #81163

RalfJung opened this issue Jan 18, 2021 · 31 comments
Assignees
Labels
A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining A-raw-pointers Area: raw pointers, MaybeUninit, NonNull I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

RalfJung commented Jan 18, 2021

In #80290, some people raised concerns about the quality of the code that ptr::write compiles to in debug builds. Given that, to my knowledge, reads are much more common than writes, I would think that one should be much more concerned with the code that ptr::read compiles to -- and currently, there's going to be quite a few function calls in there, so without inlining, that code will be pretty slow.

ptr::read could be improved with techniques similar to what I did for ptr::write (call intrinsics directly, and inline everything else by hand). This would result in (something like) the following implementation: (EDIT see below for why this is wrong)

pub const unsafe fn read<T>(src: *const T) -> T {
    // We are calling the intrinsics directly to avoid function calls in the generated code
    // as `intrinsics::copy_nonoverlapping` is a wrapper function.
    extern "rust-intrinsic" {
        fn copy_nonoverlapping<T>(src: *const T, dst: *mut T, count: usize);
    }

    // For the same reason, we also side-step `mem::MaybeUninit` and use a custom `union` directly.
    #[repr(C)]
    union MaybeUninit<T> {
        init: mem::ManuallyDrop<T>,
        uninit: (),
    }

    let mut tmp: MaybeUninit<T> = MaybeUninit { uninit: () };
    // SAFETY: the caller must guarantee that `src` is valid for reads.
    // `src` cannot overlap `tmp` because `tmp` was just allocated on
    // the stack as a separate allocated object.
    //
    // `MaybeUninit` is repr(C), so we can assume `init` is at offset 0, justifying the pointer
    // casts.
    //
    // Finally, since we just wrote a valid value into `tmp`, it is guaranteed
    // to be properly initialized.
    unsafe {
        copy_nonoverlapping(src, &mut tmp as *mut _ as *mut T, 1);
        mem::transmute_copy(&tmp.init)
    }
}

However, here we have the extra difficulty that read is (unstably) a const fn, so the above implementation is rejected. &tmp.init can be replaced by &mut tmp.init and that works (or we wait for a bootstrap bump so we can make use of #80418), but transmute_copy is non-const, so there's still more work to be done. (transmute does not work since the compiler does not recognize that T and ManuallyDrop<T> have the same size.)

I will stop here, but if someone else strongly cares about ptr::read performance/codesize in debug builds, feel free to pick this up and drive it to completion.

Cc @bjorn3 @therealprof @usbalbin

@RalfJung
Copy link
Member Author

Also Cc #80377

@RalfJung
Copy link
Member Author

Actually, transmute_copy is implemented using read, so scratch that plan.^^
I don't think there is a way to do this without calling ManuallyDrop::into_inner.

@camelid camelid added T-libs Relevant to the library team, which will review and decide on the PR/issue. A-raw-pointers Area: raw pointers, MaybeUninit, NonNull I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jan 18, 2021
@nagisa
Copy link
Member

nagisa commented Jan 18, 2021

I don't think there is a way to do this without calling ManuallyDrop::into_inner.

If all of ManuallyDrop, MaybeUninit and ptr::read are defined in the same crate, the fields of the wrappers could be made pub to ptr::read.

@RalfJung
Copy link
Member Author

Yeah, that would work... but we'd have to make them public to the entire ptr module I think, so there is a significant risk of this being misused down the road. IOW, this is a bit too hacky for my taste.^^

@therealprof
Copy link
Contributor

Any chance of implementing those lowest level functions exclusively with intrinsics? It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.

I would argue that the most fundamental building blocks to low-level applications should never cause unnecessary overhead, even in debug mode.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 19, 2021

Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.

It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.

This is not entirely true; some MIR simplification passes do run even in debug mode.

@therealprof
Copy link
Contributor

Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.

Fair enough. I still think write, read, write_volatile and read_volatile qualify as essentials for a systems programming language.

This is not entirely true; some MIR simplification passes do run even in debug mode.

If we can get efficient implementations for the mentioned functions another way that would be great.

@RalfJung
Copy link
Member Author

I still think write, read, write_volatile and read_volatile qualify as essentials for a systems programming language.

That might be a good enough justification to make the fields of MaybeUninit and ManuallyDrop "public enough" to avoid function calls in read.

OTOH, the volatile operations have to be intrinsics anyway, so potentially much of the code could be shared between read_volatile and a read intrinsic.

@alecmocatta
Copy link
Contributor

I may be off the mark here, but might we see net perf gains with certain cheap wrapper functions (e.g. ptr::copy_nonoverlapping) just being marked #[inline(always)]? Might be worth testing before expending effort on manual inlining.

@bjorn3
Copy link
Member

bjorn3 commented Jan 19, 2021

#[inline(always)] won't help for cg_clif until the MIR inliner is enabled by default. It also requires the LLVM inliner to do more work, thus causing debug builds to become a bit slower.

@RalfJung
Copy link
Member Author

I thought inline(always) only affects release build? If it also affects debug builds, that might still be better overall than some of the more hacky alternatives.

@bjorn3
Copy link
Member

bjorn3 commented Jan 19, 2021

#[inline(always)] is also enabled when optimizations are disabled. Except when -Cno-prepopulate-passes is used.

For the old pass manager:

llvm::LLVMRustAddAlwaysInlinePass(builder, config.emit_lifetime_markers);

For the new pass manager:
MPM.addPass(AlwaysInlinerPass(EmitLifetimeMarkers));

@alecmocatta
Copy link
Contributor

It doesn't seem implausible to me that #[inline(always)] might be a net gain for debug build times, while also improving perf of debug binaries. The llvm.memcpy intrinsic might be cheaper for LLVM to reason about than a call to the external ptr::copy_nonoverlapping, possibly making up for the increased cost to the inliner.

@bjorn3
Copy link
Member

bjorn3 commented Jan 19, 2021

I mean slower compared to manual inlining.

@alecmocatta
Copy link
Contributor

Sure, #[inline(always)] may have a (I would assume very small) cost over manual inlining. The potential "net gain" I'm referring to is from automatically inlining in every place where ptr::copy_nonoverlapping is called (including user crates and applications) rather than a limited number of manual inlinings in std.

@bjorn3
Copy link
Member

bjorn3 commented Jan 19, 2021

ptr::copy_nonoverlapping is a thin wrapper around the intrinsic named the same. I would very much like all intrinsic wrappers to become re-exports of the intrinsics themself. This would eliminate the need for inlining but would require some way to change the "abi" of intrinsics from "rust-intrinsic" to "Rust" to make them indistinguishable from real functions.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 19, 2021

I would very much like all intrinsic wrappers to become re-exports of the intrinsics themself.

FWIW, copy_nonoverlapping was such a reexport until we added the debug assertions. Now that the debug assertions are disabled, if we feel like we don't want them back anyway, we could change it back to a reeexport.

This would eliminate the need for inlining but would require some way to change the "abi" of intrinsics from "rust-intrinsic" to "Rust" to make them indistinguishable from real functions.

I think the main issue here is that using re-exports makes the intrinsics stably callable at std::intrinsics::X. Rust stability is not fine-grained enough to let us stabilize the reexport without stabilizing the thing that is being reexported.

@oli-obk
Copy link
Contributor

oli-obk commented Jan 20, 2021

I don't think we should be trying to solve this at the library level. We are getting rather close to making MIR inlining work everywhere, and in my personal opinion we should look into improving debug mode by adding more mir-opt-level=1 optimizations that do not affect debug info negatively. There are still various operations that require intrinsics and function calls at the code level, but have equivalent basic MIR statements available that we could optimize to. I mean... ptr::write is equivalent to a single MIR statement: StatementKind::Assign(Place(dest, [Deref]), Rvalue::Use(Operand::Move(source)). If we can figure this out properly (optimize away forget, lower copy_nonoverlapping to StatementKind::Assign), then we may actually see better debug compile-times as well as smaller binary code than we did before.

@oli-obk oli-obk added A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining labels Jan 20, 2021
@RalfJung
Copy link
Member Author

RalfJung commented Jan 20, 2021

If we can figure this out properly (optimize away forget, lower copy_nonoverlapping to StatementKind::Assign), then we may actually see better debug compile-times as well as smaller binary code than we did before.

forget is already removed by #79049. So the only thing that's left is this optimization (and I keep repeating this but somehow it seems to get lost all the time^^).

@RalfJung
Copy link
Member Author

Anyway, this issue is about ptr::read, not ptr::write. Let's stay on topic please.

@wesleywiser
Copy link
Member

I started working on that optimization last night.

@wesleywiser wesleywiser self-assigned this Jan 20, 2021
@tmiasko
Copy link
Contributor

tmiasko commented Jan 20, 2021

Lowering copy_nonoverlapping to an assignment removes the dead cleanup pad (MIR building currently doesn't understand that some intrinsics don't unwind). It still introduces an extra alloca for src during LLVM code generation because src has its address taken, unlike what happened with move_val_init.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 21, 2021

FWIW, the MaybeUninit and ManuallyDrop functions called by ptr::read are all already inline(always).

So together with #81238, ptr::read doesn't call any non-inline(always) function any more.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 13, 2021
directly expose copy and copy_nonoverlapping intrinsics

This effectively un-does rust-lang#57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in rust-lang#80290 and rust-lang#81163.

Cc `@bjorn3` `@therealprof`
@RalfJung
Copy link
Member Author

ptr::read should have gotten better now in debug builds, since mem::copy_nonoverlapping is directly exposing the intrinsic.

@wesleywiser

I started working on that optimization last night.

What is the status of that? Both ptr::read and ptr::write would now benefit from this.

@therealprof
Copy link
Contributor

ptr::read should have gotten better now in debug builds, since mem::copy_nonoverlapping is directly exposing the intrinsic.

I can confirm that I'm seeing slight binary size reductions (< 1% in the best case) when comparing some STM32F0 applications using rustc 1.52.0-nightly (8e54a2113 2021-02-13) to either 1.50 or a previous nightly from a days ago in debug binaries.

@wesleywiser
Copy link
Member

I started working on that optimization last night.

What is the status of that? Both ptr::read and ptr::write would now benefit from this.

@RalfJung I opened #81344 with that optimization but completely forgot that we approved an MCP a few months ago to add a MIR StatementKind::CopyNonOverlapping. That work is happening in #77511. Once that is finished, I'll redo #81344 if it's still relevant.

flip1995 pushed a commit to flip1995/rust-clippy that referenced this issue Feb 25, 2021
directly expose copy and copy_nonoverlapping intrinsics

This effectively un-does rust-lang/rust#57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in rust-lang/rust#80290 and rust-lang/rust#81163.

Cc `@bjorn3` `@therealprof`
@mati865
Copy link
Contributor

mati865 commented Mar 14, 2021

FYI #77511 has landed.

@jrmuizel
Copy link
Contributor

@wesleywiser do you still plan on redoing #81344?

@wesleywiser
Copy link
Member

Thanks for the reminder! I've opened #83785 which shows some small, positive improvements to compilation time.

@RalfJung
Copy link
Member Author

read currently calls the intrinsic directly

// We are calling the intrinsics directly to avoid function calls in the generated code
// as `intrinsics::copy_nonoverlapping` is a wrapper function.
extern "rust-intrinsic" {
#[rustc_const_stable(feature = "const_intrinsic_copy", since = "1.63.0")]
fn copy_nonoverlapping<T>(src: *const T, dst: *mut T, count: usize);
}

Is there anything left to do in this issue?

@RalfJung
Copy link
Member Author

For now, considering this fixed by #87827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining A-raw-pointers Area: raw pointers, MaybeUninit, NonNull I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests