Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Placement by return #2884

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

PoignardAzur
Copy link

@PoignardAzur PoignardAzur commented Mar 17, 2020

@burdges
Copy link

burdges commented Mar 17, 2020

I've questions about https://github.com/rust-lang/rfcs/blob/c34989cc81c0dbd96ca65a50fe987ad09f5a6251/text/0000-placement-by-return.md#guide-level-explanation

Is there any good reason why good() works but bad() does not work? Isn't this really just complex bug in rustc? If the return is Sized then could we guarantee no copies so long as the return only gets bound once and gets used directly by the caller?

We cannot improve the situation for bad_dst() though because only bad_dst() can allocate the space for passing between returns_dst() and takes_dst(). It appears this problem exists for Sized types too, so they too requires the new_with, etc. proposed by this RFC.

@PoignardAzur
Copy link
Author

Is there any good reason why good() works but bad() does not work? Isn't this really just complex bug in rustc? If the return is Sized then could we guarantee no copies so long as the return only gets bound once and gets used directly by the caller?

There's no fundamental reason, no.

It just requires more complex analysis (GCE versus GCE + NRVO), and rules that have to be nailed down as to when NRVO does or doesn't apply. This RFC tries to be minimalistic.

@CAD97
Copy link

CAD97 commented Mar 19, 2020

After reading the RFC, I get the understanding that guaranteed copy elision doesn't actually guarantee copy elision for small (roughly register sized) types? This seems problematic, as the guarantee is no longer a guarantee, as it doesn't apply in some cases (though cases where the difference is basically not measurable).

@rpjohnst
Copy link

The downsides of future-possibility Split the discriminant from the payload can be mitigated by applying that split only at function boundaries.

Rust already has two separate ideas of how values are represented- one for passing/returning them, and another for storing in memory when they have their address taken- that's the whole idea behind #[repr(transparent)]. This works because you can't take the address of a value "in motion" like that to begin with.

So if you want to return DST enum variants, you can return the discriminant in a register, and the actual unwrapped variant values the usual way (i.e. via a return pointer for large ones, or yet another register for small ones). Now you can "emplace" any or all of those variants using the same techniques as structs and arrays.

@RustyYato
Copy link

RustyYato commented Mar 19, 2020

@CAD97 also because placement new works by basically passing a hidden out pointer, I don't think that RVO is actually an optimization for types smaller than a word.

On the main proposal, it would be best if we get the infallible case for RVO from a function, because can build some useful apis around that, like: Box::new_with, and maybe even Vec::push_with etc. But I don't think we need to support DST's in the initial proposal. It looks complex, and that could hinder the simpler core proposal from going anywhere.

That said, we could also make some more explicit apis for Box and Vec to allow you to do the allocation before creating the value, using typechecked out pointers. This also allows you to handle !Sized types, but with an unsafe api

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=602847046fde9cc25ec28d7b33b5fd5a

These explicit apis are harder to use, but they also completely side-step the issue of how to do fallible operations, because the "RVO"ed value isn't returned in the Result. It also makes it super easy to see what's going on, whereas implicit RVO can silently break if you aren't careful (especially because we won't have NRVO even with this proposal).

With the code in the playground, you could write something like this:

let big_value = Box::try_write_new(|uninit| {
    let value = some_fallible_op()?;
    let mut init = uninit.write(BigValue::new(value));
    init.modify(); // we can now safely use `BigValue` through `init`
    Ok(init)
})?;

Which is pretty nice. This will work if we have 3 guarantees,

  • RVO is guaranteed if the value is created on the return from a function/closure
  • Uninit::write is guaranteed to not copy a RVO value, but directly write it into the backing pointer
  • Uninit::write is guaranteed to not copy a value that is constructed at with at the call to write, , but will directly write it into the backing pointer (i.e. uninit.write([0_u8; 1024]) won't copy a kilobyte of data)

I think that these guarantees are simple enough to make (especially since they seem to already work for all the simple cases I tried), so we could easily move forward with a proposal. With this, we can support fallible construction, and the simple case of Box::new_with(BigValue::new).

We can support more elaborate schemes in the future, after we have some sort of basic RVO guarantees.

@scottmcm
Copy link
Member

Awesome to see another stab at this!

Passing a closure is essentially the same thing as passing an initializer list, so it should have the same performance as C++ emplacement.

This is a really clever point 👍

The _with suffix is based on functions like get_or_insert_with. They're basically filling the same role as C++ emplace_ methods.

One thing I've seen from that in C++ is that the usual advice became to essentially just always use emplace. Do we expect the same would happen here? If so, the extra length of _with || would seem unfortunate. (Not that I have a good idea to avoid churn here.)

Hmm, is there some way we could "overload" Box::new and such to take either T or impl FnOnce()->T? I'm not sure if we could get that through coherence (since the output of the FnOnce is an output type), but I also think it's not something that can be violated in stable today. (Roughly I'm thinking that people writing constructors could take impl BikeshedMe<T> to opt-in to supporting this, instead of needing new methods everywhere.

The design for this proposal is meant to allow already-idiomatic "constructors" [..] to be allocated directly into a container.

When I first read this design choice it felt clearly-correct, but I ended up wondering more about it over the course of reading the RFC. I wonder if a different kind of constructor could be less overall churn (somehow) than new versions of anything that puts things in a Bx or vector or whatever. Maybe there'd be a way that only the types large enough to really care about this would have to make the new kind of constructor, but ordinary things wouldn't.

Although as I type that, I guess any generic wrapper-constructor (like Some(x)) would also need to support this new kind of constructor too, so maybe that wouldn't be better anyway...

@scottmcm scottmcm added T-lang Relevant to the language team, which will review and decide on the RFC. T-libs-api Relevant to the library API team, which will review and decide on the RFC. labels Mar 20, 2020
@Ixrec
Copy link
Contributor

Ixrec commented Mar 20, 2020

One thing I've seen from that in C++ is that the usual advice became to essentially just always use emplace. Do we expect the same would happen here? If so, the extra length of _with || would seem unfortunate. (Not that I have a good idea to avoid churn here.)

Interesting. What I heard was that emplacement should theoretically always be at least as fast as insertion, and it is often faster, but there are also a bunch of cases where emplacement is actually slower, won't even compile, introduces an exception safety issue or even introduces subtle UB (Effective Modern C++ covers all this in Item 42). So we just write whichever makes the code more readable unless we care enough to benchmark it.

However, most of the reasons for those annoying cases (implicit type conversions, implicit expensive copies, new making raw pointers, imperfect perfect forwarding) simply don't exist in Rust, so it's plausible that "always use emplacement" would end up being true or closer to true for Rust regardless.

@burdges
Copy link

burdges commented Mar 20, 2020

If you're worried about length then Box::with works fine. I think Box::new should always box the closure, but you could investigate reviving the box keyword for closures.

text/0000-placement-by-return.md Outdated Show resolved Hide resolved
text/0000-placement-by-return.md Outdated Show resolved Hide resolved
text/0000-placement-by-return.md Outdated Show resolved Hide resolved
@kennytm
Copy link
Member

kennytm commented Mar 20, 2020

Since you mentioned "generator" here, the RFC could be simplified a lot (hiding the state machine) if it can be explained in terms of #2033 generator.

From what I understand, every return dst_object; gets desugared into something like

let layout = Layout::for_value(&dst_object);
let slot = yield layout;
slot.copy_from_nonoverlapping(&raw dst_object as _, layout.size());
return <*mut Dst>::from_raw_parts(slot, ptr::metadata(&raw dst_object));

so e.g.

fn with_multiple_exit_points() -> [i32] {
    while keep_going() {
        if got_cancellation() {
            return [];
        }
    }
    let n = 100;
    [1; n]
}

gets desugared as #2033 generators like

fn with_multiple_exit_points___gen() 
    -> impl Generator<*mut u8, Yield=Layout, Return=*mut [i32]> 
{
    move |_: *mut u8| {
        while keep_going() {
            if got_cancellation() {
                // BEGIN REWRITE -----------------------------------------------
                let slot = yield Layout::array::<i32>(0).unwrap();
                let p = slot as *mut i32;
                return ptr::slice_from_raw_parts_mut(p, 0);
                // END REWRITE -------------------------------------------------
            }
        }
        let n = 100;
        // BEGIN REWRITE -----------------------------------------------
        let slot = yield Layout::array::<i32>(n).unwrap();
        let p = slot as *mut i32;
        for i in 0..n {
            unsafe {
                p.add(i).write(1);
            }
        }
        return ptr::slice_from_raw_parts_mut(p, n);
        // END REWRITE -------------------------------------------------
    }
}

and then read_unsized_return_with could be implemented by wrapping that generator...,

pub struct ReadUnsizedReturnWithFinish<G> {
    layout: Layout,
    generator: G,
}

impl<T, G> ReadUnsizedReturnWithFinish<G>
where 
    T: ?Sized, 
    G: Generator<*mut u8, Yield = Layout, Return = *mut T> + Unpin,
{
    fn new(mut generator: G) -> Self {
        match Pin::new(&mut generator).resume(ptr::null_mut()) {
            GeneratorState::Yielded(layout) => Self { layout, generator },
            _ => panic!("generator completed without returning a layout"),
        }
    }
    pub fn finish(mut self, slot: *mut u8) -> *mut T {
        match Pin::new(&mut self.generator).resume(slot) {
            GeneratorState::Complete(ptr) => ptr,
            _ => panic!("generator returned too many layouts"),
        }
    }
    pub fn layout(&self) -> Layout {
        self.layout
    }
}

which can be used directly in the desugaring...

fn function_that_calls_my_function() -> str {
    println!("Hi there!");
    my_function()
}

fn function_that_calls_my_function___gen() 
    -> impl Generator<*mut u8, Yield = Layout, Return = *mut str> 
{
    move |_: *mut u8| {
        println!("Hi there!");
        // BEGIN REWRITE -----------------------------------------------
        let state = ReadUnsizedReturnWithFinish::new(my_function___gen());
        let slot = yield state.layout();
        return state.finish(slot);
        // END REWRITE -------------------------------------------------
    }
}

and write_unsized_return_with can similarly be implemented by being such a generator

fn write_unsized_return_with___gen<T: ?Sized>(layout: Layout, f: impl FnOnce(*mut u8) -> *mut T) 
    -> impl Generator<*mut u8, Yield = Layout, Return = *mut T> 
{
    move |_: *mut u8| f(yield layout)
}

@PoignardAzur
Copy link
Author

PoignardAzur commented Mar 21, 2020

@CAD97

This seems problematic, as the guarantee is no longer a guarantee, as it doesn't apply in some cases

That's a good question.

On a specification level, GCE would take the form of guarantees that any code observing the address of the variable being returned (eg trait methods) would "see" the address stay the same. This might have implications for eg Pin types.

However, upholding these guarantees for register-sized types would probably be terrible for performance (compared to just passing the return value in the rax register).


@rpjohnst

Rust already has two separate ideas of how values are represented- one for passing/returning them, and another for storing in memory when they have their address taken- that's the whole idea behind #[repr(transparent)]. This works because you can't take the address of a value "in motion" like that to begin with.

Can you elaborate?

In particular, how do you think the following code should be handled, layout-wise?

fn bar() -> Result<SomeType, Error> {
    let my_data : Result<SomeType, Error> = get_data();
    
    foo(&my_data);
    my_data
}

@ KrishnaSannasi

With the code in the playground, you could write something like this:

You can already do something similar (and safer) using the current proposal:

let big_value = || {
    let value = some_fallible_op()?;
    Ok(Box::new_with(|| BigValue::new(value))
}()

The hard case is when some_fallible_op and BigValue::new are the same function.


@scottmcm

One thing I've seen from that in C++ is that the usual advice became to essentially just always use emplace. Do we expect the same would happen here? If so, the extra length of _with || would seem unfortunate. (Not that I have a good idea to avoid churn here.)

Yeah, I had the same reaction.

It gets even worse if we nail down semantics for placement return of Result; then you get Box::new_with_result(|| actual_data)? instead of Box::new(actual_data?) .

One workaround would be to use the macros I proposed (eg box!(actual_data)).

But the fundamental problem is that a permanent solution would require non-transparent semantics; that is, a way to write a function thay says "I'm pretending to take a X, but I actually take a closure returning X" (in such a way that they're indistiguishable in calling code); something like lazy arguments in D.

I'm not sure the Rust community is willing to accept that kind of feature.

@rpjohnst
Copy link

Can you elaborate?

In particular, how do you think the following code should be handled, layout-wise?

fn bar() -> Result<SomeType, Error> {
    let my_data : Result<SomeType, Error> = get_data();
    
    foo(&my_data);
    my_data
}

Result<SomeType, Error> has a fixed memory layout that overlaps SomeType and Error, and encodes which one is live as a tag or an otherwise-invalid bit pattern. This fixed memory layout enables any code with a pointer to Result<SomeType, Error> to use it reliably.

But get_data and bar return a Result<SomeType, Error> by move, so there can be no source-level pointers around to rely on that layout. As long as the value is back in that layout for foo(&my_data), we can use a second fixed representation to communicate between callees and callers.

Assume, for example, that SomeType and Error are large enough that the usual ABI for returning them by value is for the caller to pass in a pointer for the callee to write to. Then, bar works like this:

  • As a hidden parameter, receive two pointers from its caller- one *mut SomeType and one *mut Error.
  • Allocate stack space for Result<SomeType, Error>'s memory layout.
  • Compute two pointers into that stack space, corresponding to the locations of SomeType and Error in a Result.
  • Call get_data with those two pointers. It will fill in exactly one of them, and place a discriminant for which one in a fixed register.
  • Encode that discriminant in the allocated stack space.
  • Call foo with a pointer to that stack slot.
  • Match on my_data and copy either a SomeType or Error through one of the pointers received from the caller. Mark which one with a value in a fixed register.

We can specialize this for other representations of SomeType and/or Error. If it's small enough, we skip its hidden pointer argument, the callee places it in a second fixed register, and the caller writes that register into memory if necessary.

Now if someone writes Box::try_with(get_data), try_with can work like this:

  • As a hidden parameter, receive a *mut Error pointer.
  • Allocate heap space for SomeType.
  • Call get_data with a *mut SomeType to the new allocation and the *mut Error parameter. It will either write a SomeType directly to the heap, or an Error directly through the pointer, and place a discriminant in a register.
  • Match on the discriminant:
    • If Ok, place an Ok discriminant in a register, and the newly-constructed Box<SomeType> in a second register.
    • If Err, free the heap allocation, and place an Err discriminant in a register.

This also suggests a more precise framing for @CAD97's question about guarantees. We can restrict the guarantee of "has the same address before and after the return move" to types whose function-return ABI goes through memory. With a specified ABI, this would become a real answer; today it is a less-vague version of "roughly register sized."

@PoignardAzur
Copy link
Author

PoignardAzur commented Mar 21, 2020

@rpjohnst So if I'm understanding your proposed semantics, you're saying that SomeType would be treated as a contiguous block of memory when being passed to a function, and as a pair of placement pointers when being returned?

That would be one possible design, yeah. The downside is that it clashes with NRVO in non-obvious ways. In any case, it's beyond the scope of this RFC.

@comex
Copy link

comex commented Mar 22, 2020

I don't agree that it makes sense to punt fallible allocator support.

After all, we surely want to support them eventually. Any design that supports fallible allocators can almost certainly be easily modified to support infallible ones, but the reverse isn't true. If this RFC is accepted, but we later discover that supporting fallible allocators requires a completely different design, we'll end up having to maintain two new sets of APIs in all the collections, on top of the existing non-placement-aware APIs. One set of duplicate APIs will already be a (well-justified) burden for language learners; there's no need to add another!

Therefore, we shouldn't accept the RFC without at least collectively agreeing that fallibility can be done with the same base design. But if we've managed to agree on that, it's not much further of a step to just put it in the RFC.

For what it's worth, I think fallibility can be done with the same base design, in the way @rpjohnst and others have been discussing.

@PoignardAzur
Copy link
Author

@kennytm There are pros and cons to using the generator syntax like you mention. On one hand, it's a little more concise; on the other hand, it requires keeping another hypothetical feature in mind; I'm not sure this helps readability.

On the other other hand, I'm thinking if I wrote this RFC from the ground up now, I would use the yield syntax, so maybe I should do that anyway; I'm just not sure the readability improvements (if any) are worth rewriting half the RFC. It's understandable as it is.

I'll fix the typos you mentioned.

@comex To be fair, I think this RFC already does an adequate job showing that fallible emplacement isn't a dead end. It already includes two possible strategies for fallible emplacement, and mentions some pros and cons for each strategy.

I'm going to explore the solution space a little deeper, but I don't think this RFC should commit to a future strategy or spend too much of its complexity budget on detailing a future extension.

@clarfonthey
Copy link
Contributor

So, I'm not 100% able to follow this discussion, but from a user perspective, it's not super great to have there be a huge performance gap between Box::with(|| [0; 10000]) and Box::new([0; 10000]). From an ergonomics perspective, being able to do something like Box::with(SomeType::new) is cool, but it seems a bit too magical that return values are special and parameters are not.

IMHO, I'd rather be explicit if the magic isn't applied uniformly. In other words, the closure you pass in should take an argument of &mut MaybeUninit<Self>.

@RustyYato
Copy link

So, I'm not 100% able to follow this discussion, but from a user perspective, it's not super great to have there be a huge performance gap between Box::with(|| [0; 10000]) and Box::new([0; 10000])

I don't think that we can fix this without changing the semanics of Box::new([0; 10000]). This is because [0; 10000] is evaluated before Box::new is called. i.e. before the allocation is even available. This means that you must copy the array from the caller into the new allocation. The only way around this is to delay evaluation of [0; 10000] until we have an allocation we can directly write into. This is precisely the abstraction that closures provide.

IMHO, I'd rather be explicit if the magic isn't applied uniformly. In other words, the closure you pass in should take an argument of &mut MaybeUninit.

This is kinda of what I proposed earlier, but in a more type-safe way. If we just passed a &mut MaybeUninit<Self>, you would have no way of enforcing the write. But with the way I described in the linked comment, you do.

being able to do something like Box::with(SomeType::new) is cool, but it seems a bit too magical that return values are special and parameters are not.

Note: LLVM already performs this optimization, all we are doing here is making the guarantee that this optimization will always (reliably) be performed.

@clarfonthey
Copy link
Contributor

clarfonthey commented Mar 27, 2020

You are right about the ordering-- I didn't even consider that part.

I guess that my main concern is the semantics of movement here. Assuming we omit the "bad" case and boil stuff down to:

Box::with(|| {
    let x = initial_state;
    f(&mut x);
    x
})

Then conceptually, I see no difference between this and:

let x = initial_state;
f(&mut x);
Box::new(x)

Basically, you're arguing that there's a semantic ordering between function calls, and that Rust should automagically coerce function types into this special generator type. Whereas, I feel like it is more like semantically changing the meaning of bindings, and how the compiler could specially interpret let bindings, function parameters, and return values as referring to the same memory addresses for non-Copy objects.

Because really, what you're proposing is treating it like this for a small subset of use cases. But the main issue is that this is adding new APIs that would advise against this being applied to other use cases in the future.

Obviously, it makes sense to not strictly define that Rust never moves data around ever, but given the way Rust's ownership system works, I can't see a clear way to explain to a user why return values are special and parameters are not, especially when we have to specifically access the code inside the function in order to make the special changes that permit this behaviour.

Not 100% sure if what I'm saying makes sense, but it's mostly where I'm at right now as far as thinking about this.

EDIT: Essentially, you're saying that the provided closure is the one that must be modified to make the no-copy elision work, and not ever Box::new.

@CAD97
Copy link

CAD97 commented Mar 27, 2020

Whereas, I feel like it is more like semantically changing the meaning of bindings, and how the compiler could specially interpret let bindings, function parameters, and return values as referring to the same memory addresses for non-Copy objects.

Note that this proposal explicitly does not guarantee NRVO, so your example would not do anything emplaced, and x would still semantically be a stack location. (Guaranteed) RVO only kicks in when you return an "emplacable expression," which is a struct/array/tuple literal or another emplacable function call. An intervening binding will inhibit this guarantee.

Yes, this does create a theoretical performance penalty for extracting a temporary value. But in practice, this shouldn't bite many people; if Box::with(|| { [large; array] }) is guaranteed to RVO, then Box::with(|| { let x = [large; array]; x }) should rather easily optimize to the same code; it's just the guarantee that is lost, not the actual optimization.

I don't think we should try to guarantee NRVO for your example. Why? Because NRVO is complicated to specify, and it's easily rewritable to not need it:

let mut boxed = Box::with(|| initial_state);
f(&mut *boxed);
boxed

In fact, we could probably get away without actually doing anything special for Box::with to guarantee the copy elision on our side, and just ensure LLVM does it, because it's already pretty good at eliminating the memcpy with that API. Guaranteeing it on our side is probably better, as it works with less inlining information and without optimizations, but this is just formalizing the pattern that already works.

(Edit: to be clear, I'm not against having nrvo and friends as an optimization; I just think guaranteeing it is not necessary.)

@burdges
Copy link

burdges commented Mar 27, 2020

Aside from guarantees, there are more complex cases cases for NVRO too, so an #[nrvo] attribute could enforce both guarantees and express the desired NRVO structure.

fn foo(..) -> ( #[nrvo] Big1 , Result<#[nrvo] Big2,io::Error> ) { }

fn bar(..) ->  Result<#[nrvo] Big2,io::Error> {
    let mut ret = Err(..);
    let b = Box::with(|| { 
        let mut a; 
        (a,ret) = foo(..);
        a
    });
    if Some(a) = ret { a.attach_insert("foo", b); }
    ret
}

In foo, we do placement return for a Big1 into one caller provided buffer and for a Big2 into a separate caller provided buffer. We treat the Big2 buffer as a "dissociated" internals for a Result<Big2,io::Error>, our caller should not move anything if they unpack the Result themselves with inlinable methods, but might require moves if they pass the Result<Big2,io::Error> to non-inlined code.

In bar, we allocate for the Big1 but pass through the placed Big2 and "dissociated" result, and then call some Big2 methods that attached the Big1.

I suppose an #[nrvo] attribute might mostly encourage hard to optimize code, but worth mentioning.

As an aside, we'd exploit better NRVO optimizations, and "dissociated" NRVO enums, lots in cryptography where folks often return types like Option<[u64; 5]>. We'll keep this [u64; 5] on the stack, but we currently eat too much stack space on smart cards, etc. if we return it through multiple layers. We should zero any stack afterward, which improves performance even on heavier machines.

@PoignardAzur
Copy link
Author

PoignardAzur commented Mar 27, 2020

You are right about the ordering-- I didn't even consider that part.

I guess that my main concern is the semantics of movement here. Assuming we omit the "bad" case and boil stuff down to:

Box::with(|| {
    let x = initial_state;
    f(&mut x);
    x
})

Then conceptually, I see no difference between this and:

let x = initial_state;
f(&mut x);
Box::new(x)

There is a difference, because at least some part of the code in Box::new must execute before initial_state can be computed. Before Box::new is called, the space in memory where initial_state will be stored doesn't even exist.

Of course, one could argue that in your second snippet, the compiler could reorder the calls so that Box::new is called at the beginning of the function, and initial_state is emplaced into the result; that would essentially be NRVO on steroids. And, personally speaking, I do want that feature to be added to Rust eventually.

But the semantics aren't trivial. Box::new could theoretically have any number of side-effects; even use interior mutability to change the value of initial_state under the developer's feet. Which means reordering Box::new and initial_state could silently change the observable behavior of the program, which is a huge deal-breaker.

tl;dr The fundamental reason the RFC is written that way is that space must be allocated before data can be emplaced into it; and returning data from closures that are called inside of the functions allocating the data is the simplest way Rust can express that (yet).

@comex
Copy link

comex commented Sep 27, 2021

In release mode it doesn't. -Zshare-generics is enabled by default when optimizations are disabled. This causes all instantiations of generic functions defined in the local crate to be exported such that upstream crates can reuse them. When optimizations are enabled, -Zshare-generics is disabled by default and as such all instantiations of generic functions are marked as internal.

I see…

But even in debug mode, rustc is apparently able to differentiate between a closure whose type can escape to downstream crates and one which can't, and mark only the latter internal. I don't know where that logic lives or how it works. But at first glance it seems like the same could be applied to other generic instantiations involving the closure's type, not just the closure implementation itself.

However, it's a moot point if sharing is disabled in release mode anyway.

@kennytm
Copy link
Member

kennytm commented Oct 15, 2021

🤔 The RFC and comment mentioned async trait very few times and no one actually elaborated on it.


So I was thinking about the "dyn async traits" blog series (1, 2, 3, 4, 5). As of nikomatsakis/dyner#2 the dynamic version is always returning a Pin<Box<dyn Future, Global>>.

In terms of this RFC we want to make the dynamic async fn returns an unsized dynamic future instead of a Box:

impl<'a, Item> AsyncIter for dyn AsyncIter<Item = Item> + 'a {
    type Item = Item;

    type Next<'this> = dyn Future<Output = Option<Item>> + 'this  // <- associate with unsized type
    where 
        Self: 'this,
        Item: 'this;
    
    fn next(&mut self) -> Self::Next<'_> { // <- use unsized return
        let (data_ptr, vtable) = (self as *mut _).to_raw_parts();
        (vtable.next)(data_ptr) // <- delegate to function returning DST
    }
}

The question is how do we represent the type of vtable.next?, What is a function pointer returning an unsized type?

struct AsyncIter_VTable {
    ...
    next: fn(*mut ()) -> dyn Future<Output = Option<Item>> + '_, // ??????
}

In this RFC such functions becomes a state machine structure, whose size and operations depend entirely on the function itself. AFAIK the size of this structure is close to the actual stack size needed, so we can alloca(3) it on stack. If it overflows it is inevitable.

struct ErasedUnsizedFn<Args, Return: ?Sized> {
    self_layout: Layout,
    drop: unsafe fn(*mut ()),
    init: unsafe fn(*mut (), Args),
    start: unsafe fn(*mut ()) -> Layout,
    finish: unsafe fn(*mut (), *mut u8) -> &mut Return,
}

struct AsyncIter_VTable {
    ...
    next: &'static ErasedUnsizedFn<(*mut (),), dyn Future<Output = Option<Item>> + '_>,
}

impl<'a, Item> AsyncIter for dyn AsyncIter<Item = Item> + 'a {
    ...

    fn next(&mut self) -> Self::Next<'_> {
        let (data_ptr, vtable) = (self as *mut _).to_raw_parts();
        unsafe {
            let gen: *mut () = alloca(vtable.next.self_layout);
            (vtable.next.init)(gen, (data_ptr,));
            defer!((vtable.next.drop)(gen));
            let layout = (vtable.next.start)(gen);
            write_unsized_return_with(layout, |slot: *mut u8| {
                (vtable.next.finish)(gen, slot)
            })
        }
    }
}

The coroutine nature of this RFC forces the arguments Args in the function pointer fn(Args) -> dyn X to be placed inside the state machine on stack. This is fine for AsyncIter::next() which only needs 8 bytes for &mut Self, but not fine for e.g. hyper::service::Service::call() which takes an arbitrarily large Request input.

For async fn use making next a coroutine is probably just overkill. The Self::Next<'_> type is always sized for concrete impls, and the compiler generates the body of T::next(), so we could instead rewrite the vtable entry as

struct AsyncIter_VTable {
    ...
    next_vtable: &'static Future_VTable,
    next: fn(*mut u8, *mut ()),
    //       ^~~~~~~  ^~~~~~~
    //        slot     self
    //
    // for large struct this is ABI-compatible with fn(*mut ()) -> S.
    // for small struct fitting into registers we may need to add a shim.
}

impl<'a, Item> AsyncIter for dyn AsyncIter<Item = Item> + 'a {
    ...
    fn next(&mut self) -> Self::Next<'_> {
        let (data_ptr, vtable) = (self as *mut _).to_raw_parts();
        unsafe {
            write_unsized_return_with(vtable.next_vtable.layout, |slot: *mut u8| {
                (vtable.next)(slot, data_ptr);
                &mut *<*mut _>::from_raw_parts(slot, vtable.next_vtable)
            })
        }
    }
}

@mindstorm38
Copy link

What is the status of this RFC? Has another one taken over since the last message? These optimizations seem very important for the language, maybe it's not the ideal solution, but I'd like to know what the state of the art is.

@CAD97
Copy link

CAD97 commented Aug 29, 2022

MIR optimization has continued to improve, and I think "destination propagation" serves some of the intended benefit of placement by return. The async trait case is making its own progress as well, though there the discussion is currently mostly around a "dyn*" mechanism rather than a two phase function splitting.

Guaranteeing placement remains a complicated question with a lot of moving parts, so at the time being it seems more beneficial that people's effort is focused on providing move elision best effort where it is easier to provide rather than the legwork to guarantee it where the cost benefit trade-off is more complicated.

@mindstorm38
Copy link

Can you give any link to talks/resources about MIR optimization? I guess that destination propagation is the "hidden-way" of doing what this RFC suggests, am I right?

@scottmcm
Copy link
Member

@mindstorm38 You could check out https://rustc-dev-guide.rust-lang.org/mir/passes.html, or the code in https://github.com/rust-lang/rust/tree/master/compiler/rustc_mir_transform/src.

@y86-dev
Copy link

y86-dev commented Sep 18, 2022

I am currently trying to remove unsafe from the linux kernel's rust API to initialize a Mutex<T>. I found this RFC and hope to add some feedback as to how this proposal could be extended to allow exactly that. As I have quiet a few things to talk about, I have created this blog post: https://y86-dev.github.io/blog/return-value-optimization/placement-by-return.html

@ssokolow
Copy link

ssokolow commented Sep 19, 2022

I am currently trying to remove unsafe from the linux kernel's rust API to initialize a Mutex<T>. I found this RFC and hope to add some feedback as to how this proposal could be extended to allow exactly that. As I have quiet a few things to talk about, I have created this blog post: https://y86-dev.github.io/blog/return-value-optimization/placement-by-return.html

I'm very much not the kind of person to be talking about the technicals of anything unsafe-related, but I do have some thoughts I'd like to share on the developer ergonomics and language feel side of things.

#[in_place]

The optimizer would of course still be allowed to do RVO on other functions, but it would not be guaranteed.

Disadvantages

Users will have to manually specify the attribute, new users could end up confused (without the attribute, they would be in the dark, this can be viewed as better or worse, depending on how you want to see it). This would also require existing code which wants RVO to be augmented with the attribute.

On the one hand, I can see such an attribute fitting in like pub, const, or the long-ago-reserved option of using become in place of return for the guaranteed tail call optimization that proved harder than expected to achieve.

On the other hand, I'm reminded of both the noise of things like Java's public static void main and how #[inline] becomes noise and pointless boilerplate in the mind of anyone who has the missing_inline_in_public_items clippy lint enabled. (Something I generally enable because, from what I've read, the ability or inability to choose to inline has an outsized effect on how much the optimizer can do with a codebase.)

...made worse by how I often find #[inline] and then #[must_use] being noise stuck on top of every infallible (i.e. not returning Result or Option) public function in my codebase... but not noise I can put out of conscious thought because Clippy will complain if I let it become rote and put it on things that return a #[must_use] type.

(And, given that, even for CLI utilities, I need to put most of my code in lib.rs, and use pub fn, and tests/ to get access to CARGO_TARGET_TMPDIR to implement filesystem mocking in what are conceptually unit tests without pulling in a whole new dependency, that happens a lot.)

There could be a new lint that would warn when a value returned by a #[in_place] function is copied and if the place it is copied to is the parameter of a crate-local function the compiler could suggest adding lazy to that parameter.

Given how many misunderstand how to use black_box in their benchmarks, I think that would be a footgun unless the lint were perfect, in which case it would call into question why we can't use something more concise to ask for it and let the compiler do the rest.

pub fn try_new(#[pin_in_place] lazy data: T) -> Result<Pin<Self>, AllocErr>;

See my prior comment about my concern with Rust unnecessarily becoming too public static void main-y.

@y86-dev
Copy link

y86-dev commented Sep 19, 2022

Continuing discussion on zulip.

@Jules-Bertholet
Copy link
Contributor

I posted a sketch of an alternative to the generator design, that addresses some of the pain points of this RFC for functions that return unsized types, on IRLO.

@burdges
Copy link

burdges commented Nov 5, 2022

Rust should maybe add some method resembling

pub fn replace_with<T,U,F>(dest: &mut T, f: F) -> U where F: FnOnce(T) -> (T,U)

that makes reset methods unnecessary by ensuring zero copies.

There is considerable boilerplate like the *reset functions in the digest crate and its hashers, which could be reduced if this following code contains no copies of hasher state:

hasher.replace_with( |h| (H::new(), h.finalize()) )

@nbdd0121
Copy link

nbdd0121 commented Nov 5, 2022

Such function would need to abort if f panics. There are crates implementing this already, e.g. https://docs.rs/replace_with.

@shepmaster
Copy link
Member

And #1736

@Nadrieril
Copy link
Member

Nadrieril commented Feb 23, 2023

For the "return Result" usecase, I wanted to point out a possible angle. Notice that the desugaring into two functions is equivalent to desugaring into a single function with signature fn(...) -> impl ReadUnsizedReturnWithFinish<T>, where ReadUnsizedReturnWithFinish was defined as:

pub trait ReadUnsizedReturnWithFinish<T: ?Sized> {
    pub fn layout(&self) -> Layout;
    pub fn finish(self, &mut MaybeUninit<T>);
}

My understanding is that most of the RFC is about adding compiler magic so that functions that return impl ReadUnsizedReturnWithFinish<T> can be disguised as functions that return T.

In that reading, the solution to manage Result is straightforward: we want fn(...) -> Result<impl ReadUnsizedReturnWithFinish<T>, E>. This way the function first returns the desired variant and the layout of the payload, then provides a continuation to initialize the payload. E.g.:

impl Box<T> {
    fn emplace(x: impl ReadUnsizedReturnWithFinish<T>) -> Box<T>;
}
fn foo() -> Result<impl ReadUnsizedReturnWithFinish<T>, E> { ... }
fn use_foo() -> Result<(), E> {
    let box: Box<T> = Box::emplace(foo()?);
    ...
}

More generally, this RFC feels pretty similar to async at a fundamental level: it introduces a new effect that we want to make as convenient and transparent as possible, but of course this is difficult when mixed with other existing effects like Result, iterators, etc. I'm hoping this angle could help finding more general solutions. In fact, just like with async, we can already define ReadUnsizedReturnWithFinish and APIs around it to experiment.

@NobodyXu
Copy link

this RFC feels pretty similar to async at a fundamental level

I remember there is another (pre-)RFC proposing to implement this using generator, which is the implementation detail of async fn.

@xmh0511
Copy link

xmh0511 commented Jun 7, 2023

This RFC seems to only guarantee the copy elision in the return value of functions, however, a generalized guaranteed copy elision, in terms of C++, will include the following case:

let v = S{field1:xxx, field: yyy};

This directly initialized the field1 and field2 of v(result object) with xxx and yyy, respectively. Instead of creating an temporary object and copy/memcpy the temporary object to v.

@digama0
Copy link
Contributor

digama0 commented Jun 7, 2023

There is no need for an RFC to handle that case, as that temporary is not observable (unlike in C++). It can be eliminated as an optimization without language changes.

@James-E-A
Copy link

James-E-A commented Apr 23, 2024

The following operations should be guaranteed zero-copy: … Array and struct literals, including struct literals ending with an unsized value.

Should tuples be added to that? Or are they already considered a "struct literal"?

(Guaranteed) RVO only kicks in when you return an "emplacable expression," which is a struct/array/tuple literal or another emplacable function call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Libs-Tracked Libs issues that are tracked on the team's project board. T-lang Relevant to the language team, which will review and decide on the RFC. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.