Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifications to an l-value that an r-value may be referencing #1830

Closed
josh11b opened this issue Jul 29, 2022 · 18 comments
Closed

Modifications to an l-value that an r-value may be referencing #1830

josh11b opened this issue Jul 29, 2022 · 18 comments
Labels
leads question A question for the leads team

Comments

@josh11b
Copy link
Contributor

josh11b commented Jul 29, 2022

In (not currently accepted) proposal #821 , a let may either store an address (like a const & in C++) or a copy of the original value. This leads to questions about the following code:

var a: T = ...;
let b: T = a;
// Is this allowed?
a.Mutate();
// Is this allowed?
F(&a, b);

Questions:

  • Do we prevent mutations to a while there is an outstanding let that may be referencing it? Under what circumstances? For example, do we only prevent mutations that can be seen in a local analysis of the function.
  • In situations where mutations are not prevented at compile-time, what is their effect? Is it considered a bug? Does it have defined behavior? Do mutations to a affect the value of b?
  • Is the behavior different for uncopyable types T that must always be implemented using a const reference?
  • Is the behavior different for T == i32, where many people might expect let to make a copy?

Approaches:

  • Undefined behavior, no diagnostic required. Mistakes which can be seen by looking at a single function in isolation would hopefully be diagnosed at compile time. Goal is to give the optimizer the most freedom to produce the fastest code. Like other undefined behavior, the hope is that sanitizers would be able to detect and report these kinds of errors.
  • Unspecified but not undefined behavior. Mutations to the l-value may or may not be reflected in the let. Two variations based on whether local modifications would be diagnosed. It could be an error to depend upon the behavior when the l-value is mutated, and possibly a sanitizer run in debug mode would be allowed to treat any such modification as a bug to be reported.
  • Safe, but restrictive behavior. Mutations to or escaping the address of an l-value is illegal in the non-lexical scope between the declaration of a let and the last usage of that name.
  • Safe, a let is conceptually a copy that may be optimized away, but uncopyable types are not supported or have additional restrictions.
  • Behavior could be dependent on whether a type implements an interface that marks it as particularly cheap or good to copy eagerly, or keep in registers. So for marked types, let is a copy and for others it is a reference and there are restrictions on usage. Concern is being able to mark types accurately and comprehensively (for example, we might want a generic type to only be marked for certain argument values).
  • Defined behavior, dependent on whether the type is marked, has copy or const reference semantics, no restrictions on usage.

One general concern is how this would work with C++ interop. It would be nice to be able to pass a let to a C++ function taking a const & parameter. Choices: (a) lets are always const &, (b) passing a let to a const & causes the let to be "pinned" giving it an address (maybe just for the duration of the call?).

@chandlerc
Copy link
Contributor

@josh11b and @mconst and I talked a whole bunch about this.

I was really not seeing options that I was happy with or that felt satisfying. My inclination was to proceed with an "unspecified" model above, but allowing diagnosing local modification on a somewhat "best effort" case rather than in a super rigorous way. My hope was to get realistic code built up in Carbon with this model, and then evaluate the practical impact of the risk of bugs and the practical cost of the many different options for introducing safety. The down side is that it is a really expensive experiment to run. I think that left mconst and josh11b quite uncomfortable (and many others as well).

After discussing a bunch of other variations on experiments we could run, a specific new idea emerged that I found much more compelling. It is very close to a previous one, but for various reasons was (for me at least) not at all obvious until much later.

The key idea is to make the initializer for let bindings be the special thing, and give it safe but restrictive behavior with explicit escape hatches both for "copy" (to make things safe) and "ref" (to explicitly opt into the unsafe semantics).

The result would be:

// An L-value that is never escaped or mutated.
var never_escaped: T = ...;

// An L-value that is escaped.
var escaped: T = ...;
escape.MutateMe();

// ✅ - we can prove this is safe.
let safe: T = never_escaped;

// ✅ - a copy makes this safe.
let safe_copy: T = copy escaped;

// ✅ - explicitly opts into potentially unsafe reference semantics.
let unsafe_ref: T = ref escaped;

// ❌ - We didn't prove this as safe, so must choose a copy or ref.
let error: T = escaped;

For expressions used as function arguments, we have a different model:

fn F(x: T);

// ✅ - Trivially safe
F(never_escaped);

// ✅ - Allowed although risky because the scope is so small and use cases pervasive.
F(escaped);

Basically, function arguments don't require explicit ref. (And var doesn't require an explicit copy.)

We suspect that this is the right initial position. The very limited scope of function input parameters makes them much less worrying to default into potentially risky reference semantics. But the scope of let bindings makes that much more likely to be a problem, and so we require it to be explicit.

This doesn't change the semantics of the R-values themselves though -- they still are allowed make copies if legal and useful. Otherwise they behave exactly as a const & in C++ would (no UB here).

But converting from L-value to R-value may require an explicit ref for escaped L-values based on the context. My initial idea for how to select contexts is based on the local lexical scope impacted. Expression scopes wouldn't require an explicit marker, but statements and blocks would. And you can always make this trivial by creating a copy first, here I'm suggesting with a copy unary operator[1].

The result I think will preserve the useful parts of the experiment I was hoping to run with fairly minimal effect. It's mostly intuition and guess work, but I feel like people will actually add the required copy and ref markers. Even if they switch to var instead of o copy, I think we'll still be able to analyze the code and understand that. So we can evaluate whether we should try to improve the ergonomics here by making copy implicit some of the time or all of the time, or making ref implicit some or all of the time.

We'll also know exactly how to consider analogous changes with arguments if we want to.

So I think this gives us a pretty solid direction to explore here w/ a minimal risk of constantly dodging bug risks, but still providing reasonable ergonomics, especially on the interop boundary into C++ where const & parameters will accept the same kinds of arguments as R-value input parameters in Carbon do.

I think I've recalled all of this end state from the discussion correctly, but @mconst, please let me know if I missed anything here.

PS / [1]: Regarding the copy operator, I happen to like this spelling a bit more than .clone() because the latter for me (coming from C++) evokes thoughts of heap allocation and a virtual function. But the naming isn't the most important thing here other than keeping it as short as possible.

PPS: I think we could re-use copy and ref markers with very analogous (or identical) functionality in contexts like lambda captures.

@mconst
Copy link
Contributor

mconst commented Jul 30, 2022

Thanks for writing this all up!

I think this is a big improvement over the previous design -- even if we end up making changes later, this seems usable enough that we could at least try it out for a while and see how it works. I kind of suspect we'll end up wanting to make copy implicit sometimes (at least for types like i32), but that's easy to do later. And the implicit ref on function arguments still makes me a little uncomfortable, but it's certainly convenient, and I can't think of too many practical situations where it would actually be a problem. We can always revisit once we get some more data on how these things are used.

@mconst
Copy link
Contributor

mconst commented Jul 30, 2022

This doesn't change the semantics of the R-values themselves though -- they still are allowed make copies if legal and useful. Otherwise they behave exactly as a const & in C++ would (no UB here).

Hmm, if this is just unspecified, does that make it harder for sanitizers to report an error if you accidentally modify an lvalue while it has a live rvalue binding somewhere? I'd kind of like sanitizers to be able to unambiguously report this as illegal, even if they can't prove that your code depends on seeing the old or new value.

@chandlerc
Copy link
Contributor

This doesn't change the semantics of the R-values themselves though -- they still are allowed make copies if legal and useful. Otherwise they behave exactly as a const & in C++ would (no UB here).

Hmm, if this is just unspecified, does that make it harder for sanitizers to report an error if you accidentally modify an lvalue while it has a live rvalue binding somewhere? I'd kind of like sanitizers to be able to unambiguously report this as illegal, even if they can't prove that your code depends on seeing the old or new value.

I definitely want this to be OK to sanitize.

But I don't think we should introduce any actual UB here, as I don't think there is (nearly) enough optimization benefit to matter.

@JamesJCode
Copy link

Just getting my head around this; in your examples, what is the default behaviour where no initialization specifier(?) is present:

// ✅ - we can prove this is safe.
let safe: T = never_escaped;

// ❌ - We didn't prove this as safe, so must choose a copy or ref.
let error: T = escaped;

It feels like by-ref-but-the-compiler-will-check-my-homework, as opposed to using ref or copy explicitly meaning don't-nag-me-please-I've-got-this.

@chandlerc
Copy link
Contributor

Just getting my head around this; in your examples, what is the default behaviour where no initialization specifier(?) is present:

// ✅ - we can prove this is safe.
let safe: T = never_escaped;

// ❌ - We didn't prove this as safe, so must choose a copy or ref.
let error: T = escaped;

It feels like by-ref-but-the-compiler-will-check-my-homework, as opposed to using ref or copy explicitly meaning don't-nag-me-please-I've-got-this.

Basically, yes.

@JamesJCode
Copy link

JamesJCode commented Jul 30, 2022

Makes sense - thanks. One last (honest!) question about function parameters. Is it intended to allow this?:

fn F(x: T);

F(copy escaped);

I could get a similar end result by declaring fn F(var x: T);, but then I'd have the added effect of allowing mutability of my local copy of x in the function, where I might want the immutable property of a default let binding.

@mconst
Copy link
Contributor

mconst commented Jul 30, 2022

Yup, that's allowed!

@JamesJCode
Copy link

I like it! FWIW it gets a +1 from me.

@foonathan
Copy link

Is copy really necessary to be part of the language? If ref is implicit for function parameters, and copies for rvalues, the following should work as well, shouldn't it?

fn Copy[T](x : T) -> T { return x; }

let safe_copy: T = Copy(escaped);

(Modulo any syntax issues)

@chandlerc
Copy link
Contributor

Is copy really necessary to be part of the language? If ref is implicit for function parameters, and copies for rvalues, the following should work as well, shouldn't it?

fn Copy[T](x : T) -> T { return x; }

let safe_copy: T = Copy(escaped);

(Modulo any syntax issues)

At the moment, yes... But I can both imagine changes to the semantics of return that would make this not the case, and I can imagine it being useful purely for syntactic reasons to simplify this to an operator. So it seems good to carve out a super explicit and unambiguous operation here rather than rely on the return semantics.

@zygoloid
Copy link
Contributor

zygoloid commented Aug 3, 2022

Where would we write the keyword in a for? Some options: for HERE1 (HERE2 pattern HERE3 in HERE4 range) { ... }. I think I don't like HERE4 because it looks like the whole range is copied all at once, but it's perhaps the most consistent option. I think for might be another context where we want ref to be the default.

I wonder if there's a better keyword for the "unspecified" behavior -- ref sounds like it would guarantee reference semantics.

@mconst
Copy link
Contributor

mconst commented Aug 3, 2022

That's a good question. HERE4 does feel like the most consistent to me, and it reads pretty well with ref (assuming that's not the default); I think it's just copy that makes it feel weird. And we might not even need to allow copy at all here. The situations in which you'd want it seem very rare, and if it actually comes up you can always just write let y: auto = copy x; in the loop body instead.

(In particular, if you wanted a copy so you can mutate the loop variable without affecting the underlying range, you can just use var, which doesn't need an explicit copy. Conversely, if you wanted to mutate the elements of the range itself, that doesn't really work anyway -- you'd want to bind the loop variable as a mutable reference, which we don't support. I'm guessing we'll want to allow an addr binding mode for that situation, which would give you a pointer to each element rather than the element itself; and in that case, if you want to copy the element before mutating it, doing the copy manually seems like the way to go.)

As for the meaning of ref, I think we're pretty close to being able to say that ref really does guarantee immutable reference semantics, and the compiler choosing to make a copy instead is just an optimization. I believe the only case where valid code can tell the difference right now is by passing that value to a C++ const & and then taking its address: if you do that multiple times, you might observe different addresses.

@mconst
Copy link
Contributor

mconst commented Aug 4, 2022

If we go with this approach to for, I think the way we'd define it is to say that a ref annotation on a range also applies to the elements of that range. That seems like a pretty natural rule, and it's similar to how references to containers work in Rust.

We wouldn't need any special rules for copy at all. If you write copy range, it would in fact copy the entire range just like you'd expect. To copy the individual elements, you'd either use var on the loop variable or you'd write let y: auto = copy x; inside the loop, both of which make the semantics very clear.

@zygoloid
Copy link
Contributor

zygoloid commented Aug 9, 2022

This idea seems promising, and worth writing up as a proposal.

@chandlerc
Copy link
Contributor

This idea seems promising, and worth writing up as a proposal.

We can't really. ;] It's dependent on another one.

I'm going to try to fold this one into the values and pointers proposal.

@jonmeow jonmeow added the leads question A question for the leads team label Aug 10, 2022
@jonmeow
Copy link
Contributor

jonmeow commented Aug 11, 2022

@chandlerc Do you want an issue to track the need for a decision here, or are you comfortable saying that #821 should address this?

@chandlerc
Copy link
Contributor

@chandlerc Do you want an issue to track the need for a decision here, or are you comfortable saying that #821 should address this?

Let's assume it will for now. Closing this as decided as well. We can always reopen if #821 (or a subsequent proposal) surfaces concerns or changes direciton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leads question A question for the leads team
Projects
None yet
Development

No branches or pull requests

7 participants