-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for a formalized notion on where to enforce reference propertes in MIR #2631
Closed
Closed
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
33c3423
Initial draft version
HeroicKatora b848b1e
Clear up copied raw reference description
HeroicKatora e955729
Get into size invariant vs deref invariant
HeroicKatora 8cc87c8
Add the lint example text
HeroicKatora c740377
Clarify location and purpose of analysis pass
HeroicKatora 0175ea3
Analysis overhead is a drawback
HeroicKatora 4751c5a
Clarify alignment invariant requirements
HeroicKatora 0866313
Temporarily remove confusing example in reference section
HeroicKatora File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,292 @@ | ||
- Feature Name: raw_reference_tracking | ||
- Start Date: 2019-01-21 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
Formalizes an analysis during HIR->MIR lowering to track references on which | ||
the compiler should not enforce reference properties. This complements MIR | ||
level changes through which taking the address of a subobject pointed to by a | ||
pointer does not accidentally require or guarantee reference properties on | ||
those pointers. | ||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
In current Rust semantics to obtain a pointer to some place, one creates a | ||
reference and casts it to a pointer. Since we strictly speaking attach even to | ||
this temporary reference some additional invariants, this may not be currently | ||
sound in MIR. These invariants (among them alignedness and dereferencability) | ||
are not required for pointers. Since any inspection of a type's fields involves | ||
an intermediate reference to the extracted field, it is impossible to soundly | ||
retrieve a pointer to such. | ||
|
||
A [similarly motivated RFC](https://github.com/rust-lang/rfcs/pull/2582) exists | ||
that tries to approach this problem by adding a MIR operation that performs a | ||
direct pointer to pointer-to-field conversion in a few defined cases. This | ||
leaves open the question of a formalized approach to the problem. | ||
|
||
While exposing such properties through the type system would not be backwards | ||
compatible, it could still be possible to attach such properties internally. | ||
Through this system, we can also provide better warnings and give the necessary | ||
foundation to properly discuss extended guarantees. | ||
|
||
# Guide-level explanation | ||
[guide-level-explanation]: #guide-level-explanation | ||
|
||
Each reference (`&` or `&mut`) will have an internal–that is, not observable by | ||
rust code–tracking state named `raw`. While a reference is `raw`, it will be | ||
represented as a pointer in syntax-to-MIR lowering and the value to which | ||
dereferences need not be valid. The act of unactivating the raw status of a | ||
reference will be called `settling` for the purpose of this document, how this | ||
can happen will be discussed below. When a reference with active `raw` status | ||
is converted to a pointer, this will be called a raw pointer cast and would be | ||
a no-op in MIR. | ||
|
||
``` | ||
// The result of this borrow is a raw reference. | ||
let x = unsafe { &packed.field as *const _ }; | ||
``` | ||
|
||
In short, a reference will keep its raw status as long as its pointed-to content | ||
is not accessed. Since the tracking of references should stay function local, | ||
passing a reference as an argument to another function or returning it will | ||
unset the `raw` status. Note that raw references can only originate from unsafe | ||
code. Thus, when a raw reference is cast to a pointer in safe code, this is an | ||
indicator for potentially incorrect usage of unsafe and a warning could be | ||
emitted. | ||
|
||
Since this solution works (invisbly) at type-level, we keep the composability | ||
properties that code relies on. In particular, we can decompose expressions | ||
into subexpressions: | ||
|
||
``` | ||
let x = unsafe { | ||
// Desugared version. `raw` is reference with `raw` status. | ||
let raw = &(*packed).field; | ||
// Cast with active raw status, raw pointer cast. | ||
raw as *const _ | ||
}; | ||
``` | ||
|
||
This also propagates out of unsafe blocks, even though a raw reference from a | ||
pointer can only be created within an unsafe block. This proposal explicitely | ||
does not intends stabilize the guarantee of the following code not invoking | ||
undefined behaviour but nevertheless proposes to not exploit it in MIR until | ||
better tools are created in the surface language to differentiate the reference | ||
requirements. | ||
|
||
``` | ||
let x = unsafe { &packed.field }; | ||
let x = x as *const _; | ||
``` | ||
|
||
Since this is the only use of `x`, and `x` is never settle but originated in an | ||
unsafe block, this is a strong code smell. Specifically, it would be strictly | ||
safer for this reference to never leave the unsafe code block. By tracking the | ||
origin of a raw reference, a lint could be emitted to suggest moving the pointer | ||
cast inside. | ||
|
||
# Reference-level explanation | ||
[reference-level-explanation]: #reference-level-explanation | ||
|
||
The `raw` status of a reference can only originate from a borrow expression. | ||
`raw` status can be active, or inactive. The reference result of a borrow | ||
expression `& (*t).field_path` will have the `raw` status set when at least one | ||
of these is true (where the borrow expression is after desugaring of deref | ||
magic): | ||
|
||
* `t` has pointer type. | ||
* `t` is itself a raw reference. | ||
* `field_path` contains fields of a union or packed struct. | ||
|
||
The raw status of a reference is conserved across copy and move. When a | ||
reference is used in any expression except `Copy`, `Move` or as the pointer in | ||
a borrow expression from which another raw reference originates, then it it | ||
settled. This implies that the analysis is local, as any return settles the | ||
reference and no returned reference is raw. This provides enough power to MIR | ||
to settle most reference that are used as references while upholding the raw | ||
status for the purpose of getting a raw reference to a subfield. | ||
|
||
Assume `packed: *const T`, `(*packed).field` is unaligned. All examples are | ||
assumed to be inside a single unsafe block each for the safe of brevity. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example (with the three code blocks) is confusing and not consistent with the rest of the text and will be reworked. Sorry for not noticing this before the PR... |
||
``` | ||
// TODO: expand on example above | ||
``` | ||
|
||
When a reference value is initialized or assigned to from multiple sources, it | ||
will have raw status when all sources have raw status. Otherwise, the | ||
references with raw status are settled before that initialization or | ||
assignment. | ||
|
||
On the MIR-level, reference with `raw` status are temporarily represented as a | ||
pointer, and convert it to an actual reference only when it is settled. This | ||
requires a MIR operation that can borrow a place as a pointer directly. Such an | ||
operation has been proposed and is currently in merge period already as pull | ||
request #2582. Raw references will only have their size invariants enforced, | ||
i.e. they are required to point to a large enough–and unique for mutable | ||
references–place in memory. The alignment inviarant and the requirement for the | ||
invariants on the pointed-to value are not inspected for raw references. | ||
|
||
Using the `raw` status, it is possible to provide additional lints for unsafe | ||
code which creates references but could transform them into pointers | ||
immediately. When a raw reference is only used to perform a raw cast into a | ||
pointer, this cast should likely be performed immediately. Since the analysis | ||
here is local, any raw reference must have originated in the same function. A | ||
synthetic property can be computed which checks if a reference is only used in | ||
the creation of other raw references or raw pointer casts. Where this property | ||
is the upper bound of the following monotonic function on the lattice of all | ||
references: | ||
|
||
* `unused(A) < unused(B)` where `B = A` appears in the function. | ||
* `unused(A) < unused(B)` where `B = &(*A).fields`. | ||
|
||
Now, here is an example how this lint could be useful. The following function | ||
may accidentally exhibit undefined behaviour because it creates a reference to | ||
the struct pointed to. This may have been an accident after refactoring or | ||
simply an incorrect assumption by the programmer. | ||
|
||
``` | ||
#[repr(packed)] | ||
struct Foo { _layout: u8, a: u16, b: u16, }; | ||
fn not_very_defined(input: *mut A, which: bool) -> *mut usize { | ||
// reference properties unused in all other usage. | ||
let whole = unsafe { &*input }; | ||
if which { | ||
// Cast to pointer, raw unused. | ||
&whole.a | ||
} else { | ||
// Cast to pointer, raw unused. | ||
&whole.b | ||
} | ||
} | ||
``` | ||
|
||
However, this does not necessarily stay undefined behaviour if the MIR level | ||
changes to create a raw reference get applied. None of the created references | ||
are ever settled as `whole` is only used to create the other two, and those get | ||
cast to pointers immediately. The assignment of a reference to a variable could | ||
however be avoided by putting the field access into the unsafe block as well. | ||
|
||
``` | ||
warning: local unsafe reference never used as reference | ||
| | ||
3 | fn not_very_defined(input: *mut A, which: bool) -> *mut usize { | ||
4 | let whole = unsafe { &*input }; | ||
| ^^^^^^^ | ||
5 | if which { | ||
|
||
note: derived reference coerced to pointer before use here | ||
5 | if which { | ||
6 | &whole.a | ||
| ^^^^^^^^ | ||
7 | } else { | ||
|
||
note: derived reference coerced to pointer before use here | ||
7 | } else { | ||
8 | &whole.b | ||
| ^^^^^^^^ | ||
9 | } | ||
|
||
note: the compiler can consider the reference to point to initialized memory | ||
note: consider casting original reference to pointer immediately | ||
note: to avoid this warning, explicitely cast to reference | ||
4 | let whole = unsafe { &*input as &_ }; | ||
|
||
``` | ||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
This proposal complicates reasoning about unsafe code. It is no longer clear | ||
from a single value expresion alone whether `& (*place)` will be enforced as | ||
reference type, or simply treated like a pointer. | ||
|
||
Even though we don't make such guarantees to stable code yet and only provide | ||
this as a temporary means to make user code defined at all, this could be | ||
misunderstood as a promise to programmers. | ||
|
||
The (lint) analysis may occur a non-trivial overhead especially for very large | ||
functions that mix unsafe code with safe code without adding many explicit type | ||
annotations. | ||
|
||
# Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives | ||
|
||
An alternative would be to create dedicated new surface level syntax that only | ||
works on raw pointers to create another raw pointer using a place-like | ||
expression. This design is preferable over a new syntax for creating temporary | ||
raw references for some reasons: | ||
|
||
Old code that relied on `&packed.field as *const_` and incorrectly from lack of | ||
explicit other concepts inferred from this that any temporary reference | ||
creation eventually cast to pointer is defined should be considered. This | ||
reasoning would gain an explicit approval as long as the type is never | ||
explicitly cast to/ascribed with/used as a reference. | ||
|
||
A new surface level syntax would require deeper changes. Most uncomfortably, it | ||
could require changing type resolution by adding new types. This would very | ||
unlikely be backwards compatible while offering all necessary guarantees to | ||
current code. In contrast, the mechanisms proposed here are a middle ground that | ||
achieves some of the definedness guarantees internally without making them | ||
stable. But it also enables lints that could help detect and eliminate mistakes | ||
that could become strictly undefined if the internal mechanisms were reverted | ||
again. | ||
|
||
Also, it should be noted that the invariants enforced on raw references differ | ||
from both references and simple pointers. Instead of reusing the pointer | ||
representation, we could also provide a new MIR-level type `&raw` or `&uninit` | ||
on which the size invariant is applied but not the invariants of the pointed to | ||
type. | ||
|
||
# Prior art | ||
[prior-art]: #prior-art | ||
|
||
An [alternate proposal](https://github.com/rust-lang/rfcs/pull/2582) addresses | ||
the same underlying concern als with MIR changes. These are not exclusive and | ||
MIR-only changes are generally preferable to surface language changes. This | ||
proposal tries to make these changes more consistent at both levels. | ||
|
||
# Unresolved questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
Settling a raw reference is the one semantic case that creates a reference | ||
'out of thin air', through `unsafe` blocks. For the sake of a programmatic | ||
undefined behaviour sanitizer, the settling of raw references would thus be a | ||
primary injection point. This opportunity could make it worth creating a | ||
separate instruction rather than using `&(*ptr)`. | ||
|
||
# Future possibilities | ||
[future-possibilities]: #future-possibilities | ||
|
||
Unsafe code that only performs reborrowing to raw references and casts all raw | ||
references to pointers before they are settled does not actually execute any | ||
code that must be specified as unsafe. Getting a pointer to a field from a | ||
pointer to the containing struct should not, for example. Another example would | ||
be a statement that gets the pointer to a field of a packed struct from a | ||
reference to that struct. Since this internally only translates to safe code | ||
and the size invariant of the packed struct is already guaranteed, there is no | ||
intrinsic reason why it requires `unsafe`. However, the path+borrow expression | ||
of such code currently requires the use of `unsafe`. | ||
|
||
``` | ||
let field = &(*foo).bar as *const _; | ||
let field = &packed.field as *mut _; | ||
``` | ||
|
||
The notion of `raw` references could be explicitely brought into language level. | ||
Either by making such code safe as long as it does not settle any raw | ||
references or by introducing `&raw` to both worlds. This could improve | ||
ergonomics by removing unecessary `unsafe` code blocks, and further strengthen | ||
the justification requirements for the usage of `unsafe`. Such a concept could | ||
however require a stronger foundation than this proposal alone. | ||
|
||
Should Rust instead (or additionally) introduce an explicit surface level | ||
syntax for reborrowing pointers as other pointer (complementing the MIR | ||
operator for raw references), the extra information already computed by the | ||
`raw` status can be used to offer additional lints that suggest transitioning | ||
to this new syntax. The raw reference term helps not mistakenly apply such | ||
lints to unsafe blocks were the reference output was desired. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be suggesting that the compiler ascribes significance to the boundaries of an unsafe block, which is untrue. The unit of unsafety with regards to optimization is traditionally the entire module that contains an
unsafe
block.IIUC, the other RFC does treat this as UB, but only because of the intermediate
let
.unsafe { &packed.field } as *const _
should be allowed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it explicitely states this. But
unsafe { }
is a block expression with it's own type-inference and all, so I don't see the significant difference to other block statements such asif-else
in Example 2.If the definedness does not extend to other block statements, I find this a very dangerous special case in #2582 . And as noted by @RalfJung , this seems to not be the intention, here in an Edit under Example 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's at a block boundary, it becomes a true reference value - it's somewhat equivalent to a function return (and therefore is UB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be my interpretation of the other rfc as well. However, in this rfc I intended to keep the
raw
status around for all basic blocks of the current function. Thus, function return makes it a true reference value but the block boundary would not yet invoke undefined behaviour.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, and I don't think that's ureasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So whether or not such a program has UB depends on a type that was written down far away (if we imagine this function was a bit longer). I would never accept such a program during code review, and always ask people to add the explicit cast at the place where it is needed. Otherwise, (a) it is way too hard to understand why this program is correct, and (b) someone changing the return type might not even be aware that they just introduced UB.