Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validity of integers and floating point #71

Closed
RalfJung opened this issue Jan 10, 2019 · 99 comments
Closed

Validity of integers and floating point #71

RalfJung opened this issue Jan 10, 2019 · 99 comments
Labels
A-validity Topic: Related to validity invariants S-writeup-needed Status: Ready for a writeup and no one is assigned

Comments

@RalfJung
Copy link
Member

RalfJung commented Jan 10, 2019

Discussing the validity invariant of integer and floating point types.

Clearly, every possible bit pattern is allowed. For integers they all have a distinct and meaningful interpretation, and we have a safe NOP-conversion between f32 and u32, and f64 and u64, through to_bits and from_bits.

The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value? We could reasonably decide either way. Also, when an integer is partially uninitialized, does that "infect" the entire integer or do we exactly preserve which byte is initialized?

2022-09-07: This has now pretty much been answered.

@RalfJung RalfJung added active discussion topic A-validity Topic: Related to validity invariants labels Jan 10, 2019
@RalfJung
Copy link
Member Author

Arguments for allowing uninitialized bits:

  • There does not seem to be a reasonable optimization that is broken by this.
  • It does not contradict anything we currently tell LLVM.
  • It simplifies writing code with mem::unintiailized()/MaybeUninit::uninitialized() when the type is known to consist of integers only.

Arguments for disallowing uninitialized bits:

  • If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question.
  • "No uninitialized data outside of unions" is a nice and and easy to teach principle.
  • Ruling out uninitialized bits in integers and floating points makes it easier to find bugs where people accidentally have such variables not properly initialized: A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately. Also see https://www.youtube.com/watch?v=yG1OZ69H_-o: The fact that C makes integer overflow on unsigned variables defined to wrap-around actually has lead to real-world bugs (caused by unintended overflow) not being detected by bug-finding tools, because you cannot just mark every overflow as a bug. If there is a rare exception to a common principle ("integer overflow is a bug", "uninitialized data is a bug"), then it is preferable to have the programmer state their intent of violating that principle explicitly (use explicitly overflowing arithmetic operations like wrapping_add, use types like MaybeUninit for explicitly handling uninitialized data).

@RalfJung
Copy link
Member Author

Here's an example of a piece of code that relies on uninitialized integers (and raw pointers, and AtomicPtr) being valid: https://github.com/carllerche/bytes/blob/456221d16521cf54cea0e6569669e47120a1b738/src/bytes.rs#L1810

@RalfJung
Copy link
Member Author

RalfJung commented Jan 30, 2019

An interesting use of uninitialized bits in integers is in crossbeam's AtomicCell: for types of certain sizes, it will use the AtomicU* types.

But this means that with a type like (u8, u16), it will use AtomicU32 and thus carry uninitialized bits in a u32. Also see crossbeam-rs/crossbeam#315.

Cc @stjepang

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 30, 2019

If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question.

If we were to allow uninitialized bits, it might be reasonable to say, at least initially, that any operation that's not a read or a write is UB. That would allow us to define those operations later on. For example, that would mean that adding an initialized integer with an uninitialized one is UB, but later on, we could define that to something else, like the result of that operation being uninitialized.

That is, we wouldn't need to answer these hard questions right now.


I wonder whether it is possible to allow uninitialized bits later on in a backwards compatible way or whether we do have to make this decision right now ?

I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption.

In the same way, if we allow uninitialized bits, we can't later on disallow them without breaking code that uses them.


"No uninitialized data outside of unions" is a nice and and easy to teach principle.

Agreed.

A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately.

I don't think it can flag every occurence, for example, extern fn foo() -> MaybeUninit<i32>, but it would be able to do so for all code that the tool is able to instrument at least.

@hanna-kruppe
Copy link

I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption.

I don't see how. If a type's validity invariant rules out certain initialized bit patterns, unsafe code can use those bit patterns for its own purposes, which will clash with later allowing those bit patterns in the validity invariant. However, uninitialized bits -- even if they are valid -- cannot be detected by the program: reading them is either UB or perhaps produces some string or zeros or ones (possibly a non-deterministic one). So that kind of counter-example is right out.

Moreover, because uninitialized bits will never be safe, unsafe code can't be fed them from unknown/external sources. IOW the only way a library/module/etc. will ever see uninitialized bits is if it produces them itself or obtains them from its "trusted base" (e.g., a function whose functional correctness is relevant to the soundness of the library/module/etc.), in which case it's UB today and its problem is an internal bug, not the interface with the rest of the world.

@nikomatsakis
Copy link
Contributor

I would be inclined to permit uninitialized data in integers.

My reasoning is as follows:

I think that backwards compatibility around mem::uninitialized is a real concern. It is a very common pattern to allocate an uninitialized [u8] array in practice, and I am reluctant to declare all of that pre-existing code UB (even if we deprecate uninitialized).

Second, I think that the crossbeam example which @RalfJung raised earlier feels like the kind of tricky thing that people shouldn't have to fret about. In particular, if you have uninitialized padding bits in structs and things like that which are known to be less than word size, I think you should be able to treat them as integers for convenience, and it seems like that would be Insta-UB under this proposal.

Basically, I think people are likely to do things (like crossbeam) that wind up using uninitialized bits in integers but which don't necessarily "feel" like cases where MaybeUninit should be required. I also think that people will commonly reason about uninitialized data informally and hence write code that is UB -- of course, a tool that detects such usage can help, but it's also nice if "common things" people write are not UB even if they neglect to run the tool.

That said, I find the most compelling argument against permitting uninitialized bits to be that it would allow us to declare such uses an error. But it seems like we could still have a sort of "lint", where we say "ah, you are using uninitialized data outside of a MaybeUninit struct -- while not technically UB, it is recommended to use MaybeUninit.

One meta-question:

Suppose that I do have a clever algorithm that makes use of uninitialized integers. For example, a trick I have used in a past life was to have an integer set that had O(1) initialization cost regardless of its capacity. This worked by having two vectors of integers. One of them started out with size N but had uninitialized data. The other started out as size 0 but had initialized data. To add to the set, you checked one against the other (I can give details if desired). The key point is that you had to read and use an uninitialized integer in the process and compare it against initialized data -- is reading such an integer UB?

@hanna-kruppe
Copy link

hanna-kruppe commented Jan 31, 2019

The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits. So if we want to allow them, we need to not only allow reading uninitalized memory but also make arithmetic and comparisons on them defined (and reasonably deterministic! undef results are worse than useless). That is not possible at all with current LLVM (except by initializing all memory, or pretending to do so in ways that will block lots of unrelated optimizations) and even if/when freeze and poison is adopted, it'll have some negative codegen impact and I see no way to restrict that impact only to modules that really need it.

So even though I agree that it would be best to support these things, I do not see a reasonable way to achieve that.

@RalfJung
Copy link
Member Author

The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits.

Notice that this only applies to the compare_exchange part of crossbeam. Loads and stores do not operate on uninitialized data and hence are mostly fine.

So we have a two-level discussion here:

  • Can we load/store uninitialized bits into integers?
  • If yes, is there any other operation we can perform on them?

I'd prefer the answer to the first question to be "no" so that the second question doesn't even need answering, but unfortunately @nikomatsakis has some pretty good arguments. ;)
Just one word on these: at least under some proposed poison semantics for LLVM, poison acts on a value. That means if you load a (u8, u16) into a u32, if the padding was indeed uninitialized, you now have a fully uninitialized u32 -- if one byte of the memory is poison, the entire u32 becomes poison. So that would still not allow was crossbeam does. This behavior is, from what I know, one of the reasons why the proposal isn't officially adopted yet -- but if you don't do this, you lose some other optimizations.
However, this question of the "granularity" of poison/uninit (per-value, per-byte, per-bit) is somewhat orthogonal to whether or not a u32 must be initialized, and there are uses of uninitialized integers that never run into this.

Coming to the second question, as @rkruppe said these patterns (I think you are talking about https://research.swtch.com/sparse, right?) are still not legal. Comparing an uninitialized integer with anything is either UB or produces an uninitialized boolean, branching on which is UB. But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that.

@hanna-kruppe
Copy link

But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that.

Hm, if getting people to use such an intrinsic is acceptable (I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking), then we can already build such an intrinsic today, it just has to do something that all LLVM optimizations have to assume could initialized the memory (e.g., some inline asm). This will have some unfortunate impact on optimizations unrelated to uninitialized memory, but it will still be localized to uses of that intrinsic.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 31, 2019

I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking

Fair, but I see no model that makes this work.

This is actually also a reason why I'd prefer to not allow uninitialized data in integers -- that may be more surprising, but it is easier to explain and very concise: "No uninitialized data outside MaybeUninit".

If we allow uninitialized bits but then almost all operations are UB, it becomes something more like "No uninitialized data outside MaybeUninit, except if it is an integer or a raw pointer and you are not actually looking at the data, just loading and storing" and then we still have to explain that x * 0 is also UB even though it doesn't really have to "look at" x -- and then people will be "screw this, that's too complicated, I will just do whatever and write some tests".

Basically, we are breaking expectations anyway, and maybe it is better to break them more but in simpler ways, than to figure out how to break them as little as possible while still breaking them in ways that are much more complicated to explain. That seems plausible to me. Not sure if it makes any sense.^^

@CAD97
Copy link

CAD97 commented Feb 1, 2019

If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation?

It seems like it should be possible to use an atomic (u8, u16) on a platform supporting 32 bit atomics. The only way I can see without requiring valid(T) \in initialized(u32) (and thus a very unsafe AtomicCell) is to provide some sort of mem::initialize_padding(&T).

I'm personally on the side of making uninitialized integers invalid so long as we don't lose anything (but mem::uninitialized) for it.

@Amanieu
Copy link
Member

Amanieu commented Feb 1, 2019

The case of AtomicCell<(u8, u16)> does indeed depend on UB at the moment. C++ solved this by essentially making it the responsibility of std::atomic<T> to magically ignore padding bytes (see this page, at the bottom). Note that this magic only works for structs: compare_exchange_strong is not guaranteed to ever converge with unions.

I've been toying with the idea of a clear_padding_bytes intrinsic that would support this use case, as well as things like writing a struct to a file without leaking information through the padding bytes:

/// Writes 0 to all padding bytes in `val` and returns a mutable slice of the bytes in `val`.
fn clear_padding_bytes<T>(val: &mut T) -> &mut [u8];

@RalfJung
Copy link
Member Author

RalfJung commented Feb 1, 2019

If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation?

No.

Notice though that the state of this trick (even the load/store part) is dubious in LLVM as well -- poison is infecting the entire value when just one of the bytes loaded from memory is poisoned, and there are proposals to replace undef by poison.

C++ solved this by essentially making it the responsibility of std::atomic to magically ignore padding bytes

Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution.
One possible solution is to say that values get frozen before being compared, that at least removes the UB -- but it doesn't guarantee that a compare-exchange loop will ever terminate as they could get frozen to a different value each time. Or maybe they actually get frozen in memory, but that conflicts tons of real optimizations.

Writes 0 to all padding bytes in val and returns a mutable slice of the bytes in val.

I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes. But a version of this which just picks arbitrary bit patterns for all uninitialized and padding bytes would be implementable, that's exactly what freeze does.

@Amanieu
Copy link
Member

Amanieu commented Feb 1, 2019

I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes.

I don't see what the problem is? The compiler knows the layout of type T and therefore knows where all the padding holes are.

Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution.

In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation. This will cause padding bytes to be "ignored" by the compare_exchange hardware instruction since they will always be 0.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 1, 2019

The compiler knows the layout of type T and therefore knows where all the padding holes are.

Oh, you mean statically -- well, for enums that'll require dynamic checks as well. And of course this is hopeless for unions.

In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation.

That would also guarantee that the value you read has 0 for all padding bytes, which I am fairly sure they do not want to guarantee. And anyway I see no way to implement this behavior even remotely efficiently.

I think a more reasonable operational version amounts to saying that you freeze both values before comparing -- that at least avoids comparing uninitialized bytes, and it makes compiling to a simple CAS correct. But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop.

Also this assumes the atomic operation knows the correct type, whereas AFAIK for LLVM atomic operations only work on integer types -- at which point you cannot know which bytes are padding.

@ghost
Copy link

ghost commented Feb 3, 2019

But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop.

I believe spurious comparison failures can be fixed like so:

// Assume `T` can be transmuted into `usize`.
fn compare_and_swap(&self, current: T, new: T) -> T {
    // Freeze `current` and `new` and transmute them into `usize`s.
    let mut current: usize = freeze_and_transmute(current);
    let new: usize = freeze_and_transmute(new);

    loop {
        unsafe {
            // `previous` is already frozen because we only store frozen values into `inner`.
            let previous = self.inner.compare_and_swap(current, new, SeqCst);

            // If `previous` and `current` are byte-equal, then CAS succeeded.
            if previous == current {
                return transmute(previous);
            }

            // If `previous` and `current` are semantically equal, but differ in uninitialized bits...
            let previous_t: T = transmute(previous);
            let current_t: T = transmute(current);
            if previous_t == current_t {
                // Then try again, but substitute `current` for `previous`.
                current = previous;
                continue;
            }

            // Otherwise, CAS failed and we return `previous`.
            return transmute(previous);
        }
    }
}

Now it's still possible to have a spurious failure in the first iteration of the loop, but the second one will definitely succeed (unless the atomic was concurrently modified). In fact, this is exactly how CAS in AtomicCell<T> already works, except we don't have a way to freeze values.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 4, 2019

previous is already frozen because we only store frozen values into inner.

I think this is the key invariant here -- if you can make that work, then yes I can think there is a consistent semantics here.
This requires making sure the contents are initialized, and also freezing before writing in store (and any other modifying instruction).

However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 4, 2019

We seem to be trying to answer two different questions here that are entangled.

One is whether integers and such can be uninitialized or not.

The other one, which is most fundamental, is whether we want to support using uninitialized memory outside unions or not, in general. I think we should answer this question first, and use the answer to drive the rest of the design.

We can't think of allowing uninitialized memory on integers without also considering that the validity of integers and raw pointers is going to be alike, so we are also allowing uninitialized memory on raw pointers. Raw pointers can point to DSTs, so we also need to be thinking whether we want to allow uninitialized memory in pointers to DSTs (for the whole pointer, some part of it, etc.).

I agree with @nikomatsakis that a lot of code is using mem::uninitialized today and we should weight the impact on that. I don't share their reluctance (yet), because I think deprecating mem::uninitialized doesn't break the world (the compiler and tools are going to be the same as the day before the deprecation). People code will continue working "as is", and they will have time to upgrade. I don't know what the cost of the upgrade will be, but reasoning about mem::uninitialized is hard, and the rule "no uninitialized memory outside unions" turns something that is hard into something that's relatively easy. It also potentially simplifies our "rules" for all other types (e.g. raw pointers, integers, etc.), and there is value in that.

The issue that some algorithms require uninitialized memory has shown up. The question that hasn't been answered yet AFAICT is whether MaybeUninit can be used to implement those algorithms or not. If it can't, then IMO that might hint at a design flaw in MaybeUninit (or maybe we need more helper types, e.g., AtomicMaybeUninit or similar), but that doesn't necessarily imply that banning uninitialized memory outside unions is a bad choice.

@ghost
Copy link

ghost commented Feb 4, 2019

However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.

Just to clarify for everyone else following this thread: @RalfJung and I discussed this and the conclusion was that we'll simply remove get_mut() if that is necessary to make AtomicCell sound.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 5, 2019

About whether uninitialized data is okay in integers at all: we discussed this at the all-hands.

The general consensus seems to be that we should permit uninitialized data in integers and raw pointers. There is just too much existing code doing stuff like

let x: [u8; 256] = mem::uninitialized();
// go on

or

let x: SomeFfiStruct = mem::uninitialized();
// go on

Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri.

Incidentally, that matches what Miri already currently implements, mostly for pragmatic reasons (libstd already violated the rules about uninitialized integers when I wrote the checks -- I think I have since moved it to MaybeUninit).

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 5, 2019

Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri.

Was it also discussed whether this was worth the benefit of a simpler and more teachable correctness model for unsafe code ? (independently of whether this model can be better checked with miri or not?).

think I have since moved it to MaybeUninit).

I think so too.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 5, 2019

Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 5, 2019

Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?

No, but I do think that "uninitialized memory is only allowed inside unions" is much much simpler and teachable than all other alternatives that are currently being discussed.

@CAD97
Copy link

CAD97 commented Feb 7, 2019

Another (drastic?) option to consider that allows let x: u32 = mem::uninitialized();:

Disallow uninitialized (poison) bits in integers (and pointers?). Make mem::uninitialized return freeze(poison). This makes mem::uninitialized behave as most initially intuit (a constant nondeterministic bit string) and gains us the "(true) uninitialized memory only inside unions (or padding)" property.

This may degrade some performance around uses of mem::uninitialized, but the intent is to move those over to mem::MaybeUninit anyway, right? And as previously mentioned, mem::uninitialized is a very dangerous tool to start with.

Also it could degrade talking about "uninitialized memory", because it then introduces both the "frozen uninitialized" behind mem::uninitialized and "poison uninitialized" behind mem::MaybeUninit and padding as being called "uninitialized" in the language. This could be just documented on the deprecated big-fat-warning-label mem::uninitialized for most of the points back, though. Example:

mem::uninitialized

Deprecated, do not use. Use mem::MaybeUninit for its clearer semantics instead.

For safety reasons, this function now returns "frozen" memory -- memory set to
nondeterministic but consistent bits -- rather than truly uninitialized memory.
Otherwise, merely storing the returned value in a variable would immediately
break the validity invariants of any type except mem::MaybeUninit.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 12, 2019

@CAD97 Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze. That would make all the existing code that creates uninitialized integer arrays or FFI structs valid even if we decide that integers must be properly initialized. (I guess we could say we require them to be "frozen".)

@Amanieu suggested we should allow uninitialized values in integers only for "backwards compatibility reasons". This would have a similar effect. It might, however, incur a performance cost on such existing code. Porting code to MaybeUninit enables it use "unfrozen memory" and gain back the performance.

However, I suspect people will still want to keep memory unfrozen when e.g. calling a known Read::read implementation. That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 12, 2019

Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze.

We should definitely do this.

That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)

+1.

There is just too much existing code doing stuff like

let x: [u8; 256] = mem::uninitialized();

The freeze fix makes this "work" for Rust 2015 and Rust 2018, but unless I missed something the plan is still is to deprecate mem::uninitialized. If that's the case, we could prevent Rust 2021 crates from accessing it (even though the std lib still needs to ship it for inter-edition compatibility).

So I find the argument that "integers and pointers shall support uninitialized bit-patterns because there is too much code using mem::uninitialized()" weak. In Rust < 2021 uninitialized would be changed to freeze, and in Rust >= 2021 there could be no mem::uninitialized.

@Gankra
Copy link

Gankra commented Jul 3, 2022

For those who might be a bit confused because the conversation bounced around to lots of different places, a lot of the interesting cases are spelled out very explicitly in MaybeUninit's docs. For instance, the "I want to Read into an uninitialized buffer" case is explicitly called out as incorrect here:

https://doc.rust-lang.org/std/mem/union.MaybeUninit.html#incorrect-usages-of-this-method-1

@RalfJung
Copy link
Member Author

RalfJung commented Jul 3, 2022

@Gankra That is more a question for #77, and I happen to think that we should allow the examples shown in the docs...^^ (and Miri currently allows them).

@Gankra
Copy link

Gankra commented Jul 3, 2022

In that issue when someone asked about the exact situation I referenced you told them it was off topic and to discuss it in this issue.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 3, 2022

You are quoting me incorrectly. :)
That is a reply to this example which uses uninit ints by-value. However, the examples you pointed to above do not do that, they only use uninit memory behind a reference. So they are two very different (but related questions).

@RalfJung
Copy link
Member Author

RalfJung commented Jul 3, 2022

So, no, this is not "the exact situation" you referenced. It is a very different situation.
To wit: that other example errors in Miri, the examples you referenced do not.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 5, 2022

rust-lang/rust#98919 asks the lang team to decide that uninit integers (and floats and raw pointers) are UB.

@5225225
Copy link

5225225 commented Sep 7, 2022

98919 was closed with "yes, uninit ints/floats/raw pointers are instant UB to move" so the answer to

The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value?

is "no", so we don't need to answer how it behaves. Should this get closed, then? Not sure when UCG issues get closed, but this looks like it's been decided by the lang team.

@RalfJung
Copy link
Member Author

RalfJung commented Sep 7, 2022

They get closed when we have done the writeup.

... which we haven't done in years...

@Aaron1011
Copy link
Member

From reading through this thread, there seems to have been a pretty dramatic reversal from the original loose consensus of "it isn't worth effort/complexity to make uninitialized integers UB". As far as I can tell, this was partially due to the fact that LLVM is making its optimization passes more conservative, and will require an explicit noundef to keep the codegen that we currently expect.

I'm not trying to re-litigate the already-completed FCP. However, I'm concerned that this is going to be difficult to teach - I suspect people will fall back to their (incorrect) mental model of "a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine".

Is there a concrete example of an optimization we can point to that explains why this behavior is useful? Ideally, it would:

  • Break if the any of the values involved are initialized
  • Work if the values came from frozen uninitialized memory
  • Have some sort of performance issue if we made all MaybeUninit::assume_unit have an implicit freeze - I think many people are going to wonder why we don't "just" do this.

My overall concern is that people might (implicitly) decide that this is too "weird" or "theoretical" without a concrete justification for this "weirdness", and write incorrect code that never gets run under Miri. Of course, undefined behavior is defined by the behavior of the Abstract Machine, but I think being able to say "this 'weirdness' is required for the following useful snippet to be optimized well" will really help to reinforce that in practice.

@CAD97
Copy link

CAD97 commented Oct 25, 2022

The obvious example is the Itanium Not a Thing, where using uninitialized data will actually cause a CPU-level trap.

The usual concrete example of uninitialized memory is reading from MADV_FREEd pages returning nondeterministic results based on what the state of the page allocation is.

AIUI, the main benefit of assume_init not doing a freeze is being able to optimize it out. If you do *p = freeze(*p), that isn't a removable noöp read/write pair, because it could actually change the state of memory. memcpy elision is very important, and freeze basically means to not do that.

This probably wouldn't be that big of a deal for just assume_init, since those will probably be comparatively rare — but the real problem is normal pointer reads. Introducing a freeze on every pointer read is almost certainly a nonstarter.

There's also the fact that we will replace by-value passing with by-ref passing when the value is large. This can be optimized into a continued move — but if one of the steps is freeze, that forces an actual memcpy at that point, rather than continued reference passing.

And it's not a super strong motivator, but it'd be an annoying footgun if MaybeUninit::assume_init froze, but Box::assume_init doesn't (because it can't; freeze is a by-value operation). At least part of the reasoning is simpler overall semantics of the language.

@thomcc
Copy link
Member

thomcc commented Oct 25, 2022

I don't really find your examples that convincing, TBH:

  • That's not quite how NaT works on Itanium -- a load of uninit data into a NaT register is considered initialized from from the persective of NaT. I also don't think we should base our semantics on what would be good for a mostly-unused architecture (and I say that as someone who owned an Itanium in the past).
  • We can already eliminiate *a = freeze(*a) usually (due to &mut T and &T), and memcpy in most cases (where src and dst cannot overlap).
  • MADV_FREE is not relevant for uninitialized values on the stack, which is what this is largely talking about.
  • For the same reason, your statement about being unable to remove by-ref passing seems incorrect to me.

@nbdd0121
Copy link

One example is running the code under Valgrind. Since freeze generates no actual instruction, this info is not conveyed to Valgrind. For the "uninitialized u8 is fine" model to work with Valgrind we would have to generate a lot of calls of VALGRIND_MAKE_MEM_DEFINED to make Valgrind happy.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 25, 2022

"a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine"

Note that this is wrong even if we ignore uninit memory -- u8 cannot hold a byte with provenance. If you try to put provenance into u8, it will be stripped. As a consequence, a memcpy implemented with u8 is incorrect. I don't think there is any way to salvage the model of u8 being a universal byte type -- that ship sailed before Rust even started.

@digama0
Copy link

digama0 commented Oct 25, 2022

The model of u8 being a universal byte type does actually exist in C/C++, so I wouldn't call it a nonstarter exactly, but it leads to a lot of weird special cases where memcpy of char* is magic compared to any other type.

But I assume that people who make the assumption that u8 can store all possible AM bytes would make the same assumption about u16, so the C/C++ solution isn't desirable even for those people.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 25, 2022

The model of u8 being a universal byte type does actually exist in C/C++,

Not quite -- clang adds noundef for char arguments in C/C++. See llvm/llvm-project#56551 for how that is justified. At last for C++, their argument seems solid. For C, the standard is too murky to tell.

@thomcc
Copy link
Member

thomcc commented Oct 25, 2022

At this point I've mostly given up on this topic and accepted that it's going to be UB, but I do feel like it is very close to a breaking change -- it feels like there's almost no way that someone could predict in, say, 2016 that uninitialized [u8; N] would be UB, given the existence of mem::uninitialized() (and I don't think our 0x01-initialization fixes this, since IIUC it's still considered UB, for example).

Even after that, for a long time (for most of the period this thread has been open, for example), the concensus was on the side of "this is probably fine" too... It does feel like soon after LLVM adds optimizations based on noundef opinions changed to it needing to be UB. That's pretty frustrating, and is part of why it's still very hard (or impossible) to write unsafe code that can last for a long time. This is part of why I try to support the work to hammer these things out, so that in the future we can't make the decision based on what's most convenient at the time, and can instead have firmer semantics that can be programmed against.

(Separately, I do (strongly) think there is a need to support an unsafe MaybeUninit::<T>::freeze(self) -> T function though, but that's separate as it would not produce uninit bytes from the perspective of the AM)

@RalfJung
Copy link
Member Author

RalfJung commented Oct 25, 2022

If someone had designed a proper op.sem in 2016, I think they would have quickly come to the conclusion that at least this is an unanswered question and we should clearly tell people to avoid doing this until there is an answer. But sadly unsafe Rust started with a pre-rigorous phase and it'll take a while for that to shake out. This was definitely predictable when this paper got published, but it took a while for that realization to spread and even longer until suggesting 'uninit integers are UB' seemed like a proposal that has realistic chances of being accepted in Rust. I thought for a long time that maybe we can allow 'fully uninit integers' without too much cost, but then it turns out that even that already blocks a bunch of optimizations in safe code that we really want. I know it's painful but I also don't know what else we could have done, other than stabilizing MaybeUninit sooner and pushing people harder to use it. Maybe the lesson is that we should have been more aggressive at pushing the notion that uninit integers are UB -- I was going slow here for fear of losing everyone if I go too quickly, and due to having a small hope that maybe this can be avoided (which was clearly a misjudgement on my side).

Regarding optimizations, note that this is less "LLVM becoming more conservative" and more "LLVM fixing optimizations that caused real-world miscompilations". I don't know why you framed this in such a strange way, @Aaron1011.
@nikic would know better than me where those noundef annotations are crucial, but @scottmcm did have a concrete example during the invalid-value-lint-FCP.

@Aaron1011
Copy link
Member

Aaron1011 commented Oct 25, 2022

@RalfJung: I didn't mean to imply that the reasons for noundef were purely theoretical, or that there were no actual miscompilations caused by the current behavior. I only meant that LLVM is going to (correctly) stop performing certain optimizations on the IR we currently emit.

@RalfJung
Copy link
Member Author

RalfJung commented Oct 25, 2022

We can already eliminiate *a = freeze(*a) usually (due to &mut T and &T)

I don't think we can, what does this have to do with references?

MADV_FREE is not relevant for uninitialized values on the stack, which is what this is largely talking about.

So should we say that uninit integers are fine only on the stack but not on the heap? That sounds quite terrible.

We could declare MADV_FREE memory entirely unsupported in Rust. But this is going down the train of not having poison/undef in the language, which I think is he wrong direction -- there's a reason basically every compiler has such a concept, they are crucial for getting good codegen.

I also think a magic assume_init would have caused people to think that this is how all uninit memory works, so when people use unions or raw pointers they would then still get things wrong. There's a benefit to having a clear line that can be taught consistently.

This is part of why I try to support the work to hammer these things out, so that in the future we can't make the decision based on what's most convenient at the time, and can instead have firmer semantics that can be programmed against.

I think it's a bit unfair to say that we are doing this based on what's "most convenient". If we had declared that integer types can hold uninit memory, we would have had bugreports saying "I wrote this safe code and codegen didn't produce the results I wanted", and we'd have to explain each time that it's because other people write unsafe code that we wanted to allow and we can't have our cake and eat it, too. There's no easy answer here, and "make safe code pay in performance for things unsafe code sometimes wants to do" is not obviously a good outcome, either.

I fully agree though that we need to have firmer semantics that people can rely on, that's the entire point of the vast majority of my work on Rust.

@nikic
Copy link

nikic commented Oct 25, 2022

I'm not trying to re-litigate the already-completed FCP. However, I'm concerned that this is going to be difficult to teach - I suspect people will fall back to their (incorrect) mental model of "a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine".

FWIW, I think it's kind of the contrary: I find the MaybeUninit model that Rust ended up adopting to be very easy to understand, because you explicitly say where uninitialized memory is possible and where it's not. There's a very clear and easy to reason about dividing line. You don't have to think about which types support uninitialized values and which don't, and what behavior you get when you try to perform operations on uninitialized values.

Anyway, regarding freeze, I think something that people don't realize is that freeze poison is, for most practical purposes, equivalent to zeroinitializer, because that's what LLVM will convert it into. Making mem::uninitialized() return frozen memory is effectively equivalent to returning mem::zeroed(). (And this is kinda what we ended up doing, just with a different, safer fixed bit pattern.)

Doing a freeze in assume_init() would have similar effects. In the lucky case, it would zero-initialize everything that has not been explicitly initialized. In the unlucky case (which is actually more likely for non-trivial initialization) it would perform a memcpy. The whole point of using MaybeUninit is to prevent these things from happening.

There is, at present, no way to support "uninitialized memory" in LLVM that is both efficient (i.e. does not end up just initializing memory anyway) and does not leak things like undef/poison semantics into Rust's semantics (with all the weirdness that comes with it, like x == x returning false for an uninitialized u8). And I don't think there is any interest in making this possible on the LLVM side either.

So just to spell this out clearly, I believe the options you get (from an LLVM perspective) are:

  • There is no efficient uninitialized memory.
  • There are poison semantics in the Rust AM.
  • Uninitialized memory is nicely encapsulated in MaybeUninit.

@JakobDegen
Copy link
Contributor

Closing as answered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-validity Topic: Related to validity invariants S-writeup-needed Status: Ready for a writeup and no one is assigned
Projects
None yet
Development

No branches or pull requests