Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of Undefined and Justification for UB #253

Closed
chorman0773 opened this issue Oct 18, 2020 · 74 comments
Closed

Meaning of Undefined and Justification for UB #253

chorman0773 opened this issue Oct 18, 2020 · 74 comments
Labels
A-abstract-machine Topic: concerning the abstract machine in general (as opposed to any specific part of it) C-terminology Category: Discussing terminology -- which term to use, how to define it, adding it to the glossary

Comments

@chorman0773
Copy link
Contributor

From various responses, I am confused about the meaning of Undefined Behaviour in rust. Coming from a C++ background, and having done extensive personal research on undefined behaviour, I understand the term to be literal, behaviour which is not defined. In C++ and C it is explicitly specified as "Behaviour for which this international standard poses no limitations". In a number of specifications I have written, I have adopted similar wording. As far as I can tell, rust does not explicitly define the term, so I assumed it has the same meaning (and it seems to have that same meaning). In particular this definition permits an implementation which assigns some meaning to undefined behaviour, while still conforming to the standard/specification (As an example, see clang and gcc with union type-punning in C++). However, in particular, a comment on #84 leads me to believe, this would not be valid in rust. If so, would it be reasonable to provide an explicit definition for the term, and is there a particular reason why a restricted interpreation of the term is beneficial to rust?

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition, and the conformance clause of the mentioned standards permit optimizations that assume UB doesn't occur. Rather, the original intent, at least from what I can determine, was to provide an escape hatch to portions of the standard that either cannot be specified or doesn't want to be specified, because some reasonable implementation would not be able to provide particular behaviour. If this is in fact the case in UCG, would it be reasonable to extend this justification to include reasonable implementations, not just optimizations, that are enabled as a result of the undefined behaviour.

@Diggsey
Copy link

Diggsey commented Oct 18, 2020

UB is the same in Rust as it is in C++.

In particular this definition permits an implementation which assigns some meaning to undefined behaviour

A compiler implementation could specify what happens for some subset of programs which have UB according to the Rust language. However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition

Even from the beginning, the C/C++ standards left things undefined specifically to allow compilers to translate code to more efficient machine code. For example, this is why int can be different sizes, and it infects almost every part of those standards. You're right that they were not originally designed to specify an "abstract machine" - that became a necessity later as compilers started being more aggressive - but optimization, and generating more efficient code, drove a large part of the decision-making from the very start.

There's no reason to leave something as UB unless it allows for some optimization, because UB has a significant cost. Instead, if no optimizations can be enabled, it would be better to specify the behaviour, or define a range of reasonable behaviours, but leave the exact choice up to the implementation.

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

Thank you for moving this discussion to a separate thread!

As far as I can tell, rust does not explicitly define the term

There is a definition of UB in our glossary. This definition coincides with how modern C/C++ compilers interpret UB in their respective languages. (I should add though that the UCG glossary represents UCG consensus, not Rust-wide consensus.)

There is also an excellent blog post by Raph Levien that goes a bit into the history of UB. According to that post, UB in C/C++ used to be more about "we do not want to restrict what hardware does" than about enabling optimizations, but this meaning has shifted over time. In my opinion, UB is a terrible word for how the term is used today, I think something like "language contract" or so is much clearer, but I'm afraid we are probably stuck with it. The concept itself however is great: it is a way for the programmer to convey information to the compiler that the compiler would have no way to infer itself. However, problems arise when the programmer does not realize what information they are conveying. This happens a lot in C/C++ (when a programmer writes + they might not want to convey "I carefully checked that this will not overflow"), and also happens in Rust with some of the more subtle UB, in particular around validity and aliasing.

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition

Historically, UB might not have started as being primarily for optimizations, but over the last few decades that is certainly the case. To give one example, strict aliasing is UB in C and that UB has only one purpose: more optimizations. (Specifically, the story I was told is that C compilers needed to be able to compete with Fortran compilers. Fortran has very strong aliasing guarantees, and the only way they saw to make C competitive is to also have some aliasing guarantees in C.)

In Rust, without the historical luggage of C/C++, we use UB only for optimizations. There are better ways to handle platform and implementation differences, as @Diggsey mentioned. For example, we have little-endian and big-endian platforms, and this is handled by having an explicit parameter in the Rust Abstract machine defining endianess. So it is not UB to do something byte-level with multi-byte integer types, but results differ per platform. Such differences should obviously be kept to a minimum to make code maximally portable, but there can be good reasons to introduce them. Likewise, integer overflows are defined to either raise a panic or produce 2's-complement overflowing results (and in practice this is controlled by compiler flags). In such cases it is important to precisely specify what all the possible cases are, so that programmers can make their code correct wrt. all Rust implementations. This is what sets such platform/implementation differences apart from UB.

In particular this definition permits an implementation which assigns some meaning to undefined behaviour

A compiler implementation could specify what happens for some subset of programs which have UB according to the Rust language. However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.

To add to this, the purpose of the UCG (unsafe-code-guidelines WG) is to specify Rust, not to specify a particular Rust implementation. Basically, the long-term goal of the UCG is to produce something akin to (but better than ;) the C/C++ spec. As far as the spec and therefore the UCG is concerned, programs with UB are just wrong, period. This is the same as in C/C++: the spec does not discuss any such implementation-specific guarantees.

Some members of the lang team have also expressed a preference in the past of not making any extra promises in rustc [the implementation] for things that are UB in Rust [the language]. They want to avoid fragmenting the language into dialects that only work with some implementations. Worse, since there is only one implementation currently, there is a huge risk of any such implementation-specific promise becoming a de-facto guarantee that the entire ecosystem relies on.

Therefore, as far as the rust-lang organization is concerned, programs with UB are beyond salvaging. They are not subject to stability guarantees (or any guarantees really) and they need to be fixed. Implementations could assign meaning to UB programs, but rustc [the implementation] does not. In fact it would be healthier for the ecosystem if alternative implementations (once they exist) do not do so, either, since any such guarantee is an ecosystem split -- programs that run fine in one implementation do not run fine in another. Effectively, if an implementation makes such a promise, then it implements a different language, with a different Abstract Machine. That's why I talked about "dialects".

In practice, rustc [the implementation] will do what it can to help programmers even if their programs have UB, if it does not compromise UB-free programs. Usually the goal here is to make the programmer aware of that problem so that they can fix their code. Sometimes we even temporarily take back changes that "break" UB programs until UB-free ways to do things are possible; this happened around unwinding for extern functions. (I put "break" in quote because with my spec hat on, UB programs are already broken, the compiler did not do anything wrong.) We do not just ignore the needs of users that have UB in their code, but the goal of that conversation is always to find a non-UB way for them to do what they need to do. None of this really changes the fundamental stanza on UB, in particular as far as the spec and the UCG are concerned.

@RalfJung RalfJung added C-terminology Category: Discussing terminology -- which term to use, how to define it, adding it to the glossary A-abstract-machine Topic: concerning the abstract machine in general (as opposed to any specific part of it) labels Oct 18, 2020
@digama0
Copy link

digama0 commented Oct 18, 2020

Does rust have a category corresponding to C/C++'s "implementation defined" then? It sounds like we would want to avoid it, and as long as "rust = rustc" it's a bit difficult to distinguish implementation defined from plain old defined behavior.

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

Does rust have a category corresponding to C/C++'s "implementation defined" then? It sounds like we would want to avoid it, and as long as "rust = rustc" it's a bit difficult to distinguish implementation defined from plain old defined behavior.

Not yet, mostly for the reason you mentioned. I think such questions will come up way later in the process.

The IMO more interesting other "kind of behavior" to talk about is unspecified behavior, which is closely related. There was an attempt to define it that failed to reach consensus. (That PR should likely be closed and a new one started.) The only real difference between "unspecified" and "implementation-defined" is that for the latter, implementations need to document which choice they make -- so once we nailed down what "unspecified behavior" means, we pretty much also covered "implementation-defined", we just need to decide on a case-by-case basis if implementantions ought to document a choice (and guarantee that choice for future versions of the implementation) or not.

@chorman0773
Copy link
Contributor Author

There's no reason to leave something as UB unless it allows for some optimization

Well, asside from permitting implementations with behaviour that diverges from possible specifications. This was a primary reason why signed integer overflow was undefined in C/++, because there were too many possible behaviours, depending on the machine architecture, and the signed integer format (which was unspecified until C++20, and I believe C2x also has the same thing). Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined.

I would still say its a good idea to define the term somewhere, as to avoid issues with interpretation.

However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.

I do agree. However, it is always a good idea to acknowledge a particular implementation may promise to respond to particular UB in a particular way, and that many implementations may all agree on this meaning (returning to my type-punning unions example), so it is possible to exploit that known extension (one of my personal rules of UB says that Known Compiler Extensions are fine). From the response I got, it seems like its illegal to "define" undefined behaviour in rust, even though it may even be necessary to implement the specification itself (in several places in my implementation of libcore, I transmute between &[T], which is technically repr(Rust), and a repr(Rust) type RawSlice<T>, which is strictly UB in rust, but RawSlice<T> is lang-itemed to define the layout of &[T], so by that knowledge, its fine. Compiler support and standard libraries get to basically have free reign, because if something isn't defined by the lang, they can just add an extension that defines it). One example is that as a companion to the strict definition, I normally include a note giving a non-exhausitve list of resulting behaviours, and mention in that that an implementation may assign meaning to the undefined behaviour. In my C++ API, in the section defining the term, I have the following note:

Note - Valid responses to undefined behaviour include (but are not limited to) assigning meaning to it,
discarding the construct (and potentially surrounding code that leads to it), ignoring the behaviour (potentially causing further issues with well-defined constructs), and causing or reporting an error. - End Note

It may be reasonable to include such a note, or some acknowledgement that a valid implementation may assign actual meaning to the behaviour. In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.

Worse, since there is only one implementation, there is a huge risk of any such promise becoming a de-facto guarantee that the entire ecosystem relies on.

This is one of the reasons I am working on lccc, so that rustc's implementation does not become the de-facto standard.

it would be healthier for the ecosystem if alternative implementations do not do so, either, since any such guarantee is an ecosystem split.

Sometimes yes, there are some instances when this becomes necessary. For example, low level code may need further guarantees the language does not provide, and that is why such extensions exist. Many of lccc's extensions are inherited from the fact it's designed to be a compiler for Rust, C, and C++ and successfully operate code that would work with gcc or clang (in particular, support of libstdc++ and libc++ is a goal because of binary compatibility), so many of the rules implemented for rust are lessened where C or C++ has weaker requirements and vice-versa.

In Rust, without the historical luggage of C/C++, we use UB only for optimizations. There are better ways to handle platform and implementation differences

As I mentioned above, sometimes it is infiesable to do so, as it requires adding additional requirements to the spec, that may end up incredibly vague (See the signed overflow example). If you discuss about "trapping" how do you define that to cover all possible ways traps can occur and be handled? Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that reinterpret_cast<T*>(&u) where u has a different type U may not be a bitwise identity) may change arbitrarily depending on types. I have noted that as a result of strict-aliasing, it is possible to compile C and C++ code to JVM bytecode, with a (substantial) support library without having to emulate memory. Without it, while still maintaining "typed-memory", it becomes substatially harder. union rules are for the same reason. Both are examples of implementation differences that are not possible to effect within the bounds of defined but unspecified behaviour (again, primarily because it would be incredibly stupid to talk about what the heck trapping is, does, and means). Arguably its better to say something is UB that you have to avoid like the plague, than to add 5 new sections discussing something novel, only for the result to be incredibly vague and completely unreasonable to work with.

@digama0
Copy link

digama0 commented Oct 18, 2020

@RalfJung

The IMO more interesting other "kind of behavior" to talk about is unspecified behavior, which is closely related. There was an attempt to define it that failed to reach consensus. (That PR should likely be closed and a new one started.) The only real difference between "unspecified" and "implementation-defined" is that for the latter, implementations need to document which choice they make -- so once we nailed down what "unspecified behavior" means, we pretty much also covered "implementation-defined", we just need to decide on a case-by-case basis if implementantions ought to document a choice (and guarantee that choice for future versions of the implementation) or not.

I don't see why undocumented things would ever be a good idea. Unstable things might be, and really I think most of rustc is currently in that category: all the nightly features are clearly not UB but also not among the (very few!) actually stable and defined behaviors that are in the UCG document.

@chorman0773

Well, asside from permitting implementations with behaviour that diverges from possible specifications. This was a primary reason why signed integer overflow was undefined in C/++, because there were too many possible behaviours, depending on the machine architecture, and the signed integer format (which was unspecified until C++20, and I believe C2x also has the same thing). Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined.

I agree with Ralf that this was a huge misstep on the part of the C/C++ committees. This really should have been implementation defined behavior (or platform-specific behavior), not undefined behavior. Making things like signed overflow UB makes things much more hazardous for the programmer, and when you couple it with the newly re-imagined UB as license for optimization you have a recipe for disaster.

I do agree. However, it is always a good idea to acknowledge a particular implementation may promise to respond to particular UB in a particular way, and that many implementations may all agree on this meaning (returning to my type-punning unions example), so it is possible to exploit that known extension (one of my personal rules of UB says that Known Compiler Extensions are fine).

My interpretation is that it is allowed for compilers to extend the language, but it is discouraged, because we would much rather incorporate those extensions into the language itself or come up with some suitable alternative that doesn't require creating language dialects. In particular, if you do fewer optimizations than rustc, or are doing something that matches better with C semantics, and as a result can (and are willing to) make more guarantees about behavior that would normally be undefined, I don't think that would be a problem. But programmers won't really be able to rely on it unless they write only for your compiler.

From the response I got, it seems like its illegal to "define" undefined behaviour in rust, even though it may even be necessary to implement the specification itself (in several places in my implementation of libcore, I transmute between &[T], which is technically repr(Rust), and a repr(Rust) type RawSlice, which is strictly UB in rust, but RawSlice is lang-itemed to define the layout of &[T], so by that knowledge, its fine. Compiler support and standard libraries get to basically have free reign, because if something isn't defined by the lang, they can just add an extension that defines it).

This is not UB, this is dependence on unspecified behavior. All types in rust have a layout, and if you write code for the layout that actually occurs then that is not UB. So as long as you are willing to live with the lack of stability, you can depend on the layout of repr(Rust) things, as long as you don't guess the type layouts incorrectly (possibly because you are the standard library and thus have control over such things).

This is one of the reasons I am working on lccc, so that rustc's implementation does not become the de-facto standard.

I for one am glad you are doing so. It is easy to get into a mindset that is aligned to the single implementation, and accidentally equate rustc behaviors to Rust behaviors, and I hope that a re-implementation will shake things up.

As I mentioned above, sometimes it is infiesable to do so, as it requires adding additional requirements to the spec, that may end up incredibly vague (See the signed overflow example). If you discuss about "trapping" how do you define that to cover all possible ways traps can occur and be handled?

I think implementation defined behavior or platform-specific behavior handles this well; on a particular platform or with implementation context, you can say more about what exactly a "trap" entails, for example, and the main spec doesn't have to touch it.

Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that reinterpret_cast<T*>(&u) where u has a different type U may not be a bitwise identity) may change arbitrarily depending on types.

I don't think the meaning can change arbitrarily, at least in Rust. Also AFAIK transmute is always a bitwise identity, although it may still be undecided what it does to shadow state (SB treats it as a no-op right now).

Arguably its better to say something is UB that you have to avoid like the plague, than to add 5 new sections discussing something novel, only for the result to be incredibly vague and completely unreasonable to work with.

One additional desideratum that rust has for its UB is that it should be dynamically checkable, using Miri. I'm not totally sold on this being an iron rule, but it is definitely a major improvement on the C/C++ situation where there are mines around every corner and no way to know that you have stepped on one until it is far too late. So that is an additional reason why we might not want to throw everything into the UB-bucket, if it involves a property that is not (easily) decidable.

@chorman0773
Copy link
Contributor Author

This is not UB, this is dependence on unspecified behavior.

I don't know how correct it is, however, n the current rustonomicon, it explicitly mentions that transmuting between non-repr(C) (which I have taken to specifically mean repr(Rust), because transmute is one of the exact purposes of repr(transparent) is UB) as UB apparently this has changed, and is no longer the case. It did previously include that though.

This really should have been implementation defined behavior (or platform-specific behavior), not undefined behavior.

My question is how you define trapping. As I mentioned, trapping is outside the bounds of the C++ Standard, so it wouldn't be an acceptable choice for unspecified behaviour. In arm, signed integer overflow causes a hardware trap, so if we imply exclude trapping behaviour that either requires extra code to support arm, or arm is no longer a valid target.

and the main spec doesn't have to touch it.

It would come down to being defined, since a trap would explicitly interact with observable behaviour and whether or not such observable behaviour occurs. The requirement would have be extraordinarily vague, which is worse than the current status-quo, at least we know that overflow is something to not touch.

I don't think the meaning can change arbitrarily, at least in Rust.

In C++ and C, it acknowledges implementations where that is the case. Right now, rust is impossible (or at least, unreasonably difficult) to implement on such an implementation. This isn't necessarily an issue in practical implementations, as many of them are theoretical implementations, or things done "for fun" (see the jvm implementation I mentioned).

it is definitely a major improvement on the C/C++ situation where there are mines around every corner and no way to know that you have stepped on one until it is far too late.

My question then becomes, would you rather something be specified as UB, or something be so vaguely specified because the actual specification is unreasonable or impossible, so that its possible to derive a valid interpreation where it is, in fact, UB (which means the point itself is UB, because compilers love the "best" possible interpreation as we have established). Going to the trapping example, if its left to the platform to decide what "trapping" is, what if the decision is that a result that traps is UB. How would you define "trapping" so that the behaviour of implementations that do trap, but handle traps in particular ways, or may not even have the ability to handle traps, or whether a trap is possible to handle depends on arbitrary state, etc. such that there isn't a valid interpreation where the result is UB, or effectively UB. I did bring this up in #84 (though I will concede it was off-topic there), where the layout rules of enums are unspecified, but with niche-optimizations, it was possible to devine an interpretation where unspecified behaviour (relying on the unspecified layout of repr(Rust) enums) could be elevated to undefined behaviour. Specifying something as unspecified behaviour that can result in undefined behaviour is the same as calling it undefined, except now it's hidden under interpreting the specification with a language lawyer hat, which is less fun for regular programmers I'm sure.

@Diggsey
Copy link

Diggsey commented Oct 18, 2020

would you rather something be specified as UB, or something be so vaguely specified because the actual specification is unreasonable or impossible, so that its possible to derive a valid interpreation where it is, in fact, UB

UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague.

@digama0
Copy link

digama0 commented Oct 18, 2020

@chorman0773

I don't think the meaning can change arbitrarily, at least in Rust.

In C++ and C, it acknowledges implementations where that is the case. Right now, rust is impossible (or at least, unreasonably difficult) to implement on such an implementation. This isn't necessarily an issue in practical implementations, as many of them are theoretical implementations, or things done "for fun" (see the jvm implementation I mentioned).

Well, rust doesn't have reinterpret_cast, so it doesn't really matter what this does. Rust has transmute and this does a bitwise reinterpretation of the value. I'm not exactly sure what about this makes it impossible to implement, unless you are talking about typed memory (in which case yes you are probably limited by the ease of doing a bitcast in the typed memory).

Going to the trapping example, if its left to the platform to decide what "trapping" is, what if the decision is that a result that traps is UB. How would you define "trapping" so that the behaviour of implementations that do trap, but handle traps in particular ways, or may not even have the ability to handle traps, or whether a trap is possible to handle depends on arbitrary state, etc. such that there isn't a valid interpreation where the result is UB, or effectively UB.

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB.

Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use #[cfg] flags and such to do the right thing on multiple platforms or implementations.

I did bring this up in #84 (though I will concede it was off-topic there), where the layout rules of enums are unspecified, but with niche-optimizations, it was possible to devine an interpretation where unspecified behaviour (relying on the unspecified layout of repr(Rust) enums) could be elevated to undefined behaviour. Specifying something as unspecified behaviour that can result in undefined behaviour is the same as calling it undefined, except now it's hidden under interpreting the specification with a language lawyer hat, which is less fun for regular programmers I'm sure.

Right now, writing code for repr(Rust) is very hazardous, for exactly this reason. It's not literally UB if you get it right, but it may as well be for the programmer because very little about it is stably guaranteed. Instead, there are things like the layout API that allow you to access this information in a more portable way, and ideally this would be good enough that there is no reason to make risky guesses about the layout because the safe and stable alternatives are in place.

@Diggsey

UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague.

I disagree. UB (or the "language contract" as Ralf says) is a contract between the programmer and the compiler. A vague definition helps neither party, and may in fact lead to a miscommunication, which is a failure of the spec. A clear UB is at least informative for the programmer (and may additionally simplify their mental model, so it's not necessarily a negative), and it enables more optimizations for the compiler (and a simpler model is also good for the compiler writer to avoid bugs).

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

Without having time right now to respond to all the points:

In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.

If the Rust standard library ever invokes UB, that is a critical bug -- please report it if you find such a case. It is certainly avoidable to do so, and it is a huge problem if we do so for all the usual reasons that UB is bad. (There are some known cases of this, but we do consider those bugs that we want to resolve, and efforts/discussions are underway to achieve that.) I think this approach is necessary for building a reliable foundation of the ecosystem. (We could of course do things that are UB in Rust [the language] but not UB in rustc [the compiler]. For the reasons mentioned above, there are no such things.)

It is true that some of our docs are or have been imprecise about the distinction between UB and unspecified behavior, and also sometimes about the distinction between library-level UB and language-level UB. I am trying to fix such cases as I see them.

I disagree. UB (or the "language contract" as Ralf says) is a contract between the programmer and the compiler. A vague definition helps neither party, and may in fact lead to a miscommunication, which is a failure of the spec. A clear UB is at least informative for the programmer (and may additionally simplify their mental model, so it's not necessarily a negative), and it enables more optimizations for the compiler (and a simpler model is also good for the compiler writer to avoid bugs).

I think what @Diggsey meant is not that we should be vague about something being UB or not, but that saying "X is UB" is vague about what happens when X occurs. More vague than any other thing we could say.

@chorman0773
Copy link
Contributor Author

I'm not exactly sure what about this makes it impossible to implement, unless you are talking about typed memory

The JVM implementation relies on typed memory, and strict-aliasing to avoid having to emulate memory.

If the Rust standard library ever invokes UB, that is a critical bug

inherently, the careful use of UB is inevitable in a standard library, but as mentioned, the fact it's the standard library means it can if it wants, it just needs to get the compiler to do what it needs. Hopefully, it is impossible to fully implement the standard library in the language itself, sometimes this means the use of compiler intrinsics, sometimes this means , sometimes this means the use of things strictly specified as UB.

We could of course do things that are UB in Rust [the language] but not UB in rustc [the compiler]. For the reasons mentioned above, there are no such things.

This is the UB I am referring to here. UB in the language, but which an extension of the particular compiler permits.

UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague.

At least a specification of UB is not vague that it is UB, which is what I was referring to. It's worse if its not outright said "don't do X, X is UB" then if you have "the behaviour of X is unspecified" and constrain it in the vaguest possible way where a valid interpretation of the constraints allows X to have UB, because that means that X is UB, it just doesn't outright say it. This is worse, as I say, because it's harder to realise that it is UB.

I would add that the signed integer overflow UB actually has had real performance benefits to actual code in the field. From a cppcon talk, which I could probably look up if people wanted it, there was some rather hot code that was using unsigned as a loop control, which had a performance regression when it was compiled on x86_64, and the regression was fixed by changing unsigned to int. This is one of the reasons why the "Signed Integers are 2s complement" TS that was approved for C++20 explicitly elected to not define signed integer overflow, when it had the chance to.

also sometimes about the distinction between library-level UB and language-level UB

I'm sure I've made my position on this clear, but for completeness, I really hate the distinction, because it makes it easier to reason about UB (which is rule number 1 in my rules for UB "Do not reason about UB"). The biggest footgun in C++ is not when people don't know about some arbitrary piece of UB, it's when people think they are smarter than the compiler, and try to justify a particular kind of UB (I would know, I've tried this before. It didn't end well, hense my rules of UB).

@scottmcm
Copy link
Member

One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations

I'm not convinced by that. Certainly for some things it was more about portability, but I think optimizations have been core from the beginning.

My go-to example: One of the very first things that people wanted compilers to do was register allocation for local variables. Without that optimization things would have to be loaded and stored to the stack all over the place, which would be terrible for runtime performance. But doing that requires making certain things undefined behaviour -- int a, b; can't go in registers if (&a)[1] = 2; is defined to update b.

@digama0
Copy link

digama0 commented Oct 18, 2020

My go-to example: One of the very first things that people wanted compilers to do was register allocation for local variables. Without that optimization things would have to be loaded and stored to the stack all over the place, which would be terrible for runtime performance. But doing that requires making certain things undefined behaviour -- int a, b; can't go in registers if (&a)[1] = 2; is defined to update b.

But couldn't this be handled the same way rust does layout optimization? That is, if you are lucky and guess the compiler's playbook then you can safely update b this way but if you miss and hit the wrong thing then it's UB. (And if b is in a register then of course you will miss.)

@Lokathor
Copy link
Contributor

if you are lucky and guess

A lot of things can happen if you're lucky and guess.

Specifically, the outcome of UB might be what you expect. It's always possible that the UB doesn't come back to bite you when using a particular compiler, on a particular set of flags, on a particular ... and so on. But what exactly happens is always up in the air, which is why as a user of the language/compiler you need to avoid UB if you want reliable compilations.

But in that particular example with a and b, I can't imagine much good happening. You can't really reason about b (eg: eliminating a duplicate load, or holding off on a store, etc) if you're also allowed to access it via a.

@comex
Copy link

comex commented Oct 18, 2020

But couldn't this be handled the same way rust does layout optimization? That is, if you are lucky and guess the compiler's playbook then you can safely update b this way but if you miss and hit the wrong thing then it's UB. (And if b is in a register then of course you will miss.)

In early C compilers? Yeah, it probably could be handled that way.

You might already realize this, but it couldn't be handled that way in modern compilers without needlessly sacrificing optimization potential. As a simple example:

if b >= 0 {
    do_something_with(&mut a);
    if b < 0 {
        do_something_else();
    }
}

Assuming b did not have its address taken, the compiler would like to delete the second if as dead code, since the condition can never pass. (This kind of useless code often shows up after inlining other functions.) But under the "guessing the compiler's playbook is OK" rule, if b happens to get spilled to the stack, do_something_with would then be allowed to reach out and touch b by indexing from a, making the optimization illegal.

@comex
Copy link

comex commented Oct 19, 2020

@chorman0773

Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined.

It didn't have to be; it could have been implementation-defined. For example, while the C standard makes most kinds of overflow either undefined or well-defined, there is one exception. If you cast from a larger integer type to a smaller one and the value can't fit into the smaller type, the standard says: "either the result is implementation-defined or an implementation-defined signal is raised." (C11 6.3.1.3.3) This gives the implementation an extraordinary amount of flexibility, while not going all the way to "undefined behavior".

In arm, signed integer overflow causes a hardware trap

It does not.

@chorman0773
Copy link
Contributor Author

That is, if you are lucky and guess the compiler's playbook

Yeah, that's a briliant idea. It works fairly well in debug, so what can go wrong.

Sarcasm aside, the only time its OK to use UB, is if you are in a situation with a particular compiler, or a particular set of compilers, and you know that the compiler assigns a particular meaning to the particular undefined behaviour, either because you are very closely tied to the compiler (standard library or compiler support library) or you have a documented extension (again see my union example). "Guessing" what the compiler does falls under reasoning about UB.

But doing that requires making certain things undefined behaviour -- int a, b; can't go in registers if (&a)[1] = 2; is defined to update b.

I will concede that one is likely for that UB, but an decent amount of UB in C and C++ has justification further than that.

either the result is implementation-defined or an implementation-defined signal is raised." (C11 6.3.1.3.3)

A signal wouldn't be the same as a trap, trapping doesn't need to result in a signal.

It does not.

Huh, I thought it was one of the examples of where signed integer overflow was trapped at a hardware level (I do know such processors exist).

@RalfJung
Copy link
Member

RalfJung commented Oct 22, 2020

@chorman0773

I would still say its a good idea to define the term somewhere, as to avoid issues with interpretation.

I am always in favor of defining terms. :) As mentioned before, UB is defined in our glossary; if you have suggestions for improving that definition, please let us know!

It may be reasonable to include such a note, or some acknowledgement that a valid implementation may assign actual meaning to the behaviour. In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.

As noted above, we do not want to encourage implementations to actually do that. Also I strongly disagree with it being unavoidable. In Rust, we are avoiding relying on "UB in the spec but the compiler guarantees a specific behavior" (modulo bugs), so we have constructive evidence that it is possible to build a language that way. And this is the way I (and I think I am not alone in the UCG and the lang team to think so) would prefer other Rust implementations to go as well. Certainly I see no reason that we should explicitly cater to another approach. (We shouldn't explicitly forbid it, either, but nobody has been suggesting that.)

I do not think it is the role of a spec to point out that one could derive other, adjusted specifications for it. That is just obviously true for every document. These derived specifications are separate dialects of Rust. The purpose of the UCG is to spec out the "main language", not to figure out the design space for dialects. At least, I personally have little interest in developing such dialects, and I think the UCG has enough on its plate without that additional mandate. And finally, discussion of such dialects should, even when it occurs, be kept strictly separate from the "main language". We should not mix up what ends up in a Rust spec and what ends up in the spec of some derived language dialects that some hypothetical future implementations might choose to implement instead.

Sometimes yes, there are some instances when this becomes necessary. For example, low level code may need further guarantees the language does not provide, and that is why such extensions exist.

Again I disagree that this is necessary. So far the approach of the lang team and UCG has always been to instead work with the people writing that low-level code, figure our their needs, and see how we can accommodate them without creating language dialects. I firmly believe that this is the better strategy, and I see no reason to think that it would not work. Both sides (language designers and low-level programmers) gain a lot when we can avoid splitting off a "low-level dialect" of Rust.

Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that reinterpret_cast<T*>(&u) where u has a different type U may not be a bitwise identity) may change arbitrarily depending on types.

There were versions of the C spec before strict aliasing. So no, that is not the reason. C could just specify that when the types of stores and loads do not match, the bit-pattern is interpreted at the other type. C provides ways to do this, e.g. through memcpy, so the spec anyway has to allow that possibility and decide how much it wants to say about it (which is not a lot, but that's fine).

Literally the only reason C has strict aliasing rules is to enable more optimizations. If they removed strict aliasing from the spec, there wouldn't be any gaps or open questions created by that. (There'd be a lot of open questions removed actually.^^) This is also demonstrated by the fact that all major compilers have a flag like -fno-strict-aliasing that opts into "C without strict aliasing", where they guarantee that type-punning loads just re-interpret the bits at the new type (which might be UB or not, subject to the usual rules for type punning that the language needs to have anyway).

My question is how you define trapping

There could be all sorts of things you can say without saying it is UB -- things like aborting program execution, or signal handlers (which the standard does talk about). A program that traps will not arbitrarily jump into otherwise dead regions of code, which UB might well do; I don't think it would be too hard to come up with a reasonable list of possible behaviors here.

inherently, the careful use of UB is inevitable in a standard library

You keep saying that, but it is just not true.^^ Rust proves otherwise (modulo bugs).

I would add that the signed integer overflow UB actually has had real performance benefits to actual code in the field. From a cppcon talk, which I could probably look up if people wanted it, there was some rather hot code that was using unsigned as a loop control, which had a performance regression when it was compiled on x86_64, and the regression was fixed by changing unsigned to int. This is one of the reasons why the "Signed Integers are 2s complement" TS that was approved for C++20 explicitly elected to not define signed integer overflow, when it had the chance to.

FWIW, in Rust this particular example does not carry over -- the reason signed integer overflow UB helped here is that people use int for array index arithmetic on a 64bit machine. In Rust, nobody would do that; you'd use usize, and then there is no longer any performance benefit from making overflow UB.

@digama0

I don't see why undocumented things would ever be a good idea. Unstable things might be, and really I think most of rustc is currently in that category: all the nightly features are clearly not UB but also not among the (very few!) actually stable and defined behaviors that are in the UCG document.

(This was about unspecified vs implementation-defined behavior.)
I think there is little point to precisely documenting how rustc currently happens to lay out its structs and enums... but if someone wants to do that work, and if others find it useful, sure. :)

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior.

I don't think that would be a good idea. At that point programmers have to basically treat this as UB. So this is effectively equivalent to saying it is UB but some implementations making stronger guarantees about it, which is not a good idea for all the reasons mentioned before.

In fact, if you take it as a given that implementations may guarantee specific behavior for UB, then if you allow implementation-defined behavior to be "implemented as UB" you just made it equivalent to UB. So no, I strongly disagree, "UB" should not be on the list of things an implementation may choose from for unspecified or implementation-defined behavior.


I should add that I think there are examples of UB that are not motivated by optimizations but by "there's literally nothing better we can say". For example, taking a random integer, casting that to a function pointer, and calling the function. There is no reasonable way, in the Abstract Machine, to bound the behavior of such a program. But those cases are by far in the minority for UB, both in C/C++ and in Rust. It would be trivial to precisely describe Rust, C, and C++, and to have a Miri-like checker for them, if this was the only kind of UB that we had.

@digama0
Copy link

digama0 commented Oct 22, 2020

I don#t think that would be a good idea. At that point programmers have to basically treat this as UB. So this is effectively equivalent to saying it is UB but some implementations making stronger guarantees about it, which is not a good idea for all the reasons mentioned before.

I agree that for programmers aiming for full portability to any conforming implementation, this may as well be UB. However it differs from UB in that you can use it selectively if you happen to know more about the particular implementation, e.g. using #[cfg] to use an operation on a platform/implementation for which it makes sense, and avoiding it for "generic" implementations where it is UB without loss of generality.

In fact, if you take it as a given that implementations may guarantee specific behavior for UB, then if you allow implementation-defined behavior to be "implemented as UB" you just made it equivalent to UB. So no, I strongly disagree, "UB" should not be on the list of things an implementation may choose from for unspecified or implementation-defined behavior.

I take your point. Although it makes me wonder what the role of #[cfg] is in the language then: if an operation makes sense for some configurations but not others, and is #[cfg]'d in only on platfoms or implementations where it is defined, then is that a valid Rust program? It seems that by your reasoning such an operation would have to be defined as UB in the abstract machine, meaning that the full program, including the #[cfg]s, can't be considered a valid Rust program, even though it only exercises valid implementation-defined behavior (which we have agreed can't be called such because some implementations make it UB).

@RalfJung
Copy link
Member

RalfJung commented Oct 22, 2020

I take your point. Although it makes me wonder what the role of #[cfg] is in the language then: if an operation makes sense for some configurations but not others, and is #[cfg]'d in only on platfoms or implementations where it is defined, then is that a valid Rust program? It seems that by your reasoning such an operation would have to be defined as UB in the abstract machine, meaning that the full program, including the #[cfg]s, can't be considered a valid Rust program, even though it only exercises valid implementation-defined behavior (which we have agreed can't be called such because some implementations make it UB).

The Abstract Machine has parameters, for things like pointer size and endianess. cfg lets a program query those parameters so that a single program can run for multiple different choices of those parameters.

Also, I'd say UB is a dynamic, run-time concept, and as such always refers for a program "after cfg expansion"; in that sense cfg is similar to macros -- by the time we think of a program running on the Abstract Machine, both have already been expanded away. For a pre-expansion program we can only ask "for which choices of cfg flags is this program UB". There we can exploit that some cfg flags correlate with Abstract Machine parameters (see above).

@digama0
Copy link

digama0 commented Oct 22, 2020

I agree with all of the above. My question is what if there is an operation which is UB for some configurations and not others (for example, calling a CPU intrinsic). Does the abstract machine need to know about all these configurations, in order to specify them? I was hoping that this could be classed under "implementation-defined behavior" or "platform-dependent behavior" so that the abstract language doesn't need to contain the union of all quirks from platforms it has ever compiled to.

@comex
Copy link

comex commented Oct 22, 2020

There were versions of the C spec before strict aliasing.

Nitpick: No there weren’t. Strict aliasing was already in the first standardized version of C, C89, though most people didn’t know about it until GCC started enforcing it in 2001.

Edit: But it is true that it exists solely to enable compiler optimizations.

@chorman0773
Copy link
Contributor Author

if you have suggestions for improving that definition, please let us know!

For the more informative UCG, I think its good as it is. When rust does get arround to writing a proper specification, something more akin to what C and C++ have, IE. something like:

Behaviour for which this specification imposes no limitations.

As noted above, we do not want to encourage implementations to actually do that.

It's not necessarily encouraging implementations to do that, in giving an example of what can happen. I also equally mention that the construct can be ignored/evaluated as-is, potentially interfering with other well-defined constructs. C and C++ both in a note that a valid response to UB is to assign it some arbitrary meaning. It equally means, if you really want to use this construct, you shouldn't, but seek out your compiler's documentation first as they may say you can.

And this is the way I (and I think I am not alone in the UCG and the lang team to think so) would prefer other Rust implementations to go as well.

In lccc, we inherit some things that aren't UB from C and/or C++, usually because don't care enough about the particular optimizations to add further tags saying when certain things are UB and when they are well-defined (conversely there are some things in C and C++ that are well-defined under lccc because rust says they are and I don't want to duplicate accross them). This isn't horribly new, gcc has a bunch of extensions to both C and C++ that exist in one primarily because the other allows the same (gcc lets you type-pun with unions in C++, and clang does as well primarily because gcc does).
For example:

#[repr(C)]
struct StandardLayout{
    num: i32,
    other_field: f32
};

fn do_thing(v: &mut StandardLayout){
    let x = &mut v.num;
    let y = unsafe{&mut *(x as *mut i32 as *mut StandardLayout).other_field)};
}

in SB, that is UB, because you have exceeded the provenance of x. In lccc however, it's well-defined because of pointer-interconvertibility and reachbility rules. Specifically, you can reach v.other_field from v.num because you can reach *v from v.num (as it's pointer-interconvertible). I actually intend to exploit this in my implementation of RefCell, so I can reduce Ref and RefMut to being equivalent to a single pointer (knowing what is effectively the outer RefCell<T> can be reached from a pointer to the inner T, and can reach the ref-count from that, the actual RefCell<T> is a repr(rust) wrapper arround the inner, where I promise the same amount as C++ does for things that aren't Standard Layout Types, which is aboslutely nothing). For things like these, where the benefit from enabling them is limited, so documenting the extensions makes more sense then hiding them (especially since you could derive them from the ir specification).

You keep saying that, but it is just not true.^^ Rust proves otherwise (modulo bugs).

Fair, I will concede that point. However, standard libraries do sometimes use UB in the language proper, either because they have to, or the be efficient/clever. libc++ and libstdc++ are definate example (I can't remember exactly where, but I remember seeing some in). As mentioned the rust standard library implementation for lccc will make use of pointer-interconvertibility for manual layout optimizations. More to the point, standard libraries are in a privileged position where they can make things not UB because they want to do something that is. Same with compiler-support libraries, which is less likely to be able to avoid UB, which is why they exist. Neither libunwind nor libgcc_s are particularily well-defined when it comes to unwinding (I can't really think of a way to implement stack unwinding at all absent some undefined behaviour, aside from using pure assembly, certainly not for itanium). This is why I consider standard and compiler support libraries some of the only exceptions to my otherwise absolute rules of UB, including good ideas such as "Do not reason about UB" and "Do not rationalize UB".

FWIW, in Rust this particular example does not carry over -- the reason signed integer overflow UB helped here is that people use int for array index arithmetic on a 64bit machine.

Indeed, though the exact case was it was using unsigned, which is what caused the performance regression (replacing it with int fixed it).
usize is kind of an annoying type in Rust. It works well on well-behaved platforms, like x86-64, and x86. However, saying that the size type (and index type) is the same size as a pointer is kinda wasteful on older platforms like 65816 (the maximum object size is 65535 bytes, but pointers are 24-bits), and leaves some holes open since rust effectively says all distinct object representations of integer types represent a distinct, valid value.

Edit: But it is true that it exists solely to enable compiler optimizations.

reinterpret_cast in C++ notes that pointers of different types can be represented differently (particularily, std::bit_cast<T*>(u) and reinterpret_cast<T*>(u) are not required to have the same value, or even have the same object-representation). If there is a difference in representation between a int* and float*, how would you suggest an the access of type float to an object of type int? The example I use for this is the JVM implementation, where strict-aliasing allows me to avoid emulating memory. The ability to enable differing implementations is lesser for strict-aliasing, than say signed overflow, but it is present.

Also, I'd say UB is a dynamic, run-time concept

It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details). In general, it's not. Examples of this include the prohibition in C++ against instantiating standard library templates with incomplete types (Yes, the C++ compiler is allowed for format your hard drive when translating the program, which is arguably hilareous). As I say, UB is literal, and it doesn't particularily matter when the UB happens. I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that).

@chorman0773
Copy link
Contributor Author

Adding to this

I think there is little point to precisely documenting how rustc currently happens to lay out its structs and enums

I agree, this is a poor idea

I don't think that would be a good idea. At that point programmers have to basically treat this as UB

There is a term for this: conditionally-supported behaviour. Requires implementations to document when they do not support the behaviour. There are some cases where it can actually be useful, for example, I would like to look into volatile access to address 0 as conditionally-supported with implementation-defined results, as it can be an asset to embedded devs. Certainly not should unspecified behaviour permit undefined behaviour. I do agree, most programmers should treat conditionally-supported behaviour as if something to not touch (whether being unsupported means UB or a compile error), and it doesn't give too much of a huge benefit over just being UB, and letting the compiler decide if it does want to give you w/e behaviour.

@digama0
Copy link

digama0 commented Oct 23, 2020

You have given a bunch of horror-story examples of terrible uses of UB in C/C++, and I don't find them particularly compelling for adoption in Rust. UB at translation time is just really obvious compiler-developer-pandering. We already try very hard to be able to find all uses of UB at runtime, so if there were compile time UB I would expect nothing less than to have a "dynamic checker" for that too; but dynamic checking at compile time is just compile time checking so it just ends up as part of the compiler's workings, and so it's not UB after all.

I don't think that the stock C/C++ wording

Behaviour for which this specification imposes no limitations.

is very good either, because it does not at all elucidate the way in which UB is used, as a dynamical concept of "stuck state" in the abstract machine. In fact, I would be happy with just such a description:

We say that a program has undefined behavior when there is a possible execution trace that ends in a "stuck state", which is where the Rust Abstract Machine has no valid execution step to take (either regular steps or I/O). Compilers are only required to preserve the behavior of programs which do not exercise undefined behavior, and must not introduce undefined behavior, but specific implementations may choose to assign meaning to programs with undefined behavior (with or without attendant documentation).

@digama0
Copy link

digama0 commented Oct 23, 2020

in SB, that is UB, because you have exceeded the provenance of x. In lccc however, it's well-defined because of pointer-interconvertibility and reachbility rules.

To head off Ralf's exasperated comment: This is fine and your prerogative as the designer of lccc, but not the business of the UCG.

usize is kind of an annoying type in Rust. It works well on well-behaved platforms, like x86-64, and x86. However, saying that the size type (and index type) is the same size as a pointer is kinda wasteful on older platforms like 65816 (the maximum object size is 65535 bytes, but pointers are 24-bits), and leaves some holes open since rust effectively says all distinct object representations of integer types represent a distinct, valid value.

I think it is a deliberate choice of Rust to not attempt to accomodate older architectures that differ considerably from modern hardware. We've all seen that C suffers greatly from the baggage that it carries from that era, and no one wants to keep carrying that forward if the processors are no longer in use.

reinterpret_cast in C++ notes that pointers of different types can be represented differently (particularily, std::bit_cast<T*>(u) and reinterpret_cast<T*>(u) are not required to have the same value, or even have the same object-representation). If there is a difference in representation between a int* and float*, how would you suggest an the access of type float to an object of type int? The example I use for this is the JVM implementation, where strict-aliasing allows me to avoid emulating memory. The ability to enable differing implementations is lesser for strict-aliasing, than say signed overflow, but it is present.

To emulate Rust in the JVM, you almost certainly have to emulate memory. You might be able to do various kinds of program analysis to hoist values out of memory but that's all subject to the as-if rule, and the R-AM works on flat, untyped memory. (Personally, I think that C++'s casting mechanisms are far too complicated. Rust has a simple and intuitive model of memory, even if it makes it harder to concretely represent the memory in other ways.)

Also, I'd say UB is a dynamic, run-time concept

It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details).

Why don't you just reserve a crate on crates.io? The standard library and rustc are all stuffed in the std crate, so you could do something similar.

I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that).

This is more interesting. That conformance clause doesn't currently exist in Rust AFAIK, and it does seem odd to me that we should require that you give a diagnostic for use of lccc extensions of Rust. But this is probably best suited for its own issue.

@RalfJung
Copy link
Member

RalfJung commented Oct 23, 2020

@comex

Nitpick: No there weren’t. Strict aliasing was already in the first standardized version of C, C89, though most people didn’t know about it until GCC started enforcing it in 2001.

I stand corrected; thanks for pointing that out.

@chorman0773

When rust does get arround to writing a proper specification, something more akin to what C and C++ have

I honestly don't think "Behavior for which this specification imposes no limitations" is very informative, given how often it misleads people, and it is more useful to talk explicitly about the Abstract Machine and that the implementation expects the programmer to uphold its side of the contract. That is, in my opinion, a better framing and phrasing of UB than what C and C++ do.

We can clarify that as a consequence, there are no limitations to the behavior of a program that violates said contract. In fact I think we already say that:

If it turns out the program does have undefined behavior, the contract is void, and the program produced by the compiler is essentially garbage (in particular, it is not bound by any specification; the program does not even have to be well-formed executable code).

But if you think it is helpful to explicitly say "no limitations" and not just "garbage", that is fine for me, too.

But anyway that is a separate bikeshed.^^

I also equally mention that the construct can be ignored/evaluated as-is, potentially interfering with other well-defined constructs. C and C++ both in a note that a valid response to UB is to assign it some arbitrary meaning. It equally means, if you really want to use this construct, you shouldn't, but seek out your compiler's documentation first as they may say you can.

I would say that this is a case of C/C++ encouraging implementations to assign some arbitrary meaning, which I think we should not do for Rust. But this is getting extremely subjective and we clearly have different positions here, so I doubt we will resolve the dispute by repeating our positions. ;) We'll probably have to agree to disagree, and when it comes to wording the final standard, there'll be more people involved and we can see what they think.

In lccc, we inherit some things that aren't UB from C and/or C++, usually because don't care enough about the particular optimizations to add further tags saying when certain things are UB and when they are well-defined (conversely there are some things in C and C++ that are well-defined under lccc because rust says they are and I don't want to duplicate accross them).

I obviously cannot stop you from doing whatever you want with your own project. I think I stated my point for why providing such guarantees to Rust code on some implementations risks an ecosystem split. On the other hand, having a unified semantics with C/C++ does require some very different trade-offs.

What I do not understand is how you think this should affect the UCG. Doing better than C/C++ is explicitly one of my goals, so I'd be quite opposed to any attempt to unify UB with those languages.

However, standard libraries do sometimes use UB in the language proper, either because they have to, or the be efficient/clever. libc++ and libstdc++ are definate example (I can't remember exactly where, but I remember seeing some in).

Yes, this definitely sometimes happens, it's just something we'd like to avoid in Rust proper. Again I cannot tell you how to build your own compiler, so if you think this is a good strategy, I will respectfully disagree and we can go our separate ways. ;)

Looks like you are set on defining a Rust dialect that makes some extra guarantees. I am not terribly happy about that but respect your decision. Again I am not sure how this should impact UCG work -- as long as we don't want to define any behavior that you need to be UB, you should be good, right?

This first came up around validity of references, but given that you must support C-style pointers that point to garbage (I don't think it is UB in C to have a bool* that points to 0x11), your IR clearly has to be able to support "pointers that point to invalid data". In fact, you must certainly support this for Rust raw pointers. Therefore Rust permitting this possibility for references should not impose any new constraints on your language design. Am I missing something?

More to the point, standard libraries are in a privileged position where they can make things not UB because they want to do something that is.

I'd phrase this more carefully... the compiler has to have a uniform notion of UB across all code (otherwise things like inlining are broken). So what standard libraries can do is exploit the knowledge that something is not really UB in the actual language implemented by this compiler, even though the language spec documents it as UB. This is very similar to exploiting knowledge about unspecified implementation details. Code outside the standard library could in principle do the same, but then it would be tied to only work with a particular version of the compiler.

IOW, the privilege of the standard library comes solely from being compatible with exactly one version of the compiler, and being able to rely on undocumented aspects of the compiler because it is maintained by the same people. In contrast, user code has to be compatible with a wide range of compiler versions.

If there is a difference in representation between a int* and float*, how would you suggest an the access of type float to an object of type int?

At least in C, it is legal to do union-based type punning under some circumstances. So the answer is "the same as that".

In C++, the answer is "the same as a reinterpret_cast from int to float".

It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details). In general, it's not. Examples of this include the prohibition in C++ against instantiating standard library templates with incomplete types (Yes, the C++ compiler is allowed for format your hard drive when translating the program, which is arguably hilareous). As I say, UB is literal, and it doesn't particularily matter when the UB happens. I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that).

I think that's just C++ being silly.^^ There is also some UB in the C preprocessor if I recall. But that is, on a formal/technical level, very different from the kind of UB that is used for optimizations, so they should really not use the same term.

@digama0

To head off Ralf's exasperated comment:

Sorry for that. I tried to tone it down, but clearly not enough. Maybe I should take a break from this thread; I have stated my case.

@chorman0773
Copy link
Contributor Author

Box is special in a stable way

I moved and generalized that particular lang item, into an unstable (but not quite, the impl on Box is stable to use via operators) DerefMove trait. I believe discussions are already in place to make that part of rust.

@RalfJung
Copy link
Member

RalfJung commented Nov 9, 2020

The comment left indicates that when it was written it was at the very least considered undefined

Indeed, and the comment also indicates that this is considered a bug in the standard library, precisely because not even the standard library may cause UB. In rustc, libstd is not privileged wrt UB, and any place that acts differently is a bug. The "privileged knowledge" part here explains why this is not a P-high bug that needs fixing immediately (it argues for why this bug is currently unlikely to negatively impact users), but it does not make this any less of a bug.

This is very different from saying "it is okay for libstd to do something like this". It is not okay, and this particular bug is on track to be fixed by this RFC. One that RFC is implemented, this FIXME will finally disappear. I have been waiting for that for a long time. :)

@chorman0773
Copy link
Contributor Author

considered a bug in the standard library

As far as I can tell, it was done intentionally, perhaps to satisfy a requirement that is impossible or grossly inefficient otherwise. Even if it is considered a bug, it may be a necessary one. I have not looked at the RFC but given the choice is to change the language, not the implementation, I stand by what I said. Standard libraries will frequently doing things that require explicit compiler support to be efficient, not necessarily because it would be impossible otherwise (though as mentioned, a reason for something being in the standard library is that it's impossible to implement in the language itself). For example, while not UB, clang and gcc defined an intrinsic to implement std::make_integer_sequence<T,N> in less than O(n) template instantiations (I think it's O(lgN), lccc does it internally in O(1)).

@RalfJung
Copy link
Member

RalfJung commented Nov 9, 2020

intentionally

Yes, because at the time there was no better way (the original code predates even clarifying the validity invariant). But the fact that there is a "FIXME" indicates quite clearly that this is considered a hack, not a proper solution.

I have not looked at the RFC but given the choice is to change the language, not the implementation

The RFC is in fact a libs-only change.

@comex
Copy link

comex commented Nov 9, 2020

Standard libraries will frequently doing things that require explicit compiler support to be efficient, not necessarily because it would be impossible otherwise (though as mentioned, a reason for something being in the standard library is that it's impossible to implement in the language itself). For example, while not UB, clang and gcc defined an intrinsic to implement std::make_integer_sequence<T,N> in less than O(n) template instantiations (I think it's O(lgN), lccc does it internally in O(1)).

They do, and Rust uses lang items for that purpose. However:

  • As you noted, that is different from UB.

  • While I don't think this is settled policy, some people, including me, believe that Rust should have a goal of making libstd not privileged; anything that requires compiler magic should be in libcore, and for anything that can't be expressed without compiler magic (such as Box), suitable stable language functionality should be added to make it expressible (such as a DerefMove trait). Personally I'd prefer if libcore itself had the essential compiler magic parts split out into a third crate. But that's off topic.

@chorman0773
Copy link
Contributor Author

Rust uses lang items for that purpose

That and intrinsics, indeed. Though this still begs the question of what the difference between an unstable lang item/intrinsic/language feature, and undefined behaviour explicitly provided by an extension. As mentioned, lccc has distinct lang items from rustc (though it shares some when they relate to the same feature, for example #[lang = "sized"] is still used to define the Sized trait), it also has different intrinsics (again sharing common ones, though some are renamed). It's all simply privileged code, doing privileged things that you can't do portably (and maybe not at all, depends on if it's documented).

anything that requires compiler magic should be in libcore

That's fair on the requirement side (with a cursory look through at the documentation, it looks like Box may be the only thing outstanding). However, compiler magic should always be available to optimize code, particularily when you, as the standard library writer, are also are that the compiler itself may not optimize something as well. Sometimes this involves exploiting something actually UB, but that the compiler does not treat as such (in the example, that a reference to uninit isn't allowed, or at least was probably not at the time). This isn't even technically limited to standard libraries. For example, in the lccc code itself, written in C++, I conditionally insert a call to a builtin used to optimize slice::from_raw_parts{,_mut} to make those same optimizations.
In fact, this is one of the reasons I think the compiler should be discoverable by rust code, so that when available, programs can selectively use non-portable constructs as optimizations, and fallback on other similar constructs, or on portable ones.

@RalfJung
Copy link
Member

Though this still begs the question of what the difference between an unstable lang item/intrinsic/language feature, and undefined behaviour explicitly provided by an extension.

I am not sure why you cannot accept the fact that the rustc devs and the people specifying Rust consider it an antipattern to explicitly define any part of UB via any means.^^ Adding an unstable intrinsic is a tiny extension, whereas saying that integers may be udnefined is a global change that fundamentally alters the Abstract Machine of the language that rustc actually implements (and that rustc optimizations have to be checked against).

@chorman0773
Copy link
Contributor Author

Adding an unstable intrinsic is a tiny extension

For something like reachability of pointers, I would consider that similarily small. Particularly, an intrinsic can be created to get a pointer to access an enclosing repr(C) structure from a pointer to it's first element (in fact, lccc does have such an intrinsic, ::__lccc::builtins::cxx::__builtin_launder). lccc simply considers a pointer cast to do the same, primarily because it's a fundamental side effect of the implementation (as for pointers becomes the IR operation convert reinterpret which also implements reinterpret_cast from C++, so as a side effect, the semantics become the union of valid operations). The intermediate representation itself is to have a specification, which is used by frontends that implement various languages, and by optimizers and codegenerators, to ensure that both of the latter can be independent of the source language without being incorrect, and that the frontend can be independent of the requested optimizations and code generation.
Saying this definition applies is a rather trivial extension, considering that the implementation is already enjoined from making optimizations that contradict it. Portable code cannot rely on this, but code that has knowledge this applies can exploit it, much the same as code that has knowledge of a particular intrinsic can exploit it

As another example, miri is an implementation that provides definitions for certain kinds of undefined behaviour, it happens that the definition is that miri traps on those cases. This is a valid definition though, and employed by all forms of dynamic analysis tools.
This is why I prefer the C++ definition ([intro.abstract] clause 4, clause 5), that the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour, rather than saying implementations can assume it cannot happen. The former allows a much broader interpretation, as the latter can be (and is) a derived consequence of the former, but the former also concedes that implementations can assign meaning, including by trapping on detection.

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2020

For something like reachability of pointers, I would consider that similarily small. Particularly, an intrinsic can be created to get a pointer to access an enclosing repr(C) structure from a pointer to it's first element (in fact, lccc does have such an intrinsic, ::__lccc::builtins::cxx::__builtin_launder).

For such intrinsic to have the semantics you want, you would need to modify the abstract machine a lot more than you seem to think. For example adding the try instrinsic is not a simple modification. It requires adding the concept of unwinding to the rust abstract machine. Another example are the atomic intrinsics. Those require adding support for a weak memory model. Sure, many intrinsics are small incremental additions without influencing the rest of the abstract machine. This would be intrinsics like transmute (except for the size check implementable using only unions), checked math (implementable by first checking for overflow before performing the operation), but some intrinsics require fundamental changes to the rust abstract machine.

The former allows a much broader interpretation, as the latter can be (and is) a derived consequence of the former, but the former also concedes that implementations can assign meaning, including by trapping on detection.

If the implementation can assume that UB doesn't happen, it is free to do anything it wants for things that would be UB, including inserting traps, as those things "wouldn't happen" anyway.

@RalfJung
Copy link
Member

RalfJung commented Nov 10, 2020

In fact, Miri is an excellent example for how an implementation can provide explicit guarantees for what happens on UB (raise an error) while still conforming with the formal spec which says "the implementation may assume that UB does not occur". Miri is just rather defensive about that assumption and double-checks instead of just trusting the programmer.

The wording for UB given in the UCG glossary totally allows for this possibility. In my view "the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour" is a consequence of "the implementation may assume UB does not happen", not vice versa. My proposed wording is a logical implication ("if the program does not exhibit UB, then the implementation will realize program behavior"), and as usual with implications, they impose no restrictions on what happens when the antecedent is false -- in this concrete case, this means imposing no restrictions on what happens when the program has UB. But I think viewing UB as a proof obligation explains much better why it is so useful, and how programmers can work with it (by checking the proof obligations everywhere). This works particularly well in Rust, where we expect each unsafe operations to state the requirements imposed on the caller -- those are the exact same kind of proof obligation!

@chorman0773
Copy link
Contributor Author

For such intrinsic to have the semantics you want, you would need to modify the abstract machine a lot more than you seem to think

Fair. However, in the presence of such an intrinsic, making pointer casts semantically have the same effect is not a fundamental change. It's a choice in how to implement the latter. The intrinsic mentioned is not a question of how it affects the abstract machine of rust, it exists and it is impossible for it to not exist (because of how name resolution of builtins is defined). The builtin exists to satisfy a requirement of the C++ abstract machine, std::launder, and because it needs to affect optimizations, the IR needs to be aware of it. As a result and side-effect, it can be named from rust code that uses the lccc_intrinsic_crate feature (even if it couldn't, the same feature could be used to call the ::__lccc::xir! macro, used in libcore to emit explicit IR).

It requires adding the concept of unwinding to the rust abstract machine

Didn't the rust abstract machine have that model before rustc added that intrinsic? Also I presume we can ignore in this argument any and all intrinsics for which the existence is implied by the standard library itself. The standard library itself is part of the abstract machine, after all. The fact core::intrinsics::transmute exists has no bearing on the semantics of the abstract machine (beyond being nameable and an alternative spelling of the latter), because core::mem::transmute is defined.

it is free to do anything it wants for things that would be UB, including inserting traps

... including to assign a particular, well-defined meaning to it? My point isn't that it can break some optimization performed by the compiler, but that such an optimization, in the case of lccc, would be valid anyways.

In my view "the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour" is a consequence of "the implementation may assume UB does not happen", not vice versa

Similarly fair. It can be viewed either way, and the former has definitely come to imply the latter. The fact UB never happens is one of the best truths a compiler writer has at their disposal. However, just as math can be built upon by removing restrictions, so to can programming languages. It's still a rust implementation because it fits within the behaviour prescribed by the rust abstract machine (provided that behaviour can be worked out).

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2020

It requires adding the concept of unwinding to the rust abstract machine

Didn't the rust abstract machine have that model before rustc added that intrinsic?

The intrinsic has existed ever since rust-lang/rust@c35b2bd. Before that, it directly used a function written in LLVM ir, which you could also consider a kind of intrinsic.

Also I presume we can ignore in this argument any and all intrinsics for which the existence is implied by the standard library itself.

Almost all intrinsics are implied by the standard library itself.

The standard library itself is part of the abstract machine, after all.

No, it is not. The rust abstract machine defines how MIR is executed. The standard library merely exposes some parts of the rust abstract machine that can't directly be accessed in a stable way using the user facing language that lowers to MIR. Saying that the standard library is part of the rust abstract machine is like saying that all existing unstable code that compiles with the current version of rustc is part of the abstract machine as it can access intrinsics. It is like saying that all C/C++ code is part of the C/C++ abstract machine because it can call intrinsics.

@chorman0773
Copy link
Contributor Author

This works particularly well in Rust, where we expect each unsafe operations to state the requirements imposed on the caller -- those are the exact same kind of proof obligation!

This is also an expectation at least in how I document anything. However, if I accept a valid raw pointer, I don't say that I can assume it does, I write something like

Preconditions: ptr shall point to an object or be a null pointer

or

Preconditions: ptr shall point to an object, passed the end of an object, or be a null pointer.

(with the last part optional). This is sufficient to express that the result is undefined behaviour, according to the library in question, if ptr does not satisfy either.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Nov 10, 2020

It is like saying that all C/C++ code is part of the C/C++ abstract machine because it can call intrinsics.

The C++ Standard library is part of the C++ Abstract machine. It is defined as part of the document that says "The semantic descriptions in this document define a parameterized nondeterministic abstract machine." ([intro.abstract] clause 1).
The standard library isn't just another user-defined library, it's a fundamental portion of how rust is specified. core::mem::zeroed is what introduces the ability to initialize values with an all-zero bit pattern, not the core::intrinsics::init function that implements it, because the latter isn't part of the rust language, it's part of the implementation in rustc. Similarily, the function __builtin_launder in gcc, clang, or it's fully qualified version from lccc (::__lccc::builtins::cxx::__builtin_launder) are not part of the C++ abstract machine, std::launder is, and the former implements the latter.
The definition of ::core::mem::drop is not

fn drop<T>(x: T){}

It is

Disposes of a value. This does so by calling the argument's implementation of Drop.

How drop(x) drops the owned x is entirely the business of the implementation, not of the standard library definition, but the semantics is that x is moved or copied into the function call, then dropped. I can implement drop similarily as

fn drop<T>(x: T){
    unsafe{ core::ptr::drop_in_place(&x) }
    forget(x)
}

This would be a stupid, but perfectly valid implementation for core::mem::drop.

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2020

core::mem::zeroed is what introduces the ability to initialize values with an all-zero bit pattern, not the core::intrinsics::init function that implements it, because the latter isn't part of the rust language, it's part of the implementation in rustc.

core::mem::zeroed is defined to be equivalent to MaybeUninit::zeroed().assume_init() and in fact is implemented this way. MaybeUninit::zeroed() is implemented without intrinsics as:

let mut u = MaybeUninit::<T>::uninit();
// SAFETY: `u.as_mut_ptr()` points to allocated memory.
unsafe {
    u.as_mut_ptr().write_bytes(0u8, 1);
}
u

where MaybeUninit::uninit is implemented as MaybeUninit { uninit: () }, so this is a bad example as it can be implemented as regular code.

In my opinion the fact that intrinsics are an implementation detail used to implement certain standard library functions doesn't mean that the rust abstract machine includes the standard library. The rust abstract machine is in my opinion solely implemented by the rust compiler. IMO it includes parts that are stable defining how stable code works, and it includes unstable parts used to implement the standard library. The standard library simply depends on certain unstable parts of the rust abstract machine the same way it depends on certain unstable language features like specialization. These unstable parts can differ from compiler version to compiler version or even from compiler to compiler.

The specific LLVM version is not a part of the rust abstract machine, not even an unstable part, as the same rustc version can be compiled against a wide variety of LLVM versions. This means that it is not ok for the standard library to depend on UB that just so happens to not cause a miscompilation on a specific LLVM version. Thanks to rustc_codegen_cranelift (cg_clif) (author here) even the existence of LLVM itself is not part of the rust abstract machine. Not even an unstable part.

The only part of the standard library that could be considered part of the rust abstract machine is stdarch (core::arch). This contains a lot of platform intrinsics for simd that directly use llvm intrinsics. Every other bit of the standard library is completely agnostic to the codegen backend. Because of this combined with the fact that all functions in stdarch are marked as #[inline] and thus only codegened when used, cg_clif is able to compile the standard library without changing it in any way despite using Cranelift instead of LLVM.

Besides, not giving the standard library a privileged position makes it easier to understand how it works and makes it safer to just copy snippets from it into your own code.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Nov 10, 2020

core::mem::zeroed is defined to be equivalent to MaybeUninit::zeroed().assume_init()

Exactly, defined. It isn't necessarily implemented in terms of (in fact, in lccc, it's the latter, MaybeUninit::zeroed() is implemented in terms of core::mem::zeroed.

The abstract machine is the sum of the behaviour specified by the specification. If the specification includes the standard library, then the standard library is part of that abstract machine. The standard library isn't part of a program, it's something that exists because of the specification, and has it's semantics defined by the specification, so it's semantics fall definatively under part of the abstract machine, even if those semantics can be perfectly replicated in user written code. Absent part in the specification, it wouldn't be a violation of the as-if clause to not provided the standard library. The implementation of the standard library or compiler has absolutely no bearing on the abstract machine.

Saying that the compiler defines the abstract machine is a great way to reduce the possibility of a competing implementation, something I am very much against. The compiler should have an argumentative position, in saying what can and cannot be done, but should not have the position of defining the behaviour. Intrinsics and lang items would fall under a specification like "The implementation may provide any number of unspecified unstable features, with unspecified semantics when enabled by a crate-level attribute declaring the feature. If the implementation does not support the feature or enabling features, the program is ill-formed."

Besides, not giving the standard library a privileged position makes it easier to understand how it works and makes it safer to just copy snippets from it into your own code.

Inherently, it has privilege, because it can access private implementation details such as intrinsics and lang items. I certainly wouldn't want to just abuse extensions without saying anything, maybe I'd include something that mentions the extension and it's non-portability, but people already cannot simply copy just anything from the standard library, because it may be feature gated. For example, code that uses the pointer-interconvertibility rule would have this

// SAFETY: This is sound because lccc permits casts between pointers to *pointer-interconvertible* objects,
// And we know we have a borrow-locked field of `WideType` *pointer-interconvertible* with *narrow. 
// This is an extension and not portable to other implementations.
// stdlib can do this because it expects only to be running on lccc. 
// Note: See C++ Standard [expr.static.cast], clause 13, as well as xlang ir specification, [expr.convert.strong] and [expr.derive.reachability] for details on the validity of this cast.
let wide = unsafe{&mut *narrow as *mut WideType};

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2020

Exactly, defined. It isn't necessarily implemented in terms of (in fact, in lccc, it's the latter, MaybeUninit::zeroed() is implemented in terms of core::mem::zeroed.

In a regular crate, you may also have one function defined to be equivalent to another. If it is or not is simply an implementation detail, not a part of the rust abstract machine.

The standard library isn't part of a program, it's something that exists because of the specification, and has it's semantics defined by the specification, so it's semantics fall definatively under part of the abstract machine, even if those semantics can be perfectly replicated in user written code.

The standard library is not an intrinsic part of the rust language. You can very well use it without, albeit only on nightly rustc versions. In fact I know of at least one past alternative standard library called lrs-lang. While it uses internal compiler interfaces, cg_llvm, cg_clif or miri wouldn't have to be changed to make it run. It would only need to be updated to for the latest version of all unstable interfaces it uses.

Saying that the compiler defines the abstract machine is a great way to reduce the possibility of a competing implementation, something I am very much against.

What I say is that the abstract machine is kind of split in two parts. A stable part that all rust code can use and an unstable part that is used to implement the standard library. Both parts need to have well defined semantics, but the semantics of the unstable part may change between rustc versions. lccc may decide to have a different unstable part and would thus need to change the standard library. That doesn't mean that the standard library is allowed to circumvent the rust abstract machine. It can only use well defined interfaces like intrinsics.

Inherently, it has privilege, because it can access private implementation details such as intrinsics and lang items.

Yes it has extra privileges as it can use unstable interfaces. It just shouldn't do things that normal code isn't allowed without doing so using unstable interfaces. All data structure implementations don't need intrinsics to be implemented, they can just use stable functions. This means that copy paste should just work. (after removing stability attributes) If those data structures were to use the knowledge about the specific codegen backend to for example violate the aliasing rules in such a way that the codegen backend doesn't miscompile it, then copy paste will cause problems for users later down the line. As such it shouldn't violate those rules, even if it technically can. This is what I mean with that it is not ok for the standard library to depend on UB. It is completely fine to use implementation defined intrinsics, but don't ever cause UB.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Nov 10, 2020

If it is or not is simply an implementation detail, not a part of the rust abstract machine.

If the crate defines as part of it's api that the functions are equivalent, then that definition is not an implementation detail. It's not part of the rust abstract machine either, but it's part of the public API specification for that crate, just as the standard library is a and should be a part of the rust specification. Of course, the specification does not bind a particular implementation, and it is up to the particular implementation whether or not to write one in terms of the other, or both in terms of the same, or completely independant implements. And if one is in terms of the other, it's also up to the particular implementation which is done.

The standard library is not an intrinsic part of the rust language. You can very well use it without, albeit only on nightly rustc versions

By saying this, it can be deduced that the existance of libcore, liballoc, and libstdor the content thereof is optional to provide. In fact, the opposite is true, and are only unavailable upon explicit request. Unless you specifically have #![no_std] or likewise #![no_core] in a crate root, all 3 must be provided and in the extern prelude, this is part of the specification (core must always be provided, absent the use of #![no_core]). Further, the content and semantics of these crates are defined by the same specification.
As stated, with C+, we can deduce that standard library semantics are part of the abstract machine, because they are not exlcuded from the definition in [intro.abstract] clause 1 ("The semantic descriptions in this document define a parameterized nondeterministic abstract machine", not stating that particular semantic descriptions, or even qualifying the sections, so semantic descriptions in [library] through [thread], which makes up "the standard library", fall under the same). If rust chooses the same, which implies that the as-if rule extends to that content, then the same reasoning applies.

It is completely fine to use implementation defined intrinsics, but don't ever cause UB.

An implementation can introduce undefined behaviour to a program, provided it still behaves as if it did not. llvm does this when it hoists certain operations that may be undefined out of loops, and gives limited meaning (by producing poison instead of immediate undefined behaviour) to the operations. Doing the optimization manually be UB.
Here I'm not even talking about when the codegen doesn't happen to miscompile, I'm talking about honestly well-defined extensions, a promise that the same will never start to miscompile on the compiler. Copy-paste from the standard library is already non-portable, because the same unstable features may not be available, or may be implemented differently, so copy-pasting code that clearly contains an extension (Written as such) will not alter this behaviour.
Part of this is done to satisfy zero-cost abstractions, that a manual implementation could not possibly be more efficient than the provided one. If we admit that the compiler can change it's implemented abstract machine, by introducing unstable features, then the ability to assign meaning to undefined behaviour seems to be a natural extension of that.

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2020

An implementation can introduce undefined behaviour to a program, provided it still behaves as if it did not.

Then it is not UB.

llvm does this when it hoists certain operations that may be undefined out of loops, and gives limited meaning (by producing poison instead of immediate undefined behaviour) to the operations. Doing the optimization manually be UB.

It is only UB to use poison in a way that can't propagate the poison further through for example a math operation. The mere existence of a poison value is not UB.

If we admit that the compiler can change it's implemented abstract machine, by introducing unstable features, then the ability to assign meaning to undefined behaviour seems to be a natural extension of that.

Unstable features require an explicit opt-in and only work on nightly. Using undefined behaviour that has been assigned a meaning by a specific implementation doesn't however require an opt-in. If you copy-paste code that uses an unstable feature, it will simply not compile when the unstable feature is not available. If you copy-paste code that uses it's knowledge of an assigned meaning to certain UB, then it will still compile when the UB doesn't have an assigned meaning by a specific implementation. Instead it may or may not behave in strange ways at runtime. This is much much worse than not compiling.

@chorman0773
Copy link
Contributor Author

Then it is not UB.

Yes, yes it is. The compiler doesn't decide what behaviour is undefined, it only gets to decide what to do about it (though has effectively unlimited choice in that). A crucial point of [intro.abstract], clause 1 is that the implementation does not have to implement or even emulate the abstract machine, only that it must emulate it's observable behaviour. This is the as-if rule, not the first sentance which says the abstract machine exists.

It is only UB to use poison in a way that can't propagate the poison further through for example a math operation. The mere existence of a poison value is not UB.

If the particular operation caused, for example, signed overflow in C, then the programmer could not write the same optimization by hand, even though llvm performed it, because the resulting transformed program behaved as-if it was evaluated strictly, wrt. it's observable behaviour.

If you copy-paste code that uses an unstable feature, it will simply not compile when the unstable feature is not available

Features can change meaning without changing names, or even changing syntax. Possibly because of an evolution of it, or because they were written independent (which is why lccc qualifies its feature names not explicitly taken from rustc, so as to reduce this chance of someone else writing the same feature, unless they are implementing the same from lccc). If you used the lang_items feature, you could certainly run into strange results, especially if the compiler isn't written to reject unknown lang items (or if the lang item name is common, but has distinct meanings).
For example, the rustc definition of core::ptr::drop_in_place, which is

#[lang="drop_in_place"]
pub unsafe fn drop_in_place(ptr: *mut T){}

has entirely different meaning on lccc, because on lccc, the lang item simply designates the function itself, it doesn't result in special behaviour.

A possibly reasonable thing could be do warn on these casts, similarily to gcc's -pedantic. So #![allow(lccc::pedantic)] and inversely #![deny(lccc::pedantic)] could be used to control this (in fact, I intend something similar, though the details are far from worked out, and that's one of the least important things on the road map). In which case, copying the code would give you (absent the use of either an explicit allow or deny) something like (on standard CLI, which uses GNU style warnings): "This code has undefined behaviour in the rust language, and is provided only as an extension. It is unlikely to be portable to other compilers (note: from -Wextend-provenance, which is enabled by default. This may be disabled with #[allow(lccc::extend_provencence)])". (likewise, with deny/forbid it would instead mention -Werror=extend-provenance)

@bjorn3
Copy link
Member

bjorn3 commented Nov 10, 2020

If the particular operation caused, for example, signed overflow in C, then the programmer could not write the same optimization by hand, even though llvm performed it, because the resulting transformed program behaved as-if it was evaluated strictly, wrt. it's observable behaviour.

LLVM has a special flag to forbid signed overflow. If the optimization would cause signed overflow, then it has to remove this flag. That clang doesn't expose a way to remove this flag, doesn't mean that removing this flag in an optimization is UB.

Features can change meaning without changing names, or even changing syntax.

True. In that case the feature is technically still available, it just has different behaviour. What I am mainly concerned about is stable code that doesn't have access to the unstable features.

A possibly reasonable thing could be do warn on these casts, similarily to gcc's -pedantic.

The cast itself is completely valid. It is just when you dereference it that there is a problem. This dereference could happen arbitrarily far away from the cast. A lint for this without false-positives would need to use a theorem prover. If you also want to avoid false-negatives, you will have to solve the halting problem.

@chorman0773
Copy link
Contributor Author

LLVM has a special flag to forbid signed overflow.

add nsw. The result is poison if it would cause the result to wrap into the sign bit. The nsw isn't removed, the result is simply demoted from immediate UB to poison. It does not have to remove the flag, because if the optimization occurs and would introduce new UB, the resulting poison is unused.

This dereference could happen arbitrarily far away from the cast.

The particular example would be if you derefence the pointer resulting from this cast. Also dereferencing a miscasted pointer can already happen very far way. miri could be run to detect that, as it does not have the same behaviour (however, it's run on the rustc libstd, as miri would not accept a sigificant portion of the standard library used by lccc, primarily because there is no one-to-one correspondance between MIR and XIR). I didn't say it's necessarily a perfect idea, but I do agree it's better than code just devolving from it. I also rarely copy stuff from standard libraries, because I know they do some things that you probably shouldn't ever attempt in well-behaved code. I would presume the big obvious SAFETY comment that says "This is an extension and not portable" would be sufficient to inform people of this fact, an extension, and the warning from copying the code verbaitum would reinforce this.
As mentioned, the extension exists even if I don't flat out say it does, because black-box code could perform the xir convert reinterpret op, which performs the exact operation I'm saying as does (because there isn't a point to implementing it differently, the optimizations are still enjoined).

@DemiMarie
Copy link

I haven’t read the discussion, but I personally prefer Ada’s term for UB: “erroneous execution”.

@mohtasham9
Copy link

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB.

Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use #[cfg] flags and such to do the right thing on multiple platforms or implementations.

@chorman0773
Copy link
Contributor Author

chorman0773 commented May 9, 2021 via email

@MikailBag
Copy link

I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior.

But what is the difference between "undefined behavior" and "implementation-defined behavior which can be undefined behavior in fact" then? In both cases behavior can be undefined, and undefined behavior can be replaced with any other behavior, so these two terms seem equivalent.

Additionally, it means that all operations with implementation-defined semantics have to be unsafe, including integer arithmetic operators, which is a massive breaking change.

@chorman0773
Copy link
Contributor Author

I'm closing this ahead of triage.

IMO, the question it poses is answered, and it's also a bit garbage in the comments (sorry about that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-abstract-machine Topic: concerning the abstract machine in general (as opposed to any specific part of it) C-terminology Category: Discussing terminology -- which term to use, how to define it, adding it to the glossary
Projects
None yet
Development

No branches or pull requests