Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introduce a mid-level IR (MIR) in the compiler that will drive borrowck, trans #1211

Merged
merged 1 commit into from
Aug 14, 2015

Conversation

nikomatsakis
Copy link
Contributor

@nikomatsakis nikomatsakis commented Jul 14, 2015

This proposal describes a mid-level IR that I believe we should use in the compiler. This is purely an implementation detail and should not affect the language, though it may make many language extensions and analyses easier to implement; the most notable of these is non-lexical lifetimes.

Rendered

@nikomatsakis nikomatsakis added the T-compiler Relevant to the compiler team, which will review and decide on the RFC. label Jul 14, 2015
@nikomatsakis
Copy link
Contributor Author

cc @rust-lang/compiler

@arielb1
Copy link
Contributor

arielb1 commented Jul 14, 2015

Sounds nice. However, as written this makes even rvalues be non-SSA - we may want to be smarter on that front.

We would want to do at least these optimizations on the MIR, to prevent codegen regressions:
* RVO (of course)
* NRVO (because we essentially do it in our codegen for non-nested return-s).
* some kind of constant-propagation
* some kind of move-elimination, like we do in match today

of it to make quality error messages.
3. This representation should encode drops, panics, and other
scope-dependent items explicitly.
4. This representation does not have to be well-typed Rust, though it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well-typed Rust? The representation allows for some unsafe operations (e.g. unrestricted downcasts, unchecked indexing, calling unsafe functions) but should type-check.

@eddyb
Copy link
Member

eddyb commented Jul 14, 2015

@arielb1 I expect pure constant expressions to have a single value in the MIR, modulo associated constant projections.

| [LVALUE...LVALUE]
| CONSTANT
| LEN(LVALUE) // load length from a slice, see section below
| BOX // malloc for builtin box, see section below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this also need the adjustments?

@nikomatsakis
Copy link
Contributor Author

@arielb1

However, as written this makes even rvalues be non-SSA - we may want to be smarter on that front.

I think what you mean by this is that simple things like foo(4) would introduce a (non-SSA) temporary? This is true. I don't think it's worth having a separate class of "SSA" temporaries -- I'd personally rather just do our optimizations in the older style, with kill sets. This simplifies the IR by not having more than one kind of lvalue. However, I could be persuaded otherwise. (The truth is, this is kind of a minor detail in the end. I expect us to evolve the MIR over time, and if we find that distinguishing spilled, mutable temporaries from other rvalues is worthwhile, that's fine.)

One thing the current MIR does not make explicit as explicit as it
could is when something is *moved*. For by-value uses of a value, the
code must still consult the type of the value to decide if that is a
move or not. This could be made more explicit in the IR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a world with drop calls explicitly encoded into the MIR, whether something is moved as opposed to copied doesn't matter at all; either the value will be explicitly dropped, or it won't. This is true with either the current embedded drop flags or explicit stack-based drop flags. Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eliding memcpy calls for moves, perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eefriedman

In a world with drop calls explicitly encoded into the MIR, whether something is moved as opposed to copied doesn't matter at all; either the value will be explicitly dropped, or it won't. This is true with either the current embedded drop flags or explicit stack-based drop flags. Or am I missing something?

For one thing, the MIR as I've described it thus far is allowed to DROP things that may have been moved. I'm assuming a later pass that determines precisely what needs to be dropped and inserts code to prevent double drops; this will be a type-based, control-flow-sensitive analysis, and hence it makes sense to do it after the MIR is built.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyb Right. If we explicitly encode moves, a pass could look for copies where the source is not used anymore after the copy and turn that into a move. LLVM doesn't do that for calls, most likely because it sees the address as significant when you pass a pointer to a function.

@Aatch
Copy link
Contributor

Aatch commented Jul 15, 2015

Looks good to me. SSA probably isn't worth it for this level, it's brilliant for lower-level optimisations, but it's somewhat more complex to build, whatever we want to do can probably be handled with dataflow analysis and similar. As this is an internal thing, I'm not too bothered as long as we get something in this direction. The details can be changed later.

I think that unsafe blocks are inappropriate for an MIR. I also think the property that unsafe doesn't actually change the way Rust works, it merely allows some otherwise-disallowed operations is something worth maintaining.

to figure out on its own how to do unwinding at that point. Because
the MIR doesn't "desugar" fat pointers, we include a special rvalue
`LEN` that extracts the length from an array value whose type matches
`[T]` or `[T;n]` (in the latter case, it yields a constant). Using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing LEN on fixed-size arrays seems like it just pointlessly complicates the MIR at the expense of possibly making it slightly easier to construct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it complicates the MIR at all. If anything it's simpler as it doesn't require an extra rule for fixed-size arrays. Also, we still need to bounds-check fixed-size arrays, so this would have to be a separate path for them for no obvious reason.

@eefriedman
Copy link
Contributor

Have you thought about how serialization for MIR will work?

its contents (it is not yet initialized).

Note that having this kind of builtin box code is a legacy thing. The
more generalized protocol that [RFC 809][809] specifies works in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to link to RFC 809 here.

@michaelwoerister
Copy link
Member

I'm all for giving this a try. The direction of the proposed design makes sense to me. Many of the details will become more clear when working on the implementation. Only the HIR trait thing from the prototype sounds a bit too clever for my taste but since it's not even part of the RFC really ...

Regarding debuginfo, the things that come to mind in the context of the MIR are source locations, scope information, and memory locations of local varibles/arguments.

Source Locations
For every LLVM IR statement, we want to know which piece of source code it originated from. So far, trans has read this information from the AST. I imagine that there will be some way to find out about the span of a given MIR statement. One thing that warrants special consideration in this respect, are spans of compiler generated instructions, especially drop calls. We have to assign some span to them (LLVM crashes otherwise) and currently we are using a heuristic that tries to find the closing brace of the enclosing block. This is something that would best be taken care of during the lowering step to MIR.

Scope Information
LLVM and debuginfo not only want to know about the source location of every machine instruction, they also need to know about the scope that instruction is part of (so the debugger knows which variables are visible when stopped at a given position in the program). This is what we do so far: When starting to translate a function, we build a "scope map" by walking the AST of the function. This map maps every NodeId in the function to the corresponding debuginfo descriptor for the scope the node is contained in. The scope descriptor tree is built up as the AST is traversed, also taking care of implicit scopes introduced by let-statements.
Again, if it is possible to map from an MIR statement back to the node that introduced it, there's no need to do things differently. But the scope tree could also be built before lowering and then linking each MIR statement to the scope tree node it belongs to.

Local Variables and Arguments
For these we need to know the alloca that stores them. The current, non-SSA setup indeed seems to be a good match for this. Let LLVM worry about this stuff :)

@nikomatsakis
Copy link
Contributor Author

@Aatch

I think that unsafe blocks are inappropriate for an MIR. I also think the property that unsafe doesn't actually change the way Rust works, it merely allows some otherwise-disallowed operations is something worth maintaining.

The main question was whether it'd be worth CHECKING that property (that is, checking what operations are disallowed) on MIR. The reason to consider doing that is that it would be easier, since all derefs and calls are made very explicit.

@nikomatsakis
Copy link
Contributor Author

@eefriedman

Have you thought about how serialization for MIR will work?

Not deeply. I don't foresee any particular difficulties. It should be much easier than serializing the AST, since there are no side-tables to be concerned with, and all internals links are, well, internal. That said, I'd like to define a canonical textual format for testing purposes (in an ideal world, we'd be able to supply MIR inputs directly to the compiler so we can skip early stages of the pipeline when testing).

@nikomatsakis
Copy link
Contributor Author

@arielb1

We would want to do at least these optimizations on the MIR, to prevent codegen regressions:

  • RVO (of course)

The existence of the "ReturnValue" lvalue allows us to do RVO, modulo the next bullet.

  • NRVO (because we essentially do it in our codegen for non-nested return-s).

So, the main problem here that I see is aggregates. That is, if you have

v = Struct { x: ..., y: ... }

it gets converted into:

tmpx = ...;
tmpy = ...;
v = Struct { x: tmpx, y: tmpy }

which is obviously not what trans would produce. However, there are a lot of advantages to starting out with this form. But after safety checks are done, as I describe in the RFC, it is pretty easy to convert this to:

v.x = ...;
v.y = ...;

I'm assuming this would run after safety analyses but also after drops are rewritten to be more minimal, since I think there are some cases where you might wind up with double-frees if you're not careful.

  • some kind of constant-propagation

This is why I separated out constants into their own thing. We can simplify constants and also rewrite MIR expressions as we choose.

  • some kind of move-elimination, like we do in match today

This can conceivably be expressed by rewriting to reference the original lvalue.

(Overall, I'm not sure how much optimization it makes sense to do on the MIR vs leaving it to LLVM -- we'll have to work out that trade-off. Certainly though we've found that doing optimizations in trans can be quite helpful for execution and compilation time so it's easy to see that the same will be true of the MIR. And I'm trying to think beyond LLVM as well, in which case doing more in the MIR would be helpful for portability -- especially Rust-specific things that would require custom LLVM passes or code anyway.)

@arielb1
Copy link
Contributor

arielb1 commented Jul 17, 2015

@nikomatsakis

That would convert into something like

v.x = ...;
v.y = ...;
v = Struct { x: v.x, y: v.y };

Your last example wasn't valid IR (v wasn't initialized). This can be handled the right way in translation anyway.

@nikomatsakis
Copy link
Contributor Author

@arielb1

Your last example wasn't valid IR (v wasn't initialized). This can be handled the right way in translation anyway.

I was assuming that past a certain point we would enforce looser restrictions on what's valid or invalid.

@bkoropoff
Copy link

This looks very clean and a lot easier to work with. I'm definitely in favor. I had a few thoughts about the kind of desugaring we might want to do at this level and how it would interact with region and borrow checking:

Closures

Closures seem like a natural candidate for desugaring, since they are nearly equivalent to an anonymous struct with a trait impl. One subtlety is that assignment to a non-mut by-value upvar ought to be rejected, even though this would be translated into an assignment through mut self or &mut self, which would be accepted. We'd need to track this one way or another.

CPS and friends

If we ever want Rust to support generators, async/await, coroutines, etc., this seems like the right place to do it. I've played around with writing a CPS transformation with pure macro rules and found several constructs that would be sound when doing region/borrow analysis in direct style but are not expressible in safe Rust after translation. Doing it at the MIR level after performing region/borrow checking would solve the issue nicely. On the other hand, the transformation also introduces trait bounds (e.g. Send for async/await) and moves that are not present in the source. And, of course, any non-trivial transformation complicates good error reporting. What kind of IR to IR transformations can we reasonably accommodate here?

Lints

Do we allow pluggable lints at this level? It seems like some of the ones used by Servo (e.g. checking that GC roots are used properly) would need to operate on the MIR. Maybe I'm wrong and the HIR is enough.

@nikomatsakis
Copy link
Contributor Author

@bkoropoff

Regarding upvars, the MIR actually has a richer type system than the source language, and it includes &uniq pointers, which cover the case of non-mutable upvars.

Regarding CPS transform, I agree this is the place to do it, and we'll have to do some work to produce good error messages. I think we'll gain some more experience in that regard with mapping closures etc (we've made some progress, but we definitely produce some suboptimal error messages in borrowck today, such as those that talk about "borrowing" when the borrow is implicit in the syntax today).

Regarding lints, I think it might make sense for some of them to operate on the MIR, but that's a long way off. @brson has also expressed interest in being able to write front-ends that generate MIR directly. So it seems plausible to me that we might sometime want to standardize a lowered Rust representation that can be consumed externally.

@nikomatsakis
Copy link
Contributor Author

Hear ye, hear ye. This RFC is entering final comment period.

@bkoropoff
Copy link

@nikomatsakis

&uniq helps in some cases, but move captures are still a problem since an upvar may not be behind a reference that can be marked uniq independently of the others. Obviously it's nothing a little hidden metadata plumbing can't fix. I also vaguely recall some special-case handling of Fn traits in trait selection that allow picking the auto-generated impls as candidates in spite of ununified type variables that would otherwise cause problems. There are probably other other edge cases where closures don't quite behave exactly like a struct + impl we'll need to be wary of.

@nikomatsakis
Copy link
Contributor Author

@bkoropoff The special unification logic in trait selection stuff is independent of the mir (which doesn't really touch on trait selection), but you're right I was forgetting about the rules to prevent assignments to moved upvars. What a pain. I should have pushed harder for mutpocalypse. :) In any case, to actually model that properly does require just a bit more extension of the type system: basically marking fields that cannot be directly assigned, even when reached uniquely (I've thought about proposing something similar from time to time -- obviously now it'd have to be more of a lint). As you say, not a big deal, but you're right that it has to be handled.

@bkoropoff
Copy link

@nikomatsakis Would trait selection still occur before desugaring closures to a struct + impl? I guess that's fine then, I was just hoping we could eliminate as much special case handling as possible.

@arielb1
Copy link
Contributor

arielb1 commented Jul 25, 2015

@bkoropoff

The big part of the complications closures bring are the type-system issues, and most of these (e.g. consider_unification_despite_ambiguity) even occurring only during typeck (i.e. before the MIR).

Assignments to non-mut locals are already special-cased, and that's not something the MIR can really help with. The TyTuple/TyStruct/TyEnum/TyClosure distinction will remain in the MIR - we should try to handle these as uniformly as possible through.

@nikomatsakis
Copy link
Contributor Author

On Fri, Jul 24, 2015 at 06:47:58PM -0700, Brian Koropoff wrote:

@nikomatsakis Would trait selection still occur before desugaring closures to a struct + impl? I guess that's fine then, I was just hoping we could eliminate as much special case handling as possible.

Yes, trait selection still occurs before desugaring. Trait selection
is pretty orthogonal to the MIR really, but yes it will still require
some amount of special case handling. That said, I'm getting very
excited lately about the idea of an internal "type IR" that should
play a similar role of formalizing and simplifying the type system.
More on that soon.

@arielb1
Copy link
Contributor

arielb1 commented Jul 29, 2015

Don't we already have a type IR?

@nikomatsakis
Copy link
Contributor Author

@arielb1 I'll try to write up what i'm talking about :) pretty orthogonal to this proposal.

@qwertie
Copy link

qwertie commented Jul 31, 2015

I'd suggest using some kind of standard format as a text representation - either a subset of Rust itself, or LES. That way nobody has to go to the trouble of designing a new syntax.

@nikomatsakis nikomatsakis added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Jul 31, 2015
@RalfJung
Copy link
Member

RalfJung commented Aug 5, 2015

I like this a lot! From a formal verification standpoint, this language is much better suited than the original AST. Fewer constructs, and more things explicit, it's almost like what I dreamed of ;-)

Now, from a purely practical perspective, there's one thing I do not understand: What is the relationship to the recently accepted HIR? I'm surprised that the only relationship mentioned is that the HIR trait here is not related. Skimming over the HIR RFC, the goals also seem to be fairly similar: Lowering of high-level sugar to fewer primitives, to ease processing. My impression is that the final pipeline will be "AST -> HIR -> MIR -> LLVM", with some desugaring happening on the first arrow, and other things waiting for the second arrow. Will there be anything that works on the HIR directly? Or will it be the case that the HIR is only constructed to be immediately lowered to MIR?

@eddyb
Copy link
Member

eddyb commented Aug 5, 2015

@RalfJung It's possible that everything outside of function bodies may be kept around in HIR form.
A strategy for constants that can be used by both the MIR and [T; N], allowing proper handling of associated constants in polymorphic contexts, is yet to be chosen, but one of the options involves holding a HIR expression tree (or an ID to one) and some type bindings for it.

@arielb1
Copy link
Contributor

arielb1 commented Aug 5, 2015

@RalfJung @eddyb

HIR is compiled to tables/metadata and MIR. MIR is mostly supposed to replace ast::Block. We will also need some form of ConstExpr and are still deliberating on the best way to implement it.

Type checking works on HIR + tables/metadata.

@RalfJung
Copy link
Member

RalfJung commented Aug 5, 2015

So what's the reason not to compile the AST directly to MIR+tables? Is there anything interesting happening on that intermediate stage?

(I'm not trying to suggest that HIR has no place in this world; I'm just trying to figure out the reason behind your design decisions here.)

@arielb1
Copy link
Contributor

arielb1 commented Aug 5, 2015

@RalfJung

The HIR is supposed to abstract over macros and name resolution. The new process should be:

  • parse: text -> AST
  • expansion: AST -> expanded AST + hygiene-info
  • resolution: expanded AST + hygiene-info -> def-map
  • HIR creation: expanded AST + def-map -> HIR
  • type checking: HIR -> tcx tables + MIR
  • late analysis: tcx tables + MIR -> more tcx tables
  • translation: tcx tables + MIR -> LLVM IR

Type checking is a big enough step to deserve its own IR.

@yazaddaruvala
Copy link

@arielb1

Thanks, thats a pretty simple, but thorough list for someone thats curious but completely opaque to rustc development.

Similarly I was hoping you could expand on it a bit. I've heard in the past that one way to improve code-gen speed is for rustc to optimize the amount of LLVM IR it creates. I'm not at all suggesting it happen in this implementation but this seems like a great refactor to help with that, so I'm sure you guys are keeping it in mind.

I'm just kinda curious where these IR reductions could/would take place in your list above? or if it will be more piece-meal and happen in small increments at every level as appropriate?

@arielb1
Copy link
Contributor

arielb1 commented Aug 12, 2015

@yazaddaruvala

  • parse: text -> AST
    Implemented in syntax::parse, the generated AST is in syntax::ast. A standard context-sensitive linear scan tokenizer and LL(k) parser (IIRC k<5).
  • expansion: AST -> expanded AST + hygiene-info
    Macro expansion (the rest of syntax). I don't actually understand this very well (I think @nrc understands it best). At the end of this phase, the fully-formed program AST is generated.
  • resolution: expanded-AST + hygiene-info -> def-map
    This creates a map from paths in code (e.g. mem::transmute, ::std::fmt::Display, Vec, local variables, even Trait::method) to their definition. Trait items, fields, and methods (foo.bar, <T as Trait>::method, T::Item) are handled during type-checking instead.
  • HIR creation
    Not implemented yet. This should create an HIR that abstracts over syntactic distinctions (e.g. constants vs. local variables).
  • type checking
    rustc_typeck. The most complicated phase. This determines the type of every expression in a program and ensures that traits can always be satisfied. It also resolves trait item/field accesses and method calls. We are planning on emitting a MIR after this phase is over, that contains a concrete CFG.
  • late analysis
    These are various analyses run over the code to ensure soundness and gather information required for translation. For example, borrow checking (rustc_borrowck) ensures that non-Copy values are indeed not copied and &mut references not aliased, while match checking (rustc::middle::check_match) ensures there are no missing corner cases in match expressions. Lints are also run here - this is why you don't get them if your program contains a type error. Because these are run on a program known to be essentially intact, these can do rather deep analysis relatively simply.
  • translation
    rustc_trans. This pass creates LLVM IR representing the program. It also monomorphizes (expands) generics into concrete instances. Also, it does do some basic optimizations (e.g. RVO). Because of these and the combined complexities of the AST and LLVM (and Rust control flow), this pass is rather more complicated than it should be. The main purpose of the MIR is to simplify it and allow the optimizations to be more general.

@nikomatsakis
Copy link
Contributor Author

It's official. The compiler subteam has decided to accept this RFC. (As of this writing, there are a few missing votes, but @Aatch has expressed support in thread, and @pnkfelix has expressed support in person.)

@nikomatsakis nikomatsakis merged commit bd7f40c into rust-lang:master Aug 14, 2015
nikomatsakis added a commit that referenced this pull request Aug 14, 2015
@Centril Centril added A-IR Proposals relating to intermediate representations. A-borrowck Borrow checker related proposals & ideas labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-borrowck Borrow checker related proposals & ideas A-IR Proposals relating to intermediate representations. final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-compiler Relevant to the compiler team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.