Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace TokenMap with an abstraction that matches reality #9403

Closed
Tracked by #10370
matklad opened this issue Jun 25, 2021 · 8 comments
Closed
Tracked by #10370

Replace TokenMap with an abstraction that matches reality #9403

matklad opened this issue Jun 25, 2021 · 8 comments
Assignees
Labels
C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact

Comments

@matklad
Copy link
Member

matklad commented Jun 25, 2021

AKA, @matklad have been misunderstanding how macro expansion works this whole time.

Background: originally, I thought about macro expansion process as transforming a stream of tokens into a different strem of tokens:

macro_rules! id {
    (($id:tt)*) => {($id)*}
}

fn main() {
  let foo = 92;
  id!(foo)
}

Here, I thought that token foo gets translated from macro call site to macro expansion site.

This motivated the TokenMap and related abstractions. The idea is that we assign ids to tokens (=tokens have identity), and track those ids through macro expansion. Yesterday, having looked at https://doc.rust-lang.org/stable/proc_macro/struct.Span.html, I concluded that this is not, in fact, how the world works.

Consider these two procedural macros:

#[proc_macro]
pub fn id(args: TokenStream) -> TokenStream {
    args
}

#[proc_macro]
pub fn id2(args: TokenStream) -> TokenStream {
    clone_stream(args)
}

fn clone_stream(ts: TokenStream) -> TokenStream {
    ts.into_iter().map(clone_tree).collect()
}

fn clone_tree(t: TokenTree) -> TokenTree {
    match t {
        TokenTree::Group(orig) => {
            let mut new = Group::new(orig.delimiter(), clone_stream(orig.stream()));
            new.set_span(orig.span());
            TokenTree::Group(new)
        }
        TokenTree::Ident(orig) => TokenTree::Ident(Ident::new(&orig.to_string(), orig.span())),
        TokenTree::Punct(orig) => {
            let mut new = Punct::new(orig.as_char(), orig.spacing());
            new.set_span(orig.span());
            TokenTree::Punct(new)
        }
        TokenTree::Literal(orig) =>  { ... },
    }
}

I believe their semantics is the same -- from rustc point of view, they produce equivalent outputs. The implementation of id2 completely erases identity though.

So, bad news, we need to rewrite TokenMap-based stuff to use something else (and I don't know what that something else would be). Good news -- I think this should make more weird cases like include work in a more out-of-the-box way perhaps?

cc @jonas-schievink , @edwin0cheng

@matklad matklad added E-hard fun A technically challenging issue with high impact labels Jun 25, 2021
@jonas-schievink
Copy link
Contributor

Hmm, isn't the issue here just that most of our span-related APIs are stubs (for example, the .span() calls all return TokenId::unspecified() instead of the actual TokenId)? We already use TokenId as our Span type, so in theory if we fill out those APIs, this should Just Work™, right?

@matklad
Copy link
Member Author

matklad commented Jun 25, 2021

Kind of — we can redefine the id to mean the identity of span (a pair of location, hygiene) rather than the identity of some original token. But today TokenId is very much tied to the identity of the token itself.

@matklad matklad added the C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) label Jul 4, 2021
@matklad
Copy link
Member Author

matklad commented Jul 4, 2021

So, let me summarize the issue again. At the moment, macro expansion in rust-analyzer works by mapping tokens to tokens. If you have a recursive macro, you can start with the token in the original invocation, than map it to the token in the next invocation, then repeat, etc. The core primitive are map_token_up and map_token_down functions:

https://github.com/rust-analyzer/rust-analyzer/blob/e5c1c8cf2fcfae3e15c8bcf5256e84cad3bd3436/crates/hir_expand/src/lib.rs#L348-L385

rustc works differently. Each token contains (lo, hi) byte positions pointing to some span in some original source code. There's no "level-by-level" mapping for source positions. There also isn't a direct mapping between the tokens.

Note that in practice, models are mostly compatible, as, typically, each expansion token has the same span as some invocation/definition token. This doesn't hold universally though, the two exceptions being:

  • include! macro, which, in rustc model, just produces tokens with the spans from included file. In rust-analyzer, as all the tokens are manufactured out of thin air, the token mapping infra falls down.
  • Span::join nightly rustc API: we can't express this API in rust-analyzer. Alothough this API isn't stable, it reflects the underlying model implemented in rustc.

So, to solve the problem, we need to change the primitive operations of map_token_up/down to something that matches the rustc model. I think we need a map, that maps TextRanges of the expansion to FileRanges. That is, it doesn't make Origin distinction, and instead directly tracks original sources from non-macro files. The primitive operation would then be a mapping of ranges.

@flodiebold
Copy link
Member

What does Span::join do?

@matklad
Copy link
Member Author

matklad commented Jul 4, 2021

Added the link: join takes two spans from the same file and returns the span, covering both of them.

@edwin0cheng
Copy link
Member

edwin0cheng commented Jul 4, 2021 via email

@matklad matklad mentioned this issue Jul 12, 2021
7 tasks
@jonas-schievink jonas-schievink mentioned this issue Aug 16, 2021
4 tasks
@matklad
Copy link
Member Author

matklad commented Aug 17, 2021

Thinking about this, I suggest a following model (a bit close to what we have today, a bit different):

  • There are somewhat separate syntax trees and tokens trees. Syntax trees are not build directly from token trees. Token trees are a separate vocabulary, dedicated to macro expansion. If a syntax tree corresponds to a macro expanded file, syntax tree tokens can be mapped to token tree tokens.
  • Token tree tokens don't have identity, but they have associated data. Representation wise, this is actually close to what we have today -- leaf tokens carry an id, there's a side table which maps ids to the data
  • The data for each token include FileRange and hygiene info. I am still fuzzy about what hygiene is, exactly.

This data should support two operations:

First, given a token in the syntax tree, find a FileRange corresponding to the token. This is infailable operation. Note that we need hygiene info here, just taking the raw FileRange from tokens might be wrong. When displaying diagnostics we don't want to point to macro definition, we want to point to macro call. Example:

macro_rules! m { () => { 1 + () } } // we don't want an error here

fn f() { m!() } // we want the error here

here, just the FileRange info won't allow the diagnostic to point to m!(), we need something else (hygiene info) for that.

Other than that, this should be a straightforward operation.

Second, we need the reverse -- given a range in the file, select the (deepest in terms of expansion) syntax tree that contains this range. One way to do this would be to look at all the macro expansions in the universe. This is essentially the approach used by old RLS: it has a giant table with all spans and binary-searches the relevant offset there.

In rust-analyzer, we want to avoid that global search. So we make the following modification to the algorithm. We first identify a macro expansion where the range might come from, and then look for relevant span only in that expansion. In practice this means one of the two cases:

  • the range is inside syntactic macro expansion: m!( let ident = 92; ). We can drill down this expansion
  • the range is inside an include!ed file, we can find the relevant include! (would need an index for that).

Note that this approach is meaningfully less powerful. In theory, a procedural macro can create a token with a span pointing anywhere into your project. With RLS model, we'll find that span. With rust-analyzer model, we won't.

A less hypothetical scenario is when a proc macro (eg parser generator) reads input from external file (grammar in specific meta syntax) and uses spans from that. If the user then "find all references" on the production in the grammar, in theory we should be able to do that (and even find references between grammar rules themseves, if the spans are set up properly). In the proposed system, we'll need some extra bit of info which tells us that the grammar file relates to a particular macro expansion.

A similar effect can be observed today with a hidden include:

macro_rules! m { () => include!("foo.rs") }

fn f() {
  m!()
}

In this situation, when we invoke "find usages" in the foo.rs file, we need to somehow understand that it's included (indirectly) by the m! macro. In the proposed model, we just won't handle that case by default. We could have eagerly expand all the macros mode though.

@lnicola lnicola mentioned this issue Sep 27, 2021
4 tasks
bors bot added a commit that referenced this issue Sep 27, 2021
10378: fix: Implement most proc_macro span handling methods r=jonas-schievink a=jonas-schievink

This closes #10368 – some APIs are still missing, but they are either for unstable features or require #9403

bors r+

Co-authored-by: Jonas Schievink <jonasschievink@gmail.com>
@Veykril Veykril self-assigned this Nov 25, 2022
@Veykril Veykril mentioned this issue Jun 19, 2023
7 tasks
@Veykril Veykril mentioned this issue Sep 11, 2023
5 tasks
@Veykril
Copy link
Member

Veykril commented Dec 4, 2023

Fixed in #15959

@Veykril Veykril closed this as completed Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact
Projects
None yet
Development

No branches or pull requests

5 participants