Replace `TokenMap` with an abstraction that matches reality #9403

matklad · 2021-06-25T07:48:48Z

AKA, @matklad have been misunderstanding how macro expansion works this whole time.

Background: originally, I thought about macro expansion process as transforming a stream of tokens into a different strem of tokens:

macro_rules! id {
    (($id:tt)*) => {($id)*}
}

fn main() {
  let foo = 92;
  id!(foo)
}

Here, I thought that token foo gets translated from macro call site to macro expansion site.

This motivated the TokenMap and related abstractions. The idea is that we assign ids to tokens (=tokens have identity), and track those ids through macro expansion. Yesterday, having looked at https://doc.rust-lang.org/stable/proc_macro/struct.Span.html, I concluded that this is not, in fact, how the world works.

Consider these two procedural macros:

#[proc_macro]
pub fn id(args: TokenStream) -> TokenStream {
    args
}

#[proc_macro]
pub fn id2(args: TokenStream) -> TokenStream {
    clone_stream(args)
}

fn clone_stream(ts: TokenStream) -> TokenStream {
    ts.into_iter().map(clone_tree).collect()
}

fn clone_tree(t: TokenTree) -> TokenTree {
    match t {
        TokenTree::Group(orig) => {
            let mut new = Group::new(orig.delimiter(), clone_stream(orig.stream()));
            new.set_span(orig.span());
            TokenTree::Group(new)
        }
        TokenTree::Ident(orig) => TokenTree::Ident(Ident::new(&orig.to_string(), orig.span())),
        TokenTree::Punct(orig) => {
            let mut new = Punct::new(orig.as_char(), orig.spacing());
            new.set_span(orig.span());
            TokenTree::Punct(new)
        }
        TokenTree::Literal(orig) =>  { ... },
    }
}

I believe their semantics is the same -- from rustc point of view, they produce equivalent outputs. The implementation of id2 completely erases identity though.

So, bad news, we need to rewrite TokenMap-based stuff to use something else (and I don't know what that something else would be). Good news -- I think this should make more weird cases like include work in a more out-of-the-box way perhaps?

cc @jonas-schievink , @edwin0cheng

The text was updated successfully, but these errors were encountered:

jonas-schievink · 2021-06-25T17:55:33Z

Hmm, isn't the issue here just that most of our span-related APIs are stubs (for example, the .span() calls all return TokenId::unspecified() instead of the actual TokenId)? We already use TokenId as our Span type, so in theory if we fill out those APIs, this should Just Work™, right?

matklad · 2021-06-25T18:34:07Z

Kind of — we can redefine the id to mean the identity of span (a pair of location, hygiene) rather than the identity of some original token. But today TokenId is very much tied to the identity of the token itself.

matklad · 2021-07-04T18:10:46Z

So, let me summarize the issue again. At the moment, macro expansion in rust-analyzer works by mapping tokens to tokens. If you have a recursive macro, you can start with the token in the original invocation, than map it to the token in the next invocation, then repeat, etc. The core primitive are map_token_up and map_token_down functions:

https://github.com/rust-analyzer/rust-analyzer/blob/e5c1c8cf2fcfae3e15c8bcf5256e84cad3bd3436/crates/hir_expand/src/lib.rs#L348-L385

rustc works differently. Each token contains (lo, hi) byte positions pointing to some span in some original source code. There's no "level-by-level" mapping for source positions. There also isn't a direct mapping between the tokens.

Note that in practice, models are mostly compatible, as, typically, each expansion token has the same span as some invocation/definition token. This doesn't hold universally though, the two exceptions being:

include! macro, which, in rustc model, just produces tokens with the spans from included file. In rust-analyzer, as all the tokens are manufactured out of thin air, the token mapping infra falls down.
Span::join nightly rustc API: we can't express this API in rust-analyzer. Alothough this API isn't stable, it reflects the underlying model implemented in rustc.

So, to solve the problem, we need to change the primitive operations of map_token_up/down to something that matches the rustc model. I think we need a map, that maps TextRanges of the expansion to FileRanges. That is, it doesn't make Origin distinction, and instead directly tracks original sources from non-macro files. The primitive operation would then be a mapping of ranges.

flodiebold · 2021-07-04T18:21:52Z

What does Span::join do?

matklad · 2021-07-04T18:59:42Z

Added the link: join takes two spans from the same file and returns the span, covering both of them.

edwin0cheng · 2021-07-04T19:09:53Z

+1 for use rustc model And current token mapping mechanism could not handle the followings too : https://github.com/rust-analyzer/rust-analyzer/blob/e5c1c8cf2fcfae3e15c8bcf5256e84cad3bd3436/crates/proc_macro_srv/src/rustc_server.rs#L666

On Mon, 5 Jul 2021 at 02:59, Aleksey Kladov ***@***.***> wrote: Added the link: join takes two spans from the same file and returns the span, covering both of them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9403 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUA7Z32P667355GCBOICCLTWCVSTANCNFSM47JLVPRQ> .

-- Cheers, Edwin Cheng

matklad · 2021-08-17T11:46:08Z

Thinking about this, I suggest a following model (a bit close to what we have today, a bit different):

There are somewhat separate syntax trees and tokens trees. Syntax trees are not build directly from token trees. Token trees are a separate vocabulary, dedicated to macro expansion. If a syntax tree corresponds to a macro expanded file, syntax tree tokens can be mapped to token tree tokens.
Token tree tokens don't have identity, but they have associated data. Representation wise, this is actually close to what we have today -- leaf tokens carry an id, there's a side table which maps ids to the data
The data for each token include FileRange and hygiene info. I am still fuzzy about what hygiene is, exactly.

This data should support two operations:

First, given a token in the syntax tree, find a FileRange corresponding to the token. This is infailable operation. Note that we need hygiene info here, just taking the raw FileRange from tokens might be wrong. When displaying diagnostics we don't want to point to macro definition, we want to point to macro call. Example:

macro_rules! m { () => { 1 + () } } // we don't want an error here

fn f() { m!() } // we want the error here

here, just the FileRange info won't allow the diagnostic to point to m!(), we need something else (hygiene info) for that.

Other than that, this should be a straightforward operation.

Second, we need the reverse -- given a range in the file, select the (deepest in terms of expansion) syntax tree that contains this range. One way to do this would be to look at all the macro expansions in the universe. This is essentially the approach used by old RLS: it has a giant table with all spans and binary-searches the relevant offset there.

In rust-analyzer, we want to avoid that global search. So we make the following modification to the algorithm. We first identify a macro expansion where the range might come from, and then look for relevant span only in that expansion. In practice this means one of the two cases:

the range is inside syntactic macro expansion: m!( let ident = 92; ). We can drill down this expansion
the range is inside an include!ed file, we can find the relevant include! (would need an index for that).

Note that this approach is meaningfully less powerful. In theory, a procedural macro can create a token with a span pointing anywhere into your project. With RLS model, we'll find that span. With rust-analyzer model, we won't.

A less hypothetical scenario is when a proc macro (eg parser generator) reads input from external file (grammar in specific meta syntax) and uses spans from that. If the user then "find all references" on the production in the grammar, in theory we should be able to do that (and even find references between grammar rules themseves, if the spans are set up properly). In the proposed system, we'll need some extra bit of info which tells us that the grammar file relates to a particular macro expansion.

A similar effect can be observed today with a hidden include:

macro_rules! m { () => include!("foo.rs") }

fn f() {
  m!()
}

In this situation, when we invoke "find usages" in the foo.rs file, we need to somehow understand that it's included (indirectly) by the m! macro. In the proposed model, we just won't handle that case by default. We could have eagerly expand all the macros mode though.

10378: fix: Implement most proc_macro span handling methods r=jonas-schievink a=jonas-schievink This closes #10368 – some APIs are still missing, but they are either for unstable features or require #9403 bors r+ Co-authored-by: Jonas Schievink <jonasschievink@gmail.com>

Veykril · 2023-12-04T20:37:13Z

Fixed in #15959

matklad added E-hard fun A technically challenging issue with high impact labels Jun 25, 2021

matklad added the C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) label Jul 4, 2021

matklad mentioned this issue Jul 12, 2021

Steering issue #5 #9580

Closed

7 tasks

jonas-schievink mentioned this issue Aug 16, 2021

Steering issue #6 #9925

Closed

4 tasks

Veykril mentioned this issue Sep 19, 2021

Inlay hints should work in attributed items with bodies #10043

Closed

lnicola mentioned this issue Sep 27, 2021

Steering issue #7 #10370

Closed

4 tasks

jonas-schievink mentioned this issue Sep 27, 2021

fix: Implement most proc_macro span handling methods #10378

Merged

Veykril mentioned this issue Jan 3, 2022

Record macro fragment captures #11183

Open

jonas-schievink mentioned this issue May 21, 2022

Support format_args_capture #11260

Closed

lnicola mentioned this issue Jun 13, 2022

Steering issue #13 #12523

Closed

matklad mentioned this issue Jun 17, 2022

Support Goto Definition into include! #3767

Closed

neoeinstein mentioned this issue Aug 21, 2022

Custom file output format neoeinstein/protoc-gen-prost#27

Closed

This was referenced Oct 10, 2022

Code completion not working in #[tokio::main] #13355

Closed

Proc-macros that re-use input spans break our syntax fixup #13388

Open

jonas-schievink mentioned this issue Nov 21, 2022

Steering issue #17 #13657

Closed

Veykril self-assigned this Nov 25, 2022

bazhenov mentioned this issue Feb 13, 2023

Rename refactoring produce syntactically incorrect code #14137

Closed

Veykril mentioned this issue Apr 13, 2023

Unable to click into include!'d sources; possibly spurious need-mut lint #14562

Closed

Veykril mentioned this issue Jun 19, 2023

Steering Issue 18 #15092

Closed

7 tasks

Veykril mentioned this issue Sep 11, 2023

Steering Issue 19 #15596

Closed

5 tasks

Veykril closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `TokenMap` with an abstraction that matches reality #9403

Replace `TokenMap` with an abstraction that matches reality #9403

matklad commented Jun 25, 2021

jonas-schievink commented Jun 25, 2021

matklad commented Jun 25, 2021

matklad commented Jul 4, 2021 •

edited

Loading

flodiebold commented Jul 4, 2021

matklad commented Jul 4, 2021

edwin0cheng commented Jul 4, 2021 via email

matklad commented Aug 17, 2021

Veykril commented Dec 4, 2023

Replace TokenMap with an abstraction that matches reality #9403

Replace TokenMap with an abstraction that matches reality #9403

Comments

matklad commented Jun 25, 2021

jonas-schievink commented Jun 25, 2021

matklad commented Jun 25, 2021

matklad commented Jul 4, 2021 • edited Loading

flodiebold commented Jul 4, 2021

matklad commented Jul 4, 2021

edwin0cheng commented Jul 4, 2021 via email

matklad commented Aug 17, 2021

Veykril commented Dec 4, 2023

Replace `TokenMap` with an abstraction that matches reality #9403

Replace `TokenMap` with an abstraction that matches reality #9403

matklad commented Jul 4, 2021 •

edited

Loading