Output of the lexer #42

mattheww · 2024-03-04T20:07:36Z

If the plan is to document the lexer separately from the grammar, there are choices to make about how to describe its output.

One choice is whether the lexer's output for punctuation should be described as consisting of fine-grained (single-character) tokens (like the ones procedural macros use), compound tokens (like the tt fragment specifier in macros-by-example uses), or some mixture.

If the lexer doesn't use compound tokens, there's then a choice of how to allow the lexer's clients to distinguish (say) && from & & in cases where it's necessary.

There's also a choice of whether lifetimes use fine-grained or compound tokens.

Clients of the lexer

There are three parts of Rust which consume the lexer's output:

the main Rust parser
the Procedural Macros system
the "Macros by example" system

The proc-macro system works in terms of fine-grained tokens; macros-by-example (when using the tt fragment specifier) works in terms of compound tokens.

The parser could be specified using either fine-grained or compound tokens, or a mixture of the two.

I think in practice this means that, whatever choice is made for the main description of the lexer, the "Lexing/tokenization" chapter should have additional text describing how to convert its output to the other form(s), which the chapters for macros could refer to.

Input to the parser

(I'm assuming that the intent is to accurately describe rustc's current behaviour, for the "descriptive" view of the spec.)

Using compound tokens

If the grammar is defined in terms of compound tokens, there are rules which need to match both a single-character token and a compound token beginning with that character:

&& in BorrowExpression, ReferencePattern, and ReferenceType
|| in ClosureExpression
< in many places (I think everywhere it appears, except comparison expressions)
> in many places (I think everywhere it appears, except comparison expressions)

Brian Leibig's grammar gives a reasonable idea of how this approach might end up.

With this approach, the description of how the parsed source is converted to an AST would also have to deal this complication.

Using fine-grained tokens

If the grammar is defined in terms of fine-grained tokens, its rules will sometimes need to make a distinction depending on whether punctuation tokens were separated by whitespace.

For example:

a << b must be accepted as a left-shift expression, but a < < b must not
a && b must be taken as a LazyBooleanExpression, not treated as ambiguous with a & &b
- but a &- b must be treated as equivalent to a & -b

I think that for <, >, and | it's sufficient to know whether there is spacing before the next token, but the second example above shows that for & what matters is whether it's immediately followed by another &.

One approach might be to describe the lexer's output as including explicit whitespace tokens (and presumably introduce some convenience notation in the grammar so that they don't usually have to be mentioned).

Another approach might be to say that there are two distinct tokens used to represent each relevant punctuation character, as suggested by this comment:

You distinguish "> followed by >" as one token and "> followed by something else" as another kind of token.

A third approach might be to use some form of attribute grammar, and say that the input tokens have attributes like "Spacing" and "Is immediately followed by another &".

With any of these approaches, I think the description of how the parsed source is converted to an AST would remain reasonably simple.

Rejected joined forms when using fine-grained tokens

There are some cases where parses which might naturally be allowed must be rejected when punctuation characters are not separated by whitespace.

I know of the following:

SomeStruct { field:::std::i32::MAX } must be rejected, not treated as equivalent to SomeStruct { field: ::std::i32::MAX }
fn f(s:::std::string::String) must be rejected, not treated as equivalent to fn f(s: ::std::string::String)

I expect there are more of a similar sort.

This is perhaps an argument for using a compound token for :: even if fine-grained tokens are used in other cases.

There's also this:

a <- b must be rejected, not taken as equivalent to a < -b

but perhaps that's really analogous to a && b (an ambiguity is being resolved in favour of an obsolete unstable feature).

Input to procedural macros

Procedural macros see fine-grained Punct tokens, which have a Spacing property which indicates whether there was whitespace before the following token.

Lifetime-or-label ('foo) is represented as a Punct token with joint spacing followed by an Ident token.

The documentation as of Rust 1.76 has this to say about Punct's Spacing property:

Joint

[…] in token streams parsed from source code, the compiler will only set spacing to Joint in the following cases.

When a Punct is immediately followed by another Punct without a whitespace. E.g. + is Joint in += and ++.

When a single quote ' is immediately followed by an identifier without a whitespace. E.g. ' is Joint in 'lifetime.

This list may be extended in the future to enable more token combinations.

Alone

[…] In token streams parsed from source code, the compiler will set spacing to Alone in all cases not covered by the conditions for Joint above.
E.g. + is Alone in + =, +ident and +().
In particular, tokens not followed by anything will be marked as Alone.

Input to "Macros by example" macros

By-example macros using the tt fragment specifier see the following combinations of punctuation as compound tokens:

  <=
  ==
  !=
  >=
  &&
  ||
  ..
  ...
  ..=
  ::
  ->
  <-
  =>
  <<
  >>
  +=
  -=
  *=
  /=
  %=
  ^=
  &=
  |=
  <<=
  >>=

Lifetime-or-label ('foo) is represented as a single token.

Sources of information

The current rustc implementation

(As of rustc 1.76)

The lower-level lexer in rustc_lexer emits fine-grained tokens for punctuation (but compound tokens for lifetime-or-label). It emits explicit tokens representing whitespace.

The higher-level lexer in rustc_parse::lexer emits compound tokens for punctuation. It represents whitespace as 'Spacing' information describing the relationship to the following token.

The parser breaks those compound tokens up again where necessary.

(As I understand it, there is consensus that ideally rustc would switch to using fine-grained tokens internally.)

The breaking-up takes place in the following functions in rustc_parse, which might be used to audit for cases where a grammar based on compound tokens would need special cases.

expect_and()
expect_or()
eat_lt()
expect_lt()
expect_gt()
eat_plus()

In particular, rustc takes care to split += when the + appears in generic bounds, but I think that isn't currently making any difference (I don't think a following = can ever parse).

Currently maintained documentation

The Reference

The Lexical structure chapter of the Reference uses compound tokens for punctuation and lifetime-or-label.

The grammar used in the Reference is written in a mixed style.

In most places it's written as if it's working with fine-grained tokens; notation like && can be taken as representing two & tokens without intervening whitespace.

In three cases (BorrowExpression, ReferencePattern, and ClosureExpression) it lists both the fine-grained and the compound tokens explicitly, sometimes with an explanation in the text.

It treats LIFETIME_OR_LABEL as a single token.

The Ferrocene spec

The Ferrocene spec's lexical description and grammar are close to the Reference's.

The lexical part was forked from the Reference sometime in late 2021, and has no interesting additions.

Its grammar doesn't include the separately-listed compound tokens that the Reference has.

Older formalisations

Rustypop

Rustypop from 2016 used fine-grained tokens to deal with <<, >>, &&, and ||.

Otherwise it used compound tokens (in particular, for <=, <<= <-, >=, >>=, and ::).

Its README describes its approach to dealing with multiple-punctuation symbols in the grammar.

Brian Leibig's grammar

Brian Leibig's grammar (last updated in 2017) uses compound tokens throughout.

It's the only grammar of that sort I've found that tries to deal with all cases where compound tokens need to be accepted as if they were multiple fine-grained ones.

wg-grammar

The wg-grammar grammar from 2019 appears to have used compound tokens for both punctuation and lifetimes.

I don't think that project got as far as dealing with the resulting complications. rust-lang/wg-grammar#47 has a little discussion.

rust-lang/wg-grammar#3 includes extensive discussion on how the lexer might be modelled.

One result of that discussion was https://github.com/CAD97/rust-lexical-spec , which would have output fine-grained punctuation tokens and explicit whitespace tokens.

The text was updated successfully, but these errors were encountered:

mattheww · 2024-03-04T20:09:23Z

"Accept tuple.0.0 as tuple indexing"

There is one other case where rustc currently breaks up a single token while parsing: if x is a tuple or tuple-like struct, notation like x.0.1 is permitted.

rustc currently analyses that as three tokens: x, ., and 0.1, where the last is a floating-point literal.

Effectively the grammar allows Expression . FLOAT_LITERAL as a kind of two-deep FieldExpression.

This isn't currently documented in the Reference (rust-lang/reference#847), or anywhere else as far as I know.

Procedural and by-example macros both also see the third token as a single numeric literal. Procedural macros can return a token stream of this form (using a Literal::f32_unsuffixed) which the parser will accept.

The history of this feature is in rust-lang/rust#71322.

So I think there's a choice of whether to continue to treat this as an unusual exception, or to introduce a more fine-grained tokenisation of some numeric literals.

m-ou-se · 2024-03-14T10:54:58Z

In the C++ standard, there are two types of tokens: "preprocessing tokens" [lex.pptoken] and "tokens" [lex.token]. In an early phase, the source files are tokenized into preprocessing tokens, and only a few phases later will they be converted into tokens.

It sounds like we should do something similar for Rust.

Would it work if the first phase produces fine-grained tokens, and another phase (before parsing) will convert them to compound tokens? It'd be helpful to have very clear names for each type of token, and start working on their exact definition.

(I think the tuple.0.0 thing is unrelated to this. That even shows up as a floating point literal to proc macros, so I think the only reasonable way of dealing with that is in the parsing phase, by accepting a floating point literal there.)

mattheww · 2024-03-18T22:35:57Z

I suggest the practical way to start is to write enough of the lexer chapter to describe how to produce a sequence of tokens that carries all the information that the rest of the spec ought to need, then go on to drafting other chapters.

Then any decisions about how to convert that sequence into the forms that the grammar and macros need can be made together with writing those chapters.

It might be clearer then how much of the description of how to construct (say) the tokens that proc macros see belongs in the Lexer chapter and how much in the Proc-macros chapter.

I've made a start on a description that's more rigorous than the one in the Reference, and I think "fine-grained tokens with explicit tokens for whitespace" is going to be a good form to aim for at first.

mattheww · 2024-04-11T07:05:38Z

I have put that description at https://github.com/mattheww/lexeywan

JoelMarcey added the t-spec meeting discussion label Mar 6, 2024

JoelMarcey added this to Rust Specification: Tasks and Priorities Mar 15, 2024

JoelMarcey added A-syntax-lexing Area: Related to the Lexing Chapter of the "Source code and Rust syntax tree" Section and removed t-spec meeting discussion labels Mar 20, 2024

JoelMarcey moved this to Todo in Rust Specification: Tasks and Priorities Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output of the lexer #42

Output of the lexer #42

mattheww commented Mar 4, 2024

mattheww commented Mar 4, 2024

m-ou-se commented Mar 14, 2024

mattheww commented Mar 18, 2024

mattheww commented Apr 11, 2024

Output of the lexer #42

Output of the lexer #42

Comments

mattheww commented Mar 4, 2024

Clients of the lexer

Input to the parser

Using compound tokens

Using fine-grained tokens

Rejected joined forms when using fine-grained tokens

Input to procedural macros

Input to "Macros by example" macros

Sources of information

The current rustc implementation

Currently maintained documentation

The Reference

The Ferrocene spec

Older formalisations

Rustypop

Brian Leibig's grammar

wg-grammar

mattheww commented Mar 4, 2024

"Accept tuple.0.0 as tuple indexing"

m-ou-se commented Mar 14, 2024

mattheww commented Mar 18, 2024

mattheww commented Apr 11, 2024