Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent literal escaping in proc macros #60495

Open
petrochenkov opened this issue May 3, 2019 · 3 comments
Open

Inconsistent literal escaping in proc macros #60495

petrochenkov opened this issue May 3, 2019 · 3 comments
Assignees
Labels
A-frontend Area: Compiler frontend (errors, parsing and HIR) A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) A-proc-macros Area: Procedural macros C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@petrochenkov
Copy link
Contributor

petrochenkov commented May 3, 2019

Proc macros operate on tokens, including string/character/byte-string/byte literal tokens, which they can get from various sources.

  • Source 1: Lexer.
    This is the most reliable source, the token is passed to a macro precisely like it was written in source code.
    "C" will be passed as "C", but the same C in escaped form "\x43" will be passed as "\x43".
    Proc macros can observe the difference because ToString (the only way to get the literal contents in proc macro API) also prints the literal precisely.
  • Source 2: Proc macro API.
    Literal::string(s: &str) will make you a string literal containing data s, approximately.
    The precise token (returned by ToString) will contain:
    • escape_debug(s) for string literals (Literal::string)
    • escape_unicode(s) for character literals (Literal::character)
    • escape_default(s) for byte string literals (Literal::byte_string)
  • Source 3: Recovered from non-attribute AST
    AST goes through pretty-printing first, then re-tokenized.
    The precise token (returned by ToString) will contain:
    • precise s for raw AST strings
    • escape_debug(s) for non-raw AST strings
    • escape_default(s) for AST characters, bytes and byte strings (both raw and non-raw)
  • Source 4: Recovered from attribute AST
    Just an ad-hoc recovery without pretty-printing.
    The precise token (returned by ToString) will contain:
    • precise s for raw AST strings
    • escape_default(s) for non-raw AST strings, AST characters, bytes and byte strings (both raw and non-raw)

EDIT: Also doc comments go through escape_debug when converted to #[doc = "content"] tokens for proc macros.

It would be nice to

  • Figure out what escaping we actually want (perhaps none?) and document the motivation behind the escaping choices.
  • Get rid of the escaping differences between token sources, so that at least literals of the same kind are escaped identically.
@petrochenkov petrochenkov added A-frontend Area: Compiler frontend (errors, parsing and HIR) A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) A-parser Area: The parsing of Rust source code to an AST and removed A-parser Area: The parsing of Rust source code to an AST labels May 3, 2019
@jonas-schievink jonas-schievink added the C-bug Category: This is a bug. label May 3, 2019
@petrochenkov
Copy link
Contributor Author

#60506 addresses the raw byte string literal case (no escaping should happen in that case).

bors added a commit that referenced this issue May 9, 2019
Keep original literal tokens in AST

The original literal tokens (`token::Lit`) are kept in AST until lowering to HIR.

The tokens are kept together with their lowered "semantic" representation (`ast::LitKind`), so the size of `ast::Lit` is increased (this also increases the size of meta-item structs used for processing built-in attributes).
However, the size of `ast::Expr` stays the same.

The intent is to remove the "semantic" representation from AST eventually and keep literals as tokens until lowering to HIR (at least), and I'm going to work on that, but it would be good to land this sooner to unblock progress on the [lexer refactoring](#59706).

Fixes a part of #43081 (literal tokens that are passed to proc macros are always precise, including hexadecimal numbers, strings with their original escaping, etc)
Fixes a part of #60495 (everything except for proc macro API doesn't need escaping anymore)
This also allows to eliminate a certain hack from the lexer (https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/pretty-printing.20comments/near/165005357).

cc @matklad
bors added a commit that referenced this issue May 12, 2019
Keep original literal tokens in AST

The original literal tokens (`token::Lit`) are kept in AST until lowering to HIR.

The tokens are kept together with their lowered "semantic" representation (`ast::LitKind`), so the size of `ast::Lit` is increased (this also increases the size of meta-item structs used for processing built-in attributes).
However, the size of `ast::Expr` stays the same.

The intent is to remove the "semantic" representation from AST eventually and keep literals as tokens until lowering to HIR (at least), and I'm going to work on that, but it would be good to land this sooner to unblock progress on the [lexer refactoring](#59706).

Fixes a part of #43081 (literal tokens that are passed to proc macros are always precise, including hexadecimal numbers, strings with their original escaping, etc)
Fixes a part of #60495 (everything except for proc macro API doesn't need escaping anymore)
This also allows to eliminate a certain hack from the lexer (https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/pretty-printing.20comments/near/165005357).

cc @matklad
@petrochenkov petrochenkov self-assigned this Mar 14, 2020
@jonas-schievink jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 21, 2020
@Aaron1011 Aaron1011 added the A-proc-macros Area: Procedural macros label May 21, 2020
@petrochenkov
Copy link
Contributor Author

petrochenkov commented Apr 3, 2022

Current status of used escaping:

  • proc macro server
    • fn byte_string - escape_default
    • fn from_internal - DocComment - escape_debug
    • fn string - escape_debug
    • fn character - escape_unicode
  • parser
    • fn to_lit_token - ByteStr/Byte/Str/Char - escape_default
    • fn inlined_next_desugared - DocComment - no escaping, raw literal is used instead
  • pretty-printer

@petrochenkov
Copy link
Contributor Author

Some discussion about the motivation and requirements for this escaping can be found in #95343 (comment) and below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: Compiler frontend (errors, parsing and HIR) A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) A-proc-macros Area: Procedural macros C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants