Inconsistent literal escaping in proc macros #60495

petrochenkov · 2019-05-03T00:18:18Z

Proc macros operate on tokens, including string/character/byte-string/byte literal tokens, which they can get from various sources.

Source 1: Lexer.
This is the most reliable source, the token is passed to a macro precisely like it was written in source code.
"C" will be passed as "C", but the same C in escaped form "\x43" will be passed as "\x43".
Proc macros can observe the difference because ToString (the only way to get the literal contents in proc macro API) also prints the literal precisely.
Source 2: Proc macro API.
Literal::string(s: &str) will make you a string literal containing data s, approximately.
The precise token (returned by ToString) will contain:
- escape_debug(s) for string literals (Literal::string)
- escape_unicode(s) for character literals (Literal::character)
- escape_default(s) for byte string literals (Literal::byte_string)
Source 3: Recovered from non-attribute AST
AST goes through pretty-printing first, then re-tokenized.
The precise token (returned by ToString) will contain:
- precise s for raw AST strings
- escape_debug(s) for non-raw AST strings
- escape_default(s) for AST characters, bytes and byte strings (both raw and non-raw)
Source 4: Recovered from attribute AST
Just an ad-hoc recovery without pretty-printing.
The precise token (returned by ToString) will contain:
- precise s for raw AST strings
- escape_default(s) for non-raw AST strings, AST characters, bytes and byte strings (both raw and non-raw)

EDIT: Also doc comments go through escape_debug when converted to #[doc = "content"] tokens for proc macros.

It would be nice to

Figure out what escaping we actually want (perhaps none?) and document the motivation behind the escaping choices.
Get rid of the escaping differences between token sources, so that at least literals of the same kind are escaped identically.

The text was updated successfully, but these errors were encountered:

petrochenkov · 2019-05-03T11:49:47Z

#60506 addresses the raw byte string literal case (no escaping should happen in that case).

@matklad

Keep original literal tokens in AST The original literal tokens (`token::Lit`) are kept in AST until lowering to HIR. The tokens are kept together with their lowered "semantic" representation (`ast::LitKind`), so the size of `ast::Lit` is increased (this also increases the size of meta-item structs used for processing built-in attributes). However, the size of `ast::Expr` stays the same. The intent is to remove the "semantic" representation from AST eventually and keep literals as tokens until lowering to HIR (at least), and I'm going to work on that, but it would be good to land this sooner to unblock progress on the [lexer refactoring](#59706). Fixes a part of #43081 (literal tokens that are passed to proc macros are always precise, including hexadecimal numbers, strings with their original escaping, etc) Fixes a part of #60495 (everything except for proc macro API doesn't need escaping anymore) This also allows to eliminate a certain hack from the lexer (https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/pretty-printing.20comments/near/165005357). cc @matklad

@matklad

Keep original literal tokens in AST The original literal tokens (`token::Lit`) are kept in AST until lowering to HIR. The tokens are kept together with their lowered "semantic" representation (`ast::LitKind`), so the size of `ast::Lit` is increased (this also increases the size of meta-item structs used for processing built-in attributes). However, the size of `ast::Expr` stays the same. The intent is to remove the "semantic" representation from AST eventually and keep literals as tokens until lowering to HIR (at least), and I'm going to work on that, but it would be good to land this sooner to unblock progress on the [lexer refactoring](#59706). Fixes a part of #43081 (literal tokens that are passed to proc macros are always precise, including hexadecimal numbers, strings with their original escaping, etc) Fixes a part of #60495 (everything except for proc macro API doesn't need escaping anymore) This also allows to eliminate a certain hack from the lexer (https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/pretty-printing.20comments/near/165005357). cc @matklad

petrochenkov · 2022-04-03T21:54:42Z

Current status of used escaping:

proc macro server
- fn byte_string - escape_default
- fn from_internal - DocComment - escape_debug
- fn string - escape_debug
- fn character - escape_unicode
parser
- fn to_lit_token - ByteStr/Byte/Str/Char - escape_default
- fn inlined_next_desugared - DocComment - no escaping, raw literal is used instead
pretty-printer
- fn print_string - escape_debug - used for inline asm only (Audit string literals in inline assembly syntax for correct unescaping #95625)

petrochenkov · 2022-04-04T17:26:18Z

Some discussion about the motivation and requirements for this escaping can be found in #95343 (comment) and below.

petrochenkov mentioned this issue May 3, 2019

introduce unescape module #60261

Merged

5 tasks

jonas-schievink added the C-bug Category: This is a bug. label May 3, 2019

petrochenkov mentioned this issue May 3, 2019

Preserve byte string literal raw-ness in AST #60506

Closed

petrochenkov mentioned this issue May 9, 2019

Keep original literal tokens in AST #60679

Merged

petrochenkov self-assigned this Mar 14, 2020

jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 21, 2020

Aaron1011 added the A-proc-macros Area: Procedural macros label May 21, 2020

malobre mentioned this issue Jan 19, 2022

chore: add unicode task, remove python script rome/tools#1932

Merged

4 tasks

petrochenkov mentioned this issue Mar 26, 2022

Reduce unnecessary escaping in proc_macro::Literal::character/string #95343

Merged

petrochenkov mentioned this issue Apr 3, 2022

Audit string literals in inline assembly syntax for correct unescaping #95625

Open

petrochenkov mentioned this issue Apr 11, 2024

Improve escaping of byte, byte str, and c str proc-macro literals #123769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent literal escaping in proc macros #60495

Inconsistent literal escaping in proc macros #60495

petrochenkov commented May 3, 2019 •

edited

Loading

petrochenkov commented May 3, 2019

petrochenkov commented Apr 3, 2022 •

edited

Loading

petrochenkov commented Apr 4, 2022

Inconsistent literal escaping in proc macros #60495

Inconsistent literal escaping in proc macros #60495

Comments

petrochenkov commented May 3, 2019 • edited Loading

petrochenkov commented May 3, 2019

petrochenkov commented Apr 3, 2022 • edited Loading

petrochenkov commented Apr 4, 2022

petrochenkov commented May 3, 2019 •

edited

Loading

petrochenkov commented Apr 3, 2022 •

edited

Loading