Character and string token definitions need updating. #626

ehuss · 2019-06-26T16:07:07Z

There are multiple issues here. Some of this has changed in 1.37 via rust-lang/rust#60793.

RAW_BYTE_STRING_LITERAL no longer allows bare CR (new 1.37). Input format #1459
"Raw string" and "raw byte string" needs to be updated that CRLF is converted to LF (new 1.37). Input format #1459
Several tokens need to sync the English text with the "Lexer" definition.
- STRING_LITERAL indicates several rules (like isolated CR's are not allowed), but the text does not mention any of those restrictions.
- CHAR_LITERAL says "single Unicode character…except U+0027" which is not complete.
- RAW_STRING_LITERAL does not allow bare CR's.
- BYTE_LITERAL escapes are not described.
- BYTE_STRING_LITERAL restrictions are not described.
- In general, just make sure they are all in sync!
Typo in RAW_BYTE_STRING_CONTENT, points to RAW_STRING_CONTENT when it should be RAW_BYTE_STRING_CONTENT. Fixes minor errors #818
I cannot find anywhere that mentions CRLF in a string is converted to LF. Am I blind? Input format #1459
The description for string continuations says "\ immediately before U+000A", but it can also be before CRLF. How should this be handled? I haven't looked at how it is implemented, but are all CRLF's translated everywhere? Should there just be a blanket statement somewhere about this, to avoid having to discuss it in every string literal definition? Input format #1459

I may be missing some things here. Need to very thoroughly review everything to make sure it is correct and up-to-date with the changes from 60793.

The text was updated successfully, but these errors were encountered:

ehuss · 2019-07-22T18:13:45Z

See also rust-lang/rust#62865

mattheww · 2024-01-22T21:38:47Z

rust-lang/rust#118699 (comment)
should be helpful.

mattheww · 2024-01-22T21:39:02Z

~~The current description says that forms like 'a'b are acceptable as a BYTE_LITERAL with a suffix, but in fact they're rejected (to avoid confusion with two LIFETIME_LABEL tokens).~~

The current description says that forms like 'ab'c are acceptable as two LIFETIME_LABEL tokens, but in fact they're rejected ("character literal may only contain one codepoint"; the c is taken as a suffix).

Perhaps this could be documented via another reserved form.

mattheww · 2024-01-22T21:43:25Z

A form like b"\u{00a0}" is rejected at lexing time ("unicode escape in byte string").

But as it doesn't match either BYTE_STRING_LITERAL or RESERVED_TOKEN_DOUBLE_QUOTE, the current description says there's a valid tokenisation as the identifier b followed by "\u{00a0}".

So if we keep on with the current mechanism for documenting such rejected tokens, I think we'd need yet more reserved forms.

There are probably other similar cases. I think after rust-lang/rust#119172 a
C string literal containing a NUL is one.

ehuss added the A-lexer Area: Lexical specification label Jun 26, 2019

ehuss mentioned this issue Aug 19, 2019

Normalize newlines when loading files rust-lang/rust#62948

Merged

ehuss mentioned this issue Sep 12, 2019

document behavior of \r\n in string literals #676

Merged

ehuss mentioned this issue Mar 11, 2020

Fix some typos, misformattings and small mistakes in the lexical structure reference. #778

Closed

ehuss mentioned this issue Mar 3, 2021

Raw strings don't preserve carriage return + newline sequences rust-lang/rust#82721

Closed

ehuss mentioned this issue Jun 6, 2021

Clarify "string continue" for (byte) string literals #1042

Merged

ehuss mentioned this issue Feb 20, 2023

Add \r\n to string_continue grammar #1332

Closed

mattheww mentioned this issue Jan 22, 2024

String literal expressions #1452

Merged

mattheww mentioned this issue Jan 28, 2024

Input format #1459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character and string token definitions need updating. #626

Character and string token definitions need updating. #626

ehuss commented Jun 26, 2019 •

edited

Loading

ehuss commented Jul 22, 2019

mattheww commented Jan 22, 2024

mattheww commented Jan 22, 2024 •

edited

Loading

mattheww commented Jan 22, 2024

Character and string token definitions need updating. #626

Character and string token definitions need updating. #626

Comments

ehuss commented Jun 26, 2019 • edited Loading

ehuss commented Jul 22, 2019

mattheww commented Jan 22, 2024

mattheww commented Jan 22, 2024 • edited Loading

mattheww commented Jan 22, 2024

ehuss commented Jun 26, 2019 •

edited

Loading

mattheww commented Jan 22, 2024 •

edited

Loading