Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character and string token definitions need updating. #626

Open
5 of 6 tasks
ehuss opened this issue Jun 26, 2019 · 4 comments
Open
5 of 6 tasks

Character and string token definitions need updating. #626

ehuss opened this issue Jun 26, 2019 · 4 comments
Labels
A-lexer Area: Lexical specification

Comments

@ehuss
Copy link
Contributor

ehuss commented Jun 26, 2019

There are multiple issues here. Some of this has changed in 1.37 via rust-lang/rust#60793.

  • RAW_BYTE_STRING_LITERAL no longer allows bare CR (new 1.37). Input format #1459

  • "Raw string" and "raw byte string" needs to be updated that CRLF is converted to LF (new 1.37). Input format #1459

  • Several tokens need to sync the English text with the "Lexer" definition.

    • STRING_LITERAL indicates several rules (like isolated CR's are not allowed), but the text does not mention any of those restrictions.
    • CHAR_LITERAL says "single Unicode character…except U+0027" which is not complete.
    • RAW_STRING_LITERAL does not allow bare CR's.
    • BYTE_LITERAL escapes are not described.
    • BYTE_STRING_LITERAL restrictions are not described.
    • In general, just make sure they are all in sync!
  • Typo in RAW_BYTE_STRING_CONTENT, points to RAW_STRING_CONTENT when it should be RAW_BYTE_STRING_CONTENT. Fixes minor errors #818

  • I cannot find anywhere that mentions CRLF in a string is converted to LF. Am I blind? Input format #1459

  • The description for string continuations says "\ immediately before U+000A", but it can also be before CRLF. How should this be handled? I haven't looked at how it is implemented, but are all CRLF's translated everywhere? Should there just be a blanket statement somewhere about this, to avoid having to discuss it in every string literal definition? Input format #1459

I may be missing some things here. Need to very thoroughly review everything to make sure it is correct and up-to-date with the changes from 60793.

@ehuss ehuss added the A-lexer Area: Lexical specification label Jun 26, 2019
@ehuss
Copy link
Contributor Author

ehuss commented Jul 22, 2019

See also rust-lang/rust#62865

@mattheww
Copy link
Contributor

rust-lang/rust#118699 (comment)
should be helpful.

@mattheww
Copy link
Contributor

mattheww commented Jan 22, 2024

The current description says that forms like 'a'b are acceptable as a BYTE_LITERAL with a suffix, but in fact they're rejected (to avoid confusion with two LIFETIME_LABEL tokens).

The current description says that forms like 'ab'c are acceptable as two LIFETIME_LABEL tokens, but in fact they're rejected ("character literal may only contain one codepoint"; the c is taken as a suffix).

Perhaps this could be documented via another reserved form.

@mattheww
Copy link
Contributor

A form like b"\u{00a0}" is rejected at lexing time ("unicode escape in byte string").

But as it doesn't match either BYTE_STRING_LITERAL or RESERVED_TOKEN_DOUBLE_QUOTE, the current description says there's a valid tokenisation as the identifier b followed by "\u{00a0}".

So if we keep on with the current mechanism for documenting such rejected tokens, I think we'd need yet more reserved forms.

There are probably other similar cases. I think after rust-lang/rust#119172 a
C string literal containing a NUL is one.

@mattheww mattheww mentioned this issue Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lexer Area: Lexical specification
Projects
None yet
Development

No branches or pull requests

2 participants