Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject invalid string literal whitespace on unescape #793

Merged
merged 8 commits into from
Aug 30, 2021
Merged

Reject invalid string literal whitespace on unescape #793

merged 8 commits into from
Aug 30, 2021

Conversation

jonmeow
Copy link
Contributor

@jonmeow jonmeow commented Aug 27, 2021

This is based on discussion on #732: that we should probably parse the invalid whitespace, then reject it as part of string validation, rather than having different parses. I worry the question of "how is this parsed" may lead to subtly unexpected results if we aren't consistent, so I'm switching the logic from the lexer to the unescape library (and also adjusting the list of rejected whitespace).

@jonmeow jonmeow requested a review from a team as a code owner August 27, 2021 23:20
@google-cla google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Aug 27, 2021
@@ -65,7 +65,7 @@ UNDERSCORE "_"
identifier [A-Za-z_][A-Za-z0-9_]*
sized_type_literal [iuf][1-9][0-9]*
integer_literal [0-9]+
string_literal \"([^\\\"\n\t]|\\.)*\"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this in #732:

I think we do want to disallow newlines here; both the toolchain implementation and #199 do that (#199 allows "characters other [...] vertical whitespace"). This is important in making """ string literals work: we want

var x: String = """
""";

to unambiguously be a block string literal, not three simple string literals "", "\n", "".

To match #199, we should disallow \v\f\r here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a small change to the design phrasing.

Note, if we don't have implicit string concatenation (which I think was the plan?) this is unambiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's unambiguous at the parsing level either way, and at the lexing level, a max munch rule would do the right thing here (we'd lex the """\n""" token because it's longer). I think probably the best argument for the change is to improve the behavior when a " is accidentally missed from the end of a string literal.

@jonmeow jonmeow requested a review from a team as a code owner August 30, 2021 16:22
@jonmeow jonmeow requested a review from zygoloid August 30, 2021 17:58
Copy link
Contributor

@zygoloid zygoloid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -65,7 +65,7 @@ UNDERSCORE "_"
identifier [A-Za-z_][A-Za-z0-9_]*
sized_type_literal [iuf][1-9][0-9]*
integer_literal [0-9]+
string_literal \"([^\\\"\n\t]|\\.)*\"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's unambiguous at the parsing level either way, and at the lexing level, a max munch rule would do the right thing here (we'd lex the """\n""" token because it's longer). I think probably the best argument for the change is to improve the behavior when a " is accidentally missed from the end of a string literal.

@jonmeow jonmeow merged commit 36c9b28 into carbon-language:trunk Aug 30, 2021
@jonmeow jonmeow deleted the string-reject branch August 30, 2021 22:22
chandlerc pushed a commit that referenced this pull request Jun 28, 2022
This is based on discussion on #732: that we should probably parse the invalid whitespace, then reject it as part of string validation, rather than having different parses. I worry the question of "how is this parsed" may lead to subtly unexpected results if we aren't consistent, so I'm switching the logic from the lexer to the unescape library (and also adjusting the list of rejected whitespace).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes PR meets CLA requirements according to bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants