Reject invalid string literal whitespace on unescape #793

jonmeow · 2021-08-27T23:20:33Z

This is based on discussion on #732: that we should probably parse the invalid whitespace, then reject it as part of string validation, rather than having different parses. I worry the question of "how is this parsed" may lead to subtly unexpected results if we aren't consistent, so I'm switching the logic from the lexer to the unescape library (and also adjusting the list of rejected whitespace).

zygoloid · 2021-08-28T02:04:18Z

executable_semantics/syntax/lexer.lpp

@@ -65,7 +65,7 @@ UNDERSCORE        "_"
 identifier    [A-Za-z_][A-Za-z0-9_]*
 sized_type_literal [iuf][1-9][0-9]*
 integer_literal   [0-9]+
-string_literal    \"([^\\\"\n\t]|\\.)*\"


Sorry, I missed this in #732:

I think we do want to disallow newlines here; both the toolchain implementation and #199 do that (#199 allows "characters other [...] vertical whitespace"). This is important in making """ string literals work: we want

var x: String = """ """;

to unambiguously be a block string literal, not three simple string literals "", "\n", "".

To match #199, we should disallow \v\f\r here too.

Done, added a small change to the design phrasing.

Note, if we don't have implicit string concatenation (which I think was the plan?) this is unambiguous.

Yes, it's unambiguous at the parsing level either way, and at the lexing level, a max munch rule would do the right thing here (we'd lex the """\n""" token because it's longer). I think probably the best argument for the change is to improve the behavior when a " is accidentally missed from the end of a string literal.

zygoloid

Thanks!

zygoloid · 2021-08-30T19:10:01Z

executable_semantics/syntax/lexer.lpp

@@ -65,7 +65,7 @@ UNDERSCORE        "_"
 identifier    [A-Za-z_][A-Za-z0-9_]*
 sized_type_literal [iuf][1-9][0-9]*
 integer_literal   [0-9]+
-string_literal    \"([^\\\"\n\t]|\\.)*\"


Yes, it's unambiguous at the parsing level either way, and at the lexing level, a max munch rule would do the right thing here (we'd lex the """\n""" token because it's longer). I think probably the best argument for the change is to improve the behavior when a " is accidentally missed from the end of a string literal.

This is based on discussion on #732: that we should probably parse the invalid whitespace, then reject it as part of string validation, rather than having different parses. I worry the question of "how is this parsed" may lead to subtly unexpected results if we aren't consistent, so I'm switching the logic from the lexer to the unescape library (and also adjusting the list of rejected whitespace).

Reject invalid whitespace on unescape

53cb7cf

jonmeow requested review from zygoloid and geoffromer August 27, 2021 23:20

jonmeow requested a review from a team as a code owner August 27, 2021 23:20

google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Aug 27, 2021

zygoloid reviewed Aug 28, 2021

View reviewed changes

jonmeow added 3 commits August 30, 2021 16:14

Switch vertical whitespace parsing

37ff1fa

Merge branch 'trunk' into string-reject

4812e02

Fix design phrasing

63bd70b

jonmeow requested a review from a team as a code owner August 30, 2021 16:22

jonmeow added 2 commits August 30, 2021 16:23

Link whitespace

f3496fb

Classify horizontal and vertical whitespace

94a7ca2

jonmeow requested a review from zygoloid August 30, 2021 17:58

zygoloid approved these changes Aug 30, 2021

View reviewed changes

jonmeow added 2 commits August 30, 2021 21:25

merge

c55dbc8

merge

41514ed

jonmeow merged commit 36c9b28 into carbon-language:trunk Aug 30, 2021

jonmeow deleted the string-reject branch August 30, 2021 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject invalid string literal whitespace on unescape #793

Reject invalid string literal whitespace on unescape #793

jonmeow commented Aug 27, 2021 •

edited

Loading

zygoloid Aug 28, 2021

jonmeow Aug 30, 2021

zygoloid Aug 30, 2021

zygoloid left a comment

zygoloid Aug 30, 2021

Reject invalid string literal whitespace on unescape #793

Reject invalid string literal whitespace on unescape #793

Conversation

jonmeow commented Aug 27, 2021 • edited Loading

zygoloid Aug 28, 2021

Choose a reason for hiding this comment

jonmeow Aug 30, 2021

Choose a reason for hiding this comment

zygoloid Aug 30, 2021

Choose a reason for hiding this comment

zygoloid left a comment

Choose a reason for hiding this comment

zygoloid Aug 30, 2021

Choose a reason for hiding this comment

jonmeow commented Aug 27, 2021 •

edited

Loading