-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject invalid string literal whitespace on unescape #793
Conversation
@@ -65,7 +65,7 @@ UNDERSCORE "_" | |||
identifier [A-Za-z_][A-Za-z0-9_]* | |||
sized_type_literal [iuf][1-9][0-9]* | |||
integer_literal [0-9]+ | |||
string_literal \"([^\\\"\n\t]|\\.)*\" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this in #732:
I think we do want to disallow newlines here; both the toolchain implementation and #199 do that (#199 allows "characters other [...] vertical whitespace"). This is important in making """
string literals work: we want
var x: String = """
""";
to unambiguously be a block string literal, not three simple string literals ""
, "\n"
, ""
.
To match #199, we should disallow \v\f\r
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a small change to the design phrasing.
Note, if we don't have implicit string concatenation (which I think was the plan?) this is unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's unambiguous at the parsing level either way, and at the lexing level, a max munch rule would do the right thing here (we'd lex the """\n"""
token because it's longer). I think probably the best argument for the change is to improve the behavior when a "
is accidentally missed from the end of a string literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -65,7 +65,7 @@ UNDERSCORE "_" | |||
identifier [A-Za-z_][A-Za-z0-9_]* | |||
sized_type_literal [iuf][1-9][0-9]* | |||
integer_literal [0-9]+ | |||
string_literal \"([^\\\"\n\t]|\\.)*\" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's unambiguous at the parsing level either way, and at the lexing level, a max munch rule would do the right thing here (we'd lex the """\n"""
token because it's longer). I think probably the best argument for the change is to improve the behavior when a "
is accidentally missed from the end of a string literal.
This is based on discussion on #732: that we should probably parse the invalid whitespace, then reject it as part of string validation, rather than having different parses. I worry the question of "how is this parsed" may lead to subtly unexpected results if we aren't consistent, so I'm switching the logic from the lexer to the unescape library (and also adjusting the list of rejected whitespace).
This is based on discussion on #732: that we should probably parse the invalid whitespace, then reject it as part of string validation, rather than having different parses. I worry the question of "how is this parsed" may lead to subtly unexpected results if we aren't consistent, so I'm switching the logic from the lexer to the unescape library (and also adjusting the list of rejected whitespace).