Quoted identifiers and invalid lexing #25

alexcrichton · 2024-05-25T22:01:33Z

In #23 a test was added along the lines of:

(assert_malformed (module quote "(func $\"a\tb\")") "empty identifier")

This is asserting that a string-based identifier with invalid characters in the string is invalid, but I'm opening this issue to clarify the error emitted here. In the prototype implementation in wasm-tools of #23 the error looks like:

invalid character in string '\t'
                 --> <anon>:1:10
                  |
                1 | (func $"a    b")
                  |          ^

Specifically I'm curious about the lexical format here. Despite both implementations returning an error there's been bits about subtle differences between lexers in the past so I want to smooth this out ideally. I was under the impression that the token was going to be $"a\tb" (with the \t expanded) so during the lexing phase after seeing the $ the adjacent " means "activate the string parser". In doing so an error is generated. It looks like the spec interpreter ignores the string error and assumes the token ends just before the string, but in wasm-tools I was thinking of reporting the string parsing error.

Is there a specific behavior desired here? Is there a downside to reporting the string-parse-error?

The text was updated successfully, but these errors were encountered:

alexcrichton · 2024-05-25T22:05:09Z

(apologies I hit enter too soon when submitting this, I've now updated the description with the full text intended)

rossberg · 2024-05-25T22:36:35Z

The error you get depends on the implementation strategy. Either is fine. There is no requirement to report the same errors as the reference interpreter.

The reference interpreter does not switch to a different lexer for strings. Instead, well-formed strings are directly specified by a comprehensive regexp that follows the spec literally. Since this one isn't a lexically well-formed string, it does not match that regexp and hence the sequence isn't recognised as a $ followed by a string at all.

For a production implementation, a more specific error is probably preferable.

alexcrichton · 2024-05-25T22:54:01Z

Ok, thanks for confirming, I'll close this as answered.

tlively · 2024-05-27T03:50:00Z

I must be missing something, but why does this identifier cause an error? \t is a perfectly valid stringchar.

tlively · 2024-05-27T05:10:38Z

Ah, right, the \t is expanded into a literal tab when the string containing the module source is parsed, and a literal tab character is not valid in a string. It would have to be \\t to make it valid.

alexcrichton closed this as completed May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quoted identifiers and invalid lexing #25

Quoted identifiers and invalid lexing #25

alexcrichton commented May 25, 2024 •

edited

Loading

alexcrichton commented May 25, 2024

rossberg commented May 25, 2024

alexcrichton commented May 25, 2024

tlively commented May 27, 2024

tlively commented May 27, 2024

Quoted identifiers and invalid lexing #25

Quoted identifiers and invalid lexing #25

Comments

alexcrichton commented May 25, 2024 • edited Loading

alexcrichton commented May 25, 2024

rossberg commented May 25, 2024

alexcrichton commented May 25, 2024

tlively commented May 27, 2024

tlively commented May 27, 2024

alexcrichton commented May 25, 2024 •

edited

Loading