-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust lexer has to be able to handle singular quotes '/" #6091
Rust lexer has to be able to handle singular quotes '/" #6091
Conversation
In a valid Rust programm quotes occur in pairs to form character, string or byte string literals. However while inputting the programm the lexer must still be able to handle them. The lexer has to produce a token in that case so that the NetBeans infrastructure relying on the token stream can work. It is not a problem, that the resulting token stream can't be parsed. Concrete cases: Removing parts of single quote structure. The pipe symbols are not part of the code. The area from the first pipe to the second pipe is selected and removed. fn main() { println!('|x'); } | Enter single quote ('). The pipe symbol is not part of the code, at that position a single quote is entered. fn main() { println!(| This partially reverts b62889b and introduces a different fix, that handles the "lonely" characters as explicit tokens.
Hey @matthiasblaesing , I could use some clarification here. I thought that single quote, ', deals with Rust lifetime rules.. https://doc.rust-lang.org/nomicon/lifetimes.html Does this agree with your interpretation? |
Totally not a Rust expert here, but to me the specification implies, that
This is decideable with a look ahead of 2. This should be trivial for the code generation of antlr.
|
I'll need a few days to look at this one, I'm afraid... |
As a quick test, I cloned and opened the regex project (https://github.com/rust-lang/regex) and browsed some files. @matthiasblaesing , when opening this file: https://github.com/rust-lang/regex/blob/master/regex-syntax/src/unicode_tables/age.rs the lexer reported a syntax error, and the whole editor system broke (the SwingEDT behaves badly after that). This happens in master as well. The Lexer reports an error, and this is catastrophic for the ANTLR4 Lexer Support module (the editor freezes). We may want to investigate what's going on in ANTLR4 Lexer Support (how to handle failures in the antlr4 lexers). Can't tell if we want to merge this one before or not. |
Thank you for the test file. This is indeed a worst-case. I think, that we are seeing a problem in the NetBeans lexer infrastructure when characters outside the basic Unicode Plane are encountered. One problematic character in that file is the code point 66349. That code point is: https://www.compart.com/de/unicode/U+1032D In java we normally deal with It seems, that at least one problem is, that the antlr lexer does not expect surrogate pairs and so we would need to recombine first and feed the resulting int into the lexer. A quick test looks promising. However the rendering is still off. I'll look further into this. |
I'll take a look at how ANTLR4 Support handles exceptions as time permits. We may want to return some sort of "error token" or similar in ANTLR4Support on these cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can safely merge this one. It's clever creating "error" tokens to handle mismatched single and double quotes. This is the way to go, I tink. Good to see them categorized in the Error category in RustTokenID.
Regarding Unicode support, it seems ANTLR 4 has specific ways to handle these codes (see https://github.com/antlr/antlr4/blob/dev/doc/unicode.md). We may want to tackle this in another PR, if we want to support this.
Thank you all for review. I agree, that the lexer problem with characters outside the BMP should be tackled outside this PR. |
In a valid Rust programm quotes occur in pairs to form character, string or byte string literals. However while inputting the programm the lexer must still be able to handle them. The lexer has to produce a token in that case so that the NetBeans infrastructure relying on the token stream can work.
It is not a problem, that the resulting token stream can't be parsed.
Concrete cases:
Removing parts of single quote structure. The pipe symbols are not part of the code. The area from the first pipe to the second pipe is selected and removed.
Enter single quote ('). The pipe symbol is not part of the code, at that position a single quote is entered.
This partially reverts b62889b and introduces a different fix, that handles the "lonely" characters as explicit tokens.