-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Error Message When Parsing Greek Question Mark (and similar confusing characters) #25957
Comments
Useful resource might be the list of Unicode "confusables": http://www.unicode.org/Public/security/revision-06/confusables.txt Just going on printable ASCII punctuation, the list contains 486 possibly confusing glyphs. For reference, the command I used to work that out was: $ egrep -v '^[ ]*#' confusables.txt | egrep -v '^[ ]+$|^$' | egrep '00(2[1-9A-F]|3[A-F]|40|5[BCDF]|7[BCD])' | egrep ';[^0-9A-F]00.. ;' | egrep -v '^00' > confusables-ascii-punc.txt |
Note that the error is in the lexer, not parser, so errors of the type "expected ... found" are right out. The parser only knows what to expect once everything has been converted into tokens. The most we can do here is make it detect the confusable unicode characters. |
Willing to mentor this. You basically need to maintain an array of confusable characters and check against them where this error is emitted in The error would finally look something like:
|
I'll take it! |
Let me lknow if you need help or clarification! |
Current:
Do you know what the error message is here? It's that I'm using the Greek Question Mark instead of a semicolon, causing a parse error. And while there is an error message, it's extremely opaque.
A better error message would actually print the token that is in error directly in the error message, e.g. as @niconii suggsted:
Using the human readable U+37E over the Rust encoding thing (
\u{37E}
) would definitely help.In the specific case of the Greek Question Mark, it looks exactly like a semicolon, and symbols that look like other symbols are pretty common in Unicode, but perfectly enumerable. If we had a table of unicode point to actually used symbol, we could display an even better notice message:
The text was updated successfully, but these errors were encountered: