-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unicode escaping #2461
Comments
I think this should only be added to There is however one point that may need some discussion: invalid UTF-8 code-points. Should we replace invalid UTF-8 sequences like surrogates? My guess would be yes, but this would mean that valid strings in languages like Java or .Net languages might not be understood in Nit. Other than that, since Unicode is limited to 10FFFF, no \U escape sequence should support more than that. And tool-wise, there will probably be one minor modification to the grammar if we want to support this kind of sequences, but this should not be too much of a hassle to implement. PR will likely follow in the next couple of days |
One problem I think we have to consider is that people are used to |
This is linked with what I pointed out yesterday, imo the spec should be
something along the lines of:
* Allow \(u|U)[0-9A-Fa-f]{1,6}
* Disallow characters above the Unicode maximum (0x10FFFF)
The only question remaining is what to do with surrogate pairs, should we
allow them?
In some languages (C# being one example if my memory serves me right), \u
and \U have different masks. The capital one expects 8 digits while the
other expects 4.
This feels like an example of what not to do if you ask me
…On 24 May 2017 10:22 am, "Jean-Christophe Beaupré" ***@***.***> wrote:
One problem I think we have to consider is that people are used to \u and
\U to always take 4 digits (so they are limited to BMP or UCS-2/UTF-16
wydes). It is especially important for strings like "1\u00A0000\u00A0000"
(1 000 000).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#2461 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABYL2Qf63ErL7CvkCDTnYux-PO0RYA3eks5r9D0rgaJpZM4Nj4wa>
.
|
IMO, since you allow non-BMP code points, the behavior should be in sync with |
what is the behavior of JS and python on surrogate pairs and on \u vs \U? |
JS and JSON was designed for UCS-2/UTF-16. So, they simply handle them as UTF-16 prescribe. Furthermore, Sources:
|
Nit should have literal Unicode escape sequence
\u008B
and\U0000080B
added inescape_to_nit
andunescape_nit
. I assume the change in the lib will be automatically used by the Nit compiler and tools#2459 (comment)
The text was updated successfully, but these errors were encountered: