-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add custom JULIA normalization? #11
Comments
+1 |
Yeah, it seems like we really need this. Unfortunately, the standardized normalizations just don't cut it. |
A JULIA mode in base/utf8proc.jl or in libmojibake? It seems utf8proc is better though we probably need to apply it before feeding code to julia-parser.scm |
How realistic is it to actually upstream all the changes we've made to utf8proc? I would guess that a new normalization mode would be fairly easy to keep separate from other changes. |
What about this example (referring to the proposed new brackets in Julia): a1 = ⟨c,d⟩ # canonical \langle and \rangle
a2 = ⟪c,d⟫ # using \lAngle and \rAngle (legibility preference)
a3 = ⟪c,d⟩ # unmatched brackets should throw parse error
b1 = "⟪c,d⟫"
b2 = "❰c,d❱" # dingbat angular brackets
b3 = "〈c,d〉" # full-width angular brackets U3008, U3009 I'd prefer normalizing the angular brackets for a1 and a2 so they parse, and leave the chars in the literal strings untouched. This means the lexer/parser needs to control the normalization, at least for syntactically important symbols. Or is there a hook for that already? |
This could be handled without the parser needing to know about it by having a mapping from brackets to their pair and just raising an error if the parser finds a pair that aren't really a pair. If the Unicode code points are always near each other, the check could just be for that. Of course this still implies that normalization has to happen after that check, and thus after lexing at least. So the sequence would be: lex, check, normalize, parse. Seems like a lot of trouble to prevent people from using unpaired Unicode brackets that happen to look similar. Maybe not worth it. |
It isn't so bad actually. The (lex, check) part of that is already in place in my PR, albeit manually and most likely non-exhaustive. I was brought here wondering if some of that work can be off-loaded to utf8proc, but it probably requires way too much finessing. So perhaps just an incremental change like so would work:
|
@StefanKarpinski, the changes so far aren't too radical. The first obstacle is that we need to get copyright assignments from all of the contributors in order for upstream to consider a patch. After that, I don't know what their patch-review process will be like, but I'm guessing it will be a bit on the slow side based on past interactions. |
@stevengj Have you been able to get a reply from them? I didn't get any. I can help asking for copyright assignment if that can help, I'd rather not have to package libmojibake in addition to utf8proc in Fedora. :-) |
I think that copyright assignment is not a good idea, hopefully a contributor license agreement is all they actually require. Copyright assignment isn't even legally valid in many countries, e.g. Germany. |
You're right, Stefan, it actually seems to be just a contributor license. |
(Update: the current changes in libmojibake, mainly Unicode-7 support, have been submitted upstream with CLAs.) |
A quick note that Unicode provides a list of confusable characters as part of UAX 39, which also provides a list of recommendations for characters in identifier names given security concerns. |
@jiahao, I think we explicitly decided to reject these recommendations, along with NFKC normalization, in JuliaLang/julia#5434, in order to distinguish a wider array of mathematical symbols (e.g. |
Upon reflection, I think the best thing would be to make this pluggable, by allowing the caller to supply a custom mapping function that is applied to the codepoints after normalization. |
Providing a "reasonable" set of confusable mathematical characters won't be too crazy though. |
Closed by #89. |
For JuliaLang/julia#5903. If utf8proc can have LUMP, then libmojibake can have JULIA. Unless we want to keep this separate from libmojibake.
The text was updated successfully, but these errors were encountered: