Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move lexer punctuation confusables list to unicode-security crate and sync it with newest Unicode version. #70002

Open
crlf0710 opened this issue Mar 14, 2020 · 9 comments
Labels
A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@crlf0710
Copy link
Member

Now that Unicode 13 is released, maybe we should bump the unicode version.

@jonas-schievink
Copy link
Contributor

#69929

@jonas-schievink jonas-schievink added A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 14, 2020
@Mark-Simulacrum
Copy link
Member

cc @estebank @rust-lang/wg-diagnostics re:confusables list

@est31
Copy link
Member

est31 commented Mar 14, 2020

It's a bit tough because that list isn't automatically generated, or it has been at a point in the past but since then it has been manually edited and changed. Those amendments are valuable and should not be lost by an automatic regeneration. One could think about having two lists though, one automatically generated, the other with the rust-specific edits/improvements.

@crlf0710
Copy link
Member Author

As i've recently got a crate that has confusables.txt data in it. I made a small analysis on existing items in fore-mentioned list. There're 178 items that are already covered by confusables.txt, 85 items not covered, and one duplicate item (Canadian Syllabics Final Middle Dot).

@crlf0710
Copy link
Member Author

The former two lists are here. Also cc @Manishearth to see whether the second list is of some value to further development of UTS39.

@Manishearth
Copy link
Member

The in-tree list is specifically about things which are confusable with rust syntax

@crlf0710
Copy link
Member Author

crlf0710 commented May 1, 2020

Yes, and rust syntax here means most of ASCII punctuation characters.

I saw one major difference between this list and the UTS39 one is that most of the items here are just unintentional, and not for security reasons. For example,

(',', "Fullwidth Comma", ','),

I'm using an IME daily, and a single shift key press toggles between Latin mode and Han mode, where the comma key corresponds to these two characters in each mode. And it's really easy to unintentionally get them wrong. So while this item is not in UTS39 list, it is practically quite useful.

@crlf0710 crlf0710 changed the title Update to Unicode 13. Move lexer punctuation confusables list to unicode-security crate. Mar 26, 2021
@crlf0710 crlf0710 added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 26, 2021
@crlf0710 crlf0710 changed the title Move lexer punctuation confusables list to unicode-security crate. Move lexer punctuation confusables list to unicode-security crate and sync it with newest Unicode version. Mar 26, 2021
@crlf0710
Copy link
Member Author

Updated the issue title to reflect the actual discussion topic here - the libs part is already done in #69929.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants