-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ContextJ (RFC 5892) is Security Theater #776
Comments
It depends? 😊 I haven't looked into it, probably depends a lot on what it ends up meaning for ToASCII and what the arguments are. |
For a concrete example:
The simplest solution is that
For reference, I recently implemented a normalization standard for the Ethereum Name Service ecosystem. I used a combination of UTS-51 + UTS-46 + significantly safer character set (banned punctuation, parens, brackets, vocalizations, obsolete, deprecated, ancient, reversed, turned, flipped, many ligatures, etc.) + an intelligent confusable system (that isn't just a warning system: eg. From my experience with the Unicode and RFC documentation, the primary source of confusion and bugs is due to the documentation itself. Many of these rules should be deprecated and the rules should be clarified and modernized. I think WHATWG made the correct decision with I think they should do the same with |
It might make sense to give feedback to UTS46 for them to consider allowing more emoji through, but the general idea of leaving the problem of confusables entirely to the proprietary ToUnicode algorithms does not seem great as it's already problematic that those algorithms are not standardized and this would make the problem a lot worse. |
Something to consider before deciding whether to send feedback about UTS 46: Even before getting to ContextJ, which affects U+200D from https://www.unicode.org/Public/emoji/16.0/emoji-zwj-sequences.txt , the mapping stage removes U+FE0F, which occurs in multiple sequences there. That is, UTS 46 is incompatible with RGI emoji on a more fundamental level than ContextJ. Making it not so would be very messy implementation-wise considering how the fused mapping and normalization stage works in both ICU4C and ICU4X, so asking for RGI emoji support in UTS 46 would be a big request. |
Fixing stuff on that level was supposed to be the way to mitigate the breaking change to the characters called deviation characters. Evidence suggests that it didn't go the way things were explained at the time of making the breaking change. On the flip side, if the TLDs enforced ICANN policy, registering any emoji in domains would not work. It appears that there are TLDs that do not enforce ICANN policy. |
The Emoji are registered as punycode, which is ASCII. However, URL parsing libraries are decoding the punycode, running UTS-46, spotting a ZWJ, and then failing due to CheckJoiners (see screenshot above.) Either CheckJoiners should be false or it should be false when decoding a name that is already punycode. |
No, as also mentioned in #821 (comment) we want to preserve round-tripping. |
By "No" you mean for the punycode-conditional CheckJoiners comment? Similar to the title of this thread, the "security" provided by that check is almost nothing. UTS-46 alone is insufficient for user safety. Just look at all of the stuff browsers do to guard the URL input field. For every 1 name caught with CheckJoiners, a gazillion confusables slip through the cracks and have to be addressed at a higher level of the stack. |
In general we want to align with UTS46. There's a couple places where we deviate for web compatibility and hopefully we can narrow the scope of those over time. I think we are agreed that the "ToUnicode" needs more work and hopefully at some point that can be more properly standardized than it is now, though to some extent that is also clearly in the realm of user interfaces which standards typically do not touch. At least not to a significant extent. However, that does not mean we want to differentiate between ASCII and non-ASCII inputs. Or that we should allow all Unicode to go through as it becomes ASCII anyway. I'm closing this as WONTFIX therefore. |
Is CheckJoiners/ContextJ set in stone or can it be debated? If so, I'd like to present some arguments.
The text was updated successfully, but these errors were encountered: