ContextJ (RFC 5892) is Security Theater #776

adraffy · 2023-06-15T06:16:29Z

Is CheckJoiners/ContextJ set in stone or can it be debated? If so, I'd like to present some arguments.

annevk · 2023-06-15T06:40:58Z

It depends? 😊 I haven't looked into it, probably depends a lot on what it ends up meaning for ToASCII and what the arguments are.

adraffy · 2023-06-15T08:06:00Z

There are 3K+ RGI emoji and 1/3 of them involve ZWJ sequences. CheckJoiners chooses a few exotic characters (that can easily be enforced at the registrar level) for 1350 emoji sequences that are used internationally by billions of people.
RFC 5892 is both outdated (2010) and misguided. AFAICT it's trying to allow ZW(N)J for typographical reasons yet I don't think there's any ambiguity with or without a joiner.
- Are there any registrars that allow both virama with and without ZWNJ as separate names (no)
- How many actual domains benefit from this rule?
If you look across the internet, there are thousands of developer hours wasted on deciding these choices one way or another, but at the end of the day, CheckJoiners is just a convoluted way to disallow 200C and 200D.

For a concrete example: 1F468 200D 1F4BB

This emoji was released in 2016 (7 years ago)
Major browsers don't agree on it's validity: Compare Chrome/Brave vs Safari/Firefox
The punycode of this emoji is xn--1ugz855pfha
This emoji is invalid with CheckJoiners.
In some browsers, this encodes as xn--qq8hgf which is wrong — 1F468 1F4BB is not the same as 1F468 200D 1F4BB
NodeJS recently switched to Ada which uses WHATWG. This means that even if you correctly punycode the domain, a WHATWG URL implementation will prevent its use, even though the punycode is valid and the domain is DNS compatible.
In general, the validity of URLs seems to change randomly between browser releases as libraries are periodically replaced and the standards aren't clear.

The simplest solution is that CheckJoiners should be false

Any name with a joiner is already punycode.
UTS-46 provides poor guidance regarding spoofs and confusables and has forced developers to implement various parts of UAX-39 and their own logic to decide when to display punycode as Unicode.
UTS-46 advice about validating punycode is also strange because name validity is a registrar problem, not a resolution problem.
This is a disaster for the end-user because the rules are constantly changing, yet at the same time, there are thousands confusables and mixed scripted spoofs that slip right through the implemented standards.

For reference, I recently implemented a normalization standard for the Ethereum Name Service ecosystem. I used a combination of UTS-51 + UTS-46 + significantly safer character set (banned punctuation, parens, brackets, vocalizations, obsolete, deprecated, ancient, reversed, turned, flipped, many ligatures, etc.) + an intelligent confusable system (that isn't just a warning system: eg. rn is a footgun confusable.) Demo | Github

From my experience with the Unicode and RFC documentation, the primary source of confusion and bugs is due to the documentation itself. Many of these rules should be deprecated and the rules should be clarified and modernized.

I think WHATWG made the correct decision with AllowHyphens and finally broke away from archaic DNS rules.

I think they should do the same with CheckJoiners. If the WHATWG really wants to protect end-users, it should recommend UTS-51 RGI pre-processing and outright disallow ZW(N)J outside of emoji.

annevk · 2024-12-02T09:51:47Z

It might make sense to give feedback to UTS46 for them to consider allowing more emoji through, but the general idea of leaving the problem of confusables entirely to the proprietary ToUnicode algorithms does not seem great as it's already problematic that those algorithms are not standardized and this would make the problem a lot worse.

hsivonen · 2024-12-05T10:21:51Z

Something to consider before deciding whether to send feedback about UTS 46:

Even before getting to ContextJ, which affects U+200D from https://www.unicode.org/Public/emoji/16.0/emoji-zwj-sequences.txt , the mapping stage removes U+FE0F, which occurs in multiple sequences there.

That is, UTS 46 is incompatible with RGI emoji on a more fundamental level than ContextJ.

Making it not so would be very messy implementation-wise considering how the fused mapping and normalization stage works in both ICU4C and ICU4X, so asking for RGI emoji support in UTS 46 would be a big request.

hsivonen · 2024-12-05T10:32:19Z

that can easily be enforced at the registrar level

Fixing stuff on that level was supposed to be the way to mitigate the breaking change to the characters called deviation characters. Evidence suggests that it didn't go the way things were explained at the time of making the breaking change.

On the flip side, if the TLDs enforced ICANN policy, registering any emoji in domains would not work. It appears that there are TLDs that do not enforce ICANN policy.

adraffy · 2024-12-06T10:41:23Z

Even before getting to ContextJ, which affects U+200D from https://www.unicode.org/Public/emoji/16.0/emoji-zwj-sequences.txt , the mapping stage removes U+FE0F, which occurs in multiple sequences there.

The FE0E or FE0F don't impact the emoji though. All emoji are uniquely identifiable regardless of variation selector placement. Those characters can (and probably should) be stripped for canonicalization.

Emoji are registered as punycode, which is ASCII. However, URL parsing libraries are decoding the punycode, running UTS-46, spotting a ZWJ, and then failing due to CheckJoiners (see screenshot above.)

Either CheckJoiners should be false or it should be false when decoding a name that is already punycode.

annevk · 2024-12-06T10:43:33Z

No, as also mentioned in #821 (comment) we want to preserve round-tripping.

adraffy · 2024-12-06T11:07:50Z

By "No" you mean for the punycode-conditional CheckJoiners comment?

Similar to the title of this thread, the "security" provided by that check is almost nothing. UTS-46 alone is insufficient for user safety. Just look at all of the stuff browsers do to guard the URL input field.

For every 1 name caught with CheckJoiners, a gazillion confusables slip through the cracks and have to be addressed at a higher level of the stack.

annevk · 2024-12-09T09:23:32Z

In general we want to align with UTS46. There's a couple places where we deviate for web compatibility and hopefully we can narrow the scope of those over time.

I think we are agreed that the "ToUnicode" needs more work and hopefully at some point that can be more properly standardized than it is now, though to some extent that is also clearly in the realm of user interfaces which standards typically do not touch. At least not to a significant extent.

However, that does not mean we want to differentiate between ASCII and non-ASCII inputs. Or that we should allow all Unicode to go through as it becomes ASCII anyway.

I'm closing this as WONTFIX therefore.

annevk added the topic: idna label Jun 15, 2023

adraffy mentioned this issue Sep 9, 2023

Unicode RGI Emoji ZWJ Sequences / WhatWG URL / CheckJoiners ada-url/ada#510

Closed

annevk added the topic: parser label Nov 29, 2024

annevk closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContextJ (RFC 5892) is Security Theater #776

ContextJ (RFC 5892) is Security Theater #776

adraffy commented Jun 15, 2023

annevk commented Jun 15, 2023

adraffy commented Jun 15, 2023 •

edited

Loading

annevk commented Dec 2, 2024

hsivonen commented Dec 5, 2024 •

edited

Loading

hsivonen commented Dec 5, 2024 •

edited

Loading

adraffy commented Dec 6, 2024 •

edited

Loading

annevk commented Dec 6, 2024

adraffy commented Dec 6, 2024 •

edited

Loading

annevk commented Dec 9, 2024

ContextJ (RFC 5892) is Security Theater #776

ContextJ (RFC 5892) is Security Theater #776

Comments

adraffy commented Jun 15, 2023

annevk commented Jun 15, 2023

adraffy commented Jun 15, 2023 • edited Loading

annevk commented Dec 2, 2024

hsivonen commented Dec 5, 2024 • edited Loading

hsivonen commented Dec 5, 2024 • edited Loading

adraffy commented Dec 6, 2024 • edited Loading

annevk commented Dec 6, 2024

adraffy commented Dec 6, 2024 • edited Loading

annevk commented Dec 9, 2024

adraffy commented Jun 15, 2023 •

edited

Loading

hsivonen commented Dec 5, 2024 •

edited

Loading

hsivonen commented Dec 5, 2024 •

edited

Loading

adraffy commented Dec 6, 2024 •

edited

Loading

adraffy commented Dec 6, 2024 •

edited

Loading