-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refusing a mix of numeric-only and BIDI domains #543
Comments
These domains are considered invalid because they don't meet the criteria from RFC 5893 Section 2. Specifically, the label "163" fails criteria 1, which requires the first character of a label to have a Bidi property of L, R, or AL. The digits 0-9 have a Bidi property of EN (European Number) The domain to ASCII algorithm sets the RFC 4920 Section 1 states:
So, the URL spec is doing the right thing here. The only 2 options for making these domains valid in terms of this spec, as far as I can tell, would be setting the |
I agree that the current spec disallows these URLs as invalid. My question was more in the line of „Was the spec's/author's intention to disallow them, or did the spec got written in a way that it disallows them by accident?“ Several sections in there, pointed out in this comment seem to suggest that allowing such URLs as compatibility with existing deployments was considered. So, in other words, my question is not „Are they invalid“, but „Should they be invalid“? |
This might be the same underlying issue as #438. |
I think @TRowbotham has the correct analysis here and indeed it very much depends on how CheckBidi is used. To simplify from OP:
However, https://www.rfc-editor.org/rfc/rfc5893.html#section-2 (which UTS46 invokes) also says:
But that seems contradictory as a label that starts with an ASCII digit can never fulfill The Bidi Rule due to ASCII digits not having the correct Bidi property (they have EN according to https://unicode.org/reports/tr9/):
I'm not sure what to make of this. I would appreciate input from @achristensen07 @valenting @markusicu @macchiati @alvestrand. I would be somewhat inclined to set CheckBidi to false given that it matches most implementations, is more likely to match deployed content, and the bidi requirements appear contradictory, but I'm open to suggestions. |
This might also be more complicated still as, e.g., And then (All of this is only concerned with the ToASCII code path, for what it's worth.) |
This logic took quite a while to work out, including actually coding up the BIDI rule and running it through all the possible combinations of directions to make sure I had them all covered..... The "bidi rule" in RFC 5893 section 2 applies to a single label. So a label (not a domain name) can either obey the rule or not. The guarantees in the last two paragraphs are about the properties of a whole domain name. They are not part of the rule. The practical consequence is that if you want sanity in your display, you can never have <RTL-label>.3com.com - because that would probably display as 3.<RTL-label>com.com, which is confusing. So 1.ي should not be rejected, but 1.ي.3com should be. (inspect the order of the characters in that one!) |
@alvestrand so do I understand it correctly that IDNA2008 doesn't take a stance as to whether all labels in a domain need to obey The Bidi Rule? https://www.unicode.org/reports/tr46/#Validity_Criteria does which might explain the difference. https://www.rfc-editor.org/rfc/rfc5891.html#section-4.2.3.4 seems to only enforce The Bidi Rule upon individual labels containing characters whose property is R whereas UTS46 enforces it upon all labels in a domain as long as CheckBidi is true. I do think The Bidi Rule is somewhat confusing if that is the case as it itself states
which easily leads one to think it applies to all labels and has to be obeyed. Also, it's not clear to me how from The Bidi Rule enforced only upon labels containing characters whose property is R the guarantee follows that labels starting with an ASCII digit do not come after the RTL label. |
You are correct. IDNA2008 states only rules about single labels - this was a result of discussing the various ways in which labels can be put together into domain names. There is a very explicit discussion of "what can happen if you concatenate labels into domain names" in https://www.rfc-editor.org/rfc/rfc5893#section-5 - it ends with "Rather than trying to suggest rules that disallow all such undesirable situations, this document merely warns about the possibility, and leaves it to application developers to take whatever measures they deem appropriate to avoid problematic situations." TR46 was written by people who have far less DNS experience than the people who were involved in RFC 5891. The two groups did not agree at the time TR46 was first written, and while my impression is that TR46 has been revised to be more in line with IDNA2008 over time, I am not surprised that there are still cases where trying to interpret the two as saying exactly the same thing will fail. |
@alvestrand okay, but that still doesn't address my last paragraph about the purported guarantees from IDNA2008. |
The point is that no single entity can make that guarantee, as described in section 5. If you want to require that a certain application rejects domain names that don't obey the requirements, that's an application spec, not a DNS spec. Section 5 (and the "it follows" parts of section 2) are intended to give guidance on how to decide to reject such names. (I suspect that I'm a victim of knowing what I intended when I wrote it, and being unable to see where it's unclear; to me, I'm just repeating what I already wrote in the RFC. But I still hope it's understandable.) |
So why say they are guarantees? Given that IDNA2003 was implemented by user agents it does seem somewhat irresponsible that IDNA2008 didn't try to address them at all, but I guess that's water under the bridge. I guess I need input from @achristensen07 @valenting @markusicu @macchiati as to what exactly we'd like to enforce here. Banning numeric labels in domains containing RTL labels seems bad so I assume we want to change that part of UTS46. Enforcing The Bidi Rule for labels containing a character whose property is R seems realistic and implemented by all user agents. We could additionally try to enforce the second "guarantee" by not allowing a numeric label after an RTL label, but not sure. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
My plan is to submit feedback to Unicode's April meeting to get this addressed. Draft:
If anyone here has suggestions for how to make this more concrete I'm all ears. cc @ricea |
As per whatwg/url#543. Not ready to merge until that has concluded.
The point on empty labels is actually not right. It's largely fallen out of use. My suggestion for a solution would be to add text in the URL standard as follows:
That should be the necessary and sufficient rules for ensuring that display of domain names using the Unicode bidi algorithm don't contain characters that "jump the dot". |
It's correct per how browsers and the URL Standard deal with it. The domain name is Thanks for suggesting a set of rules. I'll incorporate that in the feedback. |
This is the problematic statement in UTS#46: https://unicode.org/reports/tr46/#ProcessingStepBreak The problem is that a.b.c. is using the "preferred name syntax" from RFC 1035 section 2.3.1, where empty labels are disallowed - and UTS#46 is ignoring that. The grammar rule is " ::= [ [ ] ]" - this was relaxed to allow leading digits in RFC 1123 section 2.1, but there was never a relaxation of the rule that there should be at least one character. A competent DNS name processor should: a) disallow any domain name with two consecutive dots |
This is all way before DNS gets involved and also has other applications (such as the same-origin policy) so it's not quite that simple, but it might well be better if UTS46 is not invoked with a trailing dot. They just need to make that clear I think. |
@alvestrand on reflection, it's not clear to me how your suggestion ends up allowing cases such as
|
The example of 1.ي is not covered by my suggested rule:
since the RTL domain is a top level domain in this case.
(the word "directly" makes it more obvious that 1.ي.foo.3.tld is allowed too) |
@alvestrand but for that second now-reformulated case would the RTL labels still need to obey "The Bidi Rule"? And when you say "domains" do you mean "Bidi domain names" or all of them? |
@alvestrand if you could give this another look that would help. Otherwise I'll submit feedback without a specific recommendation. |
The two paragraphs in my suggested rule are AND, not OR; domain names need to satisfy both. All domain names with RTL labels are Bidi domain names. A "Bidi domain name" is a domain name that contains at least one RTL Domain names that don't contain RTL labels are out of scope for this recommendation. |
@alvestrand how does AND work for LTR labels solely consisting of EN code points? They would violate The Bidi Rule. |
Anne, I believe your conclusion in your first message is correct. That is,
if a domain name contains any R, AL, or AN character then by condition 1,
none of its labels can start with an EN character, eg [0-9].
But as you say, the wording of the paragraph XX is appears odd:
In a domain name consisting of only LDH labels (as defined in the
Definitions document [RFC5890
<https://www.rfc-editor.org/rfc/rfc5890>]) and labels that satisfy the
rule,
the requirements of Section 3
<https://www.rfc-editor.org/rfc/rfc5893.html#section-3> are satisfied
as long as a label
that starts with an ASCII digit does not come after a
right-to-left label.
After all, 5890 defines the following.
The term "LDH code point" is defined in this document to refer to the
code points associated with ASCII letters (Unicode code points
0041..005A and 0061..007A), digits (0030..0039), and the hyphen-minus
(U+002D).
That means that a domain name "domain name consisting of only LDH
labels" can't have any right-to-left labels. So it is by definition
always true for such a domain name that "as long as a label that
starts with an ASCII digit does not come after a right-to-left label."
because there can be no right-to-left labels in such a domain name.
…On Mon, Feb 6, 2023 at 5:40 AM Anne van Kesteren ***@***.***> wrote:
@alvestrand <https://github.com/alvestrand> how does AND work for LTR
labels solely consisting of EN code points? They would violate The Bidi
Rule.
—
Reply to this email directly, view it on GitHub
<#543 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBCIM4QFYTTF7CNVELWWD5N3ANCNFSM4RKZDPDA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yeah, I guess we could accept it, but it seems unnecessarily constraining for RTL users and developers, and ends up rejecting domains known to exist (see OP). At the moment it also only matches WebKit, but unfortunately I haven't been able to get @ricea (Chromium) and @valenting (Gecko) to chime in thus far. |
OP had mail.163.com.xn----9mcjf9b4dbm09f.com - here, the RTL label is followed by an ASCII label that does not start with a digit. I don't see how that would fail the rule I suggested. To @annevk : All-numeric labels start with a number. No need to consider anything more about them; if they follow an RTL label, they make the domain name fail the rule. (Note: 1.ي.3.tld (the 3 is actually the subdomain of .tld) is an example of an all-numeric label. It will be rare for users to actually comprehend that.) |
@alvestrand because the |
The 163 label does not follow an RTL label, so while it violates the bidi rule for a label, it doesn't violate the domain name rule I proposed. I think I've said that several times too. Quoting RFC 5893 again:
Satisfying the requirements of section 3 should be the goal of a domain name verification filter. |
In #543 (comment) you suggested two rules and later clarified they are AND. One of the rules is that all labels adhere to The Bidi Rule. Could you please restate your rules in clearer terms? |
Adding in the AND and "immediately" from the suggested clarifications gives the following text:
The AND means that both kinds of domain will be accepted, it is "accept AND accept". I don't understand where the comprehension difficulty is, but then English is not my first language. |
Well, you also said:
But now you are saying domain names only need to satisfy one of the rules, right? (Which brings me back to my question about the lack of enforcement of The Bidi Rule on RTL labels with the second rule.) |
I was wrong when I said "domain names need to satisfy both". I wasn't reading my own proposed text. Sorry! Try N:
|
Thank you, that seems like an improvement, but LDH labels per https://datatracker.ietf.org/doc/html/rfc5890#section-2.3.1 contain A-labels and it seems that A-labels that are RTL labels should obey The Bidi Rule. So maybe LDH there should be LTR? |
No, LDH should be LDH, because there are LTR labels that don't obey the Bidi rule, and we need to not permit those. (See bullets 5 and 6 of the Bidi rule). I don't have proposed surrounding text for this rule, but it should probably say "this rule is evaluated after all A-labels have been converted to U-labels for testing" - meaning that xn-- labels should be decoded before evaluating; if we don't do that, explicit xn-- labels offer a way to sneak in bidi domains into unsuspecting places. |
Okay that makes sense. I think that precondition means that LTR in your second rule can be LDH as well (which guarantees ASCII). And to be clear, there is the (unstated) precondition that these domains are Bidi domain names, right? As presumably we will not impose these requirements on non-Bidi domain names. I think with that we'd recommend these changes to UTS 46:
I'd appreciate your review and of anyone else still paying attention. 😅 |
Thanks for the context! Yes, I think this is appropriate advice. |
Thank you @alvestrand for coming up with the recommendation, @vorner for raising this, and everyone else who helped move this along! I submitted the feedback to Unicode for their April 2023 meeting. The final comment can be found at the bottom of OP in #744. |
The presumption is not correct. Previously (at the time of the quoted comment), CheckBidi was Gecko currently (well after the quoted comment) implements CheckBidi (still (This comment is not meant as disagreement with the feedback relayed to the UTC.) |
Hello
Some time ago I was trying to figure out if the domains below were rejected by the Rust url crate, it is tracked here. It seems this is maybe accidentally disallowed by the standard. I was recommended to raise it here.
It's a bit old so I don't remember the exact details and would have to dig them up, I tried to describe it in this comment. I think the issue was the combination of numeric only label and BIDI label.
Now, my question is, should these be valid URLs? They certainly are valid domains, even though it might be discouraged to allow them and the URLs are (were at least when it was reported; I could provide new ones if needed) alive and reachable. Note that they are considered malware URLs, so be careful when handling them.
The text was updated successfully, but these errors were encountered: