Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't parse urls starting with xn-- #438

Closed
Janpot opened this issue May 2, 2019 · 21 comments
Closed

can't parse urls starting with xn-- #438

Janpot opened this issue May 2, 2019 · 21 comments

Comments

@Janpot
Copy link

Janpot commented May 2, 2019

Can't seem to parse urls like http://xn--abc.com. This seems to work in browsers though.
I've been digging through the code and specs a bit.

It looks like tr46.toASCII returns an error. Digging further, it looks like it should implement this spec: https://www.unicode.org/reports/tr46/#Processing. But that seems to say:

Even if an error occurs, the conversion of the string is performed as much as is possible.

And it says

If the label starts with “xn--”:
Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.

The url spec seems to dictate (https://url.spec.whatwg.org/#idna)

If result is a failure value, validation error, return failure.

I feel like this should be possible though, tr46 seems quite ambiguous as to what's recoverable and what not.

I came across an example that renders and parses in the browser but seems to fail the parsing algorithm: http://xn--12cr4aua8bifvs3aljr6edb1al1vlg1a.blogspot.com (disclaimer: I am in no way connected to this url or the content of the site, it just passed by our systems)

In any case, I'm not super experienced in reading these specs, so take the previous with an appropriate grain of salt. It just seems strange to me that urls can render in a browser, but fail parsing them according to the spec.

EDIT:

forgot to mention, when I say "I've been digging through the code" I'm talking about https://github.com/jsdom/whatwg-url. FWIW, the node.js native url parser seems to behave the same way.

@annevk
Copy link
Member

annevk commented May 2, 2019

Thank you for reporting this. Unfortunately, I suspect you're correct and this is not something we have adequately tested for thus far.

Interesting cases (Live URL Viewer links for comparison purposes):

@macchiati I suspect this might require further adjustments to TR46 in due course, once we figure out the full details. Part of the problem here is that browsers have been slow on aligning with requirements in general and making the necessary adjustments.

@Janpot
Copy link
Author

Janpot commented May 2, 2019

Digging a bit further, and another cornercase that seems related to this is that host setter doesn't throw when set to an invalid value:

const x = new URL('http://example.com');
x.host = 'xn--a';
console.log(x.href);
// node: http://example.com/
// browser: http://xn--a/

While

const x = new URL('http://example.com');
x.href = 'http://xn--a/';
console.log(x.href);
// node: throws "Invalid URL: http://xn--a/"
// browser: http://xn--a/

Wouldn't it make sense to make all setters throw when it results in an invalid URL, not only the href setter. According to the spec, this behavior is only specified for the href setter. (Maybe this should be a separate issue?)

EDIT:

And here's another funny one:

const x = new URL('http://example.com');
x.host = 'xn--ß.com';
console.log(x.href);
// node: http://example.com/
// chrome: http://xn--%C3%9F.com/
// firefox: http://xn--xn---yna.com/ 
// safari: http://example.com/

Only firefox seems to be consistent withitself:

const x = new URL('http://example.com');
x.host = 'xn--ß.com';
const y = new URL('http://xn--ß.com');
console.log(x.href === y.href);
// node: exception on line 3
// chrome: exception on line 3
// firefox: true
// safari: exception on line 3

@annevk
Copy link
Member

annevk commented May 2, 2019

That's how the host setter deals with incorrect input for largely historical reasons. The input ends up getting ignored. It should be functionally equivalent otherwise though.

@Janpot
Copy link
Author

Janpot commented May 2, 2019

Ok I see. fwiw, I can't think of a situation where I want this to just ignore my input, rather than complain. But I guess that would be a breaking change by now.

@annevk
Copy link
Member

annevk commented May 2, 2019

Unfortunately it would be. There's been some talk about a dedicated host API, which will not have this problem.

@zackw
Copy link

zackw commented Oct 10, 2019

I'd like to point out that the current rev of the IDNA RFC [IDNA2008] encourages applications that do DNS lookup to be liberal in what they accept, and in particular to "rely on the assumption that names that are present in the DNS are valid" except for specific cases which are known to cause "serious problems". In particular, note the text at the end of section 5.4:

For all other strings, the lookup application MUST rely on the
presence or absence of labels in the DNS to determine the validity of
those labels and the validity of the characters they contain. If
they are registered, they are presumed to be valid; if they are not,
their possible validity is not relevant.

where "all other strings" means "all strings that have passed the sequence of checks for 'serious problems' described in sections 5.3 and 5.4".

Here are some examples of URLs that I have personally observed in the wild (during my research, which involves Web crawling) to contain hostnames which are formally invalid per some RFC or other, but do not rise to the level of a 'serious problem', and which I think should probably be accepted by the URL standard, if only for interop's sake:

http://r2---sn-gvbxgn-tt1s.googlevideo.com/
http://r9---sn-i3b7sn7d.googlevideo.com/
http://lgbt_grani.livejournal.com/
http://www.mi-ru_mo.bbs.fc2.com/
http://-friction-.tumblr.com/

@domenic
Copy link
Member

domenic commented Sep 30, 2020

I wonder if we should consider enshrining browsers' "ASCII fast path", where they don't perform ToASCII on ASCII inputs. In https://bugs.chromium.org/p/chromium/issues/detail?id=724018 @annevk seemed to think that was a bad idea, but I'm not sure I fully understood the negative consequences of that direction.

@annevk
Copy link
Member

annevk commented Oct 1, 2020

Yeah, I think that's probably needed given the number of existing systems that seem to rely on this to varying degrees. I think my concerns were mostly design-wise, that it seems somewhat bad to have a different set of restrictions on non-ASCII and ASCII input, e.g., with regards to length.

If it wasn't already the case it might also lead to certain security issues I suppose, as you can smuggle invalid xn-- sequences in that might trip up something downstream that is poorly configured. That browsers already allow this hopefully means the ecosystem is already robust against such surprises.

@rmisev
Copy link
Member

rmisev commented Oct 1, 2020

Example: http://xn--a.xn--nxa/
According to "ASCII fast path" it must pass. But I can't browse to this address. When entered in the address bar, Chrome converts it to http://xn--a.β/ and fails (because it treats http://xn--a.β/ invalid). I think if Chrome converts it to Unicode form, then must treat both equivalent. But actually it isn't:

  • new URL("http://xn--a.xn--nxa/") - succeeds
  • new URL("http://xn--a.β/") - throws

I think if URL's ASCII form is valid, then converted to Unicode form must by valid too. And vice versa - if URL is invalid in one form, then it must be invalid in other form too. Otherwise we got weird results. So I am the opposite of "ASCII fast path".

@annevk
Copy link
Member

annevk commented Oct 1, 2020

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search. So if you fiddled with your DNS to put something there it would probably work.

I agree that this leads to weirdly inconsistent rules though, so if we go down this path we should be very explicit about it and document these side effects.

@rmisev
Copy link
Member

rmisev commented Oct 1, 2020

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search.

Yes, you are right. Anyway converting valid URL to not valid (even visually) is weird.

@macchiati
Copy link

macchiati commented Oct 1, 2020 via email

@markusicu
Copy link

A fix is being proposed for tr46.

The fix proposed for UTS #46 is to detect "xn--" and "xn--ASCII-" as errors in a way that's equivalent with what IDNA2008 does. See https://www.unicode.org/L2/L2020/20240-utc165-properties-recs.pdf item F7 (on page 8).

More generally, the strategy is to report errors but still produce output (unless, for example, the Punycode string is ill-formed and thus not decodable), because different users/callers may ignore different types of errors.

@domenic
Copy link
Member

domenic commented Oct 1, 2020

Thanks @markusicu! However, what about xn-ASCII with no trailing dash? As seen in e.g. https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly94bi0tYS5jb20v&base=YWJvdXQ6Ymxhbms=, browsers treat these sorts of URLs as parseable. Will the updated UTS 46 also produce output for those?

@markusicu
Copy link

what about xn-ASCII with no trailing dash? ... Will the updated UTS 46 also produce output for those?

I assume you mean xn--ASCII with double hyphen. The difference is that the additional hyphen in xn--ASCII- separates the "basic characters" (ASCII) from the actual Punycode encoding, and that is empty in this kind of label, which means that just Punycode-decoding it returns the ASCII part and you have an alternate encoding of the same label. (Punycode does not fail. IDNA2008 fails a round-trip check.)

Without the additional hyphen the "ASCII" substring is not actually ASCII at all but it's all-non-ASCII Punycode.

I don't think that UTS #46 is missing anything for those.
https://www.unicode.org/reports/tr46/#ProcessingStepConvertValidate

It might be ill-formed Punycode, and the spec says to just record an error for that label. If it's well-formed, then the decoded string is subjected to validation, which in turn might record an error if there is a disallowed character or something else wrong.

@TimothyGu
Copy link
Member

This issue demonstrates a need for URLs such as xn--x.com to be preserved as xn--x.com, despite the Punycode decoding error. However, to prevent reparse bugs, we need to treat Unicode and validly-encoded ASCII versions of a invalid label the same way. In other words:

  • Both xn--a-ecp.ru and a⒈.ru should have the same parsing result ( is a disallowed character according to UTS 46)
  • Both xn--a.xn--nxa and xn--a.β should have the same parsing result
  • It's unclear whether xn--é and xn--xn---epa should be allowed to have different parsing results (Double-encoded IDNA labels don't roundtrip #603)

I propose allowing ASCII labels with Punycode decoding errors to remain, but still forbid other types of UTS 46 error. So we have the following matrix:

Domain spec Chrome Firefox Safari proposal
xn--a-ecp.ru fail xn--a-ecp.ru xn--a-ecp.ru fail fail
a⒈.ru fail fail xn--a-q10i.ru fail fail
xn--a.xn--nxa fail xn--a.xn--nxa xn--a.xn--nxa xn--a.xn--nxa xn--a.xn--nxa
xn--a.β fail fail xn--a.xn--nxa fail xn--a.xn--nxa
xn--é fail fail xn--xn---epa fail fail
xn--xn---epa xn--xn---epa xn--xn---epa xn--xn---epa xn--xn---epa xn--xn---epa

There's already precedent (Safari) for treating Punycode decoding error differently from other UTS 46 failures, as one can see by comparing xn--a-ecp.ru against xn--a.xn--nxa. However, this also means we will need a UTS 46 modification to distinguish Punycode decoding errors from other types of errors.

One way to get this is adding a IgnoreInvalidPunycode boolean flag to UTS 46, and in Processing's xn-- step, change it to:

  1. Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error if the label contains non-ASCII characters or if IgnoreInvalidPunycode is false, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

@markusicu @macchiati thoughts on #438 (comment)? Especially for the xn--a.xn--nxa case where all implementations reportedly do the same thing, but UTS46 does not.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

Hmm, it seems that only Chromium-based browsers still have a problem here studying the results of https://wpt.fyi/results/url/toascii.window.html so maybe no change is needed. @foolip are you all planning on fixing those remaining failures?

@foolip
Copy link
Member

foolip commented Jan 11, 2023

@ricea can you make a judgment about these failures and the linked bugs?

@ricea
Copy link

ricea commented Jan 12, 2023

@foolip I think we want to fix these.

I think the linked bug for these issues is slightly different, so I newly filed https://crbug.com/1406728

@annevk
Copy link
Member

annevk commented Jan 12, 2023

Thanks, I think that means we can close this out. @ricea your help and insights on #543 would be appreciated.

@annevk annevk closed this as completed Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants