whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
532 stars 140 forks source link

can't parse urls starting with xn-- #438

Closed Janpot closed 1 year ago

Janpot commented 5 years ago

Can't seem to parse urls like http://xn--abc.com. This seems to work in browsers though. I've been digging through the code and specs a bit.

It looks like tr46.toASCII returns an error. Digging further, it looks like it should implement this spec: https://www.unicode.org/reports/tr46/#Processing. But that seems to say:

Even if an error occurs, the conversion of the string is performed as much as is possible.

And it says

If the label starts with “xn--”: Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.

The url spec seems to dictate (https://url.spec.whatwg.org/#idna)

If result is a failure value, validation error, return failure.

I feel like this should be possible though, tr46 seems quite ambiguous as to what's recoverable and what not.

I came across an example that renders and parses in the browser but seems to fail the parsing algorithm: http://xn--12cr4aua8bifvs3aljr6edb1al1vlg1a.blogspot.com (disclaimer: I am in no way connected to this url or the content of the site, it just passed by our systems)

In any case, I'm not super experienced in reading these specs, so take the previous with an appropriate grain of salt. It just seems strange to me that urls can render in a browser, but fail parsing them according to the spec.

EDIT:

forgot to mention, when I say "I've been digging through the code" I'm talking about https://github.com/jsdom/whatwg-url. FWIW, the node.js native url parser seems to behave the same way.

annevk commented 5 years ago

Thank you for reporting this. Unfortunately, I suspect you're correct and this is not something we have adequately tested for thus far.

Interesting cases (Live URL Viewer links for comparison purposes):

@macchiati I suspect this might require further adjustments to TR46 in due course, once we figure out the full details. Part of the problem here is that browsers have been slow on aligning with requirements in general and making the necessary adjustments.

Janpot commented 5 years ago

Digging a bit further, and another cornercase that seems related to this is that host setter doesn't throw when set to an invalid value:

const x = new URL('http://example.com');
x.host = 'xn--a';
console.log(x.href);
// node: http://example.com/
// browser: http://xn--a/

While

const x = new URL('http://example.com');
x.href = 'http://xn--a/';
console.log(x.href);
// node: throws "Invalid URL: http://xn--a/"
// browser: http://xn--a/

Wouldn't it make sense to make all setters throw when it results in an invalid URL, not only the href setter. According to the spec, this behavior is only specified for the href setter. (Maybe this should be a separate issue?)

EDIT:

And here's another funny one:

const x = new URL('http://example.com');
x.host = 'xn--ß.com';
console.log(x.href);
// node: http://example.com/
// chrome: http://xn--%C3%9F.com/
// firefox: http://xn--xn---yna.com/ 
// safari: http://example.com/

Only firefox seems to be consistent withitself:

const x = new URL('http://example.com');
x.host = 'xn--ß.com';
const y = new URL('http://xn--ß.com');
console.log(x.href === y.href);
// node: exception on line 3
// chrome: exception on line 3
// firefox: true
// safari: exception on line 3
annevk commented 5 years ago

That's how the host setter deals with incorrect input for largely historical reasons. The input ends up getting ignored. It should be functionally equivalent otherwise though.

Janpot commented 5 years ago

Ok I see. fwiw, I can't think of a situation where I want this to just ignore my input, rather than complain. But I guess that would be a breaking change by now.

annevk commented 5 years ago

Unfortunately it would be. There's been some talk about a dedicated host API, which will not have this problem.

zackw commented 5 years ago

I'd like to point out that the current rev of the IDNA RFC [IDNA2008] encourages applications that do DNS lookup to be liberal in what they accept, and in particular to "rely on the assumption that names that are present in the DNS are valid" except for specific cases which are known to cause "serious problems". In particular, note the text at the end of section 5.4:

For all other strings, the lookup application MUST rely on the presence or absence of labels in the DNS to determine the validity of those labels and the validity of the characters they contain. If they are registered, they are presumed to be valid; if they are not, their possible validity is not relevant.

where "all other strings" means "all strings that have passed the sequence of checks for 'serious problems' described in sections 5.3 and 5.4".

Here are some examples of URLs that I have personally observed in the wild (during my research, which involves Web crawling) to contain hostnames which are formally invalid per some RFC or other, but do not rise to the level of a 'serious problem', and which I think should probably be accepted by the URL standard, if only for interop's sake:

http://r2---sn-gvbxgn-tt1s.googlevideo.com/
http://r9---sn-i3b7sn7d.googlevideo.com/
http://lgbt_grani.livejournal.com/
http://www.mi-ru_mo.bbs.fc2.com/
http://-friction-.tumblr.com/
domenic commented 4 years ago

I wonder if we should consider enshrining browsers' "ASCII fast path", where they don't perform ToASCII on ASCII inputs. In https://bugs.chromium.org/p/chromium/issues/detail?id=724018 @annevk seemed to think that was a bad idea, but I'm not sure I fully understood the negative consequences of that direction.

annevk commented 4 years ago

Yeah, I think that's probably needed given the number of existing systems that seem to rely on this to varying degrees. I think my concerns were mostly design-wise, that it seems somewhat bad to have a different set of restrictions on non-ASCII and ASCII input, e.g., with regards to length.

If it wasn't already the case it might also lead to certain security issues I suppose, as you can smuggle invalid xn-- sequences in that might trip up something downstream that is poorly configured. That browsers already allow this hopefully means the ecosystem is already robust against such surprises.

rmisev commented 4 years ago

Example: http://xn--a.xn--nxa/ According to "ASCII fast path" it must pass. But I can't browse to this address. When entered in the address bar, Chrome converts it to http://xn--a.β/ and fails (because it treats http://xn--a.β/ invalid). I think if Chrome converts it to Unicode form, then must treat both equivalent. But actually it isn't:

I think if URL's ASCII form is valid, then converted to Unicode form must by valid too. And vice versa - if URL is invalid in one form, then it must be invalid in other form too. Otherwise we got weird results. So I am the opposite of "ASCII fast path".

annevk commented 4 years ago

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search. So if you fiddled with your DNS to put something there it would probably work.

I agree that this leads to weirdly inconsistent rules though, so if we go down this path we should be very explicit about it and document these side effects.

rmisev commented 4 years ago

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search.

Yes, you are right. Anyway converting valid URL to not valid (even visually) is weird.

macchiati commented 4 years ago

A fix is being proposed for tr46. @Markus Scherer markus.icu@gmail.com

On Thu, Oct 1, 2020, 05:49 Rimas Misevičius notifications@github.com wrote:

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search.

Yes, you are right. Anyway converting valid URL to not valid (even visually) is weird.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/whatwg/url/issues/438#issuecomment-702111155, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMADKJ4QIUHCHMDHA23SIR3EBANCNFSM4HJ5J26Q .

markusicu commented 4 years ago

A fix is being proposed for tr46.

The fix proposed for UTS #46 is to detect "xn--" and "xn--ASCII-" as errors in a way that's equivalent with what IDNA2008 does. See https://www.unicode.org/L2/L2020/20240-utc165-properties-recs.pdf item F7 (on page 8).

More generally, the strategy is to report errors but still produce output (unless, for example, the Punycode string is ill-formed and thus not decodable), because different users/callers may ignore different types of errors.

domenic commented 4 years ago

Thanks @markusicu! However, what about xn-ASCII with no trailing dash? As seen in e.g. https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly94bi0tYS5jb20v&base=YWJvdXQ6Ymxhbms=, browsers treat these sorts of URLs as parseable. Will the updated UTS 46 also produce output for those?

markusicu commented 4 years ago

what about xn-ASCII with no trailing dash? ... Will the updated UTS 46 also produce output for those?

I assume you mean xn--ASCII with double hyphen. The difference is that the additional hyphen in xn--ASCII- separates the "basic characters" (ASCII) from the actual Punycode encoding, and that is empty in this kind of label, which means that just Punycode-decoding it returns the ASCII part and you have an alternate encoding of the same label. (Punycode does not fail. IDNA2008 fails a round-trip check.)

Without the additional hyphen the "ASCII" substring is not actually ASCII at all but it's all-non-ASCII Punycode.

I don't think that UTS #46 is missing anything for those. https://www.unicode.org/reports/tr46/#ProcessingStepConvertValidate

It might be ill-formed Punycode, and the spec says to just record an error for that label. If it's well-formed, then the decoded string is subjected to validation, which in turn might record an error if there is a disallowed character or something else wrong.

TimothyGu commented 3 years ago

This issue demonstrates a need for URLs such as xn--x.com to be preserved as xn--x.com, despite the Punycode decoding error. However, to prevent reparse bugs, we need to treat Unicode and validly-encoded ASCII versions of a invalid label the same way. In other words:

I propose allowing ASCII labels with Punycode decoding errors to remain, but still forbid other types of UTS 46 error. So we have the following matrix:

Domain spec Chrome Firefox Safari proposal
xn--a-ecp.ru fail xn--a-ecp.ru xn--a-ecp.ru fail fail fail
a⒈.ru fail fail xn--a-q10i.ru fail fail fail
xn--a.xn--nxa fail xn--a.xn--nxa xn--a.xn--nxa xn--a.xn--nxa xn--a.xn--nxa
xn--a.β fail fail xn--a.xn--nxa fail xn--a.xn--nxa
xn--é fail fail xn--xn---epa fail fail
xn--xn---epa xn--xn---epa xn--xn---epa xn--xn---epa xn--xn---epa xn--xn---epa

There's already precedent (Safari) for treating Punycode decoding error differently from other UTS 46 failures, as one can see by comparing xn--a-ecp.ru against xn--a.xn--nxa. However, this also means we will need a UTS 46 modification to distinguish Punycode decoding errors from other types of errors.

One way to get this is adding a IgnoreInvalidPunycode boolean flag to UTS 46, and in Processing's xn-- step, change it to:

  1. Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error if the label contains non-ASCII characters or if IgnoreInvalidPunycode is false, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.
annevk commented 1 year ago

@markusicu @macchiati thoughts on https://github.com/whatwg/url/issues/438#issuecomment-842578214? Especially for the xn--a.xn--nxa case where all implementations reportedly do the same thing, but UTS46 does not.

annevk commented 1 year ago

Hmm, it seems that only Chromium-based browsers still have a problem here studying the results of https://wpt.fyi/results/url/toascii.window.html so maybe no change is needed. @foolip are you all planning on fixing those remaining failures?

foolip commented 1 year ago

@ricea can you make a judgment about these failures and the linked bugs?

ricea commented 1 year ago

@foolip I think we want to fix these.

I think the linked bug for these issues is slightly different, so I newly filed https://crbug.com/1406728

annevk commented 1 year ago

Thanks, I think that means we can close this out. @ricea your help and insights on #543 would be appreciated.