hsivonen commented 7 months ago

What is the issue with the URL Standard?

The URL Standard, UTS 46, and RFC 3492 don’t specify interoperable behavior for Punycode encode and decode failures when a label is longer than what actually makes sense for DNS purposes.

If the input is too long, at some point an integer internal to the Punycode algorithm overflows. See https://datatracker.ietf.org/doc/html/rfc3492.html#section-6.4

One way to specify this would be to specify that the internal integer size be 32 bits, but that can lead to denial of service attacks with unreasonably long inputs. (Apparently Chrome‘s fuzzers managed to time out when fuzzing Punycode.) For this reason, ICU4C has somewhat arbitrary length limits for the inputs to Punycode decode and encode. https://unicode-org.atlassian.net/browse/ICU-13727 https://searchfox.org/mozilla-central/rev/6bc0f370cc459bf79e1330ef74b21009a9848c91/intl/icu/source/common/punycode.cpp#173-176

The rationale from the issue is:

A well-formed label is limited to 63 bytes, which means at most 59 bytes after "xn--". However, we don't have any limit so far, and people sometimes use libraries and protocols with non-standard inputs.

Something 1000-ish seems like a reasonable compromise to keep n^2 tame and users happy even with somewhat unusual inputs.

The non-arbitrary tight bound would be to fail before decoding Punycode if the decoder input (not counting the xn-- prefix) would exceed 59 (ASCII) characters and to fail during encoding if the encoder is (not counting the xn-- prefix) about to output a 60th (ASCII) character.

Using the tight bound would come pretty close to setting VerifyDNSLength to true (close, but not exactly: It would still not place a limit for ASCII-only labels and the domain name as a whole). Yet, the URL Standard sets VerifyDNSLength to false. This comes from https://github.com/whatwg/url/commit/3bec3b89c4deb10842ba6c464c700df47c268f17 , which does not state motivation.

Without knowing the motivation for setting VerifyDNSLength to false, it’s hard to assess if placing the tight bounds on Punycode would work.

I think the specs should make the behavior here well defined even if it’s not a particularly pressing issue, since it only concern labels that are too long for DNS anyway. (This probably belongs in UTS 46, but filing this here for discussion before sending UTS 46 feedback.)

hsivonen commented 7 months ago

@annevk, do you recall the motivation for setting VerifyDNSLength to false?

@markusicu, do you recall if you had a more concrete reason than “people sometimes use libraries and protocols with non-standard inputs” for not using the tight bounds in ICU4C?

hsivonen commented 7 months ago

CC @valenting who removed the DNS-oriented length limit from Gecko, AFAICT, due to WPTs and not due to actual Web compat.

hsivonen commented 7 months ago

Hmm. Even with the error condition of counting encode output units, there should still be a limit on encode input units to avoid denial of service. I haven’t even tried doing the math of what the tight bound there would be.

karwa commented 7 months ago

I think conceptually the idea is that we don't want to couple the hostnames of special URLs too tightly with DNS, although I don't think it is well defined (long-standing issue https://github.com/whatwg/url/issues/397).

That said, a length limit on the encode function to prevent overflow is a practical necessity. My library uses 32-bit integers, so to prevent overflow we refuse to encode input sequences longer than 3854 scalars. That's not even for DoS protection; it's just for correctness.

https://github.com/karwa/swift-url/blob/01ad5a103d14839a68c55ee556513e5939008e9e/Sources/IDNA/Punycode.swift#L185-L202

I wouldn't necessarily mind a tighter bound to keep the n^2 encoding algorithm under control. I'm not sure it makes sense to use length limits from DNS, though.

valenting commented 7 months ago

I think one example why you wouldn't want to enforce a DNS limit on URLs is something like Tor onion URLs. As far as I can tell right now it's http://[56characters].onion - but since Tor doesn't use DNS to resolve this address, it shouldn't necessarily be limited by it.

annevk commented 7 months ago

397 is certainly an open question that's relevant here. I also don't think implementations ever enforced the limit for pure ASCII inputs. I'm not a 100% positive on also having tested long non-ASCII inputs, but I would also not be comfortable with having different length requirements for ASCII and non-ASCII.

hsivonen commented 7 months ago

Given that ASCII labels have O(n) behavior and non-ASCII labels have O(n^2) behavior, I think it's not good for an implementation to take the position that a) the length of ASCII labels is unconstrained and b) ASCII and non-ASCII labels have the same length (non-)requirements. It follows that either the specs give precise interoperable constraints or implementations make something up under https://infra.spec.whatwg.org/#algorithm-limits as ICU4C has done (or risk denial of service).

It’s a bit unclear to me what the foreseeable space of naming systems is.

397 points to NetBIOS, but, AFAICT, the way the URL Standard’s "forbidden domain code point" does not appear to arise directly from NetBIOS. In any case, NetBIOS seems to be limited to 15 ASCII characters, so the limit is a) irrelevant to non-ASCII labels and b) more constraining on length than DNS’s length constraint.

.onion naming uses an ASCII label before the onion label, so the concern of allowing longer cryptographic material in .onion naming is relevant to setting VerifyDNSLength to false but not relevant to what length constraints might be appropriate for Unicode labels.

annevk commented 7 months ago

I think agreeing on a maximum label length and a maximum number of labels would be reasonable. Presumably to be enforced after ToASCII is performed, but potentially possible to be inlined? Do we also want to agree on limits for other fields of a URL? All implementations probably have them.

whatwg / url

Punycode behavior for labels exceeding DNS length is ill-defined #824

What is the issue with the URL Standard?