Open hsivonen opened 7 months ago
@annevk, do you recall the motivation for setting VerifyDNSLength
to false
?
@markusicu, do you recall if you had a more concrete reason than “people sometimes use libraries and protocols with non-standard inputs” for not using the tight bounds in ICU4C?
CC @valenting who removed the DNS-oriented length limit from Gecko, AFAICT, due to WPTs and not due to actual Web compat.
Hmm. Even with the error condition of counting encode output units, there should still be a limit on encode input units to avoid denial of service. I haven’t even tried doing the math of what the tight bound there would be.
I think conceptually the idea is that we don't want to couple the hostnames of special URLs too tightly with DNS, although I don't think it is well defined (long-standing issue https://github.com/whatwg/url/issues/397).
That said, a length limit on the encode function to prevent overflow is a practical necessity. My library uses 32-bit integers, so to prevent overflow we refuse to encode input sequences longer than 3854 scalars. That's not even for DoS protection; it's just for correctness.
I wouldn't necessarily mind a tighter bound to keep the n^2 encoding algorithm under control. I'm not sure it makes sense to use length limits from DNS, though.
I think one example why you wouldn't want to enforce a DNS limit on URLs is something like Tor onion URLs.
As far as I can tell right now it's http://[56characters].onion
- but since Tor doesn't use DNS to resolve this address, it shouldn't necessarily be limited by it.
Given that ASCII labels have O(n) behavior and non-ASCII labels have O(n^2) behavior, I think it's not good for an implementation to take the position that a) the length of ASCII labels is unconstrained and b) ASCII and non-ASCII labels have the same length (non-)requirements. It follows that either the specs give precise interoperable constraints or implementations make something up under https://infra.spec.whatwg.org/#algorithm-limits as ICU4C has done (or risk denial of service).
It’s a bit unclear to me what the foreseeable space of naming systems is.
.onion
naming uses an ASCII label before the onion
label, so the concern of allowing longer cryptographic material in .onion
naming is relevant to setting VerifyDNSLength
to false
but not relevant to what length constraints might be appropriate for Unicode labels.
I think agreeing on a maximum label length and a maximum number of labels would be reasonable. Presumably to be enforced after ToASCII is performed, but potentially possible to be inlined? Do we also want to agree on limits for other fields of a URL? All implementations probably have them.
What is the issue with the URL Standard?
The URL Standard, UTS 46, and RFC 3492 don’t specify interoperable behavior for Punycode encode and decode failures when a label is longer than what actually makes sense for DNS purposes.
If the input is too long, at some point an integer internal to the Punycode algorithm overflows. See https://datatracker.ietf.org/doc/html/rfc3492.html#section-6.4
One way to specify this would be to specify that the internal integer size be 32 bits, but that can lead to denial of service attacks with unreasonably long inputs. (Apparently Chrome‘s fuzzers managed to time out when fuzzing Punycode.) For this reason, ICU4C has somewhat arbitrary length limits for the inputs to Punycode decode and encode. https://unicode-org.atlassian.net/browse/ICU-13727 https://searchfox.org/mozilla-central/rev/6bc0f370cc459bf79e1330ef74b21009a9848c91/intl/icu/source/common/punycode.cpp#173-176
The rationale from the issue is:
The non-arbitrary tight bound would be to fail before decoding Punycode if the decoder input (not counting the
xn--
prefix) would exceed 59 (ASCII) characters and to fail during encoding if the encoder is (not counting thexn--
prefix) about to output a 60th (ASCII) character.Using the tight bound would come pretty close to setting
VerifyDNSLength
to true (close, but not exactly: It would still not place a limit for ASCII-only labels and the domain name as a whole). Yet, the URL Standard setsVerifyDNSLength
tofalse
. This comes from https://github.com/whatwg/url/commit/3bec3b89c4deb10842ba6c464c700df47c268f17 , which does not state motivation.Without knowing the motivation for setting
VerifyDNSLength
tofalse
, it’s hard to assess if placing the tight bounds on Punycode would work.I think the specs should make the behavior here well defined even if it’s not a particularly pressing issue, since it only concern labels that are too long for DNS anyway. (This probably belongs in UTS 46, but filing this here for discussion before sending UTS 46 feedback.)