whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
527 stars 137 forks source link

Hoist "forbidden domain code point" check into "domain to ASCII" #818

Open hsivonen opened 7 months ago

hsivonen commented 7 months ago

What is the issue with the URL Standard?

When reading https://url.spec.whatwg.org/#concept-domain-to-ascii in isolation of https://url.spec.whatwg.org/#concept-host-parser (and without reading ICU4C's uts46.cpp first), it's not at all apparent that 1) STD3 rules are really a post-processing step to UTS 46 mapping despite UTS 46 making it look like a pre-processing step and that 2) the URL Standard's forbidden domain code point check is a similar but different post-processing step that takes place instead of STD3 post-processing.

The spec could be improved by hoisting the forbidden domain code point check from under https://url.spec.whatwg.org/#concept-host-parser into https://url.spec.whatwg.org/#concept-domain-to-ascii and adding a note that it is an ASCII filtering step that happens instead of STD3 filtering for compatibility with (whatever it is for compatibility with).

Even better if the Note listed what the difference between STD3 filtering and "forbidden domain code point" filtering is (16 rather surprising ASCII characters by my manual check) and the rationale for the differences.

annevk commented 7 months ago

This relates to #397. Moving the check seems like an obvious improvement we can make right away. The remainder is quite a bit harder to do.

hsivonen commented 6 months ago

After looking into this more, I think the right abstraction would be for UTS 46 to take an ASCII deny list instead of taking a boolean flag for STD3 rules.

What the deny list can modify should probably be constrained so that denying ASCII letters, digits, hyphen or full-stop would not be allowed. I think it would simplify data quite a bit if the caller of UTS 46 was not permitted to allow the ASCII space. (I am not aware of use cases for permitting ASCII space in domain name-like things, and the characteristics of the output get weird if space is allowed.)

But whether the rest of ASCII is allowed or denied could be customizable by the caller of UTS 46, and I think acting on that deny list should belong in the UTS 46 algorithms and not in the algorithms in URL.

So far, I'm not aware of more than two relevant configurations: the STD3 list (deny everything that I didn't list as must-allow above) and the WHATWG list ("forbidden domain code point"). So far, in the code I'm writing, I'm supporting only these two options.

I'm thinking of sending UTS 46 feedback to this effect. @annevk, what do you think?

annevk commented 6 months ago

Any kind of restructuring that's editorial but can lead to more efficient implementations seems fair game and I'm supportive of that being pursued.

hsivonen commented 5 months ago

For reference, I sent this feedback:

When implementing UTS 46, the most time-consuming wrong path was trying to design data structures for UTS 46 data assuming that the data needs to have distinct data entries for disallowed_STD3_valid and disallowed_STD3_mapped before discovering that these can be handled as valid and mapped with an ASCII deny list applied afterwards.

I suggest refactoring the spec so that:

1) disallowed_STD3_valid and disallowed_STD3_mapped become simply valid and mapped in the data and the spec says when to apply an ASCII deny list 2) instead of a boolean UseSTD3ASCIIRules the algorithm would take an ASCII deny list.

UTS 46 itself could define an STD3 ASCII deny list and the WHATWG URL Standard could use forbidden domain code point https://url.spec.whatwg.org/#forbidden-domain-code-point as an ASCII deny list parameter to UTS 46.

It would probably appropriate to make informative remarks that a) putting ASCII letters, digits, or hyphen on the deny list would break things and b) in the validation phase, the ASCII period can be put on the deny list to handle that validity constraint as part of the ASCII deny list check.