Open annevk opened 1 year ago
+1.
Tests that use trailing ASCII digit labels (or such a label followed by a dot) are not useful for browsers as that will trigger the IPv4 parser.
I think it's still valuable to test what a UTS46 implementation does for these kinds of inputs, even if the URL host parser will later interpret them as invalid IPv4 addresses rather than domain names. But as you say, it is non-trivial to detect these inputs because of limitations in the APIs exposed by browsers, so perhaps these tests should include a flag.
An idea - bearing in mind that URLs are part of a living standard (and things like the ends-in-a-number predicate have changed fairly recently), perhaps the tests should include a set of "URLX" flags for use by implementors of this standard? To mark inputs which are technically allowed by UTS46 but may not be usable in URLs.
The test for ToASCII("$") is marked P1 and V6, not U1. This also affects numerous tests with <, >, and =.
Status U1 is not used at all by the test files :(
I submitted the CheckBidi feedback and added it to OP.
@karwa for now I did not submit feedback around your comment above. In a couple of months we'll find out how this initial round went. If there's still issues with the tests after that we cannot resolve through a filter (as I did now) let's put a more coherent proposal together.
Hi @annevk, regarding
Steps don't always consider that domain labels can be empty, e.g., when CheckBidi is true the first subrule of "The Bidi Rule" inspects the first character of a label. I think that might also apply to CheckJoiners and potentially other steps. (I initially thought the problem here was VerifyDnsLength not being considered, but that check happens much later on in the processing model so it's something more fundamental.)
I am looking at the text of UTS46 and I don't see what should be changed. For CheckBidi and CheckJoiners, we just refer to the RFCs.
We have some checks like
but it's pretty obvious what to do when the label is empty, or has fewer than 4 characters.
Please clarify.
(FYI @macchiati)
The problem is that the RFCs assume they are passed a label that is not the empty string. So we shouldn't call into the RFC when that is not the case.
Looking at “the ContextJ rules”, https://www.rfc-editor.org/rfc/rfc5892.html#appendix-A processes a label with a pseudo-code loop of For All Characters
. On an empty label, this is an empty loop.
For CheckBidi, I see that IDNA2008 appears to trigger the rule only if the label contains RTL characters while UTS46 triggers it if the domain name contains RTL. (Although https://www.rfc-editor.org/rfc/rfc5893.html#section-2 says that it “applies to labels in Bidi domain names”.) And https://www.rfc-editor.org/rfc/rfc5893.html#section-2 just has “1. The first character must be ...”
It seems like we could clarify this in UTS46 with this insertion:
If CheckBidi, and if the domain name is a Bidi domain name, and if the label is not empty, then the label must satisfy ...
Alternative change: We could make this small insertion at the beginning of 4.1 Validity Criteria: “Each of the following criteria must be satisfied for a non-empty label”
It looks like a number of changes have been made in response to our feedback: https://unicode.org/reports/tr46/#Modifications. I haven't yet made the time to review in detail.
I'm wondering if we should be making use of the new IgnoreInvalidPunycode flag.
Feedback I submitted to be considered for the Unicode April 2023 meeting.
— https://github.com/whatwg/url/issues/733#issuecomment-1384085197