whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
519 stars 136 forks source link

Meta: UTS46 feedback #744

Open annevk opened 1 year ago

annevk commented 1 year ago

Feedback I submitted to be considered for the Unicode April 2023 meeting.


Chromium will ship Nontransitional Processing soon: https://chromestatus.com/feature/5105856067141632. That covers all browser engines. I suggest taking that opportunity to simplify this document and its test suite and declare the transition period for which this conditional existed to be over.


Steps don't always consider that domain labels can be empty, e.g., when CheckBidi is true the first subrule of "The Bidi Rule" inspects the first character of a label. I think that might also apply to CheckJoiners and potentially other steps. (I initially thought the problem here was VerifyDnsLength not being considered, but that check happens much later on in the processing model so it's something more fundamental.)


Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.

These code points are not decomposed so they can never conflict with =, <, and >. And they are not inherently more confusing than any of the other allowed code points, which include hieroglyphics and emoji. These code points also work as-is in all browser engines (while < and > are forbidden) and on balance preference ought to be given to retaining compatibility so end users are not prevented from visiting websites or seeing subresources that might use these code points in their domain for one reason or another.

For further background and discussion please see https://github.com/whatwg/url/issues/733.

Thank you!

https://github.com/whatwg/url/issues/733#issuecomment-1384085197


I have worked on importing IdnaTestV2.txt into web-platform-tests, the test framework used by all web browsers. The goal was to meet the requirements of the domain to ASCII algorithm specified at https://url.spec.whatwg.org/#idna with beStrict initialized to false.

As such, I attempted to filter out ToASCII statuses for UseSTD3ASCIIRules, CheckHyphens, and VerifyDnsLength. Hoping that any statuses that are left would indicate a failure requirement.

You can find my work at https://github.com/web-platform-tests/wpt/pull/38080.

I ran into the following issues. Most of them relate to status annotation. IPv4 address confusion was the one issue that did not relate to statuses.

  • VerifyDnsLength is not P4, but rather A4_1 and A4_2.
  • Tests that use trailing ASCII digit labels (or such a label followed by a dot) are not useful for browsers as that will trigger the IPv4 parser. Which will then usually return failure as the input was not actually an IPv4 address string. This is a problem for a number of the A4_1 and A4_2 tests. And also a large number of tests later on, such as ToASCII("xn--gl0as212a.8.") or ToASCII("1.27"). I wrote a filter to exclude them, but it would be better if they were adjusted slightly (e.g., made to contain one non-EN code point) so what they aim to test can also be tested in browsers. (Note that the IPv4 parser runs after domain to ASCII, but the web platform doesn't provide a way to invoke domain to ASCII on its own and probably never will.)
  • The test for ToASCII("$") is marked P1 and V6, not U1. This also affects numerous tests with <, >, and =. If they continue to have multiple statuses that will also make it impossible to filter them in an automated fashion. (This also applies to non-ASCII UseSTD3ASCIIRules code points, but I filed a separate request to remove those.)
  • NV8 is not used as a status.
  • A3 and X3 do not appear to be used as a status. (These are catered for by P4 presumably.)
  • CheckBidi is not V8. V8 does not appear to be used. You'd have to filter out all B1-6 statuses instead.

An issue reported against the URL Standard indicated that the current CheckBidi handling from UTS 46 is rather strict: https://github.com/whatwg/url/issues/543. Namely, domains containing RTL-labels cannot have labels consisting solely of ASCII digits preceding them (such labels are invalid per The Bidi Rule subrule 1). This ends up rejecting a number of domains in the wild and also seems unnecessarily restrictive for RTL users.

In that issue I worked with Harald Alvestrand (one of the editors of RFC 5893: Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA)) on a specific set of changes for UTS 46 that would remedy this issue, while still imposing the majority of Bidi-related requirements present in UTS 46 today.

The proposed changes are:

  1. Remove step 8 of https://unicode.org/reports/tr46/#Validity_Criteria as Validity Criteria only operates on a single label. (Although it somehow claims to have knowledge about the domain_name string as well...)
  2. Add a new step 5 to https://unicode.org/reports/tr46/#Processing. (Note that due to step 4 we will have U-labels.)

The new step 5 would as follows:

  • If CheckBidi, and the domain_name string is a Bidi domain name, record there was an error if neither of the following conditions is true:
    • All labels in the domain_name string satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.
    • RTL labels in the domain_name string are immediately followed by an LDH label whose first code point is not of class EN and all labels in the domain_name string are either LDH labels or satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.

Thank you for your consideration. This is probably the final IDNA-related issue from the URL Standard. Once all of them have been resolved I’ll work with browser implementers to ensure the changes (if any) get implemented so we can finally declare victory on IDNA interoperability.

karwa commented 1 year ago

+1.

Tests that use trailing ASCII digit labels (or such a label followed by a dot) are not useful for browsers as that will trigger the IPv4 parser.

I think it's still valuable to test what a UTS46 implementation does for these kinds of inputs, even if the URL host parser will later interpret them as invalid IPv4 addresses rather than domain names. But as you say, it is non-trivial to detect these inputs because of limitations in the APIs exposed by browsers, so perhaps these tests should include a flag.

An idea - bearing in mind that URLs are part of a living standard (and things like the ends-in-a-number predicate have changed fairly recently), perhaps the tests should include a set of "URLX" flags for use by implementors of this standard? To mark inputs which are technically allowed by UTS46 but may not be usable in URLs.

The test for ToASCII("$") is marked P1 and V6, not U1. This also affects numerous tests with <, >, and =.

Status U1 is not used at all by the test files :(

annevk commented 1 year ago

I submitted the CheckBidi feedback and added it to OP.

@karwa for now I did not submit feedback around your comment above. In a couple of months we'll find out how this initial round went. If there's still issues with the tests after that we cannot resolve through a filter (as I did now) let's put a more coherent proposal together.

markusicu commented 1 year ago

Hi @annevk, regarding

Steps don't always consider that domain labels can be empty, e.g., when CheckBidi is true the first subrule of "The Bidi Rule" inspects the first character of a label. I think that might also apply to CheckJoiners and potentially other steps. (I initially thought the problem here was VerifyDnsLength not being considered, but that check happens much later on in the processing model so it's something more fundamental.)

I am looking at the text of UTS46 and I don't see what should be changed. For CheckBidi and CheckJoiners, we just refer to the RFCs.

We have some checks like

but it's pretty obvious what to do when the label is empty, or has fewer than 4 characters.

Please clarify.

(FYI @macchiati)

annevk commented 1 year ago

The problem is that the RFCs assume they are passed a label that is not the empty string. So we shouldn't call into the RFC when that is not the case.

markusicu commented 1 year ago

Looking at “the ContextJ rules”, https://www.rfc-editor.org/rfc/rfc5892.html#appendix-A processes a label with a pseudo-code loop of For All Characters. On an empty label, this is an empty loop.

For CheckBidi, I see that IDNA2008 appears to trigger the rule only if the label contains RTL characters while UTS46 triggers it if the domain name contains RTL. (Although https://www.rfc-editor.org/rfc/rfc5893.html#section-2 says that it “applies to labels in Bidi domain names”.) And https://www.rfc-editor.org/rfc/rfc5893.html#section-2 just has “1. The first character must be ...”

It seems like we could clarify this in UTS46 with this insertion:

If CheckBidi, and if the domain name is a Bidi domain name, and if the label is not empty, then the label must satisfy ...

markusicu commented 1 year ago

Alternative change: We could make this small insertion at the beginning of 4.1 Validity Criteria: “Each of the following criteria must be satisfied for a non-empty label”

annevk commented 9 months ago

It looks like a number of changes have been made in response to our feedback: https://unicode.org/reports/tr46/#Modifications. I haven't yet made the time to review in detail.

domenic commented 8 months ago

I'm wondering if we should be making use of the new IgnoreInvalidPunycode flag.