whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
526 stars 137 forks source link

Refusing a mix of numeric-only and BIDI domains #543

Open vorner opened 4 years ago

vorner commented 4 years ago

Hello

Some time ago I was trying to figure out if the domains below were rejected by the Rust url crate, it is tracked here. It seems this is maybe accidentally disallowed by the standard. I was recommended to raise it here.

It's a bit old so I don't remember the exact details and would have to dig them up, I tried to describe it in this comment. I think the issue was the combination of numeric only label and BIDI label.

Now, my question is, should these be valid URLs? They certainly are valid domains, even though it might be discouraged to allow them and the URLs are (were at least when it was reported; I could provide new ones if needed) alive and reachable. Note that they are considered malware URLs, so be careful when handling them.

Parsing failed: invalid international domain name, http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php
Parsing failed: invalid international domain name, http://shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/sitemap.html
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/bvv
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/index.php
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php
TRowbotham commented 4 years ago

These domains are considered invalid because they don't meet the criteria from RFC 5893 Section 2. Specifically, the label "163" fails criteria 1, which requires the first character of a label to have a Bidi property of L, R, or AL. The digits 0-9 have a Bidi property of EN (European Number) 0030..0039 ; EN # Nd [10] DIGIT ZERO..DIGIT NINE according to the DerivedBidiProps.

The domain to ASCII algorithm sets the CheckBidi option to true, which causes the result of Step 2 to return a failure value due to not meeting the above criteria, which is then rejected in Step 3 and ultimately leads to the host parser returning a failure, which then causes the the URL parser to abort.

RFC 4920 Section 1 states:

If any step of the ToASCII operation fails on any label in a domain name, that domain name MUST NOT be used as an internationalized domain name.

So, the URL spec is doing the right thing here. The only 2 options for making these domains valid in terms of this spec, as far as I can tell, would be setting the CheckBidi option to false or allowing the options for the Unicode ToASCII steps to be user configurable.

vorner commented 4 years ago

I agree that the current spec disallows these URLs as invalid.

My question was more in the line of „Was the spec's/author's intention to disallow them, or did the spec got written in a way that it disallows them by accident?“

Several sections in there, pointed out in this comment seem to suggest that allowing such URLs as compatibility with existing deployments was considered.

So, in other words, my question is not „Are they invalid“, but „Should they be invalid“?

domenic commented 3 years ago

This might be the same underlying issue as #438.

annevk commented 1 year ago

I think @TRowbotham has the correct analysis here and indeed it very much depends on how CheckBidi is used.

To simplify from OP:

However, https://www.rfc-editor.org/rfc/rfc5893.html#section-2 (which UTS46 invokes) also says:

In a domain name consisting of only LDH labels (as defined in the Definitions document [RFC5890]) and labels that satisfy the rule, the requirements of Section 3 are satisfied as long as a label that starts with an ASCII digit does not come after a right-to-left label.

But that seems contradictory as a label that starts with an ASCII digit can never fulfill The Bidi Rule due to ASCII digits not having the correct Bidi property (they have EN according to https://unicode.org/reports/tr9/):

The first character must be a character with Bidi property L, R, or AL. If it has the R or AL property, it is an RTL label; if it has the L property, it is an LTR label.

I'm not sure what to make of this.

I would appreciate input from @achristensen07 @valenting @markusicu @macchiati @alvestrand. I would be somewhat inclined to set CheckBidi to false given that it matches most implementations, is more likely to match deployed content, and the bidi requirements appear contradictory, but I'm open to suggestions.

annevk commented 1 year ago

This might also be more complicated still as, e.g., يa is rejected by all browsers. Which I think is due to mixing L and R character properties and would not be rejected without CheckBidi being true?

And then 1.ي is only rejected in Chromium and WebKit. So Chromium can again be explained through an erroneous ASCII fast path. And it seems that Gecko has a different CheckBidi behavior when it comes to ASCII digits at least, perhaps due to the above contradiction. Or perhaps there is another check related to character properties unrelated to CheckBidi.

(All of this is only concerned with the ToASCII code path, for what it's worth.)

alvestrand commented 1 year ago

This logic took quite a while to work out, including actually coding up the BIDI rule and running it through all the possible combinations of directions to make sure I had them all covered.....

The "bidi rule" in RFC 5893 section 2 applies to a single label. So a label (not a domain name) can either obey the rule or not.

The guarantees in the last two paragraphs are about the properties of a whole domain name. They are not part of the rule.

The practical consequence is that if you want sanity in your display, you can never have <RTL-label>.3com.com - because that would probably display as 3.<RTL-label>com.com, which is confusing.

So 1.ي should not be rejected, but 1.ي.3com should be. (inspect the order of the characters in that one!)

annevk commented 1 year ago

@alvestrand so do I understand it correctly that IDNA2008 doesn't take a stance as to whether all labels in a domain need to obey The Bidi Rule?

https://www.unicode.org/reports/tr46/#Validity_Criteria does which might explain the difference.

https://www.rfc-editor.org/rfc/rfc5891.html#section-4.2.3.4 seems to only enforce The Bidi Rule upon individual labels containing characters whose property is R whereas UTS46 enforces it upon all labels in a domain as long as CheckBidi is true.

I do think The Bidi Rule is somewhat confusing if that is the case as it itself states

The following rule, consisting of six conditions, applies to labels in Bidi domain names.

which easily leads one to think it applies to all labels and has to be obeyed.

Also, it's not clear to me how from The Bidi Rule enforced only upon labels containing characters whose property is R the guarantee follows that labels starting with an ASCII digit do not come after the RTL label.

alvestrand commented 1 year ago

You are correct. IDNA2008 states only rules about single labels - this was a result of discussing the various ways in which labels can be put together into domain names.

There is a very explicit discussion of "what can happen if you concatenate labels into domain names" in https://www.rfc-editor.org/rfc/rfc5893#section-5 - it ends with "Rather than trying to suggest rules that disallow all such undesirable situations, this document merely warns about the possibility, and leaves it to application developers to take whatever measures they deem appropriate to avoid problematic situations."

TR46 was written by people who have far less DNS experience than the people who were involved in RFC 5891. The two groups did not agree at the time TR46 was first written, and while my impression is that TR46 has been revised to be more in line with IDNA2008 over time, I am not surprised that there are still cases where trying to interpret the two as saying exactly the same thing will fail.

annevk commented 1 year ago

@alvestrand okay, but that still doesn't address my last paragraph about the purported guarantees from IDNA2008.

alvestrand commented 1 year ago

The point is that no single entity can make that guarantee, as described in section 5. Remember that IDNA2008 intends to impose requirements on people registering labels; it does not impose requirements on those who use domain names.

If you want to require that a certain application rejects domain names that don't obey the requirements, that's an application spec, not a DNS spec. Section 5 (and the "it follows" parts of section 2) are intended to give guidance on how to decide to reject such names.

(I suspect that I'm a victim of knowing what I intended when I wrote it, and being unable to see where it's unclear; to me, I'm just repeating what I already wrote in the RFC. But I still hope it's understandable.)

annevk commented 1 year ago

The point is that no single entity can make that guarantee, as described in section 5.

So why say they are guarantees?

Given that IDNA2003 was implemented by user agents it does seem somewhat irresponsible that IDNA2008 didn't try to address them at all, but I guess that's water under the bridge.

I guess I need input from @achristensen07 @valenting @markusicu @macchiati as to what exactly we'd like to enforce here. Banning numeric labels in domains containing RTL labels seems bad so I assume we want to change that part of UTS46.

Enforcing The Bidi Rule for labels containing a character whose property is R seems realistic and implemented by all user agents.

We could additionally try to enforce the second "guarantee" by not allowing a numeric label after an RTL label, but not sure.

zackw commented 1 year ago

I'm concerned that the URL spec might be moving in the direction of rejecting domain names that are actually in use in the wild. Gonna bring over my comment from #438:

Here are some examples of URLs that I have personally observed in the wild (during my research, which involves Web crawling) to contain hostnames which are formally invalid per some RFC or other, but do not rise to the level of a 'serious problem' (as the IDNA2008 RFC uses that term), and which I think should probably be accepted by the URL standard, if only for interop's sake:

http://r2---sn-gvbxgn-tt1s.googlevideo.com/
http://r9---sn-i3b7sn7d.googlevideo.com/
http://lgbt_grani.livejournal.com/
http://www.mi-ru_mo.bbs.fc2.com/
http://-friction-.tumblr.com/

None of these actually involve xn--, though. Should I file a new bug?

annevk commented 1 year ago

@zackw those are not rejected by the URL Standard. I'm going to mark your and my comment as off-topic. Feel free to file a new issue if anything is unclear.

annevk commented 1 year ago

My plan is to submit feedback to Unicode's April meeting to get this addressed. Draft:

Please change the processing model of CheckBidi to allow for more right-to-left domains.

Currently when CheckBidi is set to true and the input is determined to be a Bidi domain name it enforces all six subrules of The Bidi Rule https://www.rfc-editor.org/rfc/rfc5893.html#section-2 for each label of a domain. This has a couple of issues:

  • As discussed in https://github.com/whatwg/url/issues/543 subrule 1 alone ends up disallowing EN code point labels in such domain names (e.g., 1.ي is a fatal error). This seems unnecessarily constraining.
  • Subrule 1 also creates undefined behavior for empty string labels (e.g., for a domain such as ي.), as it imposes requirements upon a character that is not there. (If the expectation is that trailing dots are removed before ToASCII is invoked that could use clearer documentation or an assert somewhere.)
  • As discussed in the URL Standard issue referenced one of the editors of IDNA2008 asserts The Bidi Rule was not aimed at client implementations, but rather at registries. While browsers have been enforcing it to varying degree nevertheless as suggested by UTS46, it's probably worth another close review to ensure this is actually what we want.

I don't have a recommendation here unfortunately as this is not my area of expertise. It's my hope Bidi experts on the committee can help out. One solution might be to not enforce subrule 1 for left-to-right labels, but do enforce that a label that starts with an EN code point cannot follow a right-to-left label.

If anyone here has suggestions for how to make this more concrete I'm all ears.

cc @ricea

alvestrand commented 1 year ago

The point on empty labels is actually not right. The string "a.b.c." (trailing dot) does not represent a DNS name with an empty label; it is a syntactic convention saying "we know c is a top level domain, don't try to append your search path elements to it in order to find it".

It's largely fallen out of use.

My suggestion for a solution would be to add text in the URL standard as follows:

The IDNA2008 standard, in RFC 5893, gives a rule for evaluating whether or not a single label is suitable for use in a BIDI domain name, and some advice for applications.

RFC 5893 defines the terms "RTL label", "LTR label", "Bidi domain name", and "Bidi rule".

Based on this advice, the following domain names will be accepted by the URL standard:

  • Domains containing only labels that obey the Bidi rule
  • Domains containing RTL labels followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

That should be the necessary and sufficient rules for ensuring that display of domain names using the Unicode bidi algorithm don't contain characters that "jump the dot".

annevk commented 1 year ago

It's correct per how browsers and the URL Standard deal with it. The domain name is x.. That is passed to UTS46 which splits on . to get labels. At that point they have an empty label they need to deal with. Currently they pass it as-is to IDNA2008's The Bidi Rule where it goes wrong. (And as I suggest we could instead omit the empty label before passing it on to UTS46.)

Thanks for suggesting a set of rules. I'll incorporate that in the feedback.

alvestrand commented 1 year ago

This is the problematic statement in UTS#46: https://unicode.org/reports/tr46/#ProcessingStepBreak "Break the string into labels at U+002E"

The problem is that a.b.c. is using the "preferred name syntax" from RFC 1035 section 2.3.1, where empty labels are disallowed - and UTS#46 is ignoring that.

The grammar rule is "

A competent DNS name processor should:

a) disallow any domain name with two consecutive dots b) interpret a trailing dot as "this domain name is rooted at the DNS root", not as a trailing empty label

annevk commented 1 year ago

This is all way before DNS gets involved and also has other applications (such as the same-origin policy) so it's not quite that simple, but it might well be better if UTS46 is not invoked with a trailing dot. They just need to make that clear I think.

annevk commented 1 year ago

@alvestrand on reflection, it's not clear to me how your suggestion ends up allowing cases such as 1.ي. Not all labels there obey The Bidi Rule as discussed. Perhaps it needs to be something like this:

alvestrand commented 1 year ago

The example of 1.ي is not covered by my suggested rule:

Domains containing RTL labels followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

since the RTL domain is a top level domain in this case. It is covered if reformulated as

Domains containing RTL labels where each RTL label is either the top level domain or directly followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

(the word "directly" makes it more obvious that 1.ي.foo.3.tld is allowed too)

annevk commented 1 year ago

@alvestrand but for that second now-reformulated case would the RTL labels still need to obey "The Bidi Rule"? And when you say "domains" do you mean "Bidi domain names" or all of them?

annevk commented 1 year ago

@alvestrand if you could give this another look that would help. Otherwise I'll submit feedback without a specific recommendation.

alvestrand commented 1 year ago

The two paragraphs in my suggested rule are AND, not OR; domain names need to satisfy both.

All domain names with RTL labels are Bidi domain names. Quoting RFC 5893 again:

A "Bidi domain name" is a domain name that contains at least one RTL label. (Note: This definition includes domain names containing only dots and right-to-left characters. Providing a separate category of "RTL domain names" would not make this specification simpler, so it has not been done.)

Domain names that don't contain RTL labels are out of scope for this recommendation.

annevk commented 1 year ago

@alvestrand how does AND work for LTR labels solely consisting of EN code points? They would violate The Bidi Rule.

macchiati commented 1 year ago

Anne, I believe your conclusion in your first message is correct. That is, if a domain name contains any R, AL, or AN character then by condition 1, none of its labels can start with an EN character, eg [0-9].

But as you say, the wording of the paragraph XX is appears odd:

In a domain name consisting of only LDH labels (as defined in the Definitions document [RFC5890 https://www.rfc-editor.org/rfc/rfc5890]) and labels that satisfy the rule, the requirements of Section 3 https://www.rfc-editor.org/rfc/rfc5893.html#section-3 are satisfied as long as a label that starts with an ASCII digit does not come after a right-to-left label.

After all, 5890 defines the following.

The term "LDH code point" is defined in this document to refer to the code points associated with ASCII letters (Unicode code points 0041..005A and 0061..007A), digits (0030..0039), and the hyphen-minus (U+002D).

That means that a domain name "domain name consisting of only LDH labels" can't have any right-to-left labels. So it is by definition always true for such a domain name that "as long as a label that starts with an ASCII digit does not come after a right-to-left label." because there can be no right-to-left labels in such a domain name.

On Mon, Feb 6, 2023 at 5:40 AM Anne van Kesteren @.***> wrote:

@alvestrand https://github.com/alvestrand how does AND work for LTR labels solely consisting of EN code points? They would violate The Bidi Rule.

— Reply to this email directly, view it on GitHub https://github.com/whatwg/url/issues/543#issuecomment-1419100570, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBCIM4QFYTTF7CNVELWWD5N3ANCNFSM4RKZDPDA . You are receiving this because you were mentioned.Message ID: @.***>

annevk commented 1 year ago

Yeah, I guess we could accept it, but it seems unnecessarily constraining for RTL users and developers, and ends up rejecting domains known to exist (see OP). At the moment it also only matches WebKit, but unfortunately I haven't been able to get @ricea (Chromium) and @valenting (Gecko) to chime in thus far.

alvestrand commented 1 year ago

OP had mail.163.com.xn----9mcjf9b4dbm09f.com - here, the RTL label is followed by an ASCII label that does not start with a digit. I don't see how that would fail the rule I suggested.

To @annevk : All-numeric labels start with a number. No need to consider anything more about them; if they follow an RTL label, they make the domain name fail the rule.

(Note: 1.ي.3.tld (the 3 is actually the subdomain of .tld) is an example of an all-numeric label. It will be rare for users to actually comprehend that.)

annevk commented 1 year ago

@alvestrand because the 163 label violates The Bidi Rule subrule 1 as we've said repeatedly in this thread. Mark just mentioned it again just now in his first paragraph.

alvestrand commented 1 year ago

The 163 label does not follow an RTL label, so while it violates the bidi rule for a label, it doesn't violate the domain name rule I proposed. I think I've said that several times too. Quoting RFC 5893 again:

o In a domain name consisting of only LDH labels (as defined in the Definitions document [RFC5890]) and labels that satisfy the rule, the requirements of Section 3 are satisfied as long as a label that starts with an ASCII digit does not come after a right-to-left label.

Satisfying the requirements of section 3 should be the goal of a domain name verification filter.

annevk commented 1 year ago

In https://github.com/whatwg/url/issues/543#issuecomment-1384923846 you suggested two rules and later clarified they are AND. One of the rules is that all labels adhere to The Bidi Rule.

Could you please restate your rules in clearer terms?

alvestrand commented 1 year ago

Adding in the AND and "immediately" from the suggested clarifications gives the following text:

The IDNA2008 standard, in RFC 5893, gives a rule for evaluating whether or not a single label is suitable for use in a BIDI domain name, and some advice for applications.

RFC 5893 defines the terms "RTL label", "LTR label", "Bidi domain name", and "Bidi rule".

Based on this advice, the following domain names will be accepted by the URL standard:

  • Domains containing only labels that obey the Bidi rule, AND
  • Domains containing RTL labels immediately followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

The AND means that both kinds of domain will be accepted, it is "accept AND accept".

I don't understand where the comprehension difficulty is, but then English is not my first language.

annevk commented 1 year ago

Well, you also said:

The two paragraphs in my suggested rule are AND, not OR; domain names need to satisfy both.

But now you are saying domain names only need to satisfy one of the rules, right? (Which brings me back to my question about the lack of enforcement of The Bidi Rule on RTL labels with the second rule.)

alvestrand commented 1 year ago

I was wrong when I said "domain names need to satisfy both". I wasn't reading my own proposed text. Sorry!

Try N:

The IDNA2008 standard, in RFC 5893, gives a rule for evaluating whether or not a single label is suitable for use in a BIDI domain name, and some advice for applications.

RFC 5893 defines the terms "RTL label", "LTR label", "Bidi domain name", and "Bidi rule".

Based on this advice, the following domain names will be accepted by the URL standard:

  • Domains containing only labels that obey the Bidi rule, AND
  • Domains containing RTL labels immediately followed by an LTR label consisting only of ASCII characters, where the first character is not a digit, and where all labels are either LDH labels or obey the Bidi rule.
annevk commented 1 year ago

Thank you, that seems like an improvement, but LDH labels per https://datatracker.ietf.org/doc/html/rfc5890#section-2.3.1 contain A-labels and it seems that A-labels that are RTL labels should obey The Bidi Rule. So maybe LDH there should be LTR?

alvestrand commented 1 year ago

No, LDH should be LDH, because there are LTR labels that don't obey the Bidi rule, and we need to not permit those. (See bullets 5 and 6 of the Bidi rule).

I don't have proposed surrounding text for this rule, but it should probably say "this rule is evaluated after all A-labels have been converted to U-labels for testing" - meaning that xn-- labels should be decoded before evaluating; if we don't do that, explicit xn-- labels offer a way to sneak in bidi domains into unsuspecting places.

annevk commented 1 year ago

Okay that makes sense. I think that precondition means that LTR in your second rule can be LDH as well (which guarantees ASCII).

And to be clear, there is the (unstated) precondition that these domains are Bidi domain names, right? As presumably we will not impose these requirements on non-Bidi domain names.

I think with that we'd recommend these changes to UTS 46:

  1. Remove step 8 of https://unicode.org/reports/tr46/#Validity_Criteria as Validity Criteria only operates on a single label. (Although it somehow claims to have knowledge about the domain_name string as well...)
  2. Add a new step 5 to https://unicode.org/reports/tr46/#Processing. (Note that due to step 4 we will have U-labels.)

    1. If CheckBidi, and the domain_name string is a Bidi domain name, record there was an error if neither of the following conditions is true:
      • All labels in the domain_name string satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.
      • RTL labels in the domain_name string are immediately followed by an LDH label whose first code point is not of class EN and all labels in the domain_name string are either LDH labels or satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.

I'd appreciate your review and of anyone else still paying attention. 😅

alvestrand commented 1 year ago

Thanks for the context!

Yes, I think this is appropriate advice.

annevk commented 1 year ago

Thank you @alvestrand for coming up with the recommendation, @vorner for raising this, and everyone else who helped move this along! I submitted the feedback to Unicode for their April 2023 meeting. The final comment can be found at the bottom of OP in #744.

hsivonen commented 1 week ago
  • 1.xn--mhb errors (though does not error in Gecko, presumably it has CheckBidi set to false;

The presumption is not correct. Previously (at the time of the quoted comment), CheckBidi was true in Gecko, but Gecko invoked UTS 46 processing on a per-label basis, so the bidiness status of the domain as a whole did not end up affecting LTR labels.

Gecko currently (well after the quoted comment) implements CheckBidi (still true) as described in the Unicode 15.1 version of UTS 46, so if there is RTL anywhere in the domain as a whole, the domain is rejected if there is an LTR label that starts with an ASCII digit. (This also has the effect that domains like 1password.com and 9to5mac.com fall off the fastest path, because at the time a fast-path decision about the first label needs to be made, it's not yet known if there is going to be RTL in subsequent labels.)

(This comment is not meant as disagreement with the feedback relayed to the UTC.)