whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
533 stars 139 forks source link

Web compatibility issue with various unknown (external) protocols like ed2k #815

Open evilpie opened 9 months ago

evilpie commented 9 months ago

What is the issue with the URL Standard?

After Firefox shipped the new standard URL parser in Firefox 122, we have received multiple bug reports about external protocol handlers that don't work anymore.

The most common seems to be ed2k:, a protocol used for the eDonkey file sharing network. It's notable because even the Wikipedia page contains URLs that aren't parseable using a WhatWG URL conformant implementation.

Various other issues are related to handling of ://. For example openimstoolkit://http://example.com is now parsed as openimstoolkit://http//example.com (note the missing : after http) (Bug 1876729). A similar issue happens for potplayer: (Bug 1876731).

annevk commented 9 months ago

Are these web compatibility issues or issues with extensions? It seems any website breakage would also impact Safari and I haven't seen any reports about breakage.

evilpie commented 9 months ago

These are issues with external applications, not extensions, which are supposed to be opened as external protocol handlers. I assume most users of e.g. eDonkey are on Windows, which might affect Safari less.

valenting commented 9 months ago

I think there are two different issues with these schemes.

  1. Schemes that wrap another URL with the intention of passing that URL to an external protocol handler.
  2. Schemes such as ed2k which have a totally different definition. Even with RFC 3986 rules ed2k://|file|The_Two_Towers-The_Purist_Edit-Trailer.avi|14997504|965c013e991ee246d63d45ea71954c4d|/ is not a valid URI as | is not a valid host character, but as it happens it used to work. The question is how much effort should we expend in keeping these URLs functioning on the web?

While I don't have any experience with ed2k I think it's also supposed to be passed to an external protocol handler. But that doesn't work unless it's successfully parsed.

karwa commented 9 months ago

The ed2k issue seems to come from U+007C (vertical bar) being listed as a forbidden host code-point. Personally I think it would be very low-risk to allow that character in opaque hostnames.

Failing that, it would be reasonable to at least percent-encode the character -- it's very possible that the application's processing would be tolerant to such a change.

evilpie commented 9 months ago

BMO 1878295 has an other example with vscode:///{'cmd':'openFile'} ({ and } are escaped)

edit: Live URL Viewer

hsivonen commented 7 months ago

It's easy to see why space and code points below it would be forbidden. It's easy to see why DELETE would be forbidden. Also, it's easy to see why squary brackets (IPv6) are forbidden. It's easy to see why characters that occur before or after the host are forbidden. Why are ^, |, and % forbidden?

(Today I learned that Thunderbird expects post-parse host to be able to contain %. However, Firefox has not allowed % since 2019, so chances are that it's not a Web compat issue for % to be forbidden..)

annevk commented 7 months ago

% would make a re-parse possibly result in a different URL (e.g., %25aa). ^ conflicted with Firefox's Origin Attributes feature if I remember correctly. Not sure about |. If you go through blame you might be able to tell.

karwa commented 7 months ago

I suspect they are inherited from RFC-2396:

Other characters are excluded because gateways and other transport agents are known to sometimes modify such characters, or they are used as delimiters.

unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Data corresponding to excluded characters must be escaped in order to be properly represented within a URI.

https://www.ietf.org/rfc/rfc2396.txt (2.4.3. Excluded US-ASCII Characters)

I further suspect the latter part ("they are used as delimiters") is much more common than gateways or other transport agents modifying these characters in URLs.

But really, the idea of URLs escaping delimiter characters of popular enclosing document formats is inherently flawed. Consider that parentheses are allowed without escaping, and by some cruel irony are used by the Markdown document format specifically for delimiting URLs. Rust and Swift source code allow user-customisable delimiters for string literals (e.g. r#"..."# in Rust, where the number of #s is customisable so you can include unescaped # in the body), which is a much better solution.

At least for characters where there are not likely to be any web-related delimiter issues (vertical bar, curly braces, etc), I think we can afford to be more relaxed and allow them to be used without escaping.

annevk commented 7 months ago

https://github.com/whatwg/url/pull/459 blocked <, >, and ^. https://github.com/whatwg/url/pull/589 blocked | (Windows-drive letter re-parsing).

karwa commented 7 months ago

From reading those previous discussions:

^

458 seems to indicate that WebKit used to allow it. If I'm reading the Gecko bug report correctly, their implementation of origins included a separator character for internal flags (which just so happened to be ^). This is a rather strange design and the bug report even notes that it wasn't the first time it was found to be problematic:

For backwards compatibility reasons, when the origin string was first given access to originAttributes, it was designed such that the trailing attributes block is optional. Namely, if no attributes are non-default, the origin will be written as it is in the spec. The separator character to distinguish between the core origin and originAttributes is ^, so an origin might look like https://twitter.com^userContextId=1, or like https://twitter.com if there were no origin attributes set.

This leads to the spoofing issue. If it was possible for a site to include a ^ in its bare origin string, that could cause issues with the origin logic, as it could be possible to imitate originAttribute from a non-attributed origin. It used to be that the separator used was !, but that was found to be spoofable in bug 1172080. That bug is also where the ^ character was chosen.

In my opinion, this seems like rather weak justification for disallowing this character in all URLs. We could at least allow it in non-special URLs (i.e. opaque hostnames), since they do not have defined origins.


|

Okay, for file URLs it's fair enough, because this standard does actually define a meaning for this character in the hostname of a file URL. But it shouldn't apply to non-file URLs. I think we can at least allow it in opaque hostnames, to solve the ed2k compatibility issue.

In general, it usually doesn't matter if we're overly restrictive for domains/special URLs (which is what browsers tend to care about), because those special characters often won't be registered to any actual domains. But when it comes to opaque hostnames (which browsers have had very spotty support for), it does matter a great deal, because they contain arbitrary content that will be processed in an arbitrary way. The changes which forbade this characters strike me as being overly broad.

hsivonen commented 3 months ago

A couple of updates:

First, Gecko no longer needs to allow % in general, since there's now a Thunderbird/SeaMonkey-specific hack that permits two specific pseudo host names for which the percent issue was relevant. (I've been told at least one of these pseudo hosts comes from the 1990s.)

Second, the remaining Gecko deviation from the spec is that Gecko prohibits * and " in domain names in URLs. The most significant remaining reason is that those are the characters that are prohibited in file/directory names on Windows but are allowed in domains by the URL Standard. I suspect that there might be more software that wishes to be able to create a directory whose name is a domain name taken from a URL on Windows, and allowing * and " on the URL level is going to lead to inconvenient results in more software than just Gecko.

Additionally, there exists an example in the Web Platform that expects the asterisk not to be part of the normal domain name value space so that it's legitimate to use it as a wild card: wild card certificates. There are other things that deal with origins and explicitly don't allow wild cards. Prohibiting * in domain names has the useful property that it's legitimate to reject https://*.example.org/ when someone tries to use it as a wild card origin in a place that does not actually support wild cards.

Does URL really need to allow * and " in domain names?

annevk commented 3 months ago

@hsivonen see #397. The rationale for allowing non-DNS domains is to allow for non-DNS systems to the widest extent possible. Potentially we could make additional restrictions, but we might well run into issues and as such it seems safer to allow and potentially reject the domain at a layer further down.

hsivonen commented 3 months ago

@annevk , #397 talks about NetBIOS, but the asterisk and double quote are documented by Microsoft as prohibited in NetBIOS even before the Windows 2000 NetBIOS alignment with DNS. Is the reality more permissive than Microsoft's documentation suggests?

annevk commented 3 months ago

I'm not sure. Note that RFC 3986 allows * as-is too as far as I can tell: reg-name allows sub-delims, which includes *. " is not mentioned so probably disallowed there though.

But since we cannot align it with DNS completely and the partial DNS-alignment is also somewhat weird for opaque URLs, I'd rather leave things as-is as any further change seems risky and not worth it.