Open domenic opened 6 years ago
i'd lean toward (1) under the theory that there are likely registered schemes where percent-decoding and white-space stripping are inappropriate.
I'm puzzling over (my) characterisation of the WHATWG resolution and this issue came to mind. Some observations, in case it helps.
Let's look at the properties of parsed/ resolved URLs:
/
or just a drive letter)./
).These properties are natural consequences of the protocols.
For non-special URLs the parser/ resolver uses the 'cannot-be-a-base-url' flag to decide if the URL is a base URL. This amounts to the following:
/
then it is used as a base-URL. So javascript:foo
is not considered a base URL, but javascript:/foo
and javascript://
are.
Note that foo
against javascript:
throws an error whereas foo
against javascript:/
results in javascript:/foo
.
I think it makes sense to define what is and what is not a base URL, based on the protocol only. The protocol would then select one of the following options:
javascript:
URLs)That requires a hardcoded list of protocols and their associated URL 'type' (ie. parsing/resolving behaviour) though. It could also be useful to provide a way to manually register protocols to map to a certain parsing/resolving behaviour.
Just some ideas.
Having a largely protocol-agnostic parser is a design goal. Having to tweak the parser or getting different parser outcomes over time is far from ideal. (While at the moment this still happens due to convergence between implementations, my hope is that long term it won't.)
Completely agree, however it does seem accurate to distinguish a few categories.
It is very strange to apply path normalisation to javascript URLs, for example.
The same would be true for, say, data:
, news:
, urn:
,mailto:
.
I think there is a consistent, more general pattern here.
Maybe it makes sense to define a few “special exceptions” like http:
, https:
, etc. being treated uniquely, and the same for javascript:
, data:
, etc., but then also allow URLs to somehow specify that they want to take that mode of parsing explicitly, e.g. with a prefix, so web-myscheme:
would work the same as http[s]:
and raw-myscheme:
would work the same as javascript:
and data:
.
However, maybe it also makes sense to allow for implementations that give value to specific URLs to interpret and parse them specially. I know that hyper:
URLs (for the Hypercore stuff) actually uses a hash instead of an address for the host. I think currently the hash will be parsed as a domain with the WHATWG spec, but that’s not accurate to what it actually represents (e.g. it can’t have a port, for example).
Of course, that would be awful in a way, because then different implementations would parse the same URL differently, so people couldn’t rely on manipulating URLs working the same way across implementations, which is what this spec is aiming to solve.
Maybe a good approach could be to establish a (limited) set of normalization rules that can be applied to URLs by implementations, enforcing specific normalization rules for certain URLs like http[s]:
, but allowing the implementations to choose among other normalization rules for their own URLs.
So, for example, the spec could allow implementations to change the port of URLs freely depending on the scheme without requiring it to be fetched and redirected (as long as they do it consistently), then e.g. http:
would take away the port if it is 80
(enforced by the spec), and hyper:
URLs would always take away the port in implementations that support it (allowed by the spec).
Some other modifications and normalizations could likewise be done in a similar way, by being required for well‐known URLs, and allowed for other URLs.
The key here, I think, is that the set of normalization rules that can even be applied to URLs is already well known beforehand and is not arbitrary, so it is possible for authors to enjoy a consistent URL handling across implementations.
There are several URL types that are basically of the form
scheme:<some arbitrary data>
. For example,data:
,mailto:
,javascript:
, andurn:
.The question is, how should software process these URLs? I see three main models:
scheme:
, then look at everything after that.data:
URL processor spec worksjavascript:
URL processing is specced (although I don't think we have extensive tests in that area)<some arbitrary data>
contains?
s or#
s, you have to model that as allowing queries and fragments, and then processing${path}?${query}#${fragment}
. Whereas (2) just lets you process the whole string at once.An interesting example contrasting (2) and (3) is the following:
javascript://somehost/%0Aalert(1)
//somehost/\nalert(1)
is interpreted as a comment followed by an alert.javascript:
URLs.Another example is that
mailto:///d@domenic.me
is interpreted as containing a<some data here>
of///d@domenic.me
in (2) and a path of/d@domenic.me
in (3). Maybe not relevant since I doubt many mail clients will let you send email to such an address?There are probably more interesting examples of this sort.
The purpose of this thread is to gather community thoughts on these scenarios, with an eye toward setting a precedent for future such schemes, and providing recommendations for software that processes such URLs (including both the web's specced
data:
andjavascript:
, and other schemes likemailto:
orurn:
).If we decide (2) is better, we should provide better spec support for it, including helper operations and explicit recommendations to continue doing this pattern. If we decide (3) is better, we should do the same, and we should either explicitly note
data:
andjavascript:
's processing models as legacy, or we should try to change them (which might be possible if interop is bad)./ccing some people who might have thoughts: @mnot @jasnell @sleevi @masinter