How should "everything after the scheme" URLs work?

domenic commented 6 years ago

There are several URL types that are basically of the form scheme:<some arbitrary data>. For example, data:, mailto:, javascript:, and urn:.

The question is, how should software process these URLs? I see three main models:

Treat these as non-URLs: check if the string has a leading scheme:, then look at everything after that.
- Nothing specced does this. (Although I suspect a decent amount of un-specced non-browser software might.)
- This is probably not a good idea, if we want to call these things URLs at all. For example, it misses canonicalizations like percent-decoding and whitespace-stripping that are otherwise common to URLs.
Parse the URL. Check if its scheme is the one you want. Then, serialize them, and strip the leading scheme. (Maybe also strip the fragment?) Now process that remaining set of code units.
- This is how the relatively-new data: URL processor spec works
- This is how the very old javascript: URL processing is specced (although I don't think we have extensive tests in that area)
Parse the URL. Now, validate it according to some strict criteria, such as: no username, no password, no host, no port, maybe no query, maybe no fragment. Now, process the path, and optionally process the query or fragment, if those are allowed for your scheme.
- Nothing specced does this, yet.
- This might be better than (2), as it is stricter validation, and more in line with the traditional RFCs, which consider these "everything after the scheme" URLs as having paths only.
- This model seems a bit weird in that if your <some arbitrary data> contains ?s or #s, you have to model that as allowing queries and fragments, and then processing ${path}?${query}#${fragment}. Whereas (2) just lets you process the whole string at once.

An interesting example contrasting (2) and (3) is the following: javascript://somehost/%0Aalert(1)

In (2), it would work, and cause an alert, because the source string //somehost/\nalert(1) is interpreted as a comment followed by an alert.
In (3), it would fail, since we'd validate that hosts aren't present in javascript: URLs.

Another example is that mailto:///d@domenic.me is interpreted as containing a <some data here> of ///d@domenic.me in (2) and a path of /d@domenic.me in (3). Maybe not relevant since I doubt many mail clients will let you send email to such an address?

There are probably more interesting examples of this sort.

The purpose of this thread is to gather community thoughts on these scenarios, with an eye toward setting a precedent for future such schemes, and providing recommendations for software that processes such URLs (including both the web's specced data: and javascript:, and other schemes like mailto: or urn:).

If we decide (2) is better, we should provide better spec support for it, including helper operations and explicit recommendations to continue doing this pattern. If we decide (3) is better, we should do the same, and we should either explicitly note data: and javascript:'s processing models as legacy, or we should try to change them (which might be possible if interop is bad).

/ccing some people who might have thoughts: @mnot @jasnell @sleevi @masinter

masinter commented 6 years ago

i'd lean toward (1) under the theory that there are likely registered schemes where percent-decoding and white-space stripping are inappropriate.

ExE-Boss commented 4 years ago

alwinb commented 3 years ago

I'm puzzling over (my) characterisation of the WHATWG resolution and this issue came to mind. Some observations, in case it helps.

Let's look at the properties of parsed/ resolved URLs:

File URLs have an authority (may be empty) and an absolute path (may be just / or just a drive letter).
Other special URLs have a non-empty authority and an absolute path (may be just /).

These properties are natural consequences of the protocols.

For non-special URLs the parser/ resolver uses the 'cannot-be-a-base-url' flag to decide if the URL is a base URL. This amounts to the following:

If a non-special URL has an authority, or a path that starts with / then it is used as a base-URL.

So javascript:foo is not considered a base URL, but javascript:/foo and javascript:// are. Note that foo against javascript: throws an error whereas foo against javascript:/ results in javascript:/foo.

I think it makes sense to define what is and what is not a base URL, based on the protocol only. The protocol would then select one of the following options:

An authority and an an absolute path (file URLs)
A nonempty authority and an absolute path (other special URLs)
(a, b) An absolute path, or an absolute path or an authority (some subset of current non-special URLs)
No authority and opaque path. (such as javascript: URLs)

That requires a hardcoded list of protocols and their associated URL 'type' (ie. parsing/resolving behaviour) though. It could also be useful to provide a way to manually register protocols to map to a certain parsing/resolving behaviour.

Just some ideas.

annevk commented 3 years ago

Having a largely protocol-agnostic parser is a design goal. Having to tweak the parser or getting different parser outcomes over time is far from ideal. (While at the moment this still happens due to convergence between implementations, my hope is that long term it won't.)

alwinb commented 3 years ago

Completely agree, however it does seem accurate to distinguish a few categories.

It is very strange to apply path normalisation to javascript URLs, for example. The same would be true for, say, data:, news:, urn:,mailto:.

I think there is a consistent, more general pattern here.

ghost commented 3 years ago

Maybe it makes sense to define a few “special exceptions” like http:, https:, etc. being treated uniquely, and the same for javascript:, data:, etc., but then also allow URLs to somehow specify that they want to take that mode of parsing explicitly, e.g. with a prefix, so web-myscheme: would work the same as http[s]: and raw-myscheme: would work the same as javascript: and data:.

However, maybe it also makes sense to allow for implementations that give value to specific URLs to interpret and parse them specially. I know that hyper: URLs (for the Hypercore stuff) actually uses a hash instead of an address for the host. I think currently the hash will be parsed as a domain with the WHATWG spec, but that’s not accurate to what it actually represents (e.g. it can’t have a port, for example).

Of course, that would be awful in a way, because then different implementations would parse the same URL differently, so people couldn’t rely on manipulating URLs working the same way across implementations, which is what this spec is aiming to solve.

Maybe a good approach could be to establish a (limited) set of normalization rules that can be applied to URLs by implementations, enforcing specific normalization rules for certain URLs like http[s]:, but allowing the implementations to choose among other normalization rules for their own URLs.

So, for example, the spec could allow implementations to change the port of URLs freely depending on the scheme without requiring it to be fetched and redirected (as long as they do it consistently), then e.g. http: would take away the port if it is 80 (enforced by the spec), and hyper: URLs would always take away the port in implementations that support it (allowed by the spec).

Some other modifications and normalizations could likewise be done in a similar way, by being required for well‐known URLs, and allowed for other URLs.

The key here, I think, is that the set of normalization rules that can even be applied to URLs is already well known beforehand and is not arbitrary, so it is possible for authors to enjoy a consistent URL handling across implementations.

whatwg / url

How should "everything after the scheme" URLs work? #385