whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
526 stars 137 forks source link

Allow queries in relative references when the baseURL has an opaque path #668

Open karwa opened 2 years ago

karwa commented 2 years ago

The recent change which renamed "cannot-be-a-base" URLs to "URLs with opaque paths" reflects the reality that you can parse some relative references against them (e.g. fragment-only relative references).

However, it also exposes that the parser is overly-restrictive: namely, that while you can use the API to set a query on a URL with an opaque path, you can't parse a relative reference which includes a query against them.

Trying this out against the JSDOM implementation:

var url = new whatwgURL.URL("opq:hello");
console.log("Before - " + url.toString());

try {
  var relative = new whatwgURL.URL("?q=foo", url); 
  console.log("Worked! - " + relative.toString());
} catch(ex) {
  console.log("Error - " + ex);
}

url.search = "?q=foo";
console.log("After - " + url.toString());

Outputs the following:

[Log] Before - opq:hello
[Log] Error - TypeError: Invalid URL: ?q=foo
[Log] After - opq:hello?q=foo

I can't think of a reason why we would allow one way of writing this operation but not the other.

This bug is to ask whether there would be any objections to changing the parser to accept queries in relative references when the baseURL has an opaque path. If there aren't, I'd be happy to draft a PR and add some tests.

annevk commented 2 years ago

The reason for this is blob: and data: URLs and such. I also don't think that RFC 3986 necessarily allowed this for all URLs (URL schemes could "forbid" the question mark for instance), whereas it does make special mention of same-document references.

Edit: https://www.rfc-editor.org/rfc/rfc3986#section-5.2.2 seems to allow for this.

annevk commented 1 year ago

@achristensen07 @valenting @ricea thoughts?

ricea commented 1 year ago

Agreed that RFC3986 seems to permit this. Given the low impact, this would be very low priority to implement.

valenting commented 1 year ago

Agreed. This seems like something that should work, but it's quite the corner-case.

annevk commented 1 year ago

Reading the RFC it also suggests that

so from that perspective I think we should probably not do this. Being consistent on path and query makes a lot of sense to me.

karwa commented 1 year ago

Since paths are opaque in those URLs, I think it is logical that we cannot resolve a relative path against them (they don't have a hierarchical structure that the standard recognises; we can't assume what they mean or how to combine the given relative path with the existing contents). A relative URL string without any leading delimiter is assumed to be a relative path.

I think the more appropriate consistency domain here is between query and fragment. They both require a leading delimiter, so we know unambiguously which component they refer to, and that they are absolute values rather than relative (both query and fragment are fully opaque; there is no concept of a "relative query" or "relative fragment").

As well as being consistent between query and fragment, it would improve consistency between URLs with opaque paths and those with hierarchical paths:

Kind base URL "?b" relative to base "#b" relative to base
Special http://a/ http://a/?b http://a/#b
Path-only test:/a test:/a?b test:/a#b
Opaque path test:a ERROR test:a#b

The error here appears to have an obvious solution.

annevk commented 1 year ago

In practice though in opaque path URLs the query is treated as an extension of the path data and not as its own independent thing. And that's also the case in non-opaque path URLs. The path and query together determine what you get back from the server. It's only the fragment that's special and does something locally.

karwa commented 1 year ago

The path and query together determine what you get back from the server. It's only the fragment that's special and does something locally.

Is that really something we can determine about all URLs with opaque paths? Enough to ban this operation? I mean, all of their components are opaque and have application-defined meaning - and we can't say which of their components have meaning to whom (if they even speak to a server) or for what purpose.

And even if we can determine that, I actually think it would be an argument in favour of supporting this - imagine I have a URL schema for books, isbn:12345. Why should I be prohibited from building a table of contents which uses relative URL strings to refer to specific pages? e.g. ?page=24. I think that should be supported, the same way it works with non-opaque paths.

As for the path, I think our restrictions on modifying opaque paths are also questionable. The only reasons I can think of is that the same input would behave very differently between URLs with opaque and non-opaque paths (replace vs append, respectively).

TimothyGu commented 1 year ago

I'm personally pretty satisfied with @annevk's explanation, that the path and query are considered parts of the same thing (response from the server). If a user would really like to modify the query, they can already do that using the .search setter, which works independent of relative URL parsing.

As for the path, I think our restrictions on modifying opaque paths are also questionable.

I don't think there's a technical limitation on why it's forbidden (though it's slightly tricky¹), but rather more of a web compatibility issue.

¹ To ensure the resulting URL still has opaque path, you'd have to check that the new path doesn't start with a /, etc.