whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
527 stars 137 forks source link

Support relative URLs #531

Open sholladay opened 4 years ago

sholladay commented 4 years ago

The new URL() constructor currently requires at least one of its arguments to be an absolute URL.

new URL('./page.html', 'https://site.com/help/');  // OK
new URL('./page.html', '/help/');  // Uncaught TypeError: URL constructor: /public_html/ is not a valid URL.

That requirement is painful because determining which absolute URL to use as a base can be difficult or impossible in many circumstances. In a regular browser context, document.baseURI should be used. In Web Workers, self.location should be used. In Deno, window.location should be used but only if the --location command line option was used. In Node, there is no absolute URL to use. Trying to write isomorphic code that satisfies this requirement is quite error prone.

Additionally, in many cases it would be useful to parse and resolve relative URLs against each other without knowing an absolute base URL ahead of time.

// Desired output - these currently do not work
new URL('/to', '/from').toString();  // '/to'
new URL('/to', '//from.com/').toString();  // '//from.com/to'

The lack of support for use cases involving only relative URLs is causing me to remove WHATWG URL from Ky, a popular HTTP request library, in favor of our own string replacement. See: https://github.com/sindresorhus/ky/pull/271

Desired API and whether to update the existing new URL() API or create a new API?

From my perspective, updating the new URL() constructor so it can handle a relative URL in the baseUrl argument would be ideal, i.e. remove the requirement for an absolute base in favor of simply parsing any missing URL parts as empty strings (as is currently done when a URL lacks a query, for example). But I understand that changing new URL() at this point may be difficult and it may be more practical to instead create a new API; perhaps new PartialURL() or split out the validation, parsing, and resolution algorithms into individual methods.

For my purposes, I need to at least be able to parse and serialize a relative URL, without having to provide an absolute base URL. A method that resolves two relative URLs against each other and returns the resulting relative URL would also be useful, e.g. URL.resolve('./from/index.html', './to') -> ./from/to.

annevk commented 4 years ago

Well, its purpose is to create a URL and those are by definition not relative. I could see wanting something specialized for path/query/fragment manipulation though. Are there any popular libraries that handle that we could draw inspiration from?

sholladay commented 4 years ago

Where is it defined that a URL must contain a scheme and a host in order to be a valid URL?

Even if such a definition exists, new URL() is the first API in the web ecosystem that I have encountered that has this limitation, making it quite surprising.

Beyond that, the WHATWG URL spec itself defines relative URLs...

https://url.spec.whatwg.org/#relative-url-string

As for existing implementations, see Node's url.parse() and url.resolve(), among others. I've used these extensively to manipulate URLs where the scheme and/or host is not known ahead of time and will be determined later by the end-user or browser, depending on where the URL is ultimately used.

annevk commented 4 years ago

It defines them as input (though only in the context of a base URL, which at least browsers always use), it doesn't define them as data structures. The data structure is defined at https://url.spec.whatwg.org/#url-representation (though it's fair to say that does make it seem like more is optional than in reality is optional; something to improve).

sholladay commented 4 years ago

I get that browsers need an absolute base URL to actually perform a request. And thus it makes sense for the URL specification to define what an absolute base URL is and discuss resolving relative URLs in the context of an absolute base URL, etc.

What doesn't make sense to me is why new URL() imposes this limitation. I cannot think of anything else on the web platform that does this. Even HTML's <base> tag supports relative URLs, despite the fact that it is specifically meant for defining the base URL.

I can see some value in an API that tests whether a URL is absolute. So perhaps part of the problem here is that new URL() actually does a lot of things: parsing, resolving, and validating. These could be broken down into separate methods. I don't think that is strictly necessary, though it would be one way to solve this.

annevk commented 4 years ago

Browsers only have a single URL parser that works as new URL() does (and as defined at https://url.spec.whatwg.org/#url-parsing). E.g., when parsing <base href> the location of the document is used. And in fact, the entirety of the web platform does this as it all builds upon this standard and its primitives.

sholladay commented 4 years ago

Browsers only have a single URL parser that works as new URL() does

Sure, as I said, it's completely reasonable that a browser needs to resolve to an absolute URL. But I'm not building a browser and I have a suspicion that most new URL() users aren't, either. I'm building software for the web platform that is environment agnostic and needs the same functionality as new URL() even if the scheme or host is not yet known. Use cases and relevant code linked to above.

mgiuca commented 4 years ago

To try and clarify this issue: it seems that you're not asking for a definitional change but an actual behavioural change to the Web-facing URL API.

Specifically, the changes you seem to be asking for are:

  1. If the base argument is not supplied, it defaults to document.location (the current page's URL), rather than the current behaviour which requires the url argument to be absolute if base is omitted.
  2. If the base argument is not absolute, it is first resolved against document.location (the current page's URL), rather than the current behaviour which unconditionally requires the base argument to be absolute.

So for example, if you executed these on https://github.com/whatwg/url/issues/531, all of the following are currently errors, and they would change to work as follows:

// Proposed API.
> new URL('to');
"https://github.com/whatwg/url/issues/to"

> new URL('to', '/from/');
"https://github.com/from/to"

> new URL('to', '//from.com/');
"https://from.com/to"

Technically, this is all feasible, but I don't think it's necessary or desirable. It's rather trivial to write code using the current API that behaves like this if you want it to:

// Current API.
> new URL('to', document.location);
"https://github.com/whatwg/url/issues/to"

> new URL('to', new URL('/from/', document.location));
"https://github.com/from/to"

> new URL('to', new URL('//from.com/', document.location));
"https://from.com/to"

I personally prefer not to change this. The current API forces you to be explicit about incorporating the current document's location, so it's clear to anyone reading the code that the current page's URL might leak into the result. When you don't use document.location as a base, it's a pure mathematical function of the inputs, and will produce the same output on any web page. That's a good property which I don't think we should break.

sholladay commented 4 years ago

No. I want to be able to parse and resolve relative URLs in an environment-agnostic way, for example on the server. It's completely unacceptable to rely on the DOM. The point of this issue is new functionality, which would behave exactly like new URL() does now, except it would support relative URLs in both arguments and it would return the resolved and parsed relative URL. That's it. I'm not asking for magical implicit resolution to an absolute URL. Just allow baseUrl to be relative and if it is relative, then return a relative URL.

I don't care if this is a change to the constructor or exposed as some new method.

mgiuca commented 4 years ago

Ohh, I see what you want now. (Tip: When filing a bug asking for a change to API behaviour, please give sample input and output so it's clear what you want.)

So am I right in thinking that this is what you want for my three examples:

// Proposed API.
> new URL('to');
"to"

> new URL('to', '/from/');
"/from/to"

> new URL('to', '//from.com/');
"//from.com/to"

(Noting that I'm using strings to represent the output above, but it would actually be a URL object.)

OK that makes sense. It does mean changing the URL object to allow representation of all kinds of relative URLs (scheme-relative, host-relative, path-relative, query-relative and fragment-relative). Though maybe that's helpful in explaining in general all of those different kinds of relative, which currently are not captured in the spec other than as details of the parser algorithm.

sholladay commented 4 years ago

To be fair, I referenced Node's url.resolve() as an example of an existing implementation that produces the expected output (approximately). But point taken. Yes, you are correct about the desired output.

This would be a massive help to a lot of libraries and tools, especially those that aim to be isomorphic.

masinter commented 4 years ago

For multipart/related, we invented a scheme "thismessage:". You could use "thismessage::/" as the base if you didn't have one, and remove it when if was there when done. https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml#thismessage

sholladay commented 4 years ago

Interesting. I did actually consider something exactly like that using invalid: as a scheme, but it's a hack and we'd like to avoid it. In Ky, we were able to use a regex string replacement for the query part of the URL, which also isn't great, but that was sufficient for the one place we still used new URL() - we removed all other usage of new URL() due to the aforementioned problems. There are other situations I've encountered, though, where something more complicated is needed. Parsing and resolving relative URLs is really something that should be built into the standard web APIs.

brainkim commented 4 years ago

Hi, I’m in a similar situation. I’m prototyping a bundler and I keep running into issues using the WHATWG URL class, specifically because it does not parse origin-relative URLs. The use-case is that I want to specify a common prefix for the public distribution of static files; for instance, the prefix can be the string "/static/", implying that the origin is the same origin as the server, but it can also be an absolute URL on a different origin ("https://mycdn.com/"). Some common operations I need include resolving relative and absolute URLs against this base, detecting if another URL is “outside” the base, and getting the relative path of a URL relative to the base, all of which could be done if an origin relative URL could be passed to the URL constructor, something like new URL("main.js", "/static/").

If anyone has any solutions, I’d love to hear about it. I’m loathe to abandon the URL class completely because of all the work it does in parsing URLs, but right now I have a Frankenstein system with URLs, the path/posix module, and regexes that I’d like to abstract.

annevk commented 4 years ago

@brainkim for that specific case it seems you could work around this by using a fake origin such as https://fakehost.invalid and removing it later on.

Also, if we did something here it would not be by changing new URL(). The output of that has to be "complete" and useful in a wide variety of contexts that expect a scheme and such.

brainkim commented 4 years ago

@annevk

I’m currently experimenting with using a custom protocol for the base (currently local:///) and it actually seems to be working out. It seems like it’s important to use 1 or 3 slashes so that the constructor does not interpret the first path part as a host. I still need posix path helpers to deal with pathname, and I have lots of code I’m not sure about like url.pathname.startsWith(publicPrefix.pathname) but this slowly seems to be turning into an acceptable solution.

Are there any thoughts on the fake protocol to use? I’m checking against https://en.wikipedia.org/wiki/List_of_URI_schemes to make sure I’m not stepping on well-known protocols. Maybe there is a very good reason not to use local:///? I’ve also considered internal:///, self:///, and relative:///? I want some name which indicates that the URL should be relative to the origin assigned to the server.

masinter commented 4 years ago

You could use thismessage:/ which was set up exactly for this purpose when defining multipart/related

brainkim commented 4 years ago

@masinter Looks good. From https://www.w3.org/wiki/UriSchemes/thismessage:

defined for the sole purpose of resolving relative references within a multipart/related structure when no other base URI is specified

The “multipart form” part threw me off earlier but I think this is acceptable.

ghost commented 3 years ago

I hope @alwinb doesn’t mind me advertising their library here (nor anyone else, for that matter), but I recently found it through https://github.com/whatwg/url/issues/405#issuecomment-694786491, and it allow manipulating relative URLs and resolving them against other (relative or absolute) URLs in a way that complies to this specification.

It’s really simple, actually!

let url = new Url("../messages/goodbye.txt")
url = url.set({file: "hello.txt"})
console.log(url.host, [...url.dirs], url.file) // null, ["..", "messages"], "hello.txt"

console.log(new Url("https://example.com/things/index.html").goto(url).force().normalize().href) // "https://example.com/messages/hello.txt"

A couple notes:

Maybe this library can serve as inspiration of some kind for an API for the spec.

alwinb commented 3 years ago

@zamfofex thank you, that is a nice summary!

I think that the most important part is not the API though, but the model of URLs underneath.

The parser that is used in the standard at the moment, simply cannot support relative URLs (without major changes, at least). And after having worked on my library, I can understand why, because it was a really complicated and frustrating process to come up with something compliant that could! I'd forgive people for thinking that it cannot be done at all.

I'll sketch part of my solution, for the discussion here.


The force operation is one key part of the solution. Consider the issue of repeated slashes:

  1. http:foo/bar
  2. http:/foo/bar
  3. http://foo/bar
  4. http:///foo/bar

According to the standard all of these 'parse' (ie. parse-and-resolve) to the same URL. However, when 'parsed against a base URL' they behave differently. So you cannot just use:

or something like that, as a grammar, because then you'd fail to resolve correctly when a base URL is supplied. (I'm using square brackets for optional rules here). So you need to start off with a classic rule that has two slashes before the authority.

My first parser phase is very simple and parses them as such:

  1. (scheme"http") (dir"foo") (file"bar")
  2. (scheme"http") (path-root"/") (dir"foo") (file"bar")
  3. (scheme"http") (auth-string"foo") (path-root"/") (file"bar")
  4. (scheme"http") (auth-string "") (path-root"/") (dir"foo") (file"bar")

From there,

alwinb commented 3 years ago

I did a branch of jsdom/whatwg-url a while ago that uses a modular parsing/resolving algorithm, passes all of the tests (well, except 5/1305 that I was looking to get some help with) and has everything in place to start supporting relative URLs.

I did not post it because the changes are so large, as-is, that it would not be feasible to adopt them in the standard. I was thinking about a way to provide the same benefits incrementally and with less intrusive changes, so that it could be merged into the spec gracefully. However, I have the impression that even if I'd manage to do that, the changes will be resisted for reasons that are not technical but social and emotional. So I am leaving it here as is. I am disappointed by the situation, I hope it will work out eventually, because support for relative URLs would be very useful to people, and also because a modular/ compositional approach enables you to talk with precision about the constituents that URLs are made of, improving the spec itself and all the discussions around it.

There have been good reasons why this has not been done before. It is a messy problem especially in combination with the different browser behaviours. I've built on that work and solved the issue, but as usual, there's more to it than solving the technical challenges.

Part of the discussion around this was in #479.

The branch, as-is... is here: https://github.com/alwinb/whatwg-url/tree/relative-urls. The readme is no longer accurate, Sorry for that.

annevk commented 3 years ago

I think the main reason we have not made a lot of progress here is lack of browser-related use cases. Apart from browsers the API is only supported by Node.js. That's not enough for https://whatwg.org/working-mode#changes. Perhaps that https://github.com/WICG/urlpattern brings some change to this, but it's a bit too early to say. Now I might well be wrong and there is in fact a lot of demand for this inside the browser or by web developers using a library to solve this in browsers today. If someone knows that to be the case it would be great if they could relay that.

sholladay commented 3 years ago

Our use case is in the browser, I only mentioned other environments as an example of how it could benefit the larger community. Ky targets browsers primarily. We just don't want to specifically rely on the DOM or window. So we try to avoid referencing document.baseURI or window.location. That makes it difficult for us to use new URL() because it doesn't support relative URLs, which we are sometimes given as input because we are operating in a browser and relative URLs are a common occurrence in browser land.

annevk commented 3 years ago

Thanks for your reply Seth, could you perhaps go into some more detail as to why you want to avoid window.location and where these relative URLs are common?

masinter commented 3 years ago

you might check with @jyasskin for another use of relative URLs for browsers. Relative URLs were an important part of multipart/related capture of relationship of components in a saved web page. It was the reason for the invention of the "thismessage" scheme (for supplying a base when none was present.)

jyasskin commented 3 years ago

Re @masinter, web packages don't currently have any fields that allow relative URLs. If we change that, I don't think we'd need to expose the relative-ness to Javascript—we'd just resolve them against the package's base URL, like we do for the relative URLs in HTML.

alwinb commented 3 years ago

I'm not completely sure I accurately understand the last comment, but I think that what @jyasskin calls 'exposing relative-ness' is just what this issue is asking for. It is asking for an addition to the API that exposes a parsed version of what is called a "relative reference" in the parlance of RFC 3986 (I usually call it a relative URL).

I'm arguing in favour of it because I would like the standard to define an analogue of "relative reference". This is not currently the case, so in places where relative references are useful or needed, people cannot refer to the standard for guidance.

@annevk points out that for such a change to be considered, they need examples where relative references are useful in a browser context, so we're looking for such use cases.

ti1024 commented 3 years ago

Thanks for your reply Seth, could you perhaps go into some more detail as to why you want to avoid window.location and where these relative URLs are common?

@annevk points out that for such a change to be considered, they need examples where relative references are useful in a browser context, so we're looking for such use cases.

I think that there are natural cases where generating relative URLs is useful in a web app.

Suppose that some component A generates a link to another component B which takes a query parameter. For example, component A is at http://example.com/inbox and component B is at http://example.com/message?id=<the ID of a message>.

One approach is to generate an absolute URL, so that the DOM will be like <a href="http://example.com/message?id=abcde">Open message</a>. But this introduces unnecessary dependency on the domain name. This causes inconveniences such as that the domain name has to be faked in unit tests.

Another approach is to generate a relative URL, so that the DOM will be like <a href="/message?id=abcde">Open message</a>, and leave the relative-to-absolute conversion to the browser. To do so, it would be useful to write code like

const url = new URL('/message');
url.searchParams.set('id', messageId);
const link = createElement('a');
link.href = url.href;
...

but this code does not currently work because new URL('/message') throws.

stevenvachon commented 3 years ago

@ti1024:

new URL('/message', location.href);
ti1024 commented 3 years ago

@stevenvachon That is exactly what I described as “One approach is to generate an absolute URL”, with the drawback I described.

sholladay commented 3 years ago

could you perhaps go into some more detail as to why you want to avoid window.location and where these relative URLs are common?

@annevk Sure. The reason we want to avoid using window.location is because it doesn't exist in Web Workers, among other environments. Web Workers do have self instead of window, though. Node.js doesn't have window or self. There are even environments where a window does exist but without a window.location, such as Deno. Newer environments have globalThis but older environments don't. There are so many special cases, it's a mess and difficult to maintain.

Relative URLs are common mainly in apps that target browsers. It's not uncommon to see something like fetch('/foo.jpg') or fetch('../constants.json'). We aim to make this work, while keeping the implementation of the Ky library as environment agnostic as possible.

Early versions of Ky were designed to pass URLs directly to fetch() without modifying them and without referencing window or document. That worked well because fetch() correctly handles relative URLs as input, and it resolves them against either document.baseURI (e.g. from the <base> HTML element), or window.location, depending on what is available. fetch() works as expected and we want Ky to work that way, too.

Then people requested a new feature where you can pass a searchParams object to Ky, and Ky will add those those params to the input URL before calling fetch(). This is useful, for example, if you are creating a custom API client with ky.extend() and you always want to include a ?limit=100 param to limit the page size to 100 items in the response to every request that is sent with that client. When that feature was implemented, we had to decide how to apply the searchParams to the input URL, and for that we began using new URL() and its property setters, since it's easy to do myUrl.search = mySearchParams. That solution seemed good at the time, but later we realized that it broke relative URL support because new URL() lacks support for relative URLs. I tried to fix the regression by resolving the input URL against the document base, with new URL(input, document.baseURI). But that then caused problems for people using Ky in Web Workers, React Native on mobile devices, and Node.js. I then fixed that by guarding the document reference, although in hindsight that also needs a fallback to window.location, which itself needs to be guarded. You'd think that would be enough, but we had further complaints that our approach of referencing globals was too difficult to mock. The attempted fix for that then broke more stuff...

The point is, writing environment agnostic code that depends on window or document is pretty tricky in practice. And in the end, we were only doing that as a workaround for new URL()'s lack of support for relative URLs. So we dropped new URL() and resorted to regex-based string replacement of the search params instead, for now.

annevk commented 3 years ago

How do relative URLs in fetch() work then? Does that mean that Deno and Node.js also have a base URL, but the fundamental problem in your case is that there is no consistent way to get at the base URL across host environments?

alwinb commented 3 years ago

My understanding is that it is not about a consistent way of getting the base URL but about trying to keep the base URL abstract to promote the modularity of the code.

annevk commented 3 years ago

Well, fetch() does not work with relative URLs. It needs a base URL when given a relative URL.

styfle commented 3 years ago

I think that new URL() will need to remain absolute for backwards compatibility since too many APIs accepting URL today depend on it.

Perhaps we need a new RelativeURL() to solve this use case with a similar API (has searchParams but not hostname for example).

Although I'm not sure relative is even the right term here because you might say /page is relative but you could also say these are relative too: ./page, ../page, or even page.

So I think what we want is something in between new URL() and new URLSearchParams(), perhaps new URLPathAndHash().

https://nodejs.org/api/url.html#url_url_strings_and_url_objects

sholladay commented 3 years ago

To me, a relative URL is any URL that doesn't have a scheme, such as /page.html and //site.com/page.html. In the wild, I've seen people refer to /page.html as an "absolute URL", but that doesn't make sense to me, except perhaps from the server's perspective.

I didn't mean to imply that you can fetch() relative URLs in Node or Deno. That generally wouldn't work. (Well, to be precise, in Deno you can actually opt-in to providing a value for window.location on the command line with the --location option. Then you could use fetch() with a relative URL.)

Rather, the problem is the combination of the following:

  1. We want to parse and manipulate URLs, thus we use new URL()
  2. The API design of new URL() forces us to use browser and DOM APIs to explicitly provide an absolute base URL, instead of leaving the base URL implicit as is done in every other web API that I'm familiar with, including fetch()
  3. The usage of those browser and DOM APIs makes it surprisingly difficult to write cross-platform (isomorphic) code, which is problematic for many use cases, such as an HTTP request library
alwinb commented 3 years ago

Well, fetch() does not work with relative URLs. It needs a base URL when given a relative URL.

No, it accepts a relative URL–string as an argument. It then internally resolves that against an implicit base URL. So if you want to modify the argument with the URL API you need to retrieve that implicit base URL first. The point is that this would not be an issue with an URL API that supports relative URLs.

alwinb commented 3 years ago

As for what constitutes a relative URL, depending on how you model an URL/ URL 'record' there are at least five parts that may be absent or present (protocol, authority, path, query, hash, but the path may be subdivided), so there are at least 25 'shapes' of URLs, out of which only a few would be considered absolute.

This might sound like it is very complex, but the general structure is fairly simple and RFC3986 specifies how to handle all of them. Moreover, the behaviour that is specified in the RFC can be adapted to express the behaviour of web browsers, as I've laid out in my projects.

annevk commented 3 years ago

@alwinb in order for fetch() to do its thing it needs an absolute URL. So I was wondering how that would work in Node or Deno. But reading the link in OP again it seems there is a non-standard version of fetch() that doesn't require an absolute URL. Now, that doesn't mean it would be possible to not require it for new URL() as I mentioned in https://github.com/whatwg/url/issues/531#issuecomment-681657589. A new API for relative URLs could be added as @styfle mentioned, but this would be quite a bit of effort (including in a best case scenario where it would be fairly straightforward) that has to be justified somehow.

jasnell commented 3 years ago

@annevk :

So I was wondering how that would work in Node or Deno.

The answer is, it wouldn't, not without additional context.

For those advocating for more support of relative URLs here, the challenges are that:

(a) it's impossible to interpret a relative URL reliably without at least a scheme because the semantic interpretation and normalization of the relative parts could change from one scheme to the next. (b) the relative URL simply isn't usable by itself. In order to make use of it at all you need to know what it's relative to. In the browser, there's always the context of whatever page you're currently looking at. This gives the illusion that relative URLs like <a href="./foo">thing</a> are useful because that outer context is abstracted away. That context simply does not exist in Node.js but it's no less critical.

@sholladay makes the statement, "I want to be able to parse and resolve relative URLs in an environment-agnostic way" ... but that's impossible because you cannot separate the environment from either the parsing or the resolving! Sure, you could transform one relative URL fragment into another relative URL fragment, but neither are actually usable without the environment and context.

The way to move forward, here, I think, is for those advocating for a change here to provide clear answers to:

  1. How are you intending to use the relative URLs?
  2. Is a change to the URL API actually necessary, or is a secondary API workable? e.g. const fragment1 = new URLFragment('/foo/'); const fragment3 = new URLFragment('bar', fragment1). The difference here is that URLFragment could never be interpreted as a usable URL.
alwinb commented 3 years ago

@jasnell Yes you can separate the environment from the parsing, alas sadly the 'type' of URL has to be passed as an argument to the parser (e.g. file, web, generic). I have completed the research on that, see e.g. this. Second, it is immensely useful to be able to abstract the context away, that is one of the main things we do as programmers!

Here is an example. You might have a web framework (maybe to build a webshop or what have you) that allows for 'plugins'. Such a 'plugin' would consist of a number of endpoints. Now you want to allow 'mounting' multiple instances of that plugin at different paths. The plugin's endpoints would be specified as relative URLs. The 'mount' operation instantiates the plugin and resolves the relative references. In other words, the plugin is parameterised by a base URL and the mount operation resolves them. That is immensely useful.

alwinb commented 3 years ago

About justifying the work, @annevk

To be fair, there is a lot of work that you do not have to justify anymore because I have already done it. The core problem: specifying a resolution algorithm that agrees with the current WHATWG standard – is a solved problem.

Let me stress that. The major technical issues have been solved.

What remains to be done is to ask for input from the wider community to come up with a couple of APIs around that and converge on a well liked one. This could be the most enjoyable part.

The problem with referring to the working mode to justify not making the additional effort is that the working mode conflicts with the explicitly stated goal to:

Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process.

A large part of RFC3986 is dedicated to relative references. You cannot achieve the above goal without putting in an effort to also specify relative references in the WHATWG standard.

You can choose to drop that goal. But that puts the community in a difficult spot. There is a demand for working with relative references in a way that is compatible with webbrowsers as is illustrated by e.g. this nodejs issue. Again, the current WHATWG URL API cannot be used for that. But what is worse is that there also is no way to recreate the tools yourself by looking at the standard! So you leave the community with a new impactful standard, but you take away the tools and do not provide the information to recover them.

Many of the issues that keep popping up on the issue board here are the result of that situation. So I say that addressing this issue, which has been almost completely solved anyway, will save you a lot of work in the future.

masinter commented 3 years ago

A use case that everyone relies on is HTML email. What is the base?

vwkd commented 2 years ago

Just to clarify: Using a placeholder domain doesn't cover all use cases. For example, a relative URLs with dot segments like ./a/../b/../c?c=d&e=f#g would be resolved against the base without a way to recover it afterwards.

I'd very much like to see a Web standard for a platform-agnostic way to deal with such relative URLs without resorting to brittle string manipulation. IMO, a standard that calls itself "URL" should support that.

ghost commented 2 years ago

@alwinb: The specification you wrote is beautiful. (And definitely deserves more recognition.) If you have the time, I think you should consider sending some pull request to incorporate some aspects of it to this repo.

It might be difficult to decide which parts, and how to incorporate them, though — I don’t think it makes sense to try to incorporate the whole thing at once.

Maybe it would be sensible to try to first try to ammend the existing state machine algorithm with your grammar or parts thereof, and progressively replace parts of it over time.

I suppose the precursor to that would be to write the necessary machinery to include the grammar into the spec and to interleave it with the state machine algorithm.

Honestly, I really wish the WHATWG would give more value to the interest behind this, as well as to the effort @alwinb has put into consolidating a specification.

I acknowledge it can be difficult to assess the validity of interests across all the different WHATWG specs, and that’s why I think a concrete pull request would help show the applicability of the approach.

phawxby commented 2 years ago

I think the main reason we have not made a lot of progress here is lack of browser-related use cases. Apart from browsers the API is only supported by Node.js.

If I'm reading this right it sounds like @annevk justification for not wanting to make changes to the API is because it's not a problem in browsers, and who cares about Node. There has been more than enough people in this issue and related Node issue saying this is a problem, whether you do or don't see the use case many other people do have perfectly valid use cases.

I get why it may well be difficult or problematic, especially if it involves changing an active API but the initial implementation was short-sighted and missed a core piece of functionality people actually need. This needs a course correction.

jasnell commented 2 years ago

If I'm reading this right it sounds like @annevk justification for not wanting to make changes to the API is because it's not a problem in browsers, and who cares about Node.

That's really not fair. The WG has been proactive about reaching out to Node.js on changes that impact URL.

A lot of this issue is based on a few fundamental misunderstandings about how the URL parser works. Let's go back to the original example in this issue:

new URL('./page.html', 'https://site.com/help/');  // OK
new URL('./page.html', '/help/'); // Error

The original post has the statement: "That requirement is painful because determining which absolute URL to use as a base can be difficult or impossible in many circumstances."

To be certain, it's not that the URL parser really needs to know an absolute URL to function, it needs to know what the protocol component of the URL is in order to make sense of the input; and needs to know what it is normalizing the relative path against.

For instance, specifically in that second example, should the input be treated as an opaque path or hierarchical? Should it be interpreted as "special" or not? There are quite a few places in the parsing algorithm where that distinction is important.

That said, I do think there's a reasonable potential path forward. If we could add the option of passing an object as the second argument to new URL then we could do something like new URL('/foo', { protocol: 'http:' }) -- that is, pass in a URL Record as the second argument -- to provide the missing detail -- then I think we could address the core need here with minimal changes to the API.

masinter commented 2 years ago

https://github.com/alwinb/url-specification/issues/16

looks like good progress @alwinb

karwa commented 2 years ago

I'm not a big fan of combining that with the existing API for absolute URLs. A lot of legacy libraries went that way, and it ends up having all kinds of problems - from poor performance to non-obvious semantics. For example, in a strongly-typed language, you can have a function which accepts a parameter of type URL; but if that single URL type supports both absolute URLs and relative references, pretty much anything (including "foo") counts as a URL, which is generally not what developers expect (at least for the sorts of applications in my domain, perhaps expectations on the web are different). I think the use-cases are distinct enough that they warrant a separate API.

Also, I'm not sure it's obvious that we need all of the quirky web-compatibility behaviour that the relative-string parser does, treating back-slashes like forward-slashes and such. For a lot of use-cases, you have better control over the inputs and can do perfectly fine with only a sanitised subset of that behaviour.

One thing that's worth noting though: currently you don't only need the scheme to know how to interpret a URL. For example, in this case, a correct interpretation also requires you to know the base URL's path:

// "C|" is not interpreted as a drive letter if the base path has a drive letter
(input: "/a/../C|/Windows", base: "file:///D:/Music")  --> "file:///D:/C|/Windows"

// Same input string, but base doesn't have a drive letter.
// Now "C|" is considered a drive letter.
(input: "/a/../C|/Windows", base: "file:///Dx/Music")  --> "file:///C:/Windows"

(This is #574, and hopefully fixable)

alwinb commented 2 years ago

Passing a fallback protocol to the parser to select certain behaviour for the otherwise ambiguous, scheme-less URLs does work, and this is what I have done so far.

But having to pass options around, becomes cumbersome and I can see how that would cause confusing problems. So I’ve taken on the challenge to structure things in a way that avoids that, as much as possible. And there are some interesting things to note about this.

What works well for most issues, is to loosen the constraints on URLs somewhat whilst modifying or combining them, and to enforce them later by calling a separate method to convert the (possibly) relative URL to an absolute/resolved URL, something that was suggested to me by @zamfofex.

The IETF standards have made this distinction between such more- and less constrained URLs before. The name for such more tolerant and/or relative URL is an URIReference, or an IRIReference. The WHATWG equivalent to that would be slightly more tolerant still, as it would allow a few more codepoints, and invalid-percent-escape sequences in various components so as to remain consistent with WHATWG URLs.

alwinb commented 2 years ago

Alright. I think I’m getting there. I am trying to get an implementation together that can serve as an API proposal. It may take a bit of time still, but I’ll do my best.