Add `URL.from(object)` constructor?

tabatkins commented 1 year ago

Currently URLs can only be created from a string, and then modified after that. If you have an object of URL parts you have to make a dummy URL and then modify each part in turn. However, the various bits of URLs have some special rules preventing arbitrary modification; for instance when setting protocol it runs the URL parser in a special way, and that restricts the protocol from being altered in certain ways. (See the "scheme state" state, step 2.1.)

I'm happy to assume that those restrictions are necessary to ensure a consistent data model for some reason, but it does mean that if you're wanting to construct a URL that you have entirely in parts, then your choice of dummy URL to initialize the URL object with affects whether or not you'll be able to create your desired final URL, even if a URL made of said parts would be perfectly valid and correctly parsable if originally presented in string form.

It would be helpful to have a way to construct a URL directly from parts, to avoid issues like this. Suggestion: a static URL.from(object) method, that takes a dictionary matching the modifiable bits of the URL class and returns a fresh URL set up appropriately.

This probably isn't quite trivial, since right now the only way to create a URL is by invoking the parser and then invoking the parser more for each bit, so there's a degree of statefulness in this, but hiding this complexity from authors seems worthwhile.

/cc @bakkot @ljharb, who asked about this in the WHATWG chat room

ljharb commented 1 year ago

+1, this would be awesome, and would avoid the arbitrary roadblocks that make the "make a dummy URL and Object.assign" approach frictionful.

annevk commented 1 year ago

This is a duplicate of #354. I'm happy to leave this open for a bit to see if someone can resolve the challenges pointed out in that issue though.

dead-claudia commented 1 year ago

@annevk @ljharb I have a couple ideas to fit it in without duplicating too much existing logic:

Set the initial state based i optiobs and break out where each component would terminate (typically, this is close to where the parser's buffer is currently read).
For the pointer in the basic URL parser, read it from the first of origin/host/etc, set the initial state accordingly, and switch to the next section as needed (typically, this is close to where the parser's buffer is currently read).

Do you follow? And if so, thoughts?

ljharb commented 1 year ago

I'm not sure I follow… if you mean, select properties in a specific order to get to a mutable instance, and then mutate it, that seems fine to me.

dead-claudia commented 1 year ago

I'm not sure I follow… if you mean, select properties in a specific order to get to a mutable instance, and then mutate it, that seems fine to me.

@ljharb More like this:

Read scheme, parse scheme, skip slashes
Read host, parse host, skip colon + port steps
Read port, set port (parsing not needed)
Read path (with optional query), parse path + query
Read explicit query via URLSearchParams and merge it
Return resolved URL instance

The difference between option 1 and option 2 is this:

Option 1: each "parse" step work as follows:

Start with the right state.
- Scheme: scheme start state
- Host: authority state, but disallow initial @
- Path (file:): file host state
- Path (non-file:): path state
- Fragment: fragment state with optional # prefix
- Username and password might be a bit tricky, as they don't cleanly map to a specific parsing state
Ending states must be reached
- Scheme: scheme state, trailing : optional
- Host: host state
- Path (file:): path state or end
- Path (non-file:): path state, query state, or end
- Fragment: end
- Username and password might be a bit tricky, as they don't cleanly map to a specific parsing state
Some state transitions are errors:
- Scheme to anything but scheme start
- Authority to anything but host/hostname
- Path/query to fragment
- Path to query if file:

Username and password might be a bit tricky, as they don't cleanly map to a specific parsing state.

Option 2:

Source is either a string or dictionary
It iterates through that list, but within the loop through that layer of indirection.
Instead of explicitly jumping directly into those states per option 1, it starts as normal (but setting the pointer to the first character of origin). Then, it reads each part in segments, switching through each source string.
- Example: in step 9 > scheme state > 2 in the basic URL parser (for dictionary options):
  - An EOF must follow the colon.
  - After setting the scheme, set the pointer to the first byte of host and the state to authority state.

Does this help?

tabatkins commented 1 year ago

I actually assume the easiest method would be to just build a URL string from the parts, then pass it to the URL constructor. That requires the least corner-casing or refactoring, as it would use the existing spec machinery exactly as-is. It's a little bit wasteful, but what matters is just the author-observable behavior; an impl could, if they wished, implement the parsing more directly so long as it matched behavior.

This wouldn't reduce the testing effort for the feature itself, but it would substantially reduce the unrelated testing effort caused by spec refactoring to accommodate this. In fact, it should reduce to zero.

Thinking like:

dictionary URLFromInit {
  required USVString protocol;
  USVString hostname;
  USVString username;
  USVString password;
  USVString port;
  (USVString or sequence<USVString>) pathname;
  (USVString or URLSearchParams) search;
  USVString hash;
};

partial interface URL {
  static URL from(URLFromInit init);
};

Then the algo would be (stealing directly from the URL serializer algo):

Let |output| be a string consisting of |protocol| concatenated with "://".
If |username| or |password| are present:
1. If |username| and |password| are not both present, then throw a SyntaxError.
2. Append |username| to |output|.
3. If |password| is not the empty string, append ":" to |output|, then append |password| to |output|.
4. Append "@" to |output|.
If |hostname| is present:
1. Append |hostname|, [=serialized=], to |output|.
2. If |port| is present, append ":" followed by |port| to |output|.
If |hostname| is not present, |path| is present and is a sequence of length 2 or greater, and the first item in |path| is the empty string, append "/." to |output|.
If |path| is present:
1. If |path| is a DOMString, set |path| to be a list containing |path|.
2. For each |segment| of |path|, append "/" followed by |segment| to |output|.
If |search| is present:
1. If |search| is a URLSearchParams, set |search| to the serialization of |search|.
2. Append "?", followed by |search|, to |output|.
If |hash| is present, append "#" followed by |hash| to |output|.
[Run the "new URL(url, base)" algo steps and return the result].

Sprinkle some appropriate encode steps in this and you're golden. I skipped them for brevity and because figuring out precisely where and what kind of escaping is needed is more work than I want to do for a proposal.

ljharb commented 1 year ago

That sounds like a solid approach to me, allowing for optimization as needed/discovered.

karwa commented 1 year ago

Overall, I'm in favour of adding this and would like us to discuss and resolve any issues so it can be implemented in a consistent way by JS and other URL libraries.

Programmers expect to parse a URL string on the web or in their native applications and to receive the same result, and that is why developers are creating libraries which implement the WHATWG URL standard's parser. I think developers have the same expectation when constructing a URL from a set of components, and so it is worth producing a specification for how that operation should behave and collaborating to ensure the implementation is robust and accounts for the various edge-cases.

Sprinkle some appropriate encode steps in this and you're golden. I skipped them for brevity and because figuring out precisely where and what kind of escaping is needed is more work than I want to do for a proposal.

I think we do need to think about it. Anne mentioned it in the previous issue before it was closed, so it seems like it was the blocking question:

in particular it would be hard to ensure that the individual parts cannot affect the other individual parts

I think I can answer this, and I don't think it's actually so hard.

Each component's percent-encode set already includes the delimiters of later components (e.g. the query set includes #, the path set is the query set plus ?, the userinfo set is the path set plus /, etc).

That means if we have some arbitrary string and encode it using, say, the path encode-set, it will never contain a naked ? or #. Therefore, if we place that escaped string in the path's position, it will not introduce additional delimiters and affect components after the path.

I wonder if this API should have an option which disables additional percent-escaping and fails instead. For instance, it might be important to me that:

URL({ ..., pathname = x }).pathname == x

If |hostname| is present: Append |hostname|, [=serialized=], to |output|. If |port| is present, append ":" followed by |port| to |output|.

The hostname would need to go through the host parser (which depends on the scheme and may fail), and the port would need to be validated to ensure it is a number. The other components are basically opaque so there's no validation to do.

If |path| is present: If |path| is a DOMString, set |path| to be a list containing |path|. For each |segment| of |path|, append "/" followed by |segment| to |output|.

The path will need to be simplified. For instance, it might contain . or .. components, Windows drive letters, etc. For a string, we would need to split it in to components, but this would also be the first API which exposes the path as a sequence of segments, and we would need to perform some additional escaping to ensure those segments are preserved as given.

For instance:

// If we're only given a string, "AC/DC" looks like 2 path segments.
// We have no way to tell the difference.
URL({ ..., pathname = "/bands/AC/DC" }).pathname == "/bands/AC/DC"

// But if the user tells us "AC/DC" should be 1 segment, we'd have to escape it.
URL({ ..., pathname = ["bands", "AC/DC"] }).pathname == "/bands/AC%2FDC"

It's not a significant problem (we'd just need to add U+002F / and U+005C \ to the path encode-set), but it's worth bearing in mind. We might also choose to escape %, since the user is almost certainly not giving us pre-escaped path segments.

This issue hasn't come up until now because the existing parser splits the string on / and \, so it never sees path segments containing those characters. But that also means we can just add it to the path encode-set without breaking anything (...🤞)

As I mentioned, this would be the first part of the JS URL API to expose the path as a collection of segments. In my survey of URL APIs, I found that surprisingly few libraries expose such a view. Of those that do, rust-url is notable because it implements this standard's interpretation of URLs, and its PathSegmentsMut skips . and .. segments rather than simplifying them. That's what I chose to do in my own library, as well.

So it's somewhat debatable what the following should return:

URL({ ..., pathname = ["foo", "..", "bar"] }).pathname  // "/foo/bar" or "/bar"?

Note that we cannot escape . or .. components, so "/foo/%2E%2E/bar" is not an option.

Jamesernator commented 10 months ago

So it's somewhat debatable what the following should return:

Note that we cannot escape . or .. components, so "/foo/%2E%2E/bar" is not an option.

The safest thing to do would be just to throw an error and force users to decide what to do.

Alternatively there could be an option for some possible builtin behaviours (throw being default, as it by far the safest):

enum URLFromRelativePathSegmentBehaviour {
    "throw",
    "omit", // ["foo", "..", "bar"] → "/foo/bar"
    "resolve", // ["foo", "..", "bar"] → "/bar"
}

dictionary URLFromOptions {
    URLFromRelativePathSegmentBehaviour relativePathSegment = "throw";
}

dead-claudia commented 10 months ago

If |path| is present: If |path| is a DOMString, set |path| to be a list containing |path|. For each |segment| of |path|, append "/" followed by |segment| to |output|. The path will need to be simplified.

It could be simplified, but it doesn't need simplified. Servers normally handle this through one of four ways:

Tolerate it. If it's never used as a file name, it's not a security problem. And it might even be a valid resource ID.
Reject it. This is as easy as checking the path against the regexp /(^|\/)\.\.?($|\/)/. Or if you really wanted to optimize it, just loop with a counter, increment on every ., fail if the counter's 2 on either / or end of path, and reset the counter after the check for /.
Resolve it (say, pathname = path.posix.normalize(url.pathname) in Node) and reject accesses resolved to the parent (pathname === ".." || pathname.startsWith("../")).
Resolve it against / and just swallow parent accesses (say, pathname = path.resolve("/", url.pathname) in Node).

1 also implies there must be a way to allow paths to be passed in raw as well, without modification.

whatwg / url

Add `URL.from(object)` constructor? #782