whatwg / urlpattern

URL Pattern Standard
https://urlpattern.spec.whatwg.org/
Other
159 stars 22 forks source link

Cannot capture default port in a named group #234

Open rotu opened 2 months ago

rotu commented 2 months ago

What is the issue with the URL Pattern Standard?

Using a :<name> token to capture the port of a URL fails if the URL is using the default port for that protocol (either explicitly or implicitly). Tested in Chrome.

// false!?!?
new URLPattern('http://example.com::port').test('http://example.com:80')

// works if you use a non-standard port
new URLPattern('http://example.com::port').test('http://example.com:81') 
// works if you use an asterisk instead of a name token
new URLPattern('http://example.com:*').test('http://example.com:80')
sisidovski commented 2 months ago

The spec itself seems to be correct, at least the port part is passed without any default port handling in the match algorithm https://urlpattern.spec.whatwg.org/#url-pattern-match. I'd say it's more like an implementation problem. Filed a bug for chromoium crbug.com/363027641.

I confirmed this happens on chromium and also the polyfill library. The default port is removed when processing match(). For chromium, it is removed when instantiating KURL. note: The basic URL parser removes the default port. Perhaps this is the cause in the polyfill if it uses new URL to handle input in the match().

sisidovski commented 2 months ago

Sorry, I found a problematic part in the spec. In the step 12.2.2.1 of match, url will be the result of URL Parser, which perform the basic URL parser. So the default port will be removed.

jeremyroman commented 2 months ago

Removing default ports from URLs is something that the URL API always does:

new URL('http://example.com:80').port
// ''
jeremyroman commented 2 months ago

So I think this is just a consequence of segment wildcards not matching empty strings. On the other hand, this pattern does work and captures an empty string for the port:

new URLPattern('http://example.com::port(.*)')

these are also options:

new URLPattern('http://example.com::port?')
new URLPattern('http://example.com::port*')

though if you truly want everything, you can just use *, since that will match the empty string and the exec API exposes the entire port as new URLPattern(...).exec(...).port.input.

rotu commented 2 months ago

Yes, there are workarounds. It’s probably better to not canonicalize away the default port because a “null port” in the URL spec does not semantically mean “no port” for special schemes.

rotu commented 2 months ago

another option would be to forbid name tokens in the protocol and port. This would make parsing considerably simpler without losing much in the way of expressiveness.

jeremyroman commented 2 months ago

Yes, there are workarounds. It’s probably better to not canonicalize away the default port because a “null port” in the URL spec does not semantically mean “no port” for special schemes.

Which special schemes, specifically? For all but "file", the default port is canonicalized away by the URL constructor, because it is equivalent to not specifying a port. A null port, on the other hand, can't be specified for such schemes.

new URL('http://example.com:80/').port
// ''
new URL('http://example.com:80/').toString()
// 'http://example.com/'

another option would be to forbid name tokens in the protocol and port. This would make parsing considerably simpler without losing much in the way of expressiveness.

What do we gain by doing so (assuming, for the sake of argument, that it would be web-compatible to)? e.g., this seems potentially useful and consistent with how other components work:

new URLPattern({protocol: 'web\\+:webprotocol'}).exec('web+foo://foo').protocol.groups
// {webprotocol: 'foo'}
rotu commented 2 months ago

Yes, there are workarounds. It’s probably better to not canonicalize away the default port because a “null port” in the URL spec does not semantically mean “no port” for special schemes.

Which special schemes, specifically? For all but "file", the default port is canonicalized away by the URL constructor, because it is equivalent to not specifying a port. A null port, on the other hand, can't be specified for such schemes.

By a "null port" I do mean an unspecified port. as per the spec:

If url’s port is url’s scheme’s default port, then set url’s port to null.

Removing the port makes sense when you're creating a canonicalizing a concrete URL and you prefer brevity. It does not make sense for matching.

I'd expect the meaning of new URLPattern('*://*:80') and new URLPattern({port:80}) to be "any URL that uses port 80" not "any URL that uses port 80 but is NOT http", which is the current semantics.

new URL('http://example.com:80/').port
// ''
new URL('http://example.com:80/').toString()
// 'http://example.com/'

another option would be to forbid name tokens in the protocol and port. This would make parsing considerably simpler without losing much in the way of expressiveness.

What do we gain by doing so (assuming, for the sake of argument, that it would be web-compatible to)? e.g., this seems potentially useful and consistent with how other components work:

new URLPattern({protocol: 'web\\+:webprotocol'}).exec('web+foo://foo').protocol.groups
// {webprotocol: 'foo'}

I originally thought you example would not work since new URLPattern('web\\+:x://bar').test('web+foo://bar') is false. I now think this is instead broken because Chrome parses non-special URLs incorrectly ("//bar" is treated as a pathname instead of as a hostname, whereas URLPattern correctly treats bar as the hostname)

The main advantage is that you can infer that the initial colon is the protocol delimiter.

e.g. new URLPattern('hello:world') or new URLPattern('data:text/plain,*'), both of which current error with "Relative constructor string ... must have a base URL passed as the second argument.".

The reason I don't think it makes sense to put capture groups in protocol or port is because these have atomic meaning. That is the scheme is web+foo, and so the "foo" part is not exactly useful on its own in the same way a path component or URL search param value would be.

jeremyroman commented 2 months ago

Re. port canonicalization: there are fundamentally a few options here, right:

On the web I would expect this distinction to come up less, since I suspect almost all URLs to be http/https/ws/wss URLs anyway, especially the case of http://foo.com/ and https://foo.com/ describing the same logical resource but non-default ports not necessarily doing so.

Re. segment identifier in port and protocol, I agree that it would be unusual to do so, but there is some value in as much consistency between different components as possible, if only for comprehension. I'm a little bit more amenable to the idea of making it a difference only in the shortest syntax for expressing it in the string shorthand (e.g., maybe you have to write {:foo} to get that in the protocol and port components because the parser/tokenizer defaults the other way), but I think a more depth in what exactly that would entail in terms of

  1. how easy the syntax is to explain to developers
  2. how much it complicates the spec & implementations
  3. whether this change breaks existing usage (there are users using the browser impl in Chromium browsers and polyfills in other browsers, for instance)

I'm not sure I'll have the time soon to make such an exploration, so in practice someone else would have to dig into it (I may be able to help gathering field data from Chrome if we get something that seems likely to be web-compatible and want to confirm) if you want to make a change in that direction.

rotu commented 2 months ago

The more I look at this, the more I think that the issue is that we can't base any logic on protocol before having a concrete URL and that the constructor string should always be parsed similarly to a non-special URL.

That is, the step compute protocol matches a special scheme flag is conceptually impossible because that depends on the input to exec which has not been specified yet. It could be both special and non-special for the same URLPattern object. (*://example.com might match http://example.com but also web+custom://example.com)

there are fundamentally a few options here, right:

  1. match the port component pattern against the port as specified

This would mean that pattern.exec(url) and pattern.exec(new URL(url).href) are inequivalent, which seems BAD.

  1. match the port component pattern against the canonicalized port (empty if default) (current behavior)

This causes the behavior to change based on whether the pattern has a special protocol.

  1. match the port component pattern against the canonicalized port (resolved to the actual port value, even if it was not specified in the URL)

I think I'm fully behind this one. It seems like it has the fewest corner cases.

  1. potentially match the port pattern more than once if it's the default port, and reason about the consequences of that

This seems impossibly strange. For instance, URLPattern(p).exec('http://example.com').port.input will be either '80' or '', and both are misleading.


Part of this issue is also, as mentioned, that "just a consequence of segment wildcards not matching empty strings". Was this intentional? If so, why?