Open rotu opened 2 months ago
The spec itself seems to be correct, at least the port part is passed without any default port handling in the match algorithm https://urlpattern.spec.whatwg.org/#url-pattern-match. I'd say it's more like an implementation problem. Filed a bug for chromoium crbug.com/363027641.
I confirmed this happens on chromium and also the polyfill library. The default port is removed when processing match()
. For chromium, it is removed when instantiating KURL. note: The basic URL parser removes the default port. Perhaps this is the cause in the polyfill if it uses new URL
to handle input in the match().
Sorry, I found a problematic part in the spec. In the step 12.2.2.1 of match, url
will be the result of URL Parser, which perform the basic URL parser. So the default port will be removed.
Removing default ports from URLs is something that the URL API always does:
new URL('http://example.com:80').port
// ''
So I think this is just a consequence of segment wildcards not matching empty strings. On the other hand, this pattern does work and captures an empty string for the port:
new URLPattern('http://example.com::port(.*)')
these are also options:
new URLPattern('http://example.com::port?')
new URLPattern('http://example.com::port*')
though if you truly want everything, you can just use *
, since that will match the empty string and the exec
API exposes the entire port as new URLPattern(...).exec(...).port.input
.
Yes, there are workarounds. It’s probably better to not canonicalize away the default port because a “null port” in the URL spec does not semantically mean “no port” for special schemes.
another option would be to forbid name tokens in the protocol and port. This would make parsing considerably simpler without losing much in the way of expressiveness.
Yes, there are workarounds. It’s probably better to not canonicalize away the default port because a “null port” in the URL spec does not semantically mean “no port” for special schemes.
Which special schemes, specifically? For all but "file
", the default port is canonicalized away by the URL
constructor, because it is equivalent to not specifying a port. A null port, on the other hand, can't be specified for such schemes.
new URL('http://example.com:80/').port
// ''
new URL('http://example.com:80/').toString()
// 'http://example.com/'
another option would be to forbid name tokens in the protocol and port. This would make parsing considerably simpler without losing much in the way of expressiveness.
What do we gain by doing so (assuming, for the sake of argument, that it would be web-compatible to)? e.g., this seems potentially useful and consistent with how other components work:
new URLPattern({protocol: 'web\\+:webprotocol'}).exec('web+foo://foo').protocol.groups
// {webprotocol: 'foo'}
Yes, there are workarounds. It’s probably better to not canonicalize away the default port because a “null port” in the URL spec does not semantically mean “no port” for special schemes.
Which special schemes, specifically? For all but "
file
", the default port is canonicalized away by theURL
constructor, because it is equivalent to not specifying a port. A null port, on the other hand, can't be specified for such schemes.
By a "null port" I do mean an unspecified port. as per the spec:
If url’s port is url’s scheme’s default port, then set url’s port to null.
Removing the port makes sense when you're creating a canonicalizing a concrete URL and you prefer brevity. It does not make sense for matching.
I'd expect the meaning of new URLPattern('*://*:80')
and new URLPattern({port:80})
to be "any URL that uses port 80" not "any URL that uses port 80 but is NOT http", which is the current semantics.
new URL('http://example.com:80/').port // '' new URL('http://example.com:80/').toString() // 'http://example.com/'
another option would be to forbid name tokens in the protocol and port. This would make parsing considerably simpler without losing much in the way of expressiveness.
What do we gain by doing so (assuming, for the sake of argument, that it would be web-compatible to)? e.g., this seems potentially useful and consistent with how other components work:
new URLPattern({protocol: 'web\\+:webprotocol'}).exec('web+foo://foo').protocol.groups // {webprotocol: 'foo'}
I originally thought you example would not work since new URLPattern('web\\+:x://bar').test('web+foo://bar')
is false. I now think this is instead broken because Chrome parses non-special URLs incorrectly ("//bar
" is treated as a pathname instead of as a hostname, whereas URLPattern
correctly treats bar
as the hostname)
The main advantage is that you can infer that the initial colon is the protocol delimiter.
e.g. new URLPattern('hello:world')
or new URLPattern('data:text/plain,*')
, both of which current error with "Relative constructor string ... must have a base URL passed as the second argument.".
The reason I don't think it makes sense to put capture groups in protocol or port is because these have atomic meaning. That is the scheme is web+foo
, and so the "foo" part is not exactly useful on its own in the same way a path component or URL search param value would be.
Re. port canonicalization: there are fundamentally a few options here, right:
"http{s}?://foo.com/*"
today), though you can get somewhat close in a verbose way "http{s}?://foo.com:(80|443)/*"
)On the web I would expect this distinction to come up less, since I suspect almost all URLs to be http/https/ws/wss URLs anyway, especially the case of http://foo.com/
and https://foo.com/
describing the same logical resource but non-default ports not necessarily doing so.
Re. segment identifier in port and protocol, I agree that it would be unusual to do so, but there is some value in as much consistency between different components as possible, if only for comprehension. I'm a little bit more amenable to the idea of making it a difference only in the shortest syntax for expressing it in the string shorthand (e.g., maybe you have to write {:foo}
to get that in the protocol and port components because the parser/tokenizer defaults the other way), but I think a more depth in what exactly that would entail in terms of
I'm not sure I'll have the time soon to make such an exploration, so in practice someone else would have to dig into it (I may be able to help gathering field data from Chrome if we get something that seems likely to be web-compatible and want to confirm) if you want to make a change in that direction.
The more I look at this, the more I think that the issue is that we can't base any logic on protocol
before having a concrete URL and that the constructor string should always be parsed similarly to a non-special URL.
That is, the step compute protocol matches a special scheme flag is conceptually impossible because that depends on the input to exec
which has not been specified yet. It could be both special and non-special for the same URLPattern
object. (*://example.com
might match http://example.com
but also web+custom://example.com
)
there are fundamentally a few options here, right:
- match the port component pattern against the port as specified
This would mean that pattern.exec(url)
and pattern.exec(new URL(url).href)
are inequivalent, which seems BAD.
- match the port component pattern against the canonicalized port (empty if default) (current behavior)
This causes the behavior to change based on whether the pattern has a special protocol.
- match the port component pattern against the canonicalized port (resolved to the actual port value, even if it was not specified in the URL)
I think I'm fully behind this one. It seems like it has the fewest corner cases.
- potentially match the port pattern more than once if it's the default port, and reason about the consequences of that
This seems impossibly strange. For instance, URLPattern(p).exec('http://example.com').port.input
will be either '80' or '', and both are misleading.
Part of this issue is also, as mentioned, that "just a consequence of segment wildcards not matching empty strings". Was this intentional? If so, why?
What is the issue with the URL Pattern Standard?
Using a
:<name>
token to capture the port of a URL fails if the URL is using the default port for that protocol (either explicitly or implicitly). Tested in Chrome.