whatwg / urlpattern

URL Pattern Standard
https://urlpattern.spec.whatwg.org/
Other
154 stars 20 forks source link

delimiter characters not included in component parts #227

Open rotu opened 1 month ago

rotu commented 1 month ago

What is the issue with the URL Pattern Standard?

In the URL standard, delimiters are included in their respective parts (hash, protocol, and search):

e.g. new URL("http://example.com:80/bar?baz=bang#id") preserve the delimiters:

On the other hand, URLPattern objects omit these delimiters:

rotu commented 2 weeks ago

These all have different names in the URL spec to differentiate the undelimited value from the delimited:

jeremyroman commented 2 weeks ago

I'm fairly confident the bottom line here is that it's too late to make a breaking change that would affect a very large fraction of usage of this API (essentially every use that specifies or checks input on the result of exec for, any of those three components). And Chrome detects usage of the URLPattern constructor on about 2% of page loads, suggesting that it's unlikely to be web-compatible to make a large breaking change.

Is there some other resolution (such as clarifying that this discrepancy exists in the text of the standard, or adding some sort of developer warning to tooling), that you're seeking?

There is some shaving of this rough edge (e.g., new URLPattern({protocol:'https:'}) is the same as new URLPattern({protocol:'https'})), at least.

I'm not sure whether the scheme/protocol, query/search, etc distinction you mention in the WHATWG URL standard is intentional or not (@annevk might know offhand?). I suspect it's more of a historical artifact than anything else, but I'm not sure.

rotu commented 2 weeks ago

I'm fairly confident the bottom line here is that it's too late to make a breaking change that would affect a very large fraction of usage of this API

I would think now is the time to make breaking changes if any. It's not yet implemented in Firefox and Safari and the WPT conformance coverage is still scanty.

that would affect a very large fraction of usage of this API (essentially every use that specifies or checks input on the result of exec for, any of those three components)

How prevalent is this actually?

Here's my attempt at figuring it out, which suggests "not very": https://sourcegraph.com/search?q=context:global+content:%22URLPattern%22+and+%28content:.protocol.input+or+content:hash.input+or+content:search.input%29&patternType=keyword&case=yes&sm=0

Most of the usage seems to prefer the .test method instead of the .exec method, and so would be unaffected by such a change.

Is there some other resolution (such as clarifying that this discrepancy exists in the text of the standard, or adding some sort of developer warning to tooling), that you're seeking?

I think that clarifying language is not extremely helpful here. This is the type of subtle inconsistency that you probably don't notice until it actually bites you by writing something like if (somePattern.protocol === someUrl.protocol).

There is some shaving of this rough edge

URL canonicalizes this too: u = new URL('http://foo'); u.protocol='https'; console.assert(u.protocol === 'https:'). Frankly, I could go either way on which is "correct", but I'd rather go with the convention of the more mature standard.

For query and hash, this normalization creates even more problems, since ? and # get stripped from the pattern but are permitted unescaped in the URL. So the following examples unintuitively produce non-matches:

It makes sense in these cases that canonicalization should be idempotent (give the same result if you do it twice). But if the query can legally contain ?, there's no way to tell whether the first occurrence came from the delimiter or the query itself!

annevk commented 2 weeks ago

That the names are distinct between the model and the API is indeed largely due to history. I do think it's a compelling point that if an accessor looks equivalent but behaves differently it might lead to confusion and more complicated code.

jeremyroman commented 1 week ago

I'm fairly confident the bottom line here is that it's too late to make a breaking change that would affect a very large fraction of usage of this API

I would think now is the time to make breaking changes if any. It's not yet implemented in Firefox and Safari and the WPT conformance coverage is still scanty.

I would rather change now than later, for sure.

But from a Chromium perspective, I think we'll have a much harder time shipping changes if they break web content, regardless of whether other engines ship URLPattern. And because it's feature-detectable and polyfillable, its absence in other engines doesn't guarantee that developers don't rely on present behavior. And that reliance wouldn't only affect Chromium-based browser users, since if another engine shipped the feature it would cause code that was feature-detecting it to start taking a path that may assume today's behavior.

that would affect a very large fraction of usage of this API (essentially every use that specifies or checks input on the result of exec for, any of those three components)

How prevalent is this actually?

Here's my attempt at figuring it out, which suggests "not very": https://sourcegraph.com/search?q=context:global+content:%22URLPattern%22+and+%28content:.protocol.input+or+content:hash.input+or+content:search.input%29&patternType=keyword&case=yes&sm=0

Most of the usage seems to prefer the .test method instead of the .exec method, and so would be unaffected by such a change.

Chrome's use counters do agree that test is more popular (I'm not sure why both are so much less than the constructor; is 2% of the web constructing patterns but never evaluating them?):

But checking the result of exec was the second thing I mentioned; the former (specifying any of those components) is almost surely the more frequent of the two. Even using test (or another API that integrates URL patterns, such as speculation rules or declarative service worker routing), if the meaning of a pattern like {hash: ''} were to change that would very likely break content (this might be avoidable by taking care in canonicalization, but I'm not confident), that would break content. Similar if anything examines the component properties of the URLPattern object.

Is there some other resolution (such as clarifying that this discrepancy exists in the text of the standard, or adding some sort of developer warning to tooling), that you're seeking?

I think that clarifying language is not extremely helpful here. This is the type of subtle inconsistency that you probably don't notice until it actually bites you by writing something like if (somePattern.protocol === someUrl.protocol).

Unfortunately I don't think even including the delimiter characters makes this a valid thing to do in generality.

There is some shaving of this rough edge

URL canonicalizes this too: u = new URL('http://foo'); u.protocol='https'; console.assert(u.protocol === 'https:'). Frankly, I could go either way on which is "correct", but I'd rather go with the convention of the more mature standard.

Given it is canonicalized, is the concern exclusively what is read when you access the component properties on the URLPattern object (but still storing something else internally, as the URL spec does)? Can this be done without implications for anything else?

On the one hand, consistency for URL is nice if we can avoid knock-on consequences. But I've always found it weird that URL stores without a delimiter internally and then adds it back when you ask, and I worry that exposing something more different from the real internal state is going to add more complications for URL patterns than it does for URLs.

For query and hash, this normalization creates even more problems, since ? and # get stripped from the pattern but are permitted unescaped in the URL. So the following examples unintuitively produce non-matches:

  • new URLPattern("http://bar##baz").exec("http://bar##baz")
  • new URLPattern("http://bar??baz").exec("http://bar??baz")

It makes sense in these cases that canonicalization should be idempotent (give the same result if you do it twice). But if the query can legally contain ?, there's no way to tell whether the first occurrence came from the delimiter or the query itself!

Interesting; I didn't know that multiple delimiters ended up canonicalized away like that. I agree it's surprising (if a bit contrived in practice); do you know if it happens for a reason related to the delimiters being included in the component parts? URL also normalizes those delimiters away and only adds them back in the setter, so I wonder if it's a separate issue.

rotu commented 1 week ago

But checking the result of exec was the second thing I mentioned; the former (specifying any of those components) is almost surely the more frequent of the two.

As long as they lead to the same canonicalization, the trailing : should be indistinguishable:

For instances new URLPattern({protocol:"http:"}) is identical in behavior to new URLPattern({protocol:"http"}). The behavior where this isn't identical is where the leading identifier is the same as the delimiter.

Unfortunately I don't think even including the delimiter characters makes this a valid thing to do in generality.

You're right. Perhaps I'm for deprecating these getters entirely... Having the URLPattern almost implement the URL interface seems like asking for trouble.

Also, there's no way to use the individual components of a URLPattern. If it uses some pattern language syntax, it still needs to be interpreted. (There's no pattern.protocol.exec method, nor a way to get pattern.protocol as a RegExp object, for instance).

But I've always found it weird that URL stores without a delimiter internally and then adds it back when you ask, and I worry that exposing something more different from the real internal state is going to add more complications for URL patterns than it does for URLs.

Agreed. Though for a URL object, those properties have setters and are useful for parsing a URL string. With a URLPattern, the subproperties are only useful for debugging the URLPattern itself.

Interesting; I didn't know that multiple delimiters ended up canonicalized away like that. I agree it's surprising (if a bit contrived in practice); do you know if it happens for a reason related to the delimiters being included in the component parts? URL also normalizes those delimiters away and only adds them back in the setter, so I wonder if it's a separate issue.

I don't know. I thought URL normalize the URL (e.g. that new URL('http://foo/?#').href would give "http://foo") but now I see that's not the case (which I think is basically what you're saying). It seems http://foo/?# and http://foo/ are indistinguishable from a URL object and from the standpoint of a URLPattern.

According to RFC 3986, these URLs should be considered different:

Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme.