whatwg / urlpattern

URL Pattern Standard
https://urlpattern.spec.whatwg.org/
Other
159 stars 22 forks source link

Search params matching #150

Open bathos opened 3 years ago

bathos commented 3 years ago

I’m pretty confused by the search params matching behavior.

First I’d note that WHATWG URL seems to (effectively?) define some (narrow) canonicalization rules via the URLSearchParams interface:

  1. "foo=" and "foo" are both representations of the same parameter, [ "foo", "" ], whose canonical form includes the equals sign
  2. percent-encoded bytes are interpreted as UTF-8 code unit sequences (or pass through as-is if they are not valid UTF-8) and if the characters thus encoded fall within the ASCII range and are not generic URL-syntax characters, they are canonicalized to their ASCII representations

Thus the search params / query strings of the following three URLs are “the same” and their canonical form is the first:

`${ new URL("https://example.test/?foo=").searchParams }`;
// → "foo="
`${ new URL("https://example.test/?foo").searchParams }`;
// → "foo="
`${ new URL("https://example.test/?%66oo").searchParams }`
// → "foo="

URLPattern doesn’t appear to consider them equivalent:

new URLPattern({ search: "foo" }).exec(origin + "?foo");
// → { hash, hostname, inputs, ... }
new URLPattern({ search: "foo" }).exec(origin + "?foo=");
// → undefined
new URLPattern({ search: "foo" }).exec(origin + "?%66oo");
// → undefined

This seems unfortunate to me — I’d rather not have to “think about” a representation distinction URLSearchParams decided has no meaning in itself (like representing the number 1 as 0x01 or 1e0). It seems to follow from this that the obvious way to say “bind (any) value” for a query param doesn’t work:

new URLPattern({ search: "foo=:foo" }).exec(origin + "?foo=xxx")?.search.groups.foo;
// → "xxx"
new URLPattern({ search: "foo=:foo" }).exec(origin + "?foo")?.search.groups.foo;
// → undefined

However this behavior is consistent with how URL proper works in that url.href doesn’t return the canonicalized version unless you “do something,” e.g. url.searchParams.delete("random"). And it’s not unreasonable to say that if you want the canonicalization behavior or URLSearchParams, you should first pass the exec input through URL and ensure it’s in that form. After all, one may want additional canonicalization behavior like sorting keys or other application-specific semantics that USP is agnostic too.

But... the canonical representation of an empty parameter value also doesn’t work:

new URLPattern({ search: "foo=:foo" }).exec(origin + "?foo=")?.search.groups.foo;
// → undefined

Note that search params representing boolean values often operate just like boolean content attributes in HTML: foo being present with any value is foo: true while foo being absent is foo: false, and the canonical true is the empty string.

There probably is a way to write a pattern that’s actually able to match all values including the empty value, but it seems pretty surprising to me if URLPattern has no awareness of USP possessing structure. Search params aren’t hierarchical, they are a list of key-value pairs, and it’s unclear to me how to use URLPattern to match on params without writing patterns that are more rather than less complex than equivalent RegExps, especially when you want to match while permitting arbitrary params that could appear between others.

Apologies if this has already been discussed and I just missed it. Matching search params is super important for my use cases and I’m struggling a bit as URLPattern has made them more difficult to match for me so far rather than easier to match.

domenic commented 3 years ago

https://github.com/whatwg/url/issues/491 seems related. Basically I would not trust URLSearchParams for anything. Not sure what the full implications of that are for your issue though.

wanderview commented 3 years ago

Just to clarify some terminology, URLPattern is based on URL canonicalization. So search values are encoded and unparsed.

URLSearchParams parses and decodes a search string, but generally that is not called the "canonical" representation. Its more of a processed or parsed output. This may conceptually match what you think of as "canonical search param values", but its not really defined that way. Its just what URLSearchParams does.

All that being said, URLPattern is not really structured currently to match search parameters well. A pattern string approach does not work well given that undefined ordering of parameters, etc.

We considered trying to create some kind of additional API for this. For example, we could add some way to specify separate patterns to match against a param name and a pattern to match against param values, etc. Its unclear, though, if this would really be useful or an improvement over URLSearchParams. We decided to wait to see if it was really something needed by developers before trying to create such an API.

For now I recommend using URLSearchParams to parse, inspect, and manually match query params.

Also, there is some discussion about this in:

https://github.com/WICG/urlpattern/discussions/60

Let's keep this issue open to track interest in search param matching.

(In regards to Domenic's comment about whatwg/url#491, I've never actually run into that problem in practice and I expect its not a problem for most sites.)

bathos commented 3 years ago

@domenic That is for sure related to the first stuff I described. I agree that the “touch anything on USP and stuff can change” is spooky, though the post-touch result is almost always exactly what I want. In particular, I very much want ?a=&b= and ?a&b to both be understood during matching as “two parameters, having the keys "a" and "b", whose values are the empty string”.

The remainder of the issue is unrelated to that I think. It mostly concerns the challenge of expressing patterns that match specific parameters but do not preclude the presence of others or demand a specific order.

bathos commented 3 years ago

@wanderview thanks for clarifying. I realize there are a few “layers” to URL grammars (URL itself; the overlaid grammars associated with specific schemes; interpretation of percent encoded bytes that has varied over time and by context even within the same scheme, probably more stuff like that that I’m not aware of). I assumed that URLPattern would be aiming with strong alignment URL/USP (and was unaware that USP, which has served me so well for years, was considered a controversial API), but your explanation makes sense.

It is a tough problem given the extent that application-level semantics tend to play a much bigger role in the interpretation of search params than in path segment bindings (which are almost always just “I wanna name this string”) — an API surface that suits one case might be pretty poor for another. Very often applications bring their own overlaid micro grammars to the search params party whether they realize/formalize it or not (e.g. “a positive integer”, “one of the following specific enumerated string values,” “this param can only appear singularly ... this one can appear multiple times, but is understood as an unordered set ... this one can appear multiple times but is understood as an ordered list”, etc). Most* out-of-the-box clientside routers run from all this screaming and say you’re on your own, and I wouldn’t expect URLPattern to get into any of that territory either.

However it would be swell if there’s a path to making it a friendlier to the first-pass matching step for “we don’t care about remainder params” cases, which I think is safe to say is the typical scenario. In particular I think this is important not because of my own use cases so much as because it is easy to write a search pattern that works with a single initial param ... and that breaks as soon as, say, someone in marketing sends the link out with a utm param appended.

* Most — but pour one out for the majestic Angular 1 era lib ui-router — it dove into all of that head first and came up with a fistful of pearls.

wanderview commented 3 years ago

Well, I think a single param match can kind of be written like this:

new URLPattern({ search: '{*&}?q=foo{&*}?' })

But I admit its ugly.

I'm not saying what URLPattern offers is good, but more that we did not try to solve this problem yet. With more feedback this is something we can improve in the future with an API addition.

Sayan751 commented 2 years ago

Currently, the pattern {&}?{p1=:p1val(foo|bar)}?{&}?{p2=:p2val(fizz|buzz)}? does not match p2=fizz&p1=foo. It would be cool to support that. I know it might be a long shot, still just putting it out here.

Ayc0 commented 1 week ago

A bit related I think: I'm trying to tinker with URLPattern but I didn't find any good way to check multiple search params at once. For instance, if you want to check that you have foo=1 and bar=2, the URL could look like ?foo=1&bar=2, or ?bar=2&foo=1. But often, we can use a lot of other params too and if you want to check for those 2 + any other, you can have ?*&foo=1&*&bar=2&* and any possible variations of those.

When I test in Chrome, the following code doesn't seem to work:

new URLPattern({ search: 'foo=:foo'}).exec(new URL('?hello&foo=1', window.location))
// null

And search matches don't seem to be stopping at &. So with the following code, I have in the match 1&hello and not just 1:

new URLPattern({ search: 'foo=:foo'}).exec(new URL('?foo=1&hello', window.location))?.search.groups.foo
// '1&hello'

Maybe there is something I'm missing — I just started to play with it