Base URL inheritance gives unintuitive results

domenic commented 1 year ago

(For background see https://github.com/WICG/nav-speculation/issues/259; in particular this is attempting to formalize the proposal at https://github.com/WICG/nav-speculation/issues/259#issuecomment-1507870169. cc @jeremyroman)

Problem

When constructing a URLPattern from a base URL, we have found it surprising how many components get inherited. The particularly problematic ones are search and hash. Consider code such as:

const pattern1 = new URLPattern("/wp-login.php", document.baseURI);

// or:

const pattern2 = new URLPattern({ pathname: "/wp-login.php", baseURL: document.baseURI });

if (!pattern1.test(userInput)) {
  // They're not targeting the login page for our blog. OK to proceed!
}

Can you spot the bug? There are actually two!

If document.baseURL is something like https://example.com/, then this sort of check gets defeated by a userInput derived from a relative URL such as /wp-login.php?bypass or /wp-login.php#bypass.
If document.baseURL is something like https://example.com/?utm_source=blah, then this sort of check fails even for userInput derived from /wp-login.php.

This is because these URLPattern instances inherited the empty-string search and hash components from the base URL:

const noBaseURL = new URLPattern("/wp-login.php");
console.assert(noBaseURL.search === "*");
console.assert(noBaseURL.hash === "*");

const withBaseURL1 = new URLPattern("/wp-login.php", "https://example.com/");
console.assert(withBaseURL1.search === "");
console.assert(withBaseURL1.hash === "");

const withBaseURL2 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah");
console.assert(withBaseURL2.search === "utm_source=blah");
console.assert(withBaseURL2.hash === "");

Non-path/search/hash cases

Another instance that some might find unexpected is the following:

const pattern = new URLPattern({ protocol: "https", baseURL: document.baseURI });

This pattern inherits all non-protocol components from the base URL. Whereas, some might expect that since we overrode such an "early" component, we'd only get matching on protocol, not the rest.

This is especially problematic in environments where the base URL argument is implicit, e.g. speculation rules, service worker static routing, or any framework which attempts to specialize itself to "the current page", such as

function addRouteHandlerForThisPage(urlPatternString, handler) {
  routerTable.set(new URLPattern(urlPatternString, document.baseURI), handler);
}

In such environments, you might provide a URL pattern such as

{
  "href_matches": { "protocol": "https" }
}

or

condition: {
  urlPattern: { protocol: "https" }
}

and expect this to match all https:// URLs. But instead, since it inherits all non-protocol components from the base URL, it essentially only matches the base URL itself.

Proposed solution

Whenever a URLPattern input has a given component, do not (by default) inherit any components from the base URL that are "later" than that component. "Later" means mostly the usual order: protocol, hostname, port, pathname, search, hash. For example:
- If the input has a protocol component, do not inherit any components
- If the input has a pathname component, do not inherit the search and hash components
For username and password, we define those to be "later" than protocol, hostname, and port (and define password to be "later" than username). For example:
- If the input has a hostname component, do not inherit port, pathname, search, hash, username, and password
- If the input has a username component, do not inherit pathname, search, hash, password
Add an option, baseURLInheritance, which lets you control this behavior
- The new default, per the above, is "auto"
- "all" gives the current behavior
- Maybe, you could also do an array of components to inherit, e.g. ["port", "search"]. We can leave that out for now minus compelling use cases.

Then:

const pattern1 = new URLPattern("/wp-login.php", "https://example.com/");
console.assert(pattern1.protocol === "https");
console.assert(pattern1.username === "");
console.assert(pattern1.password === "");
console.assert(pattern1.hostname === "example.com");
console.assert(pattern1.port === "");
console.assert(pattern1.pathname === "/wp-login.php");
console.assert(pattern1.search === "*");
console.assert(pattern1.hash === "*");

const pattern2 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah");
// Same as pattern1, including:
console.assert(pattern2.search === "*");
console.assert(pattern2.hash === "*");

const pattern3 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah#heading");
// Same as pattern1, including:
console.assert(pattern3.search === "*");
console.assert(pattern3.hash === "*");

const pattern4 = new URLPattern("/wp-login.php?user=foo", "https://example.com/");
// Same as pattern1 before search. Then:
console.assert(pattern4.search === "user=foo");
console.assert(pattern4.hash === "*");

const pattern5 = new URLPattern("/wp-login.php?user=foo#bar", "https://example.com/");
// Same as pattern1 before search. Then:
console.assert(pattern5.search === "user=foo");
console.assert(pattern5.hash === "bar");

const pattern6 = new URLPattern("/wp-login.php", "https://example.com/?utm_source=blah#heading", { baseURLInheritance: "all" });
console.assert(pattern6.search === "utm_source=blah");
console.assert(pattern6.hash === "heading");

Considerations

Compat

There might be some compat impact here. I suspect it is low, as we'd need to:

Be using URLPattern
...with a base URL containing a given component
...and a pattern that does not contain that component
...executed against an input where it makes a difference

"using URLPattern" is swinging between 0.02% and 0.05% of page views, so we're in a good place to start with. We plan to add use counters to see how often this conjunction happens in the wild.

Alternatives considered

We could make this new behavior opt-in instead of opt-out. I would strongly prefer to make it opt-out, because: I think it's more sensible and less likely to lead to bugs; and, I think it's the best default for various web platform features, such as speculation rules or service worker scope matching, which hope to base themselves on URL patterns.
We could just tell people to always write their patterns like "/wp-login.php?*#*. I think that would be a sad place to end up.
We could restrict this logic to just pathname -> search -> hash. I think this is reasonable as those are the most problematic components. This would mean not solving the { protocol: "https" } problem, but maybe that's OK.

Server-side considerations

From my experience, most server-side uses of URLPattern are pathname-focused. They probably aren't overly impacted by this. Any confirmation from the server-side community would be helpful.

Who does the work?

Jeremy and I are happy to volunteer on the implementation/spec/test work. Jeremy is working on the use counter.

bathos commented 1 year ago

I gave up on adopting URLPattern largely due to the behaviors described here, so I would be very happy to see them change.

jeremyroman commented 1 year ago

I think a second change is required to kill the requirement for ?*#* (or when there's a wildcard beforehand, \\?*#*). Namely, in the string constructor format, omitted search/hash parts are assumed to be empty rather than wildcard. Consider the following, which has no base URL:

new URLPattern('https://example.com/wp-login.php').test('https://example.com/wp-login.php?a=1')
// false

We would have to also make these default to wildcard, or default to wildcard if an earlier component is present (functionally, this distinction only really applies to the case of a URL pattern which is just a hash), in order to avoid this. Authors would then have to explicitly write ?# if they wanted to require emptiness, or we could provide an option.

Also, is there a compelling use case for "all" inheritance? Authors can work around it by explicitly copying in the parameters they are interested in from the base URL if necessary, and I'm not sure it's common enough to be worth adding complexity for, because I'm having some difficulty coming up with cases where you'd want it.

wanderview commented 1 year ago

Right, this isn't really an "assumption" as much as just what the URL parser does. It produces an empty string for these components.

We can deviate from that, but just be aware that we are doing so.

domenic commented 1 year ago

Also, is there a compelling use case for "all" inheritance? Authors can work around it by explicitly copying in the parameters they are interested in from the base URL if necessary, and I'm not sure it's common enough to be worth adding complexity for, because I'm having some difficulty coming up with cases where you'd want it.

@wanderview seemed to think it was important, in the last paragraph of https://github.com/WICG/nav-speculation/issues/259#issuecomment-1485203938. I admit I can't think of any use case myself.

domenic commented 1 year ago

I think a second change is required to kill the requirement for ?*#* (or when there's a wildcard beforehand, \\?*#*). Namely, in the string constructor format, omitted search/hash parts are assumed to be empty rather than wildcard.

Somehow I didn't realize this. I swear I saw some version of the string format defaulting to *, but maybe it was the object format instead...

What do you think of making the wildcard-for-unset-components apply the same way uniformly? Regardless of base URL vs. no base URL, string format vs. object format. Then new URLPattern('https://') would become the equivalent of today's new URLPattern('https://*:*/*\\?*#*').

I don't know if that's a good idea, since we haven't run into much confusion for patterns of that sort. But it is more elegant, maybe?

wanderview commented 1 year ago

What do you think of making the wildcard-for-unset-components apply the same way uniformly? Regardless of base URL vs. no base URL, string format vs. object format. Then new URLPattern('https://') would become the equivalent of today's new URLPattern('https://:/\?#*').

I'm not opposed to it necessarily. The example of https:// in particular doesn't make sense with current URLPattern behavior. URL() will throw on that string, but URLPattern basically parses it with a somewhat non-useful result.

I think the tricky part is when you extend this to strings that will parse as URLs. We've now deviated from the URL parser behavior which just seems risky to me.

But again, not opposed in general.

Have you given thought how to make these backward incompatible changes? Is the intent to measure volume of usage to see if its safe to just change them? Or do you expect to create some new API surface while deprecating this one?

jeremyroman commented 1 year ago

Any change in this direction does lose the kinda nice property that you can just escape a URL properly and it becomes a URL pattern that matches exactly that URL. Unfortunately I think that property is probably less useful than matching expectations in other ways.

Very anecdotally I think for standard URLs I would expect to need to specify the protocol, hostname and path, as those are the parts that are always visible in a URL string. So I don't think it's that bad to expect to write *://*/* if that's what you really want. Query and hash are the ones where I really would expect them to be omitted in cases where authors really did want them. Probably also username and password, now that I think about it, and they similarly require an optional character to introduce them (@). Port is debatable (and so maybe that everything-pattern should really be *://*:*/*, but at least it's not *://*:*@*:*/*\?*#*), because it's very seldom specified and it's usually significant that it isn't.

By contrast, in https://, it's not obvious to me whether the hostname is absent, or present but empty (though as it turns out, an empty hostname isn't a valid URL anyway). On the other hand, it's clear to me that there is no username, password, query or hash in https://example.com/wp-login.php, and if you wanted to make them explicitly empty the URL syntax allows you to do that as https://:@example.com/wp-login.php?#, as silly as that looks.

One difference between doing any-earlier-specified vs wildcard-if-omitted for the string format stuff is the behavior of things like https://example.com/wp-login.php#foo; should that match https://example.com/wp-login.php?a=1#foo? I'm not sure.

If we want to make a change of this type, I think we would ideally evaluate any cases in the wild that would be adversely affected, hopefully there are few-to-none, and then just make a breaking change.

jeremyroman commented 1 year ago

So for the string format, we can compare the following three policies:

(a) status quo (b) search and hash are * if absent (c) similar logic to base URLs: if no later component is specified, then wildcard (except for port, which is never wildcarded if host is specified since these form a logical unit of authority and I don't think users would expect http://localhost to mean http://localhost:8080)

Some examples with no base URL:

https:
(a) protocol: "https", username: "", password: "", hostname: "", port: "", pathname: "", search: "", hash: ""
(b) protocol: "https", username: "", password: "", hostname: "", port: "", pathname: "", search: "*", hash: "*"
(c) protocol: "https", username: "*", password: "*", hostname: "*", port: "*", pathname: "*", search: "*", hash: "*"

https://example.com
(a) protocol: "https", username: "", password: "", hostname: "example.com", port: "", pathname: "/", search: "", hash: ""
(b) protocol: "https", username: "", password: "", hostname: "example.com", port: "", pathname: "/", search: "*", hash: "*"
(c) protocol: "https", username: "*", password: "*", hostname: "example.com", port: "", pathname: "*", search: "*", hash: "*"

https://example.com?bar
(a) protocol: "https", username: "", password: "", hostname: "example.com", port: "", pathname: "/", search: "bar", hash: ""
(b) protocol: "https", username: "", password: "", hostname: "example.com", port: "", pathname: "/", search: "bar", hash: "*"
(c) protocol: "https", username: "*", password: "*", hostname: "example.com", port: "", pathname: "/", search: "bar", hash: "*"

https://example.com/foo#baz
(a) protocol: "https", username: "", password: "", hostname: "example.com", port: "", pathname: "/foo", search: "", hash: "baz"
(b) protocol: "https", username: "", password: "", hostname: "example.com", port: "", pathname: "/foo", search: "*", hash: "baz"
(c) protocol: "https", username: "*", password: "*", hostname: "example.com", port: "", pathname: "/foo", search: "", hash: "baz"

(c) is probably not too bad, though (b) is a smaller behavior change.

domenic commented 1 year ago

(c) similar logic to base URLs: if no later component is specified, then wildcard (except for port, which is never wildcarded if host is specified since these form a logical unit of authority and I don't think users would expect http://localhost to mean http://localhost:8080)

I would extend this same logic to username and password since they really shouldn't be showing up in modern URLs anyway. Like port, if people want to wildcard those, they should be explicit.

https://example.com

For (c), this differs from https://example.com/ with a trailing slash, right? I think that's OK...

One difference between doing any-earlier-specified vs wildcard-if-omitted for the string format stuff is the behavior of things like https://example.com/wp-login.php#foo; should that match https://example.com/wp-login.php?a=1#foo? I'm not sure.

I tend to think this should not match.

jeremyroman commented 1 year ago

re. username and password, I agree they shouldn't be ordinarily showing up, but if I'm an author and I say "okay, you can can prefetch everything except https://example.com/_admin/*", then if UGC contains https://bypass:password@example.com/_admin/change_admin I would be surprised to learn that it was let through. It's not really intended to be ironclad but I think my point is that this would be surprising to me.

jeremyroman commented 1 year ago

For (c), we can of course define it, but the most direct interpretation of what you suggested would seem to imply that specifying the / means you provided a path and not means you didn't (unless you did provide a query or hash).

(Tangentially, it's nifty to learn that more stuff is optional in URLs than I realized. https:html.spec.whatwg.org#parsing-html-fragments works.)

jeremyroman commented 11 months ago

The Chromium use counters for this are in stable now, and are both registering virtually zero (let's say <0.01%). Therefore I think this change is web-compatible.

annevk commented 11 months ago

This change seems reasonable overall, but given @jeremyroman's example above around username/password we should make sure there are no surprises. I guess writing out all the variants and asserts for a test and then reviewing that might be the easiest way to ensure that?

jeremyroman commented 11 months ago

A similar username/password question arises for the base URL inheritance side of things, I discovered while tweaking Chromium unit tests in a draft CL.

new URLPattern({pathname: "/_admin/*", baseURL: document.baseURI}) does today and will after these changes end up taking the username and password of the base URL (typically empty). Thus it will fail to match URLs which are the same but do specify an explicit username or password.

Incidentally, to my mild surprise, the username and password do contribute to document.baseURI (even though Chrome and Firefox, at least, don't render it in the address bar).

We could leave this to authors as a case they need to consider when using URL patterns for blocking from suspicious content, or we could extend the defaults to always wildcard username and password (at a cost of making it necessary to explicitly make them empty by writing :@ or username: "", password: "" otherwise). I don't love adding extra corner cases, but I also don't expect authors to think of this and it seems likely wildcarding is usually what is wanted.

It is possible that, conversely, an author is using patterns to allow something specific but since servers typically ignore usernames/passwords they aren't expecting (AFAIK) that seems less risky. And it's consistent with them defaulting to "*" when no base URL is used in the dictionary form, which is the most low-level/default way of using URLPattern.

I agree that writing out the interesting variants (ideally all of them, but I don't know how large that combinatorial explosion is) would be an interesting way of looking for additional surprises.

domenic commented 11 months ago

I tend toward treating username/password in similar ways for both types of defaulting: base URL inheritance, and wildcard-if-something-earlier-is-wildcard. I think that suggests in your example, they would still be inherited, since the wildcard only occurs in the path (/_admin/*).

I don't think this is the highest-stakes decision though, because URLs using usernames/passwords are non-conformant, and (I think?) on a path toward removal. (Mike West, purposefully not pinging him, has expressed cheerful sentiments about removing support for fetching such URLs in the past.)

I would, in fact, be very OK with encouraging URLPattern-using specs to turn themselves off, or at least turn URL pattern matching off, when used on documents with such URLs.

annevk commented 11 months ago

I think it really depends on the use case what ends up being bad:

Being able to bypass a service worker by including a username in a request: clearly bad.
Being able to bypass dictionary compression by including a username in a request: clearly fine, maybe even desirable per @domenic's deprecation rationale.
Being able to bypass a CSP check by including a username in a request: clearly bad.
?

I think since username and password are so unexpected, wildcarding them unless we explicitly know intent (i.e., they are explicitly set) seems very reasonable and probably even necessary.

jeremyroman commented 11 months ago

Without any additional change to username/password, here's what Chromium does today vs with proposed changes, including every variation of including/excluding a typical fixed part, and then the patterns from the WPT test suite.

https://gist.github.com/jeremyroman/c712860aed8080b5524feb8118a5105a

Edit: some dictionary-style examples were incorrect right but have been corrected

jeremyroman commented 10 months ago

Similar diffs for wildcarding username/password unless they're explicit (i.e., never inherit them from the base URL, and in the string format wildcard them unless specified).

https://gist.github.com/jeremyroman/8d99c4667490800dbb3c46d3d26b4885

Even though this adds more logic/rules, I do think it seems likely to make for better defaults. If other behavior is desirable we can always add options in the future, but I'm hopeful they won't prove that necessary in practice.

snaptopixel commented 7 months ago

When Chrome recently updated, customers started reporting issues with a webapp that uses URLPattern and it was due to this change. Hash is important in SPAs and I don't think it should be "wildcarded" like this by default.

In my case, I have a tabbed UI which sets the hash as tabs are clicked. When users load the app the first time (without a hash), I have a redirect that will add the hash causing the first tab to display. With this change it causes an infinite-loop, since test now returns true for the base url without the hash (my "redirect" route) and infinitely tries to redirect.

jeremyroman commented 7 months ago

Motivating this change, we noticed that several developers were surprised that, e.g., /foo did not suffice to match URLs like /foo?utm_source=... or /foo#more, and I did try to identify cases where this was likely to cause breakage; Chrome's data showed that only a very small fraction of existing page loads applied patterns in a way that would have started matching where it previously did not (as in your case, unfortunately).

In your case I hope that adapting to this change would be relatively straightforward in a way that is backward-compatible with any prior implementations of URLPattern, such as:

changing the string syntax /foo to /foo# when you want to expressly match pages with no or absent hash (this will also force the search section to be absent/empty, since search is less specific than hash; /foo?*# would keep search wildcarded)
Using a pattern like new URLPattern({hash: ''}) or an explicit location.hash.length < 2 check to look for an empty hash separately.
Depending on the structure of your routing system, you might be able to check "more specific" routes first and only check "less specific" routes if none of those match, leaving this redirect as a fallback case not only for empty hash, but also any unrecognized hash.

If your code looks like (1), it's possible it might have previously not worked correctly if a search query was present and not stripped (and some pages receive these unexpectedly, like the fbclid parameter when following links from Facebook).

snaptopixel commented 7 months ago

Thanks @jeremyroman I had tried adding the # last night and that works for now. I'd like to update the router to be more resilient overall, but this is an acceptable mitigation.

IMHO, it's more surprising that it would match things I'm not explicit about but I see the other perspective as well. In my view with SPAs the url is part of the state and foo/bar is not the same thing as foo/bar#first-tab or foo/bar?action=edit

whatwg / urlpattern