whatwg / urlpattern

URL Pattern Standard
https://urlpattern.spec.whatwg.org/
Other
155 stars 22 forks source link

Globstar `**` causes exponential runtime behavior #237

Open johnp opened 1 day ago

johnp commented 1 day ago

What is the issue with the URL Pattern Standard?

I've recently come across this bug report in the popular urlpattern-polyfill library. Under "§ 2.2. Converting part lists to regular expressions", a pattern such as

new URLPattern({ hostname: '**.google.com' })

is converted to the regex

/^((?:.*)*)\.google\.com$/u

. This regular expression runs in exponential time under V8, and even breaks on sufficiently long input in SpiderMonkey with the error "InternalError: too much recursion". Of the major JS engines, only WebKit is able to execute it in linear time. The globstar pattern is very common in many globbing languages/libraries, for example minimatch. Software wanting to adopt the standardized URLPattern in exchange for the underspecified space of globbing dialects, will eventually have to deal with end user's code subtly breaking due to this (example).

To avoid this, and under the assumption that there is not much value in emitting ((?:.*)*), would it maybe be possible to interpret globstars as unnamed groups, i.e. (.*)? I suspect that there was likely a reason not to support this pattern, but I haven't been able to find any discussion around the topic of globstars in relation to this spec, so I wanted to at least raise it publicly.

Interestingly, while the exponential runtime behavior happens with the polyfill, the native V8 URLPattern implementation does not appear to suffer the same issue.

jeremyroman commented 5 hours ago

Interesting. The way a series of modifiers works is, in a different way, also interesting.

This tokenizes as asterisk asterisk char(.) char(g) char(o) ..., which gets turned into:

It seems not too hard to just drop the modifier when we have a full wildcard. Would that be enough? I think you could still make it exponential in such regex engines by writing ([^])* or similar, but it'd at least be less of a footgun.

Honestly I'm not sure of the utility of applying any of the modifiers to full-wildcard parts. Is that syntax (writing **, *? or *+) useful in any way, or is it always exactly equivalent to *, including the result of trying to match it. I think it's equivalent so we could just drop the modifier, or maybe even reject such patterns since the author might be trying to do something else (would need investigation if that's web-compatible to do even if we wanted to).