whatwg / urlpattern

URL Pattern Standard
https://urlpattern.spec.whatwg.org/
Other
154 stars 20 forks source link

Attribute to show the regular expression usage. #191

Closed yoshisatoyanagisawa closed 9 months ago

yoshisatoyanagisawa commented 11 months ago

URLPattern are used in several proposals now:

URLPattern supports regular expressions, but it is concerned to execute arbitrary user-provided regular expressions in trusted area in the browser due to security reasons. It is suggested to prohibit regular expressions in some APIs using URLPattern:

To avoid unexpected regular expression use in URLPattern, a new attribute to show the regular expression has been considered in https://github.com/WICG/urlpattern/issues/182#issuecomment-1716988239. With this flag, web developers can understand unexpected regular expression usage by themselves.

Alternative solution for this is having an option to raise when the regular expression is used in the pattern because denying regular expression depends on APIs using URLPattern. Some APIs may allow it and the other may not. We should not have such a thing in the regular path.

Jamesernator commented 11 months ago

The concern here is just ReDoS right? It seems unfortunate to limit patterns like /(one|two)/ in places like service workers, would it be possible to instead of restricting all regexps to instead only limit those that contain potential backtracking?

domenic commented 11 months ago

That's not the concern, actually. The issue is that several of these features run in browser processes (usually the network process) which don't have a JavaScript engine, and thus don't have a regular expression engine. Bringing regexp support to the network process would require bringing a whole copy of V8, which as you can imagine, is not trivial.

(I'll also note that browser security doesn't make a distinction of the sort you're discussing, between "dangerous" untrusted input and "non-dangerous" untrusted input. Any untrusted input at all falls afoul of the rule of 2, at least in Chromium.)

annevk commented 11 months ago

This concern applies to server operators too, e.g., with compression dictionaries. This again leads me to think that having some kind of subset would be a good idea.

It seems like another way of stating the constraint here is that there's no interest in implementing a regular expression engine in a safe language.

Jamesernator commented 11 months ago

in implementing a regular expression engine in a safe language.

You mean in an unsafe language right?

Any untrusted input at all falls afoul of the rule of 2, at least in Chromium.)

The article you've linked mentions that Chromium does have a trusted regex library, RE2,. Would it be viable in chromium to limit regexpes in URLPatterns to some common subset that is shared with RE2? From the RE2 docs it does seem like a fairly large subset should be viable.

@annevk I presume Firefox doesn't have this limitation as any regex engine for this purpose could just be written in Rust right?

domenic commented 11 months ago

I can't say for certain on behalf of the involved teams, but my suspicion is we're not interested in exposing two dialects of regular expressions to the web platform (the standardized JS one, and the non-standardized RE2 one).

annevk commented 11 months ago

@Jamesernator I meant what I wrote. If there was interest to redo a web platform regular expression engine in a safe language, you'd meet the rule of 2.

Last I checked SpiderMonkey uses V8's regular expression engine so they're in the same boat. WebKit has its own, but also unsafe.

And yeah, exposing two separate implementations seems very risky and not long term tenable.

jeremyroman commented 11 months ago

It probably could be done (if only because the regexp use case here is more limited, since URLs can reasonably be expected to be shorter than larger haystacks). But in the short term it doesn't seem likely that any implementer (let alone all) is interested in carving out a more tailored subset of ECMAScript regexes, and then specifying, documenting, and implementing that in a safe language.

At the moment just allowing the things outside of a regexp group (i.e., fixed parts, ? and * wildcards) seems likely to be the most pragmatic compromise, even though a larger subset is in principle possible.

domenic commented 10 months ago

Anyone want to bikeshed the name here? I think we should use "RegExp" (instead of e.g. Regex) since that's what JavaScript uses. With that as a base, some ideas so far:

jeremyroman commented 10 months ago

Slight preference for referring to the parsed state rather than tokens in the input, but I could live with any of those. I think hasRegExpGroups is my narrow favorite, with requiresRegExp as a runner-up.