Matching `(` and `)` in a regexp matcher requires escaping

whatwg / urlpattern

URL Pattern Standard

https://urlpattern.spec.whatwg.org/

Other

160 stars 22 forks source link

Matching `(` and `)` in a regexp matcher requires escaping #177

Open lucacasonato opened 1 year ago

lucacasonato commented 1 year ago

Fails, even though it's valid:

const pattern = new URLPattern({ pathname: '/([()])' });
Uncaught TypeError: tokenizer error: invalid regex: nested groups must start with ? (at char 1)

OK:

const pattern = new URLPattern({ pathname: '/([\\(\\)])' });

This is because while tokenizing the pattern, we think the second ( is a nested group rather than just a char in a regexp character class.

Fixing this will make the tokenizer more complicated. Is it worth it?

lucacasonato commented 1 year ago

Actually I think we need to fix this for sure. Consider these two patterns:

// valid regexp, but pattern tokenizer thinks there is a nested group that isn't closed
const pattern = new URLPattern({ pathname: '/([(?])' });

// valid regexp group, valid urlpattern, except that this throws because "Invalid regular expression: /^(?:/([))\]\)$/u: Unterminated character class" 
const pattern = new URLPattern({ pathname: '/([)])' });

wanderview commented 1 year ago

Yea, seems like a real problem to me. I'm not sure when I will have the bandwidth to look at this, though. Do you have a proposed fix?

lucacasonato commented 1 year ago

I think we have to keep track of [ and ] in the tokenizer while parsing regexp tokens, and ignore all ( and ) while between a [ and ] in that regexp. This can be a simple boolean as character classes can't be nested (no need to keep track of depth).

jeremyroman commented 1 year ago

character classes can't be nested (no need to keep track of depth).

Is this true? Given #178 we'll support syntax like [\d--[07]] (every digit except 0 and 7) as part of the UnicodeSets regexp mode.