whatwg / urlpattern

URL Pattern Standard
https://urlpattern.spec.whatwg.org/
Other
154 stars 20 forks source link

Matching `(` and `)` in a regexp matcher requires escaping #177

Open lucacasonato opened 1 year ago

lucacasonato commented 1 year ago

Fails, even though it's valid:

const pattern = new URLPattern({ pathname: '/([()])' });
Uncaught TypeError: tokenizer error: invalid regex: nested groups must start with ? (at char 1)

OK:

const pattern = new URLPattern({ pathname: '/([\\(\\)])' });

This is because while tokenizing the pattern, we think the second ( is a nested group rather than just a char in a regexp character class.

Fixing this will make the tokenizer more complicated. Is it worth it?

lucacasonato commented 1 year ago

Actually I think we need to fix this for sure. Consider these two patterns:

// valid regexp, but pattern tokenizer thinks there is a nested group that isn't closed
const pattern = new URLPattern({ pathname: '/([(?])' });

// valid regexp group, valid urlpattern, except that this throws because "Invalid regular expression: /^(?:/([))\]\)$/u: Unterminated character class" 
const pattern = new URLPattern({ pathname: '/([)])' });
wanderview commented 1 year ago

Yea, seems like a real problem to me. I'm not sure when I will have the bandwidth to look at this, though. Do you have a proposed fix?

lucacasonato commented 1 year ago

I think we have to keep track of [ and ] in the tokenizer while parsing regexp tokens, and ignore all ( and ) while between a [ and ] in that regexp. This can be a simple boolean as character classes can't be nested (no need to keep track of depth).

jeremyroman commented 12 months ago

character classes can't be nested (no need to keep track of depth).

Is this true? Given #178 we'll support syntax like [\d--[07]] (every digit except 0 and 7) as part of the UnicodeSets regexp mode.