yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
880 stars 112 forks source link

Case insensitive literal not working with backreferences #216

Closed mingodad closed 2 years ago

mingodad commented 2 years ago

See discussion here and the examples tested on cpp-peglib playground.

yhirose commented 2 years ago

@mingodad, could you put the smallest possible PEG grammar here, so that I can reproduce it on my machine easily? Thanks!

ChrisHixon commented 2 years ago

I'm seeing various corruption in the error message with this grammar on the playground:

ROOT          <- CONTENT !.
CONTENT       <- (ELEMENT / TEXT)*
ELEMENT       <- $(STAG CONTENT ETAG)
STAG          <- '<'  < $tag<TAGNAME> > '>'
ETAG          <- '</' < $tag > '>'
TAGNAME <- 'a' / 'b'i
TEXT          <- (![<] .)+

Input: <a>foo</A>

On Firefox, the error I'm currently seeing with the above grammar/input:

1:9 syntax error, unexpected 'A', expecting 'd tota % success fail definition 13 4 '.

It seems more apt to happen if i is added to the literals in TAGNAME, but I've seen corruption in simpler cases. Minor edits of TAGNAME change the corruption, even things like altering number of spaces. I see corruption in both Chromium and Firefox, even after refreshing, clearing cookies and local data, etc.

The command line lint seems to always show the error I believe is the proper error (with lots of variations on the TAGNAME): 1:9: syntax error, unexpected 'A', expecting 'a'.

I'll see if I can narrow it down to simpler grammar any...

ChrisHixon commented 2 years ago

This is about as simple as I can get it and still see consistent corruption:

ROOT          <- CONTENT !.
CONTENT       <- (ELEMENT / TEXT)*
ELEMENT       <- $(STAG CONTENT ETAG)
STAG          <- '<'  < $tag<"a"> > '>'
ETAG          <- '</' < $tag > '>'
TEXT          <- (![<] .)+

Input: <a>foo</A> Most of the time error is: 1:9 syntax error, unexpected 'A', expecting 'd '. Occasionally: 1:9 syntax error, unexpected 'A', expecting 's) i'.

yhirose commented 2 years ago

@ChrisHixon, thanks for the problem report. I fixed it at 3c2a53c79b7642a547127b31e102526be72206e5.

yhirose commented 2 years ago

@mingodad, I would like to make sure I understand what you are mentioning here.

The current cpp-peglib backreference behavior is 'exact match' to the captured string, and same as the regular expression.

image

If your suggestion says this example should succeed, I am not sure if it's correct. Could you explain more clearly?

mingodad commented 2 years ago

After you showing it with regex I can see your point. Also in the same topic it would be nice to have character class case insensitive [a-z]i for grammars where identifiers are case insensitive (SQL, Pascal, ...).

mingodad commented 2 years ago

Here is an example on peggy playground https://peggyjs.org/online.html (also implemented here https://github.com/mingodad/peg):

start = name_char+ 
name_char =
     [a-z0-9$_]i* [ \t\n]

Input:

one
Two
One
yhirose commented 2 years ago

@mingodad, thanks for the response. I'll close this issue. Could you make a separate issue for [...]i operator?