westes / flex

The Fast Lexical Analyzer - scanner generator for lexing in C and C++
Other
3.55k stars 529 forks source link

Error lexing single/double quoted string #577

Closed mingodad closed 1 year ago

mingodad commented 1 year ago

See also discussion here https://github.com/BenHanson/lexertl14/issues/13

Testing this "(?:\\.|[^"\n\r])*"|'(?:\\.|[^'\n\r])*' on https://regex101.com/ gives this:

match,group,is_participating,start,end,content
1,0,yes,0,5,"one"
2,0,yes,6,11,'two'
3,0,yes,12,16,"\\"
4,0,yes,17,28,"BACKSLASH"
5,0,yes,29,33,'\\'
6,0,yes,34,45,'BACKSLASH'
7,0,yes,46,53,'three'

Testing this "(?:\\.|[^"\\\n\r])*"|'(?:\\.|[^'\\\n\r])*' on https://regex101.com/ gives this:

match,group,is_participating,start,end,content
1,0,yes,0,5,"one"
2,0,yes,6,11,'two'
3,0,yes,12,16,"\\"
4,0,yes,17,28,"BACKSLASH"
5,0,yes,29,33,'\\'
6,0,yes,34,45,'BACKSLASH'
7,0,yes,46,53,'three'

Test string:

"one"
'two'
"\\" "BACKSLASH"
'\\' 'BACKSLASH'
'three'

But with flex it gives this:

    int num_lines = 0, num_spaces = 0, num_strings = 0;
%option noyywrap

%%
\n      ++num_lines; ++num_spaces;
[ \t\r]+  ++num_spaces;
\"(\\.|[^\"\n\r])*\"|'(\\.|[^'\n\r])*'  ++num_strings;

%%

int main()
{
    yylex();
    printf( "# of lines = %d, # of num_spaces = %d, # of num_strings = %d\n",
        num_lines, num_spaces, num_strings );
}

Output:

cat test.string | ./test-str
BACKSLASH"BACKSLASHthree'# of lines = 4, # of num_spaces = 4, # of num_strings = 5
    int num_lines = 0, num_spaces = 0, num_strings = 0;
%option noyywrap

%%
\n      ++num_lines; ++num_spaces;
[ \t\r]+  ++num_spaces;
\"(\\.|[^\\\"\n\r])*\"|'(\\.|[^\\'\n\r])*'  ++num_strings;

%%

int main()
{
    yylex();
    printf( "# of lines = %d, # of num_spaces = %d, # of num_strings = %d\n",
        num_lines, num_spaces, num_strings );
}

Output:

cat test.string | ./test-str
# of lines = 5, # of num_spaces = 7, # of num_strings = 7
BenHanson commented 1 year ago

Yes, the second form you have in flex there is the correct one.

mingodad commented 1 year ago

The thing is if both uses the same strategy (dfa) why one works and the other don't ? I mean regular expressions work but flex doesn't with both regexes.

mingodad commented 1 year ago

With this \"(?:\\.|[^\"\n\r])*\"|'(?:\\.|[^'\n\r])*' and this \"(?:\\.|[^\"\n\r\\])*\"|'(?:\\.|[^'\n\r\\])*' regex on https://regex101.com/ the following languages recognize all strings: Java8, Golang, Python, Javascript and PCRE .

BenHanson commented 1 year ago

Those other implementations aren't using DFA/leftmost longest.

There are actually multiple regex flavours, but let's keep things simple.