westes / flex

The Fast Lexical Analyzer - scanner generator for lexing in C and C++
Other
3.55k stars 529 forks source link

C-99 comment rule flawed #609

Closed ogdenpm closed 8 months ago

ogdenpm commented 10 months ago

In section A.4.3 of the documentation, the C99 comment pattern is given as ("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)

The first part of this for recognising / ... / patterns appears to be incorrect, specifically

/* some text **/ would fail as the * would match `""[^/] , preventing the/ `matching. Note I tried inserting a / between the "" and the [^/] i.e. to check for a single not followed by a / i.e. `("/"([^]|""/[^/])"/")` but flex generates an unrecognized rule error

zmajeed commented 9 months ago

You're right - I think gobbling any trailing stars fixes the regex

"/*"([^*]|"*"[^/])*"*"+"/"

or to make it easier to read with whitespace ignored

(?x: "/*" ( [^*] | "*"[^/] )* "*"+"/" )

I'll make the change if you think this is right

ogdenpm commented 9 months ago

Zartaj I think this would fail with /* text * more text / The intermediate * will not match. If the \ option worked e.g. “”[^/] the solution would be trivial Mark

Regards Mark Ogden


From: Zartaj Majeed @.> Sent: Wednesday, December 13, 2023 1:06:15 PM To: westes/flex @.> Cc: Mark Ogden @.>; Author @.> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

You're right - I think gobbling any trailing stars fixes the regex

"/"([^]|""[^/])"*"+"/"

or to make it easier to read with whitespace ignored

(?x: "/" ( [^] | ""[^/] ) ""+"/" )

I'll make the change if you think this is right

— Reply to this email directly, view it on GitHubhttps://github.com/westes/flex/issues/609#issuecomment-1853884166, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZ6DD4KMTPVYT2CNUPNXDDYJGR4PAVCNFSM6AAAAAA7P2CSQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJTHA4DIMJWGY. You are receiving this because you authored the thread.Message ID: @.***>

zmajeed commented 9 months ago

Sorry - that had a typo - I corrected it above - it should have been

(?x: "/*" ( [^*] | "*"[^/] )* "*"+"/" )

This accepts

/* text ** more text */

But it has another problem - stemming from the original regex - where it accepts invalid comments like

/* comment 1 **/ invalid missing comment start */

This is because for the intermediate **/ input, "*"[^/] consumes ** then [^*] consumes / and the lexer merrily continues to a successful match at the very last '*/`

zmajeed commented 9 months ago

I feel this cannot be done with Flex regex - it seems to require some lookahead to avoid prematurely consuming the star from a comment end delimiter */

I'm actually surprised that every single basic regex solution I found online is wrong! Some of these posts are decades old

The Flex doc also has a FAQ on matching C comments - there it only has couple example patterns that are clearly labelled wrong and doesn't purport to offer a working regex - that's probably what needs to be done for this section too

The trailing context "*"/[^/] you tried can't be used inside group parentheses - that's why you got the error - I've always avoided using it for this and other limitations

ogdenpm commented 9 months ago

Zartag I guess this is one case where the non greedy glob works.

(?s:”/”.?”*/“) Mark

Regards Mark Ogden


From: Zartaj Majeed @.> Sent: Thursday, December 14, 2023 1:22:55 PM To: westes/flex @.> Cc: Mark Ogden @.>; Author @.> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

I feel this cannot be done with Flex regex - it seems to require some lookahead to avoid prematurely consuming the star from a comment end delimiter */

I'm actually surprised that every single basic regex solution I found online is wrong! Some of these posts are decades old

The Flex doc also has a FAQ on matching C comments - there it only has couple example patterns that are clearly labelled wrong and doesn't purport to offer a working regex - that's probably what needs to be done for this section too

The trailing context "*"/[^/] you tried can't be used inside group parentheses - that's why you got the error - I've always avoided using it for this and other limitations

— Reply to this email directly, view it on GitHubhttps://github.com/westes/flex/issues/609#issuecomment-1855844479, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZ6DD7JBYYXI745K4LNRUTYJL4S7AVCNFSM6AAAAAA7P2CSQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVHA2DINBXHE. You are receiving this because you authored the thread.Message ID: @.***>

ogdenpm commented 9 months ago

Zartag Just had a thought (?xs:”/” ([^/] | “” )/“) Might work as the / is longer than “” Mark

Regards Mark Ogden


From: @. @.> Sent: Thursday, December 14, 2023 2:05:06 PM To: westes/flex @.>; westes/flex @.> Cc: Author @.***> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

Zartag I guess this is one case where the non greedy glob works.

(?s:”/”.?”*/“) Mark

Regards Mark Ogden


From: Zartaj Majeed @.> Sent: Thursday, December 14, 2023 1:22:55 PM To: westes/flex @.> Cc: Mark Ogden @.>; Author @.> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

I feel this cannot be done with Flex regex - it seems to require some lookahead to avoid prematurely consuming the star from a comment end delimiter */

I'm actually surprised that every single basic regex solution I found online is wrong! Some of these posts are decades old

The Flex doc also has a FAQ on matching C comments - there it only has couple example patterns that are clearly labelled wrong and doesn't purport to offer a working regex - that's probably what needs to be done for this section too

The trailing context "*"/[^/] you tried can't be used inside group parentheses - that's why you got the error - I've always avoided using it for this and other limitations

— Reply to this email directly, view it on GitHubhttps://github.com/westes/flex/issues/609#issuecomment-1855844479, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZ6DD7JBYYXI745K4LNRUTYJL4S7AVCNFSM6AAAAAA7P2CSQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVHA2DINBXHE. You are receiving this because you authored the thread.Message ID: @.***>

ogdenpm commented 9 months ago

Oops The first part should be [^*] otherwise / will not be allowed Mark

Regards Mark Ogden


From: @. @.> Sent: Thursday, December 14, 2023 2:42:27 PM To: westes/flex @.>; westes/flex @.> Cc: Author @.***> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

Zartag Just had a thought (?xs:”/” ([^/] | “” )/“) Might work as the / is longer than “” Mark

Regards Mark Ogden


From: @. @.> Sent: Thursday, December 14, 2023 2:05:06 PM To: westes/flex @.>; westes/flex @.> Cc: Author @.***> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

Zartag I guess this is one case where the non greedy glob works.

(?s:”/”.?”*/“) Mark

Regards Mark Ogden


From: Zartaj Majeed @.> Sent: Thursday, December 14, 2023 1:22:55 PM To: westes/flex @.> Cc: Mark Ogden @.>; Author @.> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

I feel this cannot be done with Flex regex - it seems to require some lookahead to avoid prematurely consuming the star from a comment end delimiter */

I'm actually surprised that every single basic regex solution I found online is wrong! Some of these posts are decades old

The Flex doc also has a FAQ on matching C comments - there it only has couple example patterns that are clearly labelled wrong and doesn't purport to offer a working regex - that's probably what needs to be done for this section too

The trailing context "*"/[^/] you tried can't be used inside group parentheses - that's why you got the error - I've always avoided using it for this and other limitations

— Reply to this email directly, view it on GitHubhttps://github.com/westes/flex/issues/609#issuecomment-1855844479, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZ6DD7JBYYXI745K4LNRUTYJL4S7AVCNFSM6AAAAAA7P2CSQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVHA2DINBXHE. You are receiving this because you authored the thread.Message ID: @.***>

ogdenpm commented 9 months ago

Zartag I found a solution on the internet

(?xs: “/” ([^] | “”+[^/])”+”/“) Mark

Regards Mark


From: @. @.> Sent: Thursday, December 14, 2023 2:45:20 PM To: westes/flex @.>; westes/flex @.> Cc: Author @.***> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

Oops The first part should be [^*] otherwise / will not be allowed Mark

Regards Mark Ogden


From: @. @.> Sent: Thursday, December 14, 2023 2:42:27 PM To: westes/flex @.>; westes/flex @.> Cc: Author @.***> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

Zartag Just had a thought (?xs:”/” ([^/] | “” )/“) Might work as the / is longer than “” Mark

Regards Mark Ogden


From: @. @.> Sent: Thursday, December 14, 2023 2:05:06 PM To: westes/flex @.>; westes/flex @.> Cc: Author @.***> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

Zartag I guess this is one case where the non greedy glob works.

(?s:”/”.?”*/“) Mark

Regards Mark Ogden


From: Zartaj Majeed @.> Sent: Thursday, December 14, 2023 1:22:55 PM To: westes/flex @.> Cc: Mark Ogden @.>; Author @.> Subject: Re: [westes/flex] C-99 comment rule flawed (Issue #609)

I feel this cannot be done with Flex regex - it seems to require some lookahead to avoid prematurely consuming the star from a comment end delimiter */

I'm actually surprised that every single basic regex solution I found online is wrong! Some of these posts are decades old

The Flex doc also has a FAQ on matching C comments - there it only has couple example patterns that are clearly labelled wrong and doesn't purport to offer a working regex - that's probably what needs to be done for this section too

The trailing context "*"/[^/] you tried can't be used inside group parentheses - that's why you got the error - I've always avoided using it for this and other limitations

— Reply to this email directly, view it on GitHubhttps://github.com/westes/flex/issues/609#issuecomment-1855844479, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZ6DD7JBYYXI745K4LNRUTYJL4S7AVCNFSM6AAAAAA7P2CSQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVHA2DINBXHE. You are receiving this because you authored the thread.Message ID: @.***>

zmajeed commented 9 months ago

Yep - this works - now the middle part of the regex “*”+[^*/] only works for runs of stars that don't end at an end delimiter - this means an end delimiter can only be matched by the last part of the regex - and matching won't continue past the first end delimiter

Can test with grep

grep -Po '(?x: \/\* ([^*] | \*+[^*/])* \*+\/)' <<EOF
/* text ** more text */
/* some text **/
/* comment 1 *//* comment 2 */
/* comment 3 */    /* comment 4 */
/* comment 5 */
/* comment 6 */ invalid comment missing start after comment 6 */
/* comment 7 **/ invalid comment missing start after comment 7 */
/* comment 8 // **/ invalid comment missing start after comment 8 */
EOF
/* text ** more text */
/* some text **/
/* comment 1 */
/* comment 2 */
/* comment 3 */
/* comment 4 */
/* comment 5 */
/* comment 6 */
/* comment 7 **/
/* comment 8 // **/

and multiline comments

grep -z -Po '(?x: \/\* ([^*] | \*+[^*/])* \*+\/)' <<EOF
/* multiline
comment 9 */
EOF
/* multiline
comment 9 */

Incidentally the dotall flag (?s:) is not needed since there's no dot in the regex

zmajeed commented 9 months ago

The C++ comment regex doesn't account for newline escapes in the comment body either - it fails for

grep -z -Po '(?x: \/ (\\ \n)* \/ [^\n]* )' <<EOF
not comment before /\\
/ multiline split delimiter \\
> comment 1
> not comment after
> EOF
/\
/ multiline split delimiter \

Adding another match for escaped newlines after comment start fixes it

grep -z -Po '(?x: \/ (\\ \n)* \/ ( (\\ \n) | [^\n] )* )' <<EOF
not comment before /\\
/ multiline split delimiter \\
comment 1
not comment after
EOF
/\
/ multiline split delimiter \
comment 1
zmajeed commented 9 months ago

So the full correct regex for C and C++ comments is

("/*"([^*]|"*"[^*/])*"*"+"/")|("/"(\\\n)*"/"((\\\n)|[^\n])*)

or

(?x: ( "/*" ( [^*] | "*"+ [^*/] )* "*"+ "/" ) | ( "/" (\\ \n)* "/" ( (\\ \n) | [^\n] )* ) )
zmajeed commented 9 months ago

Fix in PR 614, https://github.com/westes/flex/pull/614

westes commented 8 months ago

fixed by #614