westes / flex

The Fast Lexical Analyzer - scanner generator for lexing in C and C++
Other
3.61k stars 537 forks source link

Char class in comment gives m4 "ERROR: end of file in string" #553

Closed gitamohr closed 1 year ago

gitamohr commented 1 year ago

The following input to flex 2.6.4 gives an m4 error:

%%
A { return 'A'; }
    /*
     * Bug: [[:alnum:]_]
     */
%%
> flex bug.ll
/bin/m4:stdin:1315: ERROR: end of file in string

The error disappears if I remove the underscore character from the comment, like * Bug: [[:alnum:]]

Mightyjo commented 1 year ago

Do you get the same error if you misspell a character class name anywhere else in your lexer?

-Joe

On Fri, Feb 17, 2023, 15:36 Alex Mohr @.***> wrote:

The following input to flex 2.6.4 gives an m4 error:

%% A { return 'A'; } /*

  • Bug: [[:alnum:]_] */ %%

flex bug.ll /bin/m4:stdin:1315: ERROR: end of file in string

The error disappears if I remove the underscore character from the comment, like * Bug: [[:alnum:]]

— Reply to this email directly, view it on GitHub https://github.com/westes/flex/issues/553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVJXIKYDWC3AZANTVNHR2TWX7OM3ANCNFSM6AAAAAAU73CTI4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

gitamohr commented 1 year ago

Well, the contents of a comment should not affect the output. But FWIW using that char class in the grammar works fine:

%%
[[:alnum:]_] { return 'Z'; }
    /*
     * Bug: [[:alnum:]]
     */
%%

(If I add the underscore back into the comment like [[:alnum:]_] the error returns.)

Mightyjo commented 1 year ago

Okay, that's weird. I notice you named the file .ll. Is that for c++? (Shouldn't matter, but weird is weird.) Any command line switches or other options needed to reproduce this?

On Fri, Feb 17, 2023, 17:58 Alex Mohr @.***> wrote:

Well, the contents of a comment should not affect the output. But FWIW using that char class in the grammar works fine:

%% [[:alnum:]_] { return 'Z'; } /*

  • Bug: [[:alnum:]] */ %%

(If I add the underscore back into the comment like [[:alnum:]_] the error returns.)

— Reply to this email directly, view it on GitHub https://github.com/westes/flex/issues/553#issuecomment-1435372982, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVJXIJK3DB2HXS3LM3CWIDWX77CXANCNFSM6AAAAAAU73CTI4 . You are receiving this because you commented.Message ID: @.***>

gitamohr commented 1 year ago

No that's just what the file happens to be named. I renamed it to to bug.l. I'm invoking flex with no arguments other than the input file. I built flex 2.6.4 into /usr/local by ./configure && make && sudo make install. Here's a complete terminal session repro. I tried to reduce the repro as much as I could:

> cat bug.l
%%
A { return 'A'; }
    /*
     * Bug: [[:alnum:]_]
     */
%%
> flex bug.l
/bin/m4:stdin:1315: ERROR: end of file in string
> flex -V
flex 2.6.4

And just to say it, this isn't just a spurious error; flex's output is truncated and invalid. I can work around it by modifying my comments, but it seems like a bona fide bug in flex's comment handling, so I wanted to report it.

gitamohr commented 1 year ago

Here's a slightly more reduced repro. This fails:

%%
A { return 'A'; }
    /* [[:alnum:]_] */
%%

This works:

%%
A { return 'A'; } /* [[:alnum:]_] */
%%
gitamohr commented 1 year ago

There is something crucial about having additional characters after the [:alnum:] character class expression in the comment. Having characters before (like [_[:alnum:]]) works fine. Also the particular characters that follow don't seem to matter, I've tried whitespace, letters, digits, special chars. Also the character class expression name doesn't matter -- I've tried :alpha:, :digit:, and even :bogus: and they all repro.

Mightyjo commented 1 year ago

First, sorry for saying [[:alnum:]_] was a misspelling. I was holding on to a false notion that character class names in flex included the outer square braces. Probably because of the next thing.

Second, I found the problem but I can't fix it right now. It's peculiar to comment handling, as you noticed. Flex wraps comments in its customized M4 quotes, which happen to be [[ and ]]. Because the character classes aren't being scanned and replaced in the comments, M4 is reading the braces around them as quotation marks. This is usually okay when they are balanced (i.e. [[:alnum:]]). It leads to the error you saw when they look like unbalanced quotes (i.e. [[:alnum:]_]).

Options:

Sorry the comment quoting makes this edge case complicated.

gitamohr commented 1 year ago

No worries -- thanks for taking a look. I can easily work around this. For what it's worth, this example works in version 2.5.39, so the bug was introduced somewhere between 2.5.39 and 2.6.4.

Mightyjo commented 1 year ago

Drat! Now it's a bug instead of an oddity.

On Mon, Feb 27, 2023, 15:38 Alex Mohr @.***> wrote:

No worries -- thanks for taking a look. I can easily work around this. For what it's worth, this example works in version 2.5.39, so the bug was introduced somewhere between 2.5.39 and 2.6.4.

— Reply to this email directly, view it on GitHub https://github.com/westes/flex/issues/553#issuecomment-1447051319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVJXILFHZCXTVKI33P6NLLWZUGDLANCNFSM6AAAAAAU73CTI4 . You are receiving this because you commented.Message ID: @.***>

Mightyjo commented 1 year ago

Think I found it. Mainly for my reference when writing a test & patch: we aren't escaping m4qstart and m4qend in the COMMENT_DISCARD condition the same way we are in COMMENT. I think that's the source of this. I'll write tests based on the cases above, thanks for those!

Mightyjo commented 1 year ago

Nope, none of that worked.

@gitamohr, exactly what example did you test in 2.5.39? I'm trying to reproduce a working test from your comments above and finding no differences between 2.5.39 and HEAD.

%% g {; } / after action comment [[:alnum:]_] / h {; } / after action comment [[:alnum:]_] / %%

Flex accepts g but dies on h.

Here's what's up: The comment after the h action is scanned as ... I don't know what. Could be a comment, could be an action. Looks like it just gets echoed a byte at a time either way.

However! The following construction works for long comments in 2.5.39 and HEAD:

i {; } /*

Outcomes: I'm adding tests for multiline comments with unmatched braces to tests/quotes.l. I'll include the g and i constructions only for now.

gitamohr commented 1 year ago

I just tried the shortest example from above:

> cat bug.l
%%
A { return 'A'; }
    /* [[:alnum:]_] */
%%

> flex bug.l 
/bin/m4:stdin:1315: ERROR: end of file in string

> flex -V
flex 2.6.4

> /old/flex bug.l

> /old/flex -V
flex 2.5.39

cheers!

Mightyjo commented 1 year ago

In this thread: I show myself to be an idiot. I have my trusty, old "2.5.39" folder connected to the 2.6.4 tag for some reason.

Beg your pardon. Be back with better results shortly.

Mightyjo commented 1 year ago

Well, I'm back where we started. I see the issue, but I can't fix it for a while.

> cat bug.l %% A { return 'A'; } /* [[:alnum:]_] */ %%

Flex sees the comment after A's action as a "CODE_COMMENT". Those aren't m4 quoted the same way as other comments because quoting them cause other problems. Until we get rid of the m4 dependency, I can't change the behavior back to what you came to expect in 2.5.39 without breaking other functionality.

That said, you can use the constructions I provided above instead. I'm about done with the tests for them so we'll notice before losing any more comment functionality.

gitamohr commented 1 year ago

Yep no worries, as I've mentioned this is no real impediment; just something I noticed.

westes commented 1 year ago

fixed by #557

matthew-wozniczka commented 7 months ago

Any idea if https://stackoverflow.com/questions/78157667/error-end-of-file-in-string-error-coming-from-m4-when-using-flex is related to this?