westes / flex

The Fast Lexical Analyzer - scanner generator for lexing in C and C++
Other
3.54k stars 529 forks source link

Flex segfaults after reading EOF in `input()` #636

Open nxg opened 5 months ago

nxg commented 5 months ago

The program below works as expected when reading from stdin, but segfaults when it is instead lexing a buffer.

The key thing about this example is that one of the rules uses input() to gobble from "!" to EOF (yes, it looks as if I could use a "!".* pattern, but that doesn't produce the intended results in the real case; the lexer needs to balance braces, and if I hit EOF when trying to do that, I want to recover gracefully).

When run, reading from stdin, I get

$ flex -o eof.c eof.lex
$ cc -o eof eof.c
$ echo -n 'one two!three four' | ./eof
word:<one>
-> 1
-> 2
word:<two>
-> 1
buf=<three four>
-> 3

That's fine, but when I instead ./eof 'one two !three four', which scans the contents of a buffer set up by yy_scan_string, I get identical program output, followed by a segfault inside yy_get_next_buffer.

I can't work out which part of the flex manual is telling me I should expect that to happen.

The sequence of events seems to be that the lexer is finding its way to the end of file, as expected (and an <<EOF>> action confirms this), but not stopping there, despite the presence of the noyywrap option, and collapsing when it can't find a ‘next’ buffer.

Points:

Program:

ALPHABETIC  [a-zA-Z]
WS      [^a-zA-Z!]

%option noyywrap nounput

%%

{ALPHABETIC}+   {
    printf("word:<%s>\n", yytext);
    return 1;
}
{WS}+   {
    return 2;
}

"!"         {  // gobble to end of input
    char buf[80];
    for (int idx=0; (buf[idx] = input()); idx++) /* empty */ ;
    printf("buf=<%s>\n", buf);
    // YY_FLUSH_BUFFER; /* makes no difference */
    return 3;
}

%%
int main(int argc, char** argv)
{
    switch (argc) {
      case 1: break;
      case 2:
        yy_scan_string(argv[1]);
        break;
      default:
        fprintf(stderr, "Usage: %s [string]\n", argv[0]);
        exit(1);
    }

    int token;
    while ((token = yylex()) != 0) {
        printf("-> %d\n", token);
    }
}
Mightyjo commented 5 months ago

I can't find a spot in the docs that explains this behavior clearly. The best hints I could find are in the sections on multiple buffers, yywrap, and EOF rules.

You need an <> rule that calls yyterminate or sets up the next buffer. That rule will take the place of yywrap in your use case.

I'm away from my computer but I'll post an example when I'm back.

nxg commented 5 months ago

Thanks for clarifying.

In case it's useful when thinking about the docs, my mental model, when writing what I did, was that when I arrive at EOF using input(), I'm doing so ‘legitimately’ (ie, as opposed to my being illegitimately creative with yyinput, or something like that). It was on that basis that I presumed yywrap would Do The Right Thing, and that when flex subsequently asked for more input from input(), it would be told calmly ‘no’.

Or, put another way, my mental model is that flex is itself using input() to get input, or something equivalent to that, so that I'm working in concert with it if I read from it separately.

If those are bad intuitions, it might be useful for the docs to disabuse the reader fairly explicitly.